WIP

2026-06-09 21:22:36 +00:00 · 2024-10-28 15:50:47 -07:00 · 2024-10-28 15:35:10 -07:00 · 2024-10-28 14:34:30 -07:00 · 2024-10-28 14:21:08 -07:00 · 2024-10-25 14:45:52 -07:00
76 changed files with 8187 additions and 1668 deletions
@@ -1,65 +1,6 @@
 Versity ScoutFS Release Notes
 =============================

---
-v1.25
-\
-*Jun 3, 2025*
-
-Fix a bug that could cause indefinite retries of failed client commits.
-Under specific error conditions the client and server's understanding of
-the current client commit could get out of sync.  The client would retry
-commits indefinitely that could never succeed.  This manifested as
-infinite "critical transaction commit failure" messages in the kernel
-log on the client and matching "error <nr> committing client logs" on
-the server.
-
-Fix a bug in a specific case of server error handling that could result
-in sending references to unwritten blocks to the client.  The client
-would try to read blocks that hadn't been written and return spurious
-errors.  This was seen under low free space conditions on the server and
-resulted in error messages with error code 116 (The errno enum for
-ESTALE, the client's indication that it couldn't read the blocks that it
-expected.)
-
---
-v1.24
-\
-*Mar 14, 2025*
-
-Add support for coherent read and write mmap() mappings of regular file
-data between mounts.
-
-Fix a bug that was causing scoutfs utilities to parse and change some
-file names before passing them on to the kernel for processing.  This
-fixes spurious scoutfs command errors for files with the offending
-patterns in their names.
-
-Fix a bug where rename wasn't updating the ctime of the inode at the
-destination name if it existed.
-
---
-v1.23
-\
-*Dec 11, 2024*
-
-Add support for kernels in the RHEL 9.5 minor release.
-
---
-v1.22
-\
-*Nov 1, 2024*
-
-Add support for building against the RHEL9 family of kernels.
-
-Fix failure of the setattr\_more ioctl() to set the attributes of a
-zero-length file when restoring.
-
-Fix support for POSIX ACLs in the RHEL8 and later family of kernels.
-
-Fix a race condition in the lock server that could drop lock requests
-under heavy load and cause cluster lock attempts to hang.
-
 ---
 v1.21
 \
@@ -4,6 +4,9 @@
 %define kmod_git_describe @@GITDESCRIBE@@
 %define pkg_date %(date +%%Y%%m%%d)

+# Disable the building of the debug package(s).
+%define debug_package %{nil}
+
 # take kernel version or default to uname -r
 %{!?kversion: %global kversion %(uname -r)}
 %global kernel_version %{kversion}
@@ -53,18 +56,6 @@ Source:		%{kmod_name}-kmod-%{kmod_version}.tar
 %global flavors_to_build x86_64
 %endif

-# el9 sanity: make sure we lock to the minor release we built for and block upgrades
-%{lua:
-  if string.match(rpm.expand("%{dist}"), "%.el9") then
-    rpm.define("el9 1")
-  end
-}
-
-%if 0%{?el9}
-%define release_major_minor 9.%{lua: print(rpm.expand("%{dist}"):match("%.el9_(%d)"))}
-Requires: system-release = %{release_major_minor}
-%endif
-
 %description
 %{kmod_name} - kernel module

@@ -6,6 +6,26 @@

 ccflags-y += -include $(src)/kernelcompat.h

+#
+# v3.10-rc6-21-gbb6f619b3a49
+#
+# _readdir changes from fop->readdir() to fop->iterate() and from
+# filldir(dirent) to dir_emit(ctx).
+#
+ifneq (,$(shell grep 'iterate.*dir_context' include/linux/fs.h))
+ccflags-y += -DKC_ITERATE_DIR_CONTEXT
+endif
+
+#
+# v3.10-rc6-23-g5f99f4e79abc
+#
+# Helpers including dir_emit_dots() are added in the process of
+# switching dcache_readdir() from fop->readdir() to fop->iterate()
+#
+ifneq (,$(shell grep 'dir_emit_dots' include/linux/fs.h))
+ccflags-y += -DKC_DIR_EMIT_DOTS
+endif
+
 #
 # v3.18-rc2-19-gb5ae6b15bd73
 # 
@@ -172,7 +192,7 @@ endif
 #
 # Kernel has current_time(inode) to uniformly retreive timespec in the right unit
 #
-ifneq (,$(shell grep 'struct timespec64 current_time' include/linux/fs.h))
+ifneq (,$(shell grep 'extern struct timespec64 current_time' include/linux/fs.h))
 ccflags-y += -DKC_CURRENT_TIME_INODE=1
 endif

@@ -393,44 +413,3 @@ endif
 ifneq (,$(shell grep 'blkdev_put.struct block_device .bdev, void .holder' include/linux/blkdev.h))
 ccflags-y += -DKC_BLKDEV_PUT_HOLDER_ARG
 endif
-
-#
-# v6.4-rc4-163-g0d625446d0a4
-#
-# Entirely removes current->backing_dev_info to ultimately remove buffer_head
-# completely at some point.
-ifneq (,$(shell grep 'struct backing_dev_info.*backing_dev_info;' include/linux/sched.h))
-ccflags-y += -DKC_CURRENT_BACKING_DEV_INFO
-endif
-
-#
-# v6.8-rc1-4-gf3a608827d1f
-#
-# adds bdev_file_open_by_path() and later in v6.8-rc1-30-ge97d06a46526 removes bdev_open_by_path()
-# which requires us to use the file method from now on.
-ifneq (,$(shell grep 'struct file.*bdev_file_open_by_path.const char.*path' include/linux/blkdev.h))
-ccflags-y += -DKC_BDEV_FILE_OPEN_BY_PATH
-endif
-
-# v4.0-rc7-1796-gfe0f07d08ee3
-#
-# direct-io changes modify inode_dio_done to now be called inode_dio_end
-ifneq (,$(shell grep 'void inode_dio_end.struct inode' include/linux/fs.h))
-ccflags-y += -DKC_INODE_DIO_END
-endif
-
-#
-# v5.0-6476-g3d3539018d2c
-#
-# page fault handlers return a bitmask vm_fault_t instead
-# Note: el8's header has a slightly modified prefix here
-ifneq (,$(shell grep 'typedef.*__bitwise unsigned.*int vm_fault_t' include/linux/mm_types.h))
-ccflags-y += -DKC_MM_VM_FAULT_T
-endif
-
-# v3.19-499-gd83a08db5ba6
-#
-# .remap pages becomes obsolete
-ifneq (,$(shell grep 'int ..remap_pages..struct vm_area_struct' include/linux/mm.h))
-ccflags-y += -DKC_MM_REMAP_PAGES
-endif
@@ -560,7 +560,7 @@ static int scoutfs_get_block(struct inode *inode, sector_t iblock,
 	u64 offset;
 	int ret;

-	WARN_ON_ONCE(create && !rwsem_is_locked(&si->extent_sem));
+	WARN_ON_ONCE(create && !inode_is_locked(inode));

 	/* make sure caller holds a cluster lock */
 	lock = scoutfs_per_task_get(&si->pt_data_lock);
@@ -1551,17 +1551,13 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	struct super_block *sb = inode->i_sb;
 	const u64 ino = scoutfs_ino(inode);
 	struct scoutfs_lock *lock = NULL;
-	struct scoutfs_extent *info = NULL;
-	struct page *page = NULL;
 	struct scoutfs_extent ext;
 	struct scoutfs_extent cur;
 	struct data_ext_args args;
 	u32 last_flags;
 	u64 iblock;
 	u64 last;
-	int entries = 0;
 	int ret;
-	int complete = 0;

 	if (len == 0) {
 		ret = 0;
@@ -1572,11 +1568,16 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	if (ret)
 		goto out;

-	page = alloc_page(GFP_KERNEL);
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	inode_lock(inode);
+	down_read(&si->extent_sem);
+
+	ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &lock);
+	if (ret)
+		goto unlock;
+
+	args.ino = ino;
+	args.inode = inode;
+	args.lock = lock;

 	/* use a dummy extent to track */
 	memset(&cur, 0, sizeof(cur));
@@ -1585,93 +1586,48 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	iblock = start >> SCOUTFS_BLOCK_SM_SHIFT;
 	last = (start + len - 1) >> SCOUTFS_BLOCK_SM_SHIFT;

-	args.ino = ino;
-	args.inode = inode;
-
-	/* outer loop */
 	while (iblock <= last) {
-		/* lock */
-		inode_lock(inode);
-		down_read(&si->extent_sem);
-
-		ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &lock);
-		if (ret) {
-			up_read(&si->extent_sem);
-			inode_unlock(inode);
-			break;
-		}
-
-		args.lock = lock;
-
-		/* collect entries */
-		info = page_address(page);
-		memset(info, 0, PAGE_SIZE);
-		while (entries < (PAGE_SIZE / sizeof(struct fiemap_extent)) - 1) {
-			ret = scoutfs_ext_next(sb, &data_ext_ops, &args,
-					       iblock, 1, &ext);
-			if (ret < 0) {
-				if (ret == -ENOENT)
-					ret = 0;
-				complete = 1;
-				last_flags = FIEMAP_EXTENT_LAST;
-				break;
-			}
-
-			trace_scoutfs_data_fiemap_extent(sb, ino, &ext);
-
-			if (ext.start > last) {
-				/* not setting _LAST, it's for end of file */
+		ret = scoutfs_ext_next(sb, &data_ext_ops, &args,
+				       iblock, 1, &ext);
+		if (ret < 0) {
+			if (ret == -ENOENT)
 				ret = 0;
-				complete = 1;
-				break;
-			}
-
-			if (scoutfs_ext_can_merge(&cur, &ext)) {
-				/* merged extents could be greater than input len */
-				cur.len += ext.len;
-			} else {
-				/* fill it */
-				memcpy(info, &cur, sizeof(cur));
-
-				entries++;
-				info++;
-
-				cur = ext;
-			}
-
-			iblock = ext.start + ext.len;
+			last_flags = FIEMAP_EXTENT_LAST;
+			break;
 		}

-		/* unlock */
-		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-		up_read(&si->extent_sem);
-		inode_unlock(inode);
+		trace_scoutfs_data_fiemap_extent(sb, ino, &ext);

-		if (ret)
+		if (ext.start > last) {
+			/* not setting _LAST, it's for end of file */
+			ret = 0;
 			break;
+		}

-		/* emit entries */
-		info = page_address(page);
-		for (; entries > 0; entries--) {
-			ret = fill_extent(fieinfo, info, 0);
+		if (scoutfs_ext_can_merge(&cur, &ext)) {
+			/* merged extents could be greater than input len */
+			cur.len += ext.len;
+		} else {
+			ret = fill_extent(fieinfo, &cur, 0);
 			if (ret != 0)
-				goto out;
-			info++;
+				goto unlock;
+			cur = ext;
 		}

-		if (complete)
-			break;
+		iblock = ext.start + ext.len;
 	}

-	/* still one left, it's in cur */
 	if (cur.len)
 		ret = fill_extent(fieinfo, &cur, last_flags);
+unlock:
+	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+	up_read(&si->extent_sem);
+	inode_unlock(inode);

 out:
 	if (ret == 1)
 		ret = 0;
-	if (page)
-		__free_page(page);
+
 	trace_scoutfs_data_fiemap(sb, start, len, ret);

 	return ret;
@@ -1958,236 +1914,6 @@ int scoutfs_data_waiting(struct super_block *sb, u64 ino, u64 iblock,
 	return ret;
 }

-#ifdef KC_MM_VM_FAULT_T
-static vm_fault_t scoutfs_data_page_mkwrite(struct vm_fault *vmf)
-{
-	struct vm_area_struct *vma = vmf->vma;
-#else
-static int scoutfs_data_page_mkwrite(struct vm_area_struct *vma,
-				     struct vm_fault *vmf)
-{
-#endif
-	struct page *page = vmf->page;
-	struct file *file = vma->vm_file;
-	struct inode *inode = file_inode(file);
-	struct scoutfs_inode_info *si = SCOUTFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	struct scoutfs_lock *lock = NULL;
-	SCOUTFS_DECLARE_PER_TASK_ENTRY(pt_ent);
-	DECLARE_DATA_WAIT(dw);
-	struct write_begin_data wbd;
-	u64 ind_seq;
-	loff_t pos;
-	loff_t size;
-	unsigned int len = PAGE_SIZE;
-	vm_fault_t ret = VM_FAULT_SIGBUS;
-	int err;
-
-	pos = vmf->pgoff << PAGE_SHIFT;
-
-	sb_start_pagefault(sb);
-
-	err = scoutfs_lock_inode(sb, SCOUTFS_LOCK_WRITE,
-				 SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
-	if (err) {
-		ret = vmf_error(err);
-		goto out;
-	}
-
-	size = i_size_read(inode);
-
-	if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, lock)) {
-		/* data_version is per inode, whole file must be online */
-		err = scoutfs_data_wait_check(inode, 0, size,
-					      SEF_OFFLINE,
-					      SCOUTFS_IOC_DWO_WRITE,
-					      &dw, lock);
-		if (err != 0) {
-			if (err < 0)
-				ret = vmf_error(err);
-			goto out_unlock;
-		}
-	}
-
-
-	/* scoutfs_write_begin */
-	memset(&wbd, 0, sizeof(wbd));
-	INIT_LIST_HEAD(&wbd.ind_locks);
-	wbd.lock = lock;
-
-	/*
-	 * Start transaction before taking page locks - we want to make sure we're
-	 * not locking a page, then waiting for trans, because writeback might race
-	 * against it and cause a lock inversion hang - as demonstrated by both
-	 * holetest and fsstress tests in xfstests.
-	 */
-	do {
-		err = scoutfs_inode_index_start(sb, &ind_seq) ?:
-			scoutfs_inode_index_prepare(sb, &wbd.ind_locks, inode,
-						    true) ?:
-			scoutfs_inode_index_try_lock_hold(sb, &wbd.ind_locks,
-							  ind_seq, false);
-	} while (err > 0);
-	if (err < 0) {
-		ret = vmf_error(err);
-		goto out_trans;
-	}
-
-	down_write(&si->extent_sem);
-
-	if (!trylock_page(page)) {
-		ret = VM_FAULT_NOPAGE;
-		goto out_sem;
-	}
-	ret = VM_FAULT_LOCKED;
-
-	if ((page->mapping != inode->i_mapping) ||
-	    (!PageUptodate(page)) ||
-	    (page_offset(page) > size))	 {
-		unlock_page(page);
-		ret = VM_FAULT_NOPAGE;
-		goto out_sem;
-	}
-
-	if (page->index == (size - 1) >> PAGE_SHIFT)
-		len = ((size - 1) & ~PAGE_MASK) + 1;
-
-	err = __block_write_begin(page, pos, PAGE_SIZE, scoutfs_get_block);
-	if (err) {
-		ret = vmf_error(err);
-		unlock_page(page);
-		goto out_sem;
-	}
-	/* end scoutfs_write_begin */
-
-	/*
-	 * We mark the page dirty already here so that when freeze is in
-	 * progress, we are guaranteed that writeback during freezing will
-	 * see the dirty page and writeprotect it again.
-	 */
-	set_page_dirty(page);
-	wait_for_stable_page(page);
-
-	/* scoutfs_write_end */
-	scoutfs_inode_set_data_seq(inode);
-	scoutfs_inode_inc_data_version(inode);
-
-	file_update_time(vma->vm_file);
-
-	scoutfs_update_inode_item(inode, wbd.lock, &wbd.ind_locks);
-	scoutfs_inode_queue_writeback(inode);
-
-out_sem:
-	up_write(&si->extent_sem);
-out_trans:
-	scoutfs_release_trans(sb);
-	scoutfs_inode_index_unlock(sb, &wbd.ind_locks);
-	/* end scoutfs_write_end */
-
-out_unlock:
-	scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
-	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
-
-out:
-	sb_end_pagefault(sb);
-
-	if (scoutfs_data_wait_found(&dw)) {
-		/*
-		 * It'd be really nice to not hold the mmap_sem lock here
-		 * before waiting for data, and then return VM_FAULT_RETRY
-		 */
-		err = scoutfs_data_wait(inode, &dw);
-		if (err == 0)
-			ret = VM_FAULT_NOPAGE;
-		else
-			ret = vmf_error(err);
-	}
-
-	trace_scoutfs_data_page_mkwrite(sb, scoutfs_ino(inode), pos, (__force u32)ret);
-
-	return ret;
-}
-
-#ifdef KC_MM_VM_FAULT_T
-static vm_fault_t scoutfs_data_filemap_fault(struct vm_fault *vmf)
-{
-	struct vm_area_struct *vma = vmf->vma;
-#else
-static int scoutfs_data_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-#endif
-	struct file *file = vma->vm_file;
-	struct inode *inode = file_inode(file);
-	struct scoutfs_inode_info *si = SCOUTFS_I(inode);
-	struct super_block *sb = inode->i_sb;
-	struct scoutfs_lock *inode_lock = NULL;
-	SCOUTFS_DECLARE_PER_TASK_ENTRY(pt_ent);
-	DECLARE_DATA_WAIT(dw);
-	loff_t pos;
-	int err;
-	vm_fault_t ret = VM_FAULT_SIGBUS;
-
-	pos = vmf->pgoff;
-	pos <<= PAGE_SHIFT;
-
-retry:
-	err = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ,
-				 SCOUTFS_LKF_REFRESH_INODE, inode, &inode_lock);
-	if (err < 0)
-		return vmf_error(err);
-
-	if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, inode_lock)) {
-		/* protect checked extents from stage/release */
-		atomic_inc(&inode->i_dio_count);
-
-		err = scoutfs_data_wait_check(inode, pos, PAGE_SIZE,
-					      SEF_OFFLINE, SCOUTFS_IOC_DWO_READ,
-					      &dw, inode_lock);
-		if (err != 0) {
-			if (err < 0)
-				ret = vmf_error(err);
-			goto out;
-		}
-	}
-
-#ifdef KC_MM_VM_FAULT_T
-	ret = filemap_fault(vmf);
-#else
-	ret = filemap_fault(vma, vmf);
-#endif
-
-out:
-	if (scoutfs_per_task_del(&si->pt_data_lock, &pt_ent))
-		kc_inode_dio_end(inode);
-	scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_READ);
-	if (scoutfs_data_wait_found(&dw)) {
-		err = scoutfs_data_wait(inode, &dw);
-		if (err == 0)
-			goto retry;
-
-		ret = VM_FAULT_RETRY;
-	}
-
-	trace_scoutfs_data_filemap_fault(sb, scoutfs_ino(inode), pos, (__force u32)ret);
-
-	return ret;
-}
-
-static const struct vm_operations_struct scoutfs_data_file_vm_ops = {
-	.fault		= scoutfs_data_filemap_fault,
-	.page_mkwrite	= scoutfs_data_page_mkwrite,
-#ifdef KC_MM_REMAP_PAGES
-	.remap_pages	= generic_file_remap_pages,
-#endif
-};
-
-static int scoutfs_file_mmap(struct file *file, struct vm_area_struct *vma)
-{
-	file_accessed(file);
-	vma->vm_ops = &scoutfs_data_file_vm_ops;
-	return 0;
-}
-
 const struct address_space_operations scoutfs_file_aops = {
 #ifdef KC_MPAGE_READ_FOLIO
 	.dirty_folio		= block_dirty_folio,
@@ -2219,7 +1945,6 @@ const struct file_operations scoutfs_file_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 #endif
-	.mmap		= scoutfs_file_mmap,
 	.unlocked_ioctl	= scoutfs_ioctl,
 	.fsync		= scoutfs_file_fsync,
 	.llseek		= scoutfs_file_llseek,
@@ -11,13 +11,11 @@
 * General Public License for more details.
 */
 #include <linux/kernel.h>
-#include <linux/stddef.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/uio.h>
 #include <linux/xattr.h>
 #include <linux/namei.h>
-#include <linux/mm.h>

 #include "format.h"
 #include "file.h"
@@ -436,15 +434,6 @@ out:
 		return d_splice_alias(inode, dentry);
 }

-/*
- * Helper to make iterating through dirent ptrs aligned
- */
-static inline struct scoutfs_dirent *next_aligned_dirent(struct scoutfs_dirent *dent, u8 len)
-{
-	return (void *)dent +
-		ALIGN(offsetof(struct scoutfs_dirent, name[len]), __alignof__(struct scoutfs_dirent));
-}
-
 /*
 * readdir simply iterates over the dirent items for the dir inode and
 * uses their offset as the readdir position.
@@ -452,112 +441,76 @@ static inline struct scoutfs_dirent *next_aligned_dirent(struct scoutfs_dirent *
 * It will need to be careful not to read past the region of the dirent
 * hash offset keys that it has access to.
 */
-static int scoutfs_readdir(struct file *file, struct dir_context *ctx)
+static int KC_DECLARE_READDIR(scoutfs_readdir, struct file *file,
+			      void *dirent, kc_readdir_ctx_t ctx)
 {
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
 	struct scoutfs_lock *dir_lock = NULL;
 	struct scoutfs_dirent *dent = NULL;
-/* we'll store name_len in dent->__pad[0] */
-#define hacky_name_len __pad[0]
 	struct scoutfs_key last_key;
 	struct scoutfs_key key;
-	struct page *page = NULL;
 	int name_len;
 	u64 pos;
-	int entries = 0;
 	int ret;
-	int complete = 0;
-	struct scoutfs_dirent *end;

-	if (!dir_emit_dots(file, ctx))
+	if (!kc_dir_emit_dots(file, dirent, ctx))
 		return 0;

-	page = alloc_page(GFP_KERNEL);
-	if (!page)
+	dent = alloc_dirent(SCOUTFS_NAME_LEN);
+	if (!dent) {
 		return -ENOMEM;
-
-	end = page_address(page) + PAGE_SIZE;
+	}

 	init_dirent_key(&last_key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
 			SCOUTFS_DIRENT_LAST_POS, 0);

-	/*
-	 * lock and fetch dirent items, until the page no longer fits
-	 * a max size dirent (288b). Then unlock and dir_emit the ones
-	 * we stored in the page.
-	 */
+	ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &dir_lock);
+	if (ret)
+		goto out;
+
 	for (;;) {
-		/* lock */
-		ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &dir_lock);
-		if (ret)
-			break;
+		init_dirent_key(&key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
+				kc_readdir_pos(file, ctx), 0);

-		dent = page_address(page);
-		pos = ctx->pos;
-		while (next_aligned_dirent(dent, SCOUTFS_NAME_LEN) < end) {
-			init_dirent_key(&key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
-					pos, 0);
-
-			ret = scoutfs_item_next(sb, &key, &last_key, dent,
-						dirent_bytes(SCOUTFS_NAME_LEN),
-						dir_lock);
-			if (ret < 0) {
-				if (ret == -ENOENT) {
-					ret = 0;
-					complete = 1;
-				}
-				break;
-			}
-
-			name_len = ret - sizeof(struct scoutfs_dirent);
-			dent->hacky_name_len = name_len;
-			if (name_len < 1 || name_len > SCOUTFS_NAME_LEN) {
-				scoutfs_corruption(sb, SC_DIRENT_READDIR_NAME_LEN,
-						   corrupt_dirent_readdir_name_len,
-						   "dir_ino %llu pos %llu key "SK_FMT" len %d",
-						   scoutfs_ino(inode),
-						   pos,
-						   SK_ARG(&key), name_len);
-				ret = -EIO;
-				break;
-			}
-
-			pos = le64_to_cpu(dent->pos) + 1;
-
-			dent = next_aligned_dirent(dent, name_len);
-			entries++;
-		}
-
-		/* unlock */
-		scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_READ);
-
-		if (ret < 0)
-			break;
-
-		dent = page_address(page);
-		for (; entries > 0; entries--) {
-			ctx->pos = le64_to_cpu(dent->pos);
-			if (!dir_emit(ctx, dent->name, dent->hacky_name_len,
-					le64_to_cpu(dent->ino),
-					dentry_type(dent->type))) {
+		ret = scoutfs_item_next(sb, &key, &last_key, dent,
+					dirent_bytes(SCOUTFS_NAME_LEN),
+					dir_lock);
+		if (ret < 0) {
+			if (ret == -ENOENT)
 				ret = 0;
-				goto out;
-			}
-
-			dent = next_aligned_dirent(dent, dent->hacky_name_len);
-
-			/* always advance ctx->pos past */
-			ctx->pos++;
+			break;
 		}

-		if (complete)
+		name_len = ret - sizeof(struct scoutfs_dirent);
+		if (name_len < 1 || name_len > SCOUTFS_NAME_LEN) {
+			scoutfs_corruption(sb, SC_DIRENT_READDIR_NAME_LEN,
+					   corrupt_dirent_readdir_name_len,
+					   "dir_ino %llu pos %llu key "SK_FMT" len %d",
+					   scoutfs_ino(inode),
+					   kc_readdir_pos(file, ctx),
+					   SK_ARG(&key), name_len);
+			ret = -EIO;
+			goto out;
+		}
+
+		pos = le64_to_cpu(key.skd_major);
+		kc_readdir_pos(file, ctx) = pos;
+
+		if (!kc_dir_emit(ctx, dirent, dent->name, name_len, pos,
+				le64_to_cpu(dent->ino),
+				dentry_type(dent->type))) {
+			ret = 0;
 			break;
+		}
+
+		kc_readdir_pos(file, ctx) = pos + 1;
 	}

 out:
-	if (page)
-		__free_page(page);
+	scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_READ);
+
+	kfree(dent);
 	return ret;
 }

@@ -1812,7 +1765,7 @@ retry:
 	}
 	old_inode->i_ctime = now;
 	if (new_inode)
-		new_inode->i_ctime = now;
+		old_inode->i_ctime = now;

 	inode_inc_iversion(old_dir);
 	inode_inc_iversion(old_inode);
@@ -2020,7 +1973,7 @@ const struct inode_operations scoutfs_symlink_iops = {
 };

 const struct file_operations scoutfs_dir_fops = {
-	.iterate	= scoutfs_readdir,
+	.KC_FOP_READDIR	= scoutfs_readdir,
 #ifdef KC_FMODE_KABI_ITERATE
 	.open		= scoutfs_dir_open,
 #endif
@@ -58,23 +58,25 @@
 * key space after we find no items in a given lock region.  This is
 * relatively cheap because reading is going to check the segments
 * anyway.
+ *
+ * This is copying to userspace while holding a read lock.  This is safe
+ * because faulting can send a request for a write lock while the read
+ * lock is being used.  The cluster locks don't block tasks in a node,
+ * they match and the tasks fall back to local locking.  In this case
+ * the spin locks around the item cache.
 */
 static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 {
 	struct super_block *sb = file_inode(file)->i_sb;
 	struct scoutfs_ioctl_walk_inodes __user *uwalk = (void __user *)arg;
 	struct scoutfs_ioctl_walk_inodes walk;
-	struct scoutfs_ioctl_walk_inodes_entry *ent = NULL;
-	struct scoutfs_ioctl_walk_inodes_entry *end;
+	struct scoutfs_ioctl_walk_inodes_entry ent;
 	struct scoutfs_key next_key;
 	struct scoutfs_key last_key;
 	struct scoutfs_key key;
 	struct scoutfs_lock *lock;
-	struct page *page = NULL;
 	u64 last_seq;
-	u64 entries = 0;
 	int ret = 0;
-	int complete = 0;
 	u32 nr = 0;
 	u8 type;

@@ -105,10 +107,6 @@ static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 		}
 	}

-	page = alloc_page(GFP_KERNEL);
-	if (!page)
-		return -ENOMEM;
-
 	scoutfs_inode_init_index_key(&key, type, walk.first.major,
 				     walk.first.minor, walk.first.ino);
 	scoutfs_inode_init_index_key(&last_key, type, walk.last.major,
@@ -117,107 +115,77 @@ static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 	/* cap nr to the max the ioctl can return to a compat task */
 	walk.nr_entries = min_t(u64, walk.nr_entries, INT_MAX);

-	end = page_address(page) + PAGE_SIZE;
+	ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ, type,
+				       walk.first.major, walk.first.ino,
+				       &lock);
+	if (ret < 0)
+		goto out;

-	/* outer loop */
-	for (nr = 0;;) {
-		ent = page_address(page);
-		/* make sure _pad and minor are zeroed */
-		memset(ent, 0, PAGE_SIZE);
+	for (nr = 0; nr < walk.nr_entries; ) {

-		ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ, type,
-					       le64_to_cpu(key.skii_major),
-					       le64_to_cpu(key.skii_ino),
-					       &lock);
-		if (ret)
+		ret = scoutfs_item_next(sb, &key, &last_key, NULL, 0, lock);
+		if (ret < 0 && ret != -ENOENT)
 			break;

-		/* inner loop 1 */
-		while (ent + 1 < end) {
-			ret = scoutfs_item_next(sb, &key, &last_key, NULL, 0, lock);
-			if (ret < 0 && ret != -ENOENT)
+		if (ret == -ENOENT) {
+
+			/* done if lock covers last iteration key */
+			if (scoutfs_key_compare(&last_key, &lock->end) <= 0) {
+				ret = 0;
 				break;
-
-			if (ret == -ENOENT) {
-				/* done if lock covers last iteration key */
-				if (scoutfs_key_compare(&last_key, &lock->end) <= 0) {
-					ret = 0;
-					complete = 1;
-					break;
-				}
-
-				/* continue iterating after locked empty region */
-				key = lock->end;
-				scoutfs_key_inc(&key);
-
-				scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-				/* avoid double-unlocking here after break */
-				lock = NULL;
-
-				ret = scoutfs_forest_next_hint(sb, &key, &next_key);
-				if (ret < 0 && ret != -ENOENT)
-					break;
-
-				if (ret == -ENOENT ||
-				    scoutfs_key_compare(&next_key, &last_key) > 0) {
-					ret = 0;
-					complete = 1;
-					break;
-				}
-
-				key = next_key;
-
-				ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ,
-							type,
-							le64_to_cpu(key.skii_major),
-							le64_to_cpu(key.skii_ino),
-							&lock);
-				if (ret)
-					break;
-
-				continue;
 			}

-			ent->major = le64_to_cpu(key.skii_major);
-			ent->ino = le64_to_cpu(key.skii_ino);
-
+			/* continue iterating after locked empty region */
+			key = lock->end;
 			scoutfs_key_inc(&key);

-			ent++;
-			entries++;
+			scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);

-			if (nr + entries >= walk.nr_entries) {
-				complete = 1;
-				break;
-			}
-		}
+			ret = scoutfs_forest_next_hint(sb, &key, &next_key);
+			if (ret < 0 && ret != -ENOENT)
+				goto out;

-		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-		if (ret < 0)
-			break;
-
-		/* inner loop 2 */
-		ent = page_address(page);
-		for (; entries > 0; entries--) {
-			if (copy_to_user((void __user *)walk.entries_ptr, ent,
-					 sizeof(struct scoutfs_ioctl_walk_inodes_entry))) {
-				ret = -EFAULT;
+			if (ret == -ENOENT ||
+			    scoutfs_key_compare(&next_key, &last_key) > 0) {
+				ret = 0;
 				goto out;
 			}
-			walk.entries_ptr += sizeof(struct scoutfs_ioctl_walk_inodes_entry);
-			ent++;
-			nr++;
+
+			key = next_key;
+
+			ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ,
+						key.sk_type,
+						le64_to_cpu(key.skii_major),
+						le64_to_cpu(key.skii_ino),
+						&lock);
+			if (ret < 0)
+				goto out;
+
+			continue;
 		}

-		if (complete)
+		ent.major = le64_to_cpu(key.skii_major);
+		ent.minor = 0;
+		ent.ino = le64_to_cpu(key.skii_ino);
+
+		if (copy_to_user((void __user *)walk.entries_ptr, &ent,
+				 sizeof(ent))) {
+			ret = -EFAULT;
 			break;
+		}
+
+		nr++;
+		walk.entries_ptr += sizeof(ent);
+
+		scoutfs_key_inc(&key);
 	}

+	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+
 out:
-	if (page)
-		__free_page(page);
 	if (nr > 0)
 		ret = nr;
+
 	return ret;
 }

@@ -556,9 +524,7 @@ static long scoutfs_ioc_stage(struct file *file, unsigned long arg)
 	}

 	si->staging = true;
-#ifdef KC_CURRENT_BACKING_DEV_INFO
 	current->backing_dev_info = inode_to_bdi(inode);
-#endif

 	pos = args.offset;
 	written = 0;
@@ -571,9 +537,7 @@ static long scoutfs_ioc_stage(struct file *file, unsigned long arg)
 	} while (ret > 0 && written < args.length);

 	si->staging = false;
-#ifdef KC_CURRENT_BACKING_DEV_INFO
 	current->backing_dev_info = NULL;
-#endif
 out:
 	scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
 	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
@@ -1195,15 +1159,11 @@ static long scoutfs_ioc_get_allocated_inos(struct file *file, unsigned long arg)
 	struct scoutfs_lock *lock = NULL;
 	struct scoutfs_key key;
 	struct scoutfs_key end;
-	struct page *page = NULL;
 	u64 __user *uinos;
 	u64 bytes;
-	u64 *ino;
-	u64 *ino_end;
-	int entries = 0;
+	u64 ino;
 	int nr;
 	int ret;
-	int complete = 0;

 	if (!(file->f_mode & FMODE_READ)) {
 		ret = -EBADF;
@@ -1225,83 +1185,47 @@ static long scoutfs_ioc_get_allocated_inos(struct file *file, unsigned long arg)
 		goto out;
 	}

-	page = alloc_page(GFP_KERNEL);
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	ino_end = page_address(page) + PAGE_SIZE;
-
 	scoutfs_inode_init_key(&key, gai.start_ino);
 	scoutfs_inode_init_key(&end, gai.start_ino | SCOUTFS_LOCK_INODE_GROUP_MASK);
 	uinos = (void __user *)gai.inos_ptr;
 	bytes = gai.inos_bytes;
 	nr = 0;

-	for (;;) {
+	ret = scoutfs_lock_ino(sb, SCOUTFS_LOCK_READ, 0, gai.start_ino, &lock);
+	if (ret < 0)
+		goto out;

-		ret = scoutfs_lock_ino(sb, SCOUTFS_LOCK_READ, 0, gai.start_ino, &lock);
-		if (ret < 0)
-			goto out;
+	while (bytes >= sizeof(*uinos)) {

-		ino = page_address(page);
-		while (ino < ino_end) {
-
-			ret = scoutfs_item_next(sb, &key, &end, NULL, 0, lock);
-			if (ret < 0) {
-				if (ret == -ENOENT) {
-					ret = 0;
-					complete = 1;
-				}
-				break;
-			}
-
-			if (key.sk_zone != SCOUTFS_FS_ZONE) {
+		ret = scoutfs_item_next(sb, &key, &end, NULL, 0, lock);
+		if (ret < 0) {
+			if (ret == -ENOENT)
 				ret = 0;
-				complete = 1;
-				break;
-			}
-
-			/* all fs items are owned by allocated inodes, and _first is always ino */
-			*ino = le64_to_cpu(key._sk_first);
-			scoutfs_inode_init_key(&key, *ino + 1);
-
-			ino++;
-			entries++;
-			nr++;
-
-			bytes -= sizeof(*uinos);
-			if (bytes < sizeof(*uinos)) {
-				complete = 1;
-				break;
-			}
-
-			if (nr == INT_MAX) {
-				complete = 1;
-				break;
-			}
+			break;
 		}

-		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-
-		if (ret < 0)
+		if (key.sk_zone != SCOUTFS_FS_ZONE) {
+			ret = 0;
 			break;
+		}

-		ino = page_address(page);
-		if (copy_to_user(uinos, ino, entries * sizeof(*uinos))) {
+		/* all fs items are owned by allocated inodes, and _first is always ino */
+		ino = le64_to_cpu(key._sk_first);
+		if (put_user(ino, uinos)) {
 			ret = -EFAULT;
-			goto out;
+			break;
 		}

-		uinos += entries;
-		entries = 0;
-
-		if (complete)
+		uinos++;
+		bytes -= sizeof(*uinos);
+		if (++nr == INT_MAX)
 			break;
+
+		scoutfs_inode_init_key(&key, ino + 1);
 	}
+
+	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
 out:
-	if (page)
-		__free_page(page);
 	return ret ?: nr;
 }

@@ -29,6 +29,50 @@ do {						\
 })
 #endif

+#ifndef KC_ITERATE_DIR_CONTEXT
+typedef filldir_t kc_readdir_ctx_t;
+#define KC_DECLARE_READDIR(name, file, dirent, ctx) name(file, dirent, ctx)
+#define KC_FOP_READDIR readdir
+#define kc_readdir_pos(filp, ctx) (filp)->f_pos
+#define kc_dir_emit_dots(file, dirent, ctx) dir_emit_dots(file, dirent, ctx)
+#define kc_dir_emit(ctx, dirent, name, name_len, pos, ino, dt) \
+	(ctx(dirent, name, name_len, pos, ino, dt) == 0)
+#else
+typedef struct dir_context * kc_readdir_ctx_t;
+#define KC_DECLARE_READDIR(name, file, dirent, ctx) name(file, ctx)
+#define KC_FOP_READDIR iterate
+#define kc_readdir_pos(filp, ctx) (ctx)->pos
+#define kc_dir_emit_dots(file, dirent, ctx) dir_emit_dots(file, ctx)
+#define kc_dir_emit(ctx, dirent, name, name_len, pos, ino, dt) \
+	dir_emit(ctx, name, name_len, ino, dt)
+#endif
+
+#ifndef KC_DIR_EMIT_DOTS
+/*
+ * Kernels before ->iterate and don't have dir_emit_dots so we give them
+ * one that works with the ->readdir() filldir() method.
+ */
+static inline int dir_emit_dots(struct file *file, void *dirent,
+				filldir_t filldir)
+{
+	if (file->f_pos == 0) {
+		if (filldir(dirent, ".", 1, 1,
+			    file->f_path.dentry->d_inode->i_ino, DT_DIR))
+			return 0;
+		file->f_pos = 1;
+	}
+
+	if (file->f_pos == 1) {
+		if (filldir(dirent, "..", 2, 1,
+			    parent_ino(file->f_path.dentry), DT_DIR))
+			return 0;
+		file->f_pos = 2;
+	}
+
+	return 1;
+}
+#endif
+
 #ifdef KC_POSIX_ACL_VALID_USER_NS
 #define kc_posix_acl_valid(user_ns, acl) posix_acl_valid(user_ns, acl)
 #else
@@ -394,20 +438,4 @@ static inline int kc_tcp_sock_set_nodelay(struct socket *sock)
 }
 #endif

-#ifdef KC_INODE_DIO_END
-#define kc_inode_dio_end inode_dio_end
-#else
-#define kc_inode_dio_end inode_dio_done
-#endif
-
-#ifndef KC_MM_VM_FAULT_T
-typedef unsigned int vm_fault_t;
-static inline vm_fault_t vmf_error(int err)
-{
-	if (err == -ENOMEM)
-		return VM_FAULT_OOM;
-	return VM_FAULT_SIGBUS;
-}
-#endif
-
 #endif
@@ -302,7 +302,6 @@ static void lock_inc_count(unsigned int *counts, enum scoutfs_lock_mode mode)
 static void lock_dec_count(unsigned int *counts, enum scoutfs_lock_mode mode)
 {
 	BUG_ON(mode < 0 || mode >= SCOUTFS_LOCK_NR_MODES);
-	BUG_ON(counts[mode] == 0);
 	counts[mode]--;
 }

@@ -202,48 +202,21 @@ static u8 invalidation_mode(u8 granted, u8 requested)

 /*
 * Return true of the client lock instances described by the entries can
- * be granted at the same time.  There's only three cases where this is
- * true.
- *
- * First, the two locks are both of the same mode that allows full
- * sharing -- read and write only.  The only point of these modes is
- * that everyone can share them.
- *
- * Second, a write lock gives the client permission to read as well.
- * This means that a client can upgrade its read lock to a write lock
- * without having to invalidate the existing read and drop caches.
- *
- * Third, null locks are always compatible between clients.  It's as
- * though the client with the null lock has no lock at all.  But it's
- * never compatible with all locks on the client requesting null.
- * Sending invalidations for existing locks on a client when we get a
- * null request is how we resolve races in shrinking locks -- we turn it
- * into the unsolicited remote invalidation case.
- *
- * All other mode and client combinations can not be shared, most
- * typically a write lock invalidating all other non-write holders to
- * drop caches and force a read after the write has completed.
+ * be granted at the same time.  Typically this only means they're both
+ * modes that are compatible between nodes. In addition there's the
+ * special case where a read lock on a client is compatible with a write
+ * lock on the same client because the client's cache covered by the
+ * read lock is still valid if they get a write lock.
 */
 static bool client_entries_compatible(struct client_lock_entry *granted,
 				      struct client_lock_entry *requested)
 {
-	/* only read and write_only can be full shared */
-	if ((granted->mode == requested->mode) &&
-	    (granted->mode == SCOUTFS_LOCK_READ || granted->mode == SCOUTFS_LOCK_WRITE_ONLY))
-		return true;
-
-	/* _write includes reading, so a client can upgrade its read to write */
-	if (granted->rid == requested->rid &&
-	    granted->mode == SCOUTFS_LOCK_READ &&
-	    requested->mode == SCOUTFS_LOCK_WRITE)
-		return true;
-
-	/* null is always compatible across clients, never within a client */
-	if ((granted->rid != requested->rid) &&
-	    (granted->mode == SCOUTFS_LOCK_NULL || requested->mode == SCOUTFS_LOCK_NULL))
-		return true;
-
-	return false;
+	return (granted->mode == requested->mode &&
+		(granted->mode == SCOUTFS_LOCK_READ ||
+		 granted->mode == SCOUTFS_LOCK_WRITE_ONLY)) ||
+	       (granted->rid == requested->rid &&
+		granted->mode == SCOUTFS_LOCK_READ &&
+		requested->mode == SCOUTFS_LOCK_WRITE);
 }

 /*
@@ -344,18 +317,16 @@ static void put_server_lock(struct lock_server_info *inf,

 	BUG_ON(!mutex_is_locked(&snode->mutex));

-	spin_lock(&inf->lock);
-
 	if (atomic_dec_and_test(&snode->refcount) &&
 	    list_empty(&snode->granted) &&
 	    list_empty(&snode->requested) &&
 	    list_empty(&snode->invalidated)) {
+		spin_lock(&inf->lock);
 		rb_erase(&snode->node, &inf->locks_root);
+		spin_unlock(&inf->lock);
 		should_free = true;
 	}

-	spin_unlock(&inf->lock);
-
 	mutex_unlock(&snode->mutex);

 	if (should_free) {
@@ -502,12 +502,12 @@ static void scoutfs_net_proc_worker(struct work_struct *work)
 * Free live responses up to and including the seq by marking them dead
 * and moving them to the send queue to be freed.
 */
-static bool move_acked_responses(struct scoutfs_net_connection *conn,
-				 struct list_head *list, u64 seq)
+static int move_acked_responses(struct scoutfs_net_connection *conn,
+				struct list_head *list, u64 seq)
 {
 	struct message_send *msend;
 	struct message_send *tmp;
-	bool moved = false;
+	int ret = 0;

 	assert_spin_locked(&conn->lock);

@@ -519,20 +519,20 @@ static bool move_acked_responses(struct scoutfs_net_connection *conn,

 		msend->dead = 1;
 		list_move(&msend->head, &conn->send_queue);
-		moved = true;
+		ret = 1;
 	}

-	return moved;
+	return ret;
 }

 /* acks are processed inline in the recv worker */
 static void free_acked_responses(struct scoutfs_net_connection *conn, u64 seq)
 {
-	bool moved;
+	int moved;

 	spin_lock(&conn->lock);

-	moved = move_acked_responses(conn, &conn->send_queue, seq) |
+	moved = move_acked_responses(conn, &conn->send_queue, seq) +
 		move_acked_responses(conn, &conn->resend_queue, seq);

 	spin_unlock(&conn->lock);
@@ -286,52 +286,6 @@ TRACE_EVENT(scoutfs_data_alloc_block_enter,
 		  STE_ENTRY_ARGS(ext))
 );

-TRACE_EVENT(scoutfs_data_page_mkwrite,
-	TP_PROTO(struct super_block *sb, __u64 ino, __u64 pos, __u32 ret),
-
-	TP_ARGS(sb, ino, pos, ret),
-
-	TP_STRUCT__entry(
-		SCSB_TRACE_FIELDS
-		__field(__u64, ino)
-		__field(__u64, pos)
-		__field(__u32, ret)
-	),
-
-	TP_fast_assign(
-		SCSB_TRACE_ASSIGN(sb);
-		__entry->ino = ino;
-		__entry->pos = pos;
-		__entry->ret = ret;
-	),
-
-	TP_printk(SCSBF" ino %llu pos %llu ret %u ",
-		  SCSB_TRACE_ARGS, __entry->ino, __entry->pos, __entry->ret)
-);
-
-TRACE_EVENT(scoutfs_data_filemap_fault,
-	TP_PROTO(struct super_block *sb, __u64 ino, __u64 pos, __u32 ret),
-
-	TP_ARGS(sb, ino, pos, ret),
-
-	TP_STRUCT__entry(
-		SCSB_TRACE_FIELDS
-		__field(__u64, ino)
-		__field(__u64, pos)
-		__field(__u32, ret)
-	),
-
-	TP_fast_assign(
-		SCSB_TRACE_ASSIGN(sb);
-		__entry->ino = ino;
-		__entry->pos = pos;
-		__entry->ret = ret;
-	),
-
-	TP_printk(SCSBF" ino %llu pos %llu ret %u ",
-		  SCSB_TRACE_ARGS, __entry->ino, __entry->pos, __entry->ret)
-);
-
 DECLARE_EVENT_CLASS(scoutfs_data_file_extent_class,
 	TP_PROTO(struct super_block *sb, __u64 ino, struct scoutfs_extent *ext),

@@ -1092,12 +1046,9 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		sk_trace_define(start)
 		sk_trace_define(end)
 		__field(u64, refresh_gen)
-		__field(u64, write_seq)
-		__field(u64, dirty_trans_seq)
 		__field(unsigned char, request_pending)
 		__field(unsigned char, invalidate_pending)
 		__field(int, mode)
-		__field(int, invalidating_mode)
 		__field(unsigned int, waiters_cw)
 		__field(unsigned int, waiters_pr)
 		__field(unsigned int, waiters_ex)
@@ -1110,12 +1061,9 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		sk_trace_assign(start, &lck->start);
 		sk_trace_assign(end, &lck->end);
 		__entry->refresh_gen = lck->refresh_gen;
-		__entry->write_seq = lck->write_seq;
-		__entry->dirty_trans_seq = lck->dirty_trans_seq;
 		__entry->request_pending = lck->request_pending;
 		__entry->invalidate_pending = lck->invalidate_pending;
 		__entry->mode = lck->mode;
-		__entry->invalidating_mode = lck->invalidating_mode;
 		__entry->waiters_pr = lck->waiters[SCOUTFS_LOCK_READ];
 		__entry->waiters_ex = lck->waiters[SCOUTFS_LOCK_WRITE];
 		__entry->waiters_cw = lck->waiters[SCOUTFS_LOCK_WRITE_ONLY];
@@ -1123,11 +1071,10 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		__entry->users_ex = lck->users[SCOUTFS_LOCK_WRITE];
 		__entry->users_cw = lck->users[SCOUTFS_LOCK_WRITE_ONLY];
        ),
-        TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u invmd %u reqp %u invp %u refg %llu wris %llu dts %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
+        TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u reqpnd %u invpnd %u rfrgen %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
 		  SCSB_TRACE_ARGS, sk_trace_args(start), sk_trace_args(end),
-		  __entry->mode, __entry->invalidating_mode, __entry->request_pending,
-		  __entry->invalidate_pending, __entry->refresh_gen, __entry->write_seq,
-		  __entry->dirty_trans_seq,
+		  __entry->mode, __entry->request_pending,
+		  __entry->invalidate_pending, __entry->refresh_gen,
 		  __entry->waiters_pr, __entry->waiters_ex, __entry->waiters_cw,
 		  __entry->users_pr, __entry->users_ex, __entry->users_cw)
 );
@@ -1299,10 +1299,12 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
 * is nested inside holding commits so we recheck the persistent item
 * each time we commit to make sure it's still what we think.   The
 * caller is still going to send the item to the client so we update the
- * caller's each time we make progress.  If we hit an error applying the
- * changes we make then we can't send the log_trees to the client.
+ * caller's each time we make progress.  This is a best-effort attempt
+ * to clean up and it's valid to leave extents in data_freed we don't
+ * return errors to the caller.  The client will continue the work later
+ * in get_log_trees or as the rid is reclaimed.
 */
-static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
+static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
 {
 	DECLARE_SERVER_INFO(sb, server);
 	struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
@@ -1311,7 +1313,6 @@ static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees
 	struct scoutfs_log_trees drain;
 	struct scoutfs_key key;
 	COMMIT_HOLD(hold);
-	bool apply = false;
 	int ret = 0;
 	int err;

@@ -1320,27 +1321,22 @@ static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees
 	while (lt->data_freed.total_len != 0) {
 		server_hold_commit(sb, &hold);
 		mutex_lock(&server->logs_mutex);
-		apply = true;

 		ret = find_log_trees_item(sb, &super->logs_root, false, rid, U64_MAX, &drain);
-		if (ret < 0) {
-			ret = 0;
+		if (ret < 0)
 			break;
-		}

 		/* careful to only keep draining the caller's specific open trans */
 		if (drain.nr != lt->nr || drain.get_trans_seq != lt->get_trans_seq ||
 		    drain.commit_trans_seq != lt->commit_trans_seq || drain.flags != lt->flags) {
-			ret = 0;
+			ret = -ENOENT;
 			break;
 		}

 		ret = scoutfs_btree_dirty(sb, &server->alloc, &server->wri,
 					  &super->logs_root, &key);
-		if (ret < 0) {
-			ret = 0;
+		if (ret < 0)
 			break;
-		}

 		/* moving can modify and return errors, always update caller and item */
 		mutex_lock(&server->alloc_mutex);
@@ -1356,19 +1352,19 @@ static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees
 		BUG_ON(err < 0); /* dirtying must guarantee success */

 		mutex_unlock(&server->logs_mutex);
+
 		ret = server_apply_commit(sb, &hold, ret);
-		apply = false;
-
-		if (ret < 0)
+		if (ret < 0) {
+			ret = 0; /* don't try to abort, ignoring ret */
 			break;
+		}
 	}

-	if (apply) {
+	/* try to cleanly abort and write any partial dirty btree blocks, but ignore result */
+	if (ret < 0) {
 		mutex_unlock(&server->logs_mutex);
-		server_apply_commit(sb, &hold, ret);
+		server_apply_commit(sb, &hold, 0);
 	}
-
-	return ret;
 }

 /*
@@ -1576,9 +1572,9 @@ out:
 		scoutfs_err(sb, "error %d getting log trees for rid %016llx: %s",
 			    ret, rid, err_str);

-	/* try to drain excessive data_freed with additional commits, if needed */
+	/* try to drain excessive data_freed with additional commits, if needed, ignoring err */
 	if (ret == 0)
-		ret = try_drain_data_freed(sb, &lt);
+		try_drain_data_freed(sb, &lt);

 	return scoutfs_net_response(sb, conn, cmd, id, ret, &lt, sizeof(lt));
 }
@@ -4153,7 +4149,7 @@ static void fence_pending_recov_worker(struct work_struct *work)
 	struct server_info *server = container_of(work, struct server_info,
 						  fence_pending_recov_work);
 	struct super_block *sb = server->sb;
-	union scoutfs_inet_addr addr = {{0,}};
+	union scoutfs_inet_addr addr;
 	u64 rid = 0;
 	int ret = 0;

@@ -160,17 +160,11 @@ static void scoutfs_metadev_close(struct super_block *sb)
 		 * from kill_sb->put_super.
 		 */
 		lockdep_off();
-
-#ifdef KC_BDEV_FILE_OPEN_BY_PATH
-		bdev_fput(sbi->meta_bdev_file);
-#else
 #ifdef KC_BLKDEV_PUT_HOLDER_ARG
 		blkdev_put(sbi->meta_bdev, sb);
 #else
 		blkdev_put(sbi->meta_bdev, SCOUTFS_META_BDEV_MODE);
 #endif
-#endif
-
 		lockdep_on();
 		sbi->meta_bdev = NULL;
 	}
@@ -487,11 +481,7 @@ out:
 static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct scoutfs_mount_options opts;
-#ifdef KC_BDEV_FILE_OPEN_BY_PATH
-	struct file *meta_bdev_file;
-#else
 	struct block_device *meta_bdev;
-#endif
 	struct scoutfs_sb_info *sbi;
 	struct inode *inode;
 	int ret;
@@ -537,22 +527,6 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}

-#ifdef KC_BDEV_FILE_OPEN_BY_PATH
-	/*
-	 * pass sbi as holder, since dev_mount already passes sb, which triggers a
-	 * WARN_ON because dev_mount also passes non-NULL hops. By passing sbi
-	 * here we just get a simple error in our test cases.
-	 */
-	meta_bdev_file = bdev_file_open_by_path(opts.metadev_path, SCOUTFS_META_BDEV_MODE, sbi, NULL);
-	if (IS_ERR(meta_bdev_file)) {
-		scoutfs_err(sb, "could not open metadev: error %ld",
-			    PTR_ERR(meta_bdev_file));
-		ret = PTR_ERR(meta_bdev_file);
-		goto out;
-	}
-	sbi->meta_bdev_file = meta_bdev_file;
-	sbi->meta_bdev = file_bdev(meta_bdev_file);
-#else
 #ifdef KC_BLKDEV_PUT_HOLDER_ARG
 	meta_bdev = blkdev_get_by_path(opts.metadev_path, SCOUTFS_META_BDEV_MODE, sb, NULL);
 #else
@@ -565,8 +539,6 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 	sbi->meta_bdev = meta_bdev;
-#endif
-
 	ret = set_blocksize(sbi->meta_bdev, SCOUTFS_BLOCK_SM_SIZE);
 	if (ret != 0) {
 		scoutfs_err(sb, "failed to set metadev blocksize, returned %d",
@@ -42,9 +42,6 @@ struct scoutfs_sb_info {
 	u64 fmt_vers;

 	struct block_device *meta_bdev;
-#ifdef KC_BDEV_FILE_OPEN_BY_PATH
-	struct file *meta_bdev_file;
-#endif

 	spinlock_t next_ino_lock;

@@ -159,58 +159,6 @@ static bool drained_holders(struct trans_info *tri)
 	return holders == 0;
 }

-static int commit_current_log_trees(struct super_block *sb, char **str)
-{
-	DECLARE_TRANS_INFO(sb, tri);
-
-	return (*str = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
-	       (*str = "item dirty", scoutfs_item_write_dirty(sb))  ?:
-	       (*str = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
-	       (*str = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc, &tri->wri)) ?:
-	       (*str = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
-	       (*str = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
-	       (*str = "commit log trees", commit_btrees(sb)) ?:
-	       scoutfs_item_write_done(sb);
-}
-
-static int get_next_log_trees(struct super_block *sb, char **str)
-{
-	return (*str = "get log trees", scoutfs_trans_get_log_trees(sb));
-}
-
-static int retry_forever(struct super_block *sb, int (*func)(struct super_block *sb, char **str))
-{
-	bool retrying = false;
-	char *str;
-	int ret;
-
-	do {
-		str = NULL;
-
-		ret = func(sb, &str);
-		if (ret < 0) {
-			if (!retrying) {
-				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
-					    str, ret);
-				retrying = true;
-			}
-
-			if (scoutfs_forcing_unmount(sb)) {
-				ret = -EIO;
-				break;
-			}
-
-			msleep(2 * MSEC_PER_SEC);
-
-		} else if (retrying) {
-			scoutfs_info(sb, "retried transaction commit succeeded");
-		}
-
-	} while (ret < 0);
-
-	return ret;
-}
-
 /*
 * This work func is responsible for writing out all the dirty blocks
 * that make up the current dirty transaction.  It prevents writers from
@@ -236,6 +184,8 @@ void scoutfs_trans_write_func(struct work_struct *work)
 	struct trans_info *tri = container_of(work, struct trans_info, write_work.work);
 	struct super_block *sb = tri->sb;
 	struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
+	bool retrying = false;
+	char *s = NULL;
 	int ret = 0;

 	tri->task = current;
@@ -264,9 +214,37 @@ void scoutfs_trans_write_func(struct work_struct *work)

 	scoutfs_inc_counter(sb, trans_commit_written);

-	/* retry {commit,get}_log_trees until they succeeed, can only fail when forcing unmount */
-	ret = retry_forever(sb, commit_current_log_trees) ?:
-	      retry_forever(sb, get_next_log_trees);
+	do {
+		ret = (s = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
+		      (s = "item dirty", scoutfs_item_write_dirty(sb))  ?:
+		      (s = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
+		      (s = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc,
+									 &tri->wri))  ?:
+		      (s = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
+		      (s = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
+		      (s = "commit log trees", commit_btrees(sb)) ?:
+		      scoutfs_item_write_done(sb) ?:
+		      (s = "get log trees", scoutfs_trans_get_log_trees(sb));
+		if (ret < 0) {
+			if (!retrying) {
+				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
+					    s, ret);
+				retrying = true;
+			}
+
+			if (scoutfs_forcing_unmount(sb)) {
+				ret = -EIO;
+				break;
+			}
+
+			msleep(2 * MSEC_PER_SEC);
+
+		} else if (retrying) {
+			scoutfs_info(sb, "retried transaction commit succeeded");
+		}
+
+	} while (ret < 0);
+
 out:
 	spin_lock(&tri->write_lock);
 	tri->write_count++;
@@ -0,0 +1,244 @@
+package main
+
+import (
+	"flag"
+	"fmt"
+	"log"
+	"os"
+	"path/filepath"
+	"sync"
+	"syscall"
+
+	"restore/pkg/restore"
+)
+
+type options struct {
+	metaPath   string
+	sourceDir  string
+	numWorkers int
+}
+
+// hardlinkTracker keeps track of inodes we've already processed
+type hardlinkTracker struct {
+	sync.Mutex
+	seen map[uint64]bool
+}
+
+func newHardlinkTracker() *hardlinkTracker {
+	return &hardlinkTracker{
+		seen: make(map[uint64]bool),
+	}
+}
+
+func (h *hardlinkTracker) isNewInode(ino uint64, nlink bool) bool {
+	if !nlink {
+		return true
+	}
+
+	h.Lock()
+	defer h.Unlock()
+
+	if _, exists := h.seen[ino]; exists {
+		return false
+	}
+
+	h.seen[ino] = true
+	return true
+}
+
+// getFileInfo extracts file information from os.FileInfo
+func getFileInfo(info os.FileInfo) restore.FileInfo {
+	stat := info.Sys().(*syscall.Stat_t)
+
+	// Use target inode number if specified, otherwise use actual inode number
+	ino := uint64(stat.Ino)
+
+	return restore.FileInfo{
+		Ino:       ino,
+		Mode:      uint32(stat.Mode),
+		Uid:       uint32(stat.Uid),
+		Gid:       uint32(stat.Gid),
+		Size:      uint64(stat.Size),
+		Rdev:      uint64(stat.Rdev),
+		AtimeSec:  stat.Atim.Sec,
+		AtimeNsec: stat.Atim.Nsec,
+		MtimeSec:  stat.Mtim.Sec,
+		MtimeNsec: stat.Mtim.Nsec,
+		CtimeSec:  stat.Ctim.Sec,
+		CtimeNsec: stat.Ctim.Nsec,
+		IsDir:     info.IsDir(),
+		IsRegular: stat.Mode&syscall.S_IFMT == syscall.S_IFREG,
+	}
+}
+
+// getXAttrs gets extended attributes for a file/directory
+func getXAttrs(path string) ([]restore.XAttr, error) {
+	size, err := syscall.Listxattr(path, nil)
+	if err != nil || size == 0 {
+		return nil, err
+	}
+
+	buf := make([]byte, size)
+	size, err = syscall.Listxattr(path, buf)
+	if err != nil {
+		return nil, err
+	}
+
+	var xattrs []restore.XAttr
+	start := 0
+	for i := 0; i < size; i++ {
+		if buf[i] == 0 {
+			name := string(buf[start:i])
+			value, err := syscall.Getxattr(path, name, nil)
+			if err != nil {
+				continue
+			}
+
+			valueBuf := make([]byte, value)
+			_, err = syscall.Getxattr(path, name, valueBuf)
+			if err != nil {
+				continue
+			}
+
+			xattrs = append(xattrs, restore.XAttr{
+				Name:  name,
+				Value: valueBuf,
+			})
+			start = i + 1
+		}
+	}
+
+	return xattrs, nil
+}
+
+func restorePath(writer *restore.WorkerWriter, hlTracker *hardlinkTracker, path string, parentIno uint64) error {
+	entries, err := os.ReadDir(path)
+	if err != nil {
+		return fmt.Errorf("failed to read directory: %v", err)
+	}
+	log.Printf("Restoring path: %s", path)
+	var subdirs int
+	var nameBytes int
+
+	for pos, entry := range entries {
+		if entry.Name() == "." || entry.Name() == ".." {
+			continue
+		}
+
+		info, err := entry.Info()
+		if err != nil {
+			return fmt.Errorf("failed to get entry info: %v", err)
+		}
+
+		stat, ok := info.Sys().(*syscall.Stat_t)
+		if !ok {
+			return fmt.Errorf("failed to get stat_t")
+		}
+		nameBytes += len(entry.Name())
+		fullPath := filepath.Join(path, entry.Name())
+
+		// Recurse into directories
+		if info.IsDir() {
+			subdirs++
+
+			if err := restorePath(writer, hlTracker, fullPath, uint64(stat.Ino)); err != nil {
+				return err
+			}
+
+		}
+
+		err = writer.CreateEntry(parentIno, uint64(pos), uint64(stat.Ino), uint32(info.Mode()), entry.Name())
+		if err != nil {
+			return fmt.Errorf("failed to create entry: %v", err)
+		}
+
+		// Handle inode
+		isHardlink := stat.Nlink > 1
+		if !info.IsDir() && hlTracker.isNewInode(uint64(stat.Ino), isHardlink) {
+			fileInfo := getFileInfo(info)
+			err = writer.CreateInode(fileInfo)
+			if err != nil {
+				return fmt.Errorf("failed to create inode: %v", err)
+			}
+
+			// Handle xattrs
+			xattrs, err := getXAttrs(fullPath)
+			if err == nil {
+				for pos, xattr := range xattrs {
+					err = writer.CreateXAttr(uint64(stat.Ino), uint64(pos), xattr)
+					if err != nil {
+						return fmt.Errorf("failed to create xattr: %v", err)
+					}
+				}
+			}
+		}
+	}
+	// Get directory info
+	dirInfo, err := os.Stat(path)
+	if err != nil {
+		return fmt.Errorf("failed to stat directory: %v", err)
+	}
+
+	// Create directory inode
+	dirFileInfo := getFileInfo(dirInfo)
+	dirFileInfo.NrSubdirs = uint64(subdirs)
+	dirFileInfo.NameBytes = uint64(nameBytes)
+
+	return writer.CreateInode(dirFileInfo)
+}
+
+func main() {
+	opts := options{}
+	flag.StringVar(&opts.metaPath, "m", "", "path to metadata device")
+	flag.StringVar(&opts.sourceDir, "s", "", "path to source directory")
+	flag.IntVar(&opts.numWorkers, "w", 4, "number of worker threads")
+	flag.Parse()
+
+	if opts.metaPath == "" || opts.sourceDir == "" {
+		flag.Usage()
+		os.Exit(1)
+	}
+
+	// Create master and worker writers
+	master, workers, err := restore.NewWriters(opts.metaPath, opts.numWorkers)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "Failed to create writers: %v\n", err)
+		os.Exit(1)
+	}
+	defer master.Destroy()
+
+	// Create hardlink tracker
+	hlTracker := newHardlinkTracker()
+
+	// Start workers
+	var wg sync.WaitGroup
+	for i, worker := range workers {
+		wg.Add(1)
+		go func(w *restore.WorkerWriter, workerNum int) {
+			defer wg.Done()
+
+			// Each worker processes a subset of the directory tree
+			if err := restorePath(w, hlTracker, opts.sourceDir, 1); err != nil {
+				fmt.Fprintf(os.Stderr, "Worker %d failed: %v\n", workerNum, err)
+				os.Exit(1)
+			}
+			// Create root inode for source directory
+			rootInfo, err := os.Stat(opts.sourceDir)
+			if err != nil {
+				fmt.Fprintf(os.Stderr, "Failed to stat source directory: %v\n", err)
+				os.Exit(1)
+			}
+			w.CreateInode(getFileInfo(rootInfo))
+			err = w.Destroy()
+			if err != nil {
+				fmt.Fprintf(os.Stderr, "Failed to destroy worker: %v\n", err)
+				os.Exit(1)
+			}
+		}(worker, i)
+	}
+
+	// Wait for all workers to complete
+	wg.Wait()
+
+	fmt.Println("Restore completed successfully")
+}
@@ -0,0 +1,3 @@
+module restore
+
+go 1.21.11
@@ -0,0 +1,472 @@
+package restore
+
+/*
+#cgo CFLAGS: -I${SRCDIR}/../../../utils/src -I${SRCDIR}/../../../kmod/src
+#cgo LDFLAGS: -L${SRCDIR}/../../../utils/src -l:scoutfs_parallel_restore.a -lm
+
+#include <stdlib.h>
+#include <linux/types.h>
+#include <stdbool.h>
+#include <math.h>
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "parallel_restore.h"
+
+// If there are any type conflicts, you might need to add:
+// #include "kernel_types.h"
+*/
+import "C"
+import (
+    "errors"
+    "fmt"
+    "sync"
+    "syscall"
+    "unsafe"
+)
+
+const batchSize = 1000
+const bufSize = 2 * 1024 * 1024
+
+type WorkerWriter struct {
+    writer      *C.struct_scoutfs_parallel_restore_writer
+    progressCh  chan *ScoutfsParallelWriterProgress
+    fileCreated int64
+    devFd       int
+    buf         unsafe.Pointer
+    wg          *sync.WaitGroup
+}
+
+type MasterWriter struct {
+    writer     *C.struct_scoutfs_parallel_restore_writer
+    progressCh chan *ScoutfsParallelWriterProgress
+    workers    []*WorkerWriter
+    wg         sync.WaitGroup
+    slice      *C.struct_scoutfs_parallel_restore_slice // Add slice field
+    progressWg sync.WaitGroup
+    devFd      int
+    super      *C.struct_scoutfs_super_block
+}
+
+type ScoutfsParallelWriterProgress struct {
+    Progress *C.struct_scoutfs_parallel_restore_progress
+    Slice    *C.struct_scoutfs_parallel_restore_slice
+}
+
+func (m *MasterWriter) aggregateProgress() {
+    defer m.progressWg.Done()
+    for progress := range m.progressCh {
+        ret := C.scoutfs_parallel_restore_add_progress(m.writer, progress.Progress)
+        if ret != 0 {
+            // Handle error appropriately, e.g., log it
+            fmt.Printf("Failed to add progress, error code: %d\n", ret)
+        }
+        if progress.Slice != nil {
+            ret = C.scoutfs_parallel_restore_add_slice(m.writer, progress.Slice)
+            C.free(unsafe.Pointer(progress.Slice))
+            if ret != 0 {
+                // Handle error appropriately, e.g., log it
+                fmt.Printf("Failed to add slice, error code: %d\n", ret)
+            }
+        }
+        // Free the C-allocated progress structures
+        C.free(unsafe.Pointer(progress.Progress))
+    }
+}
+
+func (m *MasterWriter) Destroy() {
+    m.wg.Wait()
+    close(m.progressCh)
+    m.progressWg.Wait()
+
+    if m.slice != nil {
+        C.free(unsafe.Pointer(m.slice)) // Free slice on error
+    }
+    if m.super != nil {
+        C.free(unsafe.Pointer(m.super)) // Free superblock on error
+    }
+    if m.devFd != 0 {
+        syscall.Close(m.devFd)
+    }
+    // Destroy master writer
+    C.scoutfs_parallel_restore_destroy_writer(&m.writer)
+}
+
+func NewWriters(path string, numWriters int) (*MasterWriter, []*WorkerWriter, error) {
+    if numWriters <= 1 {
+        return nil, nil, errors.New("number of writers must be positive")
+    }
+
+    devFd, err := syscall.Open(path, syscall.O_DIRECT|syscall.O_RDWR|syscall.O_EXCL, 0)
+    if err != nil {
+        return nil, nil, fmt.Errorf("failed to open metadata device '%s': %v", path, err)
+    }
+
+    var masterWriter MasterWriter
+    masterWriter.progressCh = make(chan *ScoutfsParallelWriterProgress, numWriters*2)
+    masterWriter.workers = make([]*WorkerWriter, 0, numWriters-1)
+    masterWriter.devFd = devFd
+
+    var ret C.int
+    // Allocate aligned memory for superblock
+    var super unsafe.Pointer
+    ret = C.posix_memalign(&super, 4096, C.SCOUTFS_BLOCK_SM_SIZE)
+    if ret != 0 {
+        masterWriter.Destroy()
+        return nil, nil, fmt.Errorf("failed to allocate aligned memory for superblock: %d", ret)
+    }
+    masterWriter.super = (*C.struct_scoutfs_super_block)(super)
+
+    // Read the superblock from devFd
+    superOffset := C.SCOUTFS_SUPER_BLKNO << C.SCOUTFS_BLOCK_SM_SHIFT
+    count, err := syscall.Pread(devFd, (*[1 << 30]byte)(super)[:C.SCOUTFS_BLOCK_SM_SIZE], int64(superOffset))
+    if err != nil {
+        masterWriter.Destroy()
+        return nil, nil, fmt.Errorf("failed to read superblock: %v", err)
+    }
+    if count != int(C.SCOUTFS_BLOCK_SM_SIZE) {
+        masterWriter.Destroy()
+        return nil, nil, fmt.Errorf("failed to read superblock, bytes read: %d", count)
+    }
+
+    // Check if the superblock is valid.
+    if C.le64_to_cpu(masterWriter.super.flags)&C.SCOUTFS_FLAG_IS_META_BDEV == 0 {
+        masterWriter.Destroy()
+        return nil, nil, errors.New("superblock is not a metadata device")
+    }
+
+    // Create master writer
+    ret = C.scoutfs_parallel_restore_create_writer(&masterWriter.writer)
+    if ret != 0 {
+        masterWriter.Destroy()
+        return nil, nil, errors.New("failed to create master writer")
+    }
+
+    ret = C.scoutfs_parallel_restore_import_super(masterWriter.writer, masterWriter.super, C.int(devFd))
+    if ret != 0 {
+        masterWriter.Destroy()
+        return nil, nil, fmt.Errorf("failed to import superblock, error code: %d", ret)
+    }
+
+    // Initialize slices for each worker
+    masterWriter.slice = (*C.struct_scoutfs_parallel_restore_slice)(C.malloc(C.size_t(numWriters) *
+        C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{}))))
+    if masterWriter.slice == nil {
+        masterWriter.Destroy()
+        return nil, nil, errors.New("failed to allocate slices")
+    }
+
+    ret = C.scoutfs_parallel_restore_init_slices(masterWriter.writer,
+        masterWriter.slice,
+        C.int(numWriters))
+    if ret != 0 {
+        masterWriter.Destroy()
+        return nil, nil, errors.New("failed to initialize slices")
+    }
+
+    ret = C.scoutfs_parallel_restore_add_slice(masterWriter.writer, masterWriter.slice)
+    if ret != 0 {
+        masterWriter.Destroy()
+        return nil, nil, errors.New("failed to add slice to master writer")
+    }
+
+    // Create worker writers
+    for i := 1; i < numWriters; i++ {
+        var bufPtr unsafe.Pointer
+        if ret := C.posix_memalign(&bufPtr, 4096, bufSize); ret != 0 {
+            masterWriter.Destroy()
+            return nil, nil, fmt.Errorf("failed to allocate aligned worker buffer: %d", ret)
+        }
+
+        worker := &WorkerWriter{
+            progressCh: masterWriter.progressCh,
+            buf:        bufPtr,
+            wg:         &masterWriter.wg,
+        }
+        ret = C.scoutfs_parallel_restore_create_writer(&worker.writer)
+        if ret != 0 {
+            masterWriter.Destroy()
+            return nil, nil, errors.New("failed to create worker writer")
+        }
+
+        masterWriter.wg.Add(1)
+
+        // Use each slice for the corresponding worker
+        slice := (*C.struct_scoutfs_parallel_restore_slice)(unsafe.Pointer(uintptr(unsafe.Pointer(masterWriter.slice)) +
+            uintptr(i)*unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{})))
+        ret = C.scoutfs_parallel_restore_add_slice(worker.writer, slice)
+        if ret != 0 {
+            C.scoutfs_parallel_restore_destroy_writer(&worker.writer)
+            masterWriter.Destroy()
+            return nil, nil, errors.New("failed to add slice to worker writer")
+        }
+
+        masterWriter.workers = append(masterWriter.workers, worker)
+    }
+    masterWriter.progressWg.Add(1)
+    go masterWriter.aggregateProgress()
+
+    return &masterWriter, masterWriter.workers, nil
+
+}
+
+func (w *WorkerWriter) getProgress(withSlice bool) (*ScoutfsParallelWriterProgress, error) {
+    progress := (*C.struct_scoutfs_parallel_restore_progress)(
+        C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_progress{}))),
+    )
+    if progress == nil {
+        return nil, errors.New("failed to allocate memory for progress")
+    }
+
+    // Fetch the current progress from the C library
+    ret := C.scoutfs_parallel_restore_get_progress(w.writer, progress)
+    if ret != 0 {
+        C.free(unsafe.Pointer(progress))
+        return nil, fmt.Errorf("failed to get progress, error code: %d", ret)
+    }
+
+    var slice *C.struct_scoutfs_parallel_restore_slice
+    if withSlice {
+        slice = (*C.struct_scoutfs_parallel_restore_slice)(
+            C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{}))),
+        )
+        if slice == nil {
+            C.free(unsafe.Pointer(progress))
+            return nil, errors.New("failed to allocate memory for slice")
+        }
+
+        // Optionally fetch the slice information
+        ret = C.scoutfs_parallel_restore_get_slice(w.writer, slice)
+        if ret != 0 {
+            C.free(unsafe.Pointer(progress))
+            C.free(unsafe.Pointer(slice))
+            return nil, fmt.Errorf("failed to get slice, error code: %d", ret)
+        }
+    }
+
+    return &ScoutfsParallelWriterProgress{
+        Progress: progress,
+        Slice:    slice,
+    }, nil
+}
+
+// writeBuffer writes data from the buffer to the device file descriptor.
+// It uses scoutfs_parallel_restore_write_buf to get data and pwrite to write it.
+func (w *WorkerWriter) writeBuffer() (int64, error) {
+    var totalWritten int64
+    var count int64
+    var off int64
+    var ret C.int
+
+    // Allocate memory for off and count
+    offPtr := (*C.off_t)(unsafe.Pointer(&off))
+    countPtr := (*C.size_t)(unsafe.Pointer(&count))
+
+    for {
+        ret = C.scoutfs_parallel_restore_write_buf(w.writer, w.buf,
+            C.size_t(bufSize), offPtr, countPtr)
+
+        if ret != 0 {
+            return totalWritten, fmt.Errorf("failed to write buffer: error code %d", ret)
+        }
+
+        if count > 0 {
+            n, err := syscall.Pwrite(w.devFd, unsafe.Slice((*byte)(w.buf), count), off)
+            if err != nil {
+                return totalWritten, fmt.Errorf("pwrite failed: %v", err)
+            }
+            if n != int(count) {
+                return totalWritten, fmt.Errorf("pwrite wrote %d bytes; expected %d", n, count)
+            }
+            totalWritten += int64(n)
+        }
+
+        if count == 0 {
+            break
+        }
+    }
+
+    return totalWritten, nil
+}
+
+func (w *WorkerWriter) InsertEntry(entry *C.struct_scoutfs_parallel_restore_entry) error {
+    // Add the entry using the C library
+    ret := C.scoutfs_parallel_restore_add_entry(w.writer, entry)
+    if ret != 0 {
+        return fmt.Errorf("failed to add entry, error code: %d", ret)
+    }
+
+    // Increment the fileCreated counter
+    w.fileCreated++
+    if w.fileCreated >= batchSize {
+        _, err := w.writeBuffer()
+        if err != nil {
+            return fmt.Errorf("error writing buffers: %v", err)
+        }
+        // Allocate memory for progress and slice structures
+        progress, err := w.getProgress(false)
+        if err != nil {
+            return err
+        }
+        // Send the progress update to the shared progress channel
+        w.progressCh <- progress
+        // Reset the fileCreated counter
+        w.fileCreated = 0
+    }
+
+    return nil
+}
+
+func (w *WorkerWriter) InsertXattr(xattr *C.struct_scoutfs_parallel_restore_xattr) error {
+    ret := C.scoutfs_parallel_restore_add_xattr(w.writer, xattr)
+    if ret != 0 {
+        return fmt.Errorf("failed to add xattr, error code: %d", ret)
+    }
+    return nil
+}
+
+func (w *WorkerWriter) InsertInode(inode *C.struct_scoutfs_parallel_restore_inode) error {
+    ret := C.scoutfs_parallel_restore_add_inode(w.writer, inode)
+    if ret != 0 {
+        return fmt.Errorf("failed to add inode, error code: %d", ret)
+    }
+    return nil
+}
+
+// should only be called once
+func (w *WorkerWriter) Destroy() error {
+    defer w.wg.Done()
+    // Send final progress if there are remaining entries
+    if w.fileCreated > 0 {
+        _, err := w.writeBuffer()
+        if err != nil {
+            return err
+        }
+        progress := &ScoutfsParallelWriterProgress{
+            Progress: (*C.struct_scoutfs_parallel_restore_progress)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_progress{})))),
+            Slice:    (*C.struct_scoutfs_parallel_restore_slice)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{})))),
+        }
+        w.progressCh <- progress
+        w.fileCreated = 0
+    }
+
+    if w.buf != nil {
+        C.free(w.buf)
+        w.buf = nil
+    }
+
+    C.scoutfs_parallel_restore_destroy_writer(&w.writer)
+    return nil
+}
+
+// Add these new types and functions to the existing restore.go file
+
+type FileInfo struct {
+    Ino       uint64
+    Mode      uint32
+    Uid       uint32
+    Gid       uint32
+    Size      uint64
+    Rdev      uint64
+    AtimeSec  int64
+    AtimeNsec int64
+    MtimeSec  int64
+    MtimeNsec int64
+    CtimeSec  int64
+    CtimeNsec int64
+    NrSubdirs uint64
+    NameBytes uint64
+    IsDir     bool
+    IsRegular bool
+}
+
+type XAttr struct {
+    Name  string
+    Value []byte
+}
+
+// CreateInode creates a C inode structure from FileInfo
+func (w *WorkerWriter) CreateInode(info FileInfo) error {
+    inode := (*C.struct_scoutfs_parallel_restore_inode)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_inode{}))))
+    if inode == nil {
+        return fmt.Errorf("failed to allocate inode")
+    }
+    defer C.free(unsafe.Pointer(inode))
+
+    inode.ino = C.__u64(info.Ino)
+    inode.mode = C.__u32(info.Mode)
+    inode.uid = C.__u32(info.Uid)
+    inode.gid = C.__u32(info.Gid)
+    inode.size = C.__u64(info.Size)
+    inode.rdev = C.uint(info.Rdev)
+
+    inode.atime.tv_sec = C.__time_t(info.AtimeSec)
+    inode.atime.tv_nsec = C.long(info.AtimeNsec)
+    inode.mtime.tv_sec = C.__time_t(info.MtimeSec)
+    inode.mtime.tv_nsec = C.long(info.MtimeNsec)
+    inode.ctime.tv_sec = C.__time_t(info.CtimeSec)
+    inode.ctime.tv_nsec = C.long(info.CtimeNsec)
+    inode.crtime = inode.ctime
+
+    if info.IsRegular && info.Size > 0 {
+        inode.offline = C.bool(true)
+    }
+
+    if info.IsDir {
+        inode.nr_subdirs = C.__u64(info.NrSubdirs)
+        inode.total_entry_name_bytes = C.__u64(info.NameBytes)
+    }
+
+    return w.InsertInode(inode)
+}
+
+// CreateEntry creates a directory entry
+func (w *WorkerWriter) CreateEntry(dirIno uint64, pos uint64, ino uint64, mode uint32, name string) error {
+    entryC := (*C.struct_scoutfs_parallel_restore_entry)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_entry{})) + C.size_t(len(name))))
+
+    if entryC == nil {
+        return fmt.Errorf("failed to allocate entry")
+    }
+    defer C.free(unsafe.Pointer(entryC))
+
+    entryC.dir_ino = C.__u64(dirIno)
+    entryC.pos = C.__u64(pos)
+    entryC.ino = C.__u64(ino)
+    entryC.mode = C.__u32(mode)
+    entryC.name_len = C.uint(len(name))
+
+    entryC.name = (*C.char)(C.malloc(C.size_t(len(name))))
+    if entryC.name == nil {
+        return fmt.Errorf("failed to allocate entry name")
+    }
+    defer C.free(unsafe.Pointer(entryC.name))
+    copy((*[1 << 30]byte)(unsafe.Pointer(entryC.name))[:len(name)], []byte(name))
+
+    return w.InsertEntry(entryC)
+}
+
+// CreateXAttr creates an extended attribute
+func (w *WorkerWriter) CreateXAttr(ino uint64, pos uint64, xattr XAttr) error {
+    xattrC := (*C.struct_scoutfs_parallel_restore_xattr)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_xattr{})) + C.size_t(len(xattr.Name)) + C.size_t(len(xattr.Value))))
+    if xattrC == nil {
+        return fmt.Errorf("failed to allocate xattr")
+    }
+    defer C.free(unsafe.Pointer(xattrC))
+
+    xattrC.ino = C.__u64(ino)
+    xattrC.pos = C.__u64(pos)
+    xattrC.name_len = C.uint(len(xattr.Name))
+    xattrC.value_len = C.__u32(len(xattr.Value))
+
+    xattrC.name = (*C.char)(C.malloc(C.size_t(len(xattr.Name))))
+    if xattrC.name == nil {
+        return fmt.Errorf("failed to allocate xattr name")
+    }
+    defer C.free(unsafe.Pointer(xattrC.name))
+
+    copy((*[1 << 30]byte)(unsafe.Pointer(xattrC.name))[:len(xattr.Name)], []byte(xattr.Name))
+
+    xattrC.value = unsafe.Pointer(&xattr.Value[0])
+
+    return w.InsertXattr(xattrC)
+}
@@ -0,0 +1,10 @@
+package restore
+
+import "testing"
+
+func TestNewWriters(t *testing.T) {
+	_, _, err := NewWriters("/tmp", 2)
+	if err != nil {
+		t.Fatalf("failed to create master writer: %v", err)
+	}
+}
@@ -10,5 +10,3 @@ src/stage_tmpfile
 src/create_xattr_loop
 src/o_tmpfile_umask
 src/o_tmpfile_linkat
-src/mmap_stress
-src/mmap_validate
@@ -14,8 +14,8 @@ BIN := src/createmany			\
 	src/fragmented_data_extents	\
 	src/o_tmpfile_umask		\
 	src/o_tmpfile_linkat		\
-	src/mmap_stress			\
-	src/mmap_validate
+	src/parallel_restore		\
+	src/restore_copy

 DEPS := $(wildcard src/*.d)

@@ -25,10 +25,12 @@ ifneq ($(DEPS),)
 -include $(DEPS)
 endif

-src/mmap_stress: LIBS+=-lpthread
+src/parallel_restore_cflags := ../utils/src/scoutfs_parallel_restore.a -lm
+src/restore_copy_cflags := ../utils/src/scoutfs_parallel_restore.a -lm

 $(BIN): %: %.c Makefile
-	gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@ $(LIBS)
+	gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@ $($(@)_cflags)
+

 .PHONY: clean
 clean:
@@ -80,15 +80,3 @@ t_compare_output()
 {
 	"$@" >&7 2>&1
 }
-
-#
-# usually bash prints an annoying output message when jobs
-# are killed.  We can avoid that by redirecting stderr for
-# the bash process when it reaps the jobs that are killed.
-#
-t_silent_kill() {
-	exec {ERR}>&2 2>/dev/null
-	kill "$@"
-	wait "$@"
-	exec 2>&$ERR {ERR}>&-
-}
@@ -160,9 +160,6 @@ t_filter_dmesg()
 	re="$re|Pipe handler or fully qualified core dump path required.*"
 	re="$re|Set kernel.core_pattern before fs.suid_dumpable.*"

-	# perf warning that it adjusted sample rate
-	re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
-
 	egrep -v "($re)" | \
 		ignore_harmless_unwind_kasan_stack_oob
 }
@@ -1,88 +0,0 @@
-
-#
-# Generate TAP format test results
-#
-
-t_tap_header()
-{
-	local runid=$1
-	local sequence=( $(echo $tests) )
-	local count=${#sequence[@]}
-
-	# avoid recreating the same TAP result over again - harness sets this
-	[[ -z "$runid" ]] && runid="*test*"
-
-	cat > $T_RESULTS/scoutfs.tap <<TAPEOF
-TAP version 14
-1..${count}
-#
-# TAP results for run ${runid}
-#
-# host/run info:
-#
-#   hostname: ${HOSTNAME}
-#   test start time: $(date --utc)
-#   uname -r: $(uname -r)
-#   scoutfs commit id: $(git describe --tags)
-#
-# sequence for this run:
-#
-TAPEOF
-
-	# Sequence
-	for t in ${tests}; do
-		 echo ${t/.sh/}
-	done | cat -n | expand | column -c 120 | expand | sed 's/^ /#/' >> $T_RESULTS/scoutfs.tap
-	echo "#" >> $T_RESULTS/scoutfs.tap
-}
-
-t_tap_progress()
-{
-(
-	local i=$(( testcount + 1 ))
-	local testname=$1
-	local result=$2
-
-	local diff=""
-	local dmsg=""
-
-	if [[ -s "$T_RESULTS/tmp/${testname}/dmesg.new" ]]; then
-		dmsg="1"
-	fi
-
-	if ! cmp -s golden/${testname} $T_RESULTS/output/${testname}; then
-		diff="1"
-	fi
-
-	if [[ "${result}" == "100" ]] && [[ -z "${dmsg}" ]] && [[ -z "${diff}" ]]; then
-		echo "ok ${i} - ${testname}"
-	elif [[ "${result}" == "103" ]]; then
-		echo "ok ${i} - ${testname}"
-		echo "# ${testname} ** skipped - permitted **"
-	else
-		echo "not ok ${i} - ${testname}"
-		case ${result} in
-		101)
-			echo "# ${testname} ** skipped **"
-			;;
-		102)
-			echo "# ${testname} ** failed **"
-			;;
-		esac
-
-		if [[ -n "${diff}" ]]; then
-			echo "#"
-			echo "# diff:"
-			echo "#"
-			diff -u golden/${testname} $T_RESULTS/output/${testname} | expand | sed 's/^/#   /'
-		fi
-
-		if [[ -n "${dmsg}" ]]; then
-			echo "#"
-			echo "# dmesg:"
-			echo "#"
-			cat "$T_RESULTS/tmp/${testname}/dmesg.new" | sed 's/^/#   /'
-		fi
-	fi
-) >> $T_RESULTS/scoutfs.tap
-}
@@ -1,2 +0,0 @@
-=== setup
-=== spin reading and shrinking
@@ -1,27 +0,0 @@
-== mmap_stress
-thread 0 complete
-thread 1 complete
-thread 2 complete
-thread 3 complete
-thread 4 complete
-== basic mmap/read/write consistency checks
-== mmap read from offline extent
-0: offset: 0 length: 2 flags: O.L
-extents: 1
-1
-00000200:  ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea  ................
-0
-0: offset: 0 length: 2 flags: ..L
-extents: 1
-== mmap write to an offline extent
-0: offset: 0 length: 2 flags: O.L
-extents: 1
-1
-0
-0: offset: 0 length: 2 flags: ..L
-extents: 1
-00000000  ea ea ea ea ea ea ea ea  ea ea ea ea ea ea ea ea  |................|
-00000010  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
-00000020  ea ea ea ea ea ea ea ea  ea ea ea ea ea ea ea ea  |................|
-00000030
-== done
@@ -0,0 +1,28 @@
+== simple mkfs/restore/mount
+committed_seq     1120
+total_meta_blocks 163840
+total_data_blocks 15728640
+   1440    1440   57120
+     80      80     400
+0: offset: 0 length: 1 flags: O.L
+extents: 1
+0: offset: 0 length: 1 flags: O.L
+extents: 1
+0: offset: 0 length: 1 flags: O.L
+extents: 1
+0: offset: 0 length: 1 flags: O.L
+extents: 1
+    Type  Size     Total   Used      Free  Use%  
+MetaData  64KB    163840  34722    129118    21  
+    Data   4KB  15728640     64  15728576     0  
+  7 13,L,- 15,L,- 17,L,- I 33 -
+== just under ENOSPC
+    Type  Size     Total    Used      Free  Use%  
+MetaData  64KB    163840  155666      8174    95  
+    Data   4KB  15728640      64  15728576     0  
+== just over ENOSPC
+== ENOSPC
+== attempt to restore data device
+== attempt format_v1 restore
+== test if previously mounted
+== cleanup
@@ -1,97 +0,0 @@
-== create content
-== readdir all
-00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
-00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
-00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
-00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
-00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
-00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
-00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000007: d_off: 0x00000008 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000008: d_off: 0x00000009 d_reclen: 0x40 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000009: d_off: 0x0000000a d_reclen: 0x48 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000a: d_off: 0x0000000b d_reclen: 0x50 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000b: d_off: 0x0000000c d_reclen: 0x58 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000c: d_off: 0x0000000d d_reclen: 0x60 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000d: d_off: 0x0000000e d_reclen: 0x68 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000e: d_off: 0x0000000f d_reclen: 0x70 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000000f: d_off: 0x00000010 d_reclen: 0x70 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000010: d_off: 0x00000011 d_reclen: 0x78 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000011: d_off: 0x00000012 d_reclen: 0x80 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000012: d_off: 0x00000013 d_reclen: 0x88 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000013: d_off: 0x00000014 d_reclen: 0x90 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-== readdir offset
-00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-== readdir len (bytes)
-00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
-00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
-00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
-00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
-00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
-00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
-00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-== introduce gap
-00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
-00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
-00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
-00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
-00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
-00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
-00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000007: d_off: 0x00000008 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000008: d_off: 0x00000009 d_reclen: 0x40 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000009: d_off: 0x00000014 d_reclen: 0x48 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-== cleanup
@@ -22,8 +22,6 @@ generic/024
 generic/025
 generic/026
 generic/028
-generic/029
-generic/030
 generic/031
 generic/032
 generic/033
@@ -55,7 +53,6 @@ generic/073
 generic/076
 generic/078
 generic/079
-generic/080
 generic/081
 generic/082
 generic/084
@@ -84,12 +81,10 @@ generic/116
 generic/117
 generic/118
 generic/119
-generic/120
 generic/121
 generic/122
 generic/123
 generic/124
-generic/126
 generic/128
 generic/129
 generic/130
@@ -100,7 +95,6 @@ generic/136
 generic/138
 generic/139
 generic/140
-generic/141
 generic/142
 generic/143
 generic/144
@@ -159,7 +153,6 @@ generic/210
 generic/211
 generic/212
 generic/214
-generic/215
 generic/216
 generic/217
 generic/218
@@ -180,9 +173,6 @@ generic/238
 generic/240
 generic/244
 generic/245
-generic/246
-generic/247
-generic/248
 generic/249
 generic/250
 generic/252
@@ -241,7 +231,6 @@ generic/317
 generic/319
 generic/322
 generic/324
-generic/325
 generic/326
 generic/327
 generic/328
@@ -255,7 +244,6 @@ generic/337
 generic/341
 generic/342
 generic/343
-generic/346
 generic/348
 generic/353
 generic/355
@@ -317,9 +305,7 @@ generic/424
 generic/425
 generic/426
 generic/427
-generic/428
 generic/436
-generic/437
 generic/439
 generic/440
 generic/443
@@ -329,7 +315,6 @@ generic/448
 generic/449
 generic/450
 generic/451
-generic/452
 generic/453
 generic/454
 generic/456
@@ -453,7 +438,6 @@ generic/610
 generic/611
 generic/612
 generic/613
-generic/614
 generic/618
 generic/621
 generic/623
@@ -467,7 +451,6 @@ generic/632
 generic/634
 generic/635
 generic/637
-generic/638
 generic/639
 generic/640
 generic/644
@@ -879,4 +862,4 @@ generic/688
 generic/689
 shared/002
 shared/032
-Passed all 512 tests
+Passed all 495 tests
@@ -512,11 +512,6 @@ msg "running tests"
 > "$T_RESULTS/skip.log"
 > "$T_RESULTS/fail.log"

-# generate a test ID to make sure we can de-duplicate TAP results in aggregation
-. funcs/tap.sh
-t_tap_header $(uuidgen)
-
-testcount=0
 passed=0
 skipped=0
 failed=0
@@ -532,15 +527,12 @@ for t in $tests; do
 	cmd rm -rf "$T_TMPDIR"
 	cmd mkdir -p "$T_TMPDIR"

-	# create a test name dir in the fs, clean up old data as needed
+	# create a test name dir in the fs
 	T_DS=""
 	for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
 		dir="${T_M[$i]}/test/$test_name"

-		test $i == 0 && (
-			test -d "$dir" && cmd rm -rf "$dir"
-			cmd mkdir -p "$dir"
-		)
+		test $i == 0 && cmd mkdir -p "$dir"

 		eval T_D$i=$dir
 		T_D[$i]=$dir
@@ -645,11 +637,6 @@ for t in $tests; do

 		test -n "$T_ABORT" && die "aborting after first failure"
 	fi
-
-	# record results for TAP format output
-	t_tap_progress $test_name $sts
-	((testcount++))
-
 done

 msg "all tests run: $passed passed, $skipped skipped, $skipped_permitted skipped (permitted), $failed failed"
@@ -6,7 +6,6 @@ inode-items-updated.sh
 simple-inode-index.sh
 simple-staging.sh
 simple-release-extents.sh
-simple-readdir.sh
 get-referring-entries.sh
 fallocate.sh
 basic-truncate.sh
@@ -18,7 +17,6 @@ projects.sh
 large-fragmented-free.sh
 format-version-forward-back.sh
 enospc.sh
-mmap.sh
 srch-safe-merge-pos.sh
 srch-basic-functionality.sh
 simple-xattr-unit.sh
@@ -27,7 +25,6 @@ totl-xattr-tag.sh
 quota.sh
 lock-refleak.sh
 lock-shrink-consistency.sh
-lock-shrink-read-race.sh
 lock-pr-cw-conflict.sh
 lock-revoke-getcwd.sh
 lock-recover-invalidate.sh
@@ -57,4 +54,5 @@ archive-light-cycle.sh
 block-stale-reads.sh
 inode-deletion.sh
 renameat2-noreplace.sh
+parallel_restore.sh
 xfstests.sh
@@ -1,181 +0,0 @@
-#define _GNU_SOURCE
-/*
- * mmap() stress test for scoutfs
- *
- * This test exercises the scoutfs kernel module's locking by
- * repeatedly reading/writing using mmap and pread/write calls
- * across 5 clients (mounts).
- *
- * Each thread operates on a single thread/client, and performs
- * operations in a random order on the file.
- *
- * The goal is to assure that locking between _page_mkwrite vfs
- * calls and the normal read/write paths do not cause deadlocks.
- *
- * There is no content validation performed. All that is done is
- * assure that the programs continues without errors.
- */
-
-#include <sys/types.h>
-#include <stdio.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <stdlib.h>
-#include <string.h>
-#include <stdbool.h>
-#include <sys/mman.h>
-#include <pthread.h>
-#include <errno.h>
-
-static int size = 0;
-static int count = 0; /* XXX make this duration instead */
-
-struct thread_info {
-	int nr;
-	int fd;
-};
-
-static void *run_test_func(void *ptr)
-{
-	void *buf = NULL;
-	char *addr = NULL;
-	struct thread_info *tinfo = ptr;
-	int c = 0;
-	int fd;
-	ssize_t read, written, ret;
-	int preads = 0, pwrites = 0, mreads = 0, mwrites = 0;
-
-	fd = tinfo->fd;
-
-	if (posix_memalign(&buf, 4096, size) != 0) {
-		perror("calloc");
-		exit(-1);
-	}
-
-	addr = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
-	if (addr == MAP_FAILED) {
-		perror("mmap");
-		exit(-1);
-	}
-
-	usleep(100000); /* 0.1sec to allow all threads to start roughly at the same time */
-
-	for (;;) {
-		if (++c > count)
-			break;
-
-		switch (rand() % 4) {
-		case 0: /* pread */
-			preads++;
-			for (read = 0; read < size;) {
-				ret = pread(fd, buf, size - read, read);
-				if (ret < 0) {
-					perror("pwrite");
-					exit(-1);
-				}
-				read += ret;
-			}
-			break;
-		case 1: /* pwrite */
-			pwrites++;
-			memset(buf, (char)(c & 0xff), size);
-			for (written = 0; written < size;) {
-				ret = pwrite(fd, buf, size - written, written);
-				if (ret < 0) {
-					perror("pwrite");
-					exit(-1);
-				}
-				written += ret;
-			}
-			break;
-		case 2: /* mmap read */
-			mreads++;
-			memcpy(buf, addr, size); /* noerr */
-			break;
-		case 3: /* mmap write */
-			mwrites++;
-			memset(buf, (char)(c & 0xff), size);
-			memcpy(addr, buf, size); /* noerr */
-			break;
-		}
-	}
-
-	munmap(addr, size);
-
-	free(buf);
-
-	printf("thread %u complete: preads %u pwrites %u mreads %u mwrites %u\n", tinfo->nr,
-		mreads, mwrites, preads, pwrites);
-
-	return NULL;
-}
-
-int main(int argc, char **argv)
-{
-	pthread_t thread[5];
-	struct thread_info tinfo[5];
-	int fd[5];
-	int ret;
-	int i;
-
-	if (argc != 8) {
-		fprintf(stderr, "%s requires 7 arguments - size count file1 file2 file3 file4 file5\n", argv[0]);
-		exit(-1);
-	}
-
-	size = atoi(argv[1]);
-	if (size <= 0) {
-		fprintf(stderr, "invalid size, must be greater than 0\n");
-		exit(-1);
-	}
-
-	count = atoi(argv[2]);
-	if (count < 0) {
-		fprintf(stderr, "invalid count, must be greater than 0\n");
-		exit(-1);
-	}
-
-	/* create and truncate one fd */
-	fd[0] = open(argv[3], O_RDWR | O_CREAT | O_TRUNC, 00644);
-	if (fd[0] < 0) {
-		perror("open");
-		exit(-1);
-	}
-
-	/* make it the test size */
-	if (posix_fallocate(fd[0], 0, size) != 0) {
-		perror("fallocate");
-		exit(-1);
-	}
-
-	/* now open the rest of the fds */
-	for (i = 1; i < 5; i++) {
-		fd[i] = open(argv[3+i], O_RDWR);
-		if (fd[i] < 0) {
-			perror("open");
-			exit(-1);
-		}
-	}
-
-	/* start threads */
-	for (i = 0; i < 5; i++) {
-		tinfo[i].fd = fd[i];
-		tinfo[i].nr = i;
-		ret = pthread_create(&thread[i], NULL, run_test_func, (void*)&tinfo[i]);
-
-		if (ret) {
-			perror("pthread_create");
-			exit(-1);
-		}
-	}
-
-	/* wait for complete */
-	for (i = 0; i < 5; i++)
-		pthread_join(thread[i], NULL);
-
-	for (i = 0; i < 5; i++)
-		close(fd[i]);
-
-	exit(0);
-}
@@ -1,159 +0,0 @@
-#define _GNU_SOURCE
-/*
- * mmap() content consistency checking for scoutfs
- *
- * This test program validates that content from memory mappings
- * are consistent across clients, whether written/read with mmap or
- * normal writes/reads.
- *
- * One side of (read/write) will always be memory mapped. It may
- * be that both sides do memory mapped (33% of the time).
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <string.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-#include <errno.h>
-
-static int count = 0;
-static int size = 0;
-
-static void run_test_func(int fd1, int fd2)
-{
-	void *buf1 = NULL;
-	void *buf2 = NULL;
-	char *addr1 = NULL;
-	char *addr2 = NULL;
-	int c = 0;
-	ssize_t read, written, ret;
-
-	/* buffers for both sides to compare */
-	if (posix_memalign(&buf1, 4096, size) != 0) {
-		perror("calloc1");
-		exit(-1);
-	}
-
-	if (posix_memalign(&buf2, 4096, size) != 0) {
-		perror("calloc1");
-		exit(-1);
-	}
-
-	/* memory maps for both sides */
-	addr1 = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd1, 0);
-	if (addr1 == MAP_FAILED) {
-		perror("mmap1");
-		exit(-1);
-	}
-
-	addr2 = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd2, 0);
-	if (addr2 == MAP_FAILED) {
-		perror("mmap2");
-		exit(-1);
-	}
-
-	for (;;) {
-		if (++c > count) /* 10k iterations */
-			break;
-
-		/* put a pattern in buf1 */
-		memset(buf1, c & 0xff, size);
-
-		/* pwrite or mmap write from buf1 */
-		switch (c % 3) {
-		case 0:	/* pwrite */
-			for (written = 0; written < size;) {
-				ret = pwrite(fd1, buf1, size - written, written);
-				if (ret < 0) {
-					perror("pwrite");
-					exit(-1);
-				}
-				written += ret;
-			}
-			break;
-		default: /* mmap write */
-			memcpy(addr1, buf1, size);
-			break;
-		}
-
-		/* pread or mmap read to buf2 */
-		switch (c % 3) {
-		case 2: /* pread */
-			for (read = 0; read < size;) {
-				ret = pread(fd2, buf2, size - read, read);
-				if (ret < 0) {
-					perror("pwrite");
-					exit(-1);
-				}
-				read += ret;
-			}
-			break;
-		default: /* mmap read */
-			memcpy(buf2, addr2, size);
-			break;
-		}
-
-		/* compare bufs */
-		if (memcmp(buf1, buf2, size) != 0) {
-			fprintf(stderr, "memcmp() failed\n");
-			exit(-1);
-		}
-	}
-
-	munmap(addr1, size);
-	munmap(addr2, size);
-
-	free(buf1);
-	free(buf2);
-}
-
-int main(int argc, char **argv)
-{
-	int fd[1];
-
-	if (argc != 5) {
-		fprintf(stderr, "%s requires 4 arguments - size count file1 file2\n", argv[0]);
-		exit(-1);
-	}
-
-	size = atoi(argv[1]);
-	if (size <= 0) {
-		fprintf(stderr, "invalid size, must be greater than 0\n");
-		exit(-1);
-	}
-
-	count = atoi(argv[2]);
-	if (count < 3) {
-		fprintf(stderr, "invalid count, must be greater than 3\n");
-		exit(-1);
-	}
-
-	/* create and truncate one fd */
-	fd[0] = open(argv[3], O_RDWR | O_CREAT | O_TRUNC, 00644);
-	if (fd[0] < 0) {
-		perror("open");
-		exit(-1);
-	}
-
-	fd[1] = open(argv[4], O_RDWR , 00644);
-	if (fd[1] < 0) {
-		perror("open");
-		exit(-1);
-	}
-
-	/* make it the test size */
-	if (posix_fallocate(fd[0], 0, size) != 0) {
-		perror("fallocate");
-		exit(-1);
-	}
-
-	/* run the test function */
-	run_test_func(fd[0], fd[1]);
-
-	close(fd[0]);
-	close(fd[1]);
-
-	exit(0);
-}
@@ -0,0 +1,838 @@
+#define _GNU_SOURCE /* O_DIRECT */
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/xattr.h>
+#include <ctype.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+#include <time.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <sys/socket.h>
+
+#include "../../utils/src/sparse.h"
+#include "../../utils/src/util.h"
+#include "../../utils/src/list.h"
+#include "../../utils/src/parse.h"
+#include "../../kmod/src/format.h"
+#include "../../utils/src/parallel_restore.h"
+
+/*
+ * XXX:
+ *  - add a nice description of what's going on
+ *  - mention allocator contention
+ *  - test child process dying handling
+ *  - root dir entry name length is wrong
+ */
+
+#define ERRF " errno %d (%s)"
+#define ERRA errno, strerror(errno)
+
+#define error_exit(cond, fmt, args...)			\
+do {							\
+	if (cond) {					\
+		printf("error: "fmt"\n", ##args);	\
+		exit(1);				\
+	}						\
+} while (0)
+
+#define dprintf(fmt, args...)		\
+do {					\
+	if (0)				\
+		printf(fmt, ##args);	\
+} while (0)
+
+#define REG_MODE (S_IFREG | 0644)
+#define DIR_MODE (S_IFDIR | 0755)
+
+struct opts {
+	unsigned long long buf_size;
+
+	unsigned long long write_batch;
+	unsigned long long low_dirs;
+	unsigned long long high_dirs;
+	unsigned long long low_files;
+	unsigned long long high_files;
+	char *meta_path;
+	unsigned long long total_files;
+	bool read_only;
+	unsigned long long seed;
+	unsigned long long nr_writers;
+};
+
+static void usage(void)
+{
+	printf("usage:\n"
+	       " -b NR       | threads write blocks in batches files (100000)\n"
+	       " -d LOW:HIGH | range of subdirs per directory (5:10)\n"
+	       " -f LOW:HIGH | range of files per directory (10:20)\n"
+	       " -m PATH     | path to metadata device\n"
+	       " -n NR       | total number of files to create (100)\n"
+	       " -r          | read-only, all work except writing, measure cpu cost\n"
+	       " -s NR       | randomization seed (random)\n"
+	       " -w NR       | number of writing processes to fork (online cpus)\n"
+	       );
+}
+
+static size_t write_bufs(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
+			 void *buf, size_t buf_size, int dev_fd)
+{
+	size_t total = 0;
+	size_t count;
+	off_t off;
+	int ret;
+
+	do {
+		ret = scoutfs_parallel_restore_write_buf(wri, buf, buf_size, &off, &count);
+		error_exit(ret, "write buf %d", ret);
+
+		if (count > 0) {
+			if (!opts->read_only)
+				ret = pwrite(dev_fd, buf, count, off);
+			else
+				ret = count;
+			error_exit(ret != count, "pwrite count %zu ret %d", count, ret);
+			total += ret;
+		}
+	} while (count > 0);
+
+	return total;
+}
+
+struct gen_inode {
+	struct scoutfs_parallel_restore_inode inode;
+	struct scoutfs_parallel_restore_xattr **xattrs;
+	u64 nr_xattrs;
+	struct scoutfs_parallel_restore_entry **entries;
+	u64 nr_files;
+	u64 nr_entries;
+};
+
+static void free_gino(struct gen_inode *gino)
+{
+	u64 i;
+
+	if (gino) {
+		if (gino->entries) {
+			for (i = 0; i < gino->nr_entries; i++)
+				free(gino->entries[i]);
+			free(gino->entries);
+		}
+		if (gino->xattrs) {
+			for (i = 0; i < gino->nr_xattrs; i++)
+				free(gino->xattrs[i]);
+			free(gino->xattrs);
+		}
+		free(gino);
+	}
+}
+
+static struct scoutfs_parallel_restore_xattr *
+generate_xattr(struct opts *opts, u64 ino, u64 pos, char *name, int name_len, void *value,
+		int value_len)
+{
+	struct scoutfs_parallel_restore_xattr *xattr;
+
+	xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + name_len + value_len);
+	error_exit(!xattr, "error allocating generated xattr");
+
+	*xattr = (struct scoutfs_parallel_restore_xattr) {
+		.ino = ino,
+		.pos = pos,
+		.name_len = name_len,
+		.value_len = value_len,
+	};
+
+	xattr->name = (void *)(xattr + 1);
+	xattr->value = (void *)(xattr->name + name_len);
+
+	memcpy(xattr->name, name, name_len);
+	if (value_len)
+		memcpy(xattr->value, value, value_len);
+
+	return xattr;
+}
+
+static struct gen_inode *generate_inode(struct opts *opts, u64 ino, mode_t mode)
+{
+	struct gen_inode *gino;
+	struct timespec now;
+
+	clock_gettime(CLOCK_REALTIME, &now);
+
+	gino = calloc(1, sizeof(struct gen_inode));
+	error_exit(!gino, "failure allocating generated inode");
+
+	gino->inode = (struct scoutfs_parallel_restore_inode) {
+		.ino = ino,
+		.meta_seq = ino,
+		.data_seq = 0,
+		.mode = mode,
+		.atime = now,
+		.ctime = now,
+		.mtime = now,
+		.crtime = now,
+	};
+
+	/*
+	 * hacky creation of a bunch of xattrs for now.
+	 */
+	if ((mode & S_IFMT) == S_IFREG) {
+		#define NV(n, v) { n, sizeof(n) - 1, v, sizeof(v) - 1, }
+		struct name_val {
+			char *name;
+			int len;
+			char *value;
+			int value_len;
+		} nv[] = {
+			NV("scoutfs.hide.totl.acct.8314611887310466424.2.0", "1"),
+			NV("scoutfs.hide.srch.sam_vol_E01001L6_4", ""),
+			NV("scoutfs.hide.sam_reqcopies", ""),
+			NV("scoutfs.hide.sam_copy_2", ""),
+			NV("scoutfs.hide.totl.acct.F01030L6.8314611887310466424.7.30", "1"),
+			NV("scoutfs.hide.sam_copy_1", ""),
+			NV("scoutfs.hide.srch.sam_vol_F01030L6_4", ""),
+			NV("scoutfs.hide.srch.sam_release_cand", ""),
+			NV("scoutfs.hide.sam_restime", ""),
+			NV("scoutfs.hide.sam_uuid", ""),
+			NV("scoutfs.hide.totl.acct.8314611887310466424.3.0", "1"),
+			NV("scoutfs.hide.srch.sam_vol_F01030L6", ""),
+			NV("scoutfs.hide.srch.sam_uuid_865939b7-24d6-472f-b85c-7ce7afeb813a", ""),
+			NV("scoutfs.hide.srch.sam_vol_E01001L6", ""),
+			NV("scoutfs.hide.totl.acct.E01001L6.8314611887310466424.7.1", "1"),
+			NV("scoutfs.hide.totl.acct.8314611887310466424.4.0", "1"),
+			NV("scoutfs.hide.totl.acct.8314611887310466424.11.0", "1"),
+			NV("scoutfs.hide.totl.acct.8314611887310466424.1.0", "1"),
+		};
+		unsigned int nr = array_size(nv);
+		int i;
+
+		gino->xattrs = calloc(nr, sizeof(struct scoutfs_parallel_restore_xattr *));
+
+		for (i = 0; i < nr; i++)
+			gino->xattrs[i] = generate_xattr(opts, ino, i, nv[i].name, nv[i].len,
+							 nv[i].value, nv[i].value_len);
+
+		gino->nr_xattrs = nr;
+		gino->inode.nr_xattrs = nr;
+
+		gino->inode.size = 4096;
+		gino->inode.offline = true;
+	}
+
+	return gino;
+}
+
+static struct scoutfs_parallel_restore_entry *
+generate_entry(struct opts *opts, char *prefix, u64 nr, u64 dir_ino, u64 pos, u64 ino, mode_t mode)
+{
+	struct scoutfs_parallel_restore_entry *entry;
+	char buf[PATH_MAX];
+	int bytes;
+
+	bytes = snprintf(buf, sizeof(buf), "%s-%llu", prefix, nr);
+
+	entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + bytes);
+	error_exit(!entry, "error allocating generated entry");
+
+	*entry = (struct scoutfs_parallel_restore_entry) {
+		.dir_ino = dir_ino,
+		.pos = pos,
+		.ino = ino,
+		.mode = mode,
+		.name = (void *)(entry + 1),
+		.name_len = bytes,
+	};
+
+	memcpy(entry->name, buf, bytes);
+
+	return entry;
+}
+
+/*
+ * since the _parallel_restore_quota_rule mimics the squota_rule found in the
+ * kernel we can also mimic its rule_to_irule function
+ */
+
+#define TEST_RULE_STR "7 13,L,- 15,L,- 17,L,- I 33 -"
+
+static struct scoutfs_parallel_restore_quota_rule *
+generate_quota(struct opts *opts)
+{
+	struct scoutfs_parallel_restore_quota_rule *prule;
+	int err;
+
+	prule = calloc(1, sizeof(struct scoutfs_parallel_restore_quota_rule));
+	error_exit(!prule, "Quota rule alloc failed");
+
+	err = sscanf(TEST_RULE_STR, " %hhu %llu,%c,%c %llu,%c,%c %llu,%c,%c %c %llu %c",
+		     &prule->prio,
+			 &prule->names[0].val, &prule->names[0].source, &prule->names[0].flags,
+		     &prule->names[1].val, &prule->names[1].source, &prule->names[1].flags,
+			 &prule->names[2].val, &prule->names[2].source, &prule->names[2].flags,
+			 &prule->op, &prule->limit, &prule->rule_flags);
+	error_exit(err != 13, "invalid quota rule, missing fields. nr fields: %d rule str: %s\n", err, TEST_RULE_STR);
+
+	return prule;
+}
+
+static u64 random64(void)
+{
+	return ((u64)lrand48() << 32) | lrand48();
+}
+
+static u64 random_range(u64 low, u64 high)
+{
+	return low + (random64() % (high - low + 1));
+}
+
+static struct gen_inode *generate_dir(struct opts *opts, u64 dir_ino, u64 ino_start, u64 ino_len,
+				      bool no_dirs)
+{
+	struct scoutfs_parallel_restore_entry *entry;
+	struct gen_inode *gino;
+	u64 nr_entries;
+	u64 nr_files;
+	u64 nr_dirs;
+	u64 ino;
+	char *prefix;
+	mode_t mode;
+	u64 i;
+
+	nr_dirs = no_dirs ? 0 : random_range(opts->low_dirs, opts->high_dirs);
+	nr_files = random_range(opts->low_files, opts->high_files);
+
+	if (1 + nr_dirs + nr_files > ino_len) {
+		nr_dirs = no_dirs ? 0 : (ino_len - 1) / 2;
+		nr_files = (ino_len - 1) - nr_dirs;
+	}
+
+	nr_entries = nr_dirs + nr_files;
+
+	gino = generate_inode(opts, dir_ino, DIR_MODE);
+	error_exit(!gino, "error allocating generated inode");
+
+	gino->inode.nr_subdirs = nr_dirs;
+	gino->nr_files = nr_files;
+
+	if (nr_entries) {
+		gino->entries = calloc(nr_entries, sizeof(struct scoutfs_parallel_restore_entry *));
+		error_exit(!gino->entries, "error allocating generated inode entries");
+
+		gino->nr_entries = nr_entries;
+	}
+
+	mode = DIR_MODE;
+	prefix = "dir";
+	for (i = 0; i < nr_entries; i++) {
+		if (i == nr_dirs) {
+			mode = REG_MODE;
+			prefix = "file";
+		}
+
+		ino = ino_start + i;
+		entry = generate_entry(opts, prefix, ino, gino->inode.ino,
+				       SCOUTFS_DIRENT_FIRST_POS + i, ino, mode);
+
+		gino->entries[i] = entry;
+		gino->inode.total_entry_name_bytes += entry->name_len;
+	}
+
+	return gino;
+}
+
+/*
+ * Restore a generated inode.  If it's a directory then we also restore
+ * all its entries.  The caller is going to descend into subdir entries and generate
+ * those dir inodes.  We have to generate and restore all non-dir inodes referenced
+ * by this inode's entries.
+ */
+static void restore_inode(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
+			  struct gen_inode *gino)
+{
+	struct gen_inode *nondir;
+	int ret;
+	u64 i;
+
+	ret = scoutfs_parallel_restore_add_inode(wri, &gino->inode);
+	error_exit(ret, "thread add root inode %d", ret);
+
+	for (i = 0; i < gino->nr_entries; i++) {
+		ret = scoutfs_parallel_restore_add_entry(wri, gino->entries[i]);
+		error_exit(ret, "thread add entry %d", ret);
+
+		/* caller only needs subdir entries, generate and free others */
+		if ((gino->entries[i]->mode & S_IFMT) != S_IFDIR) {
+
+			nondir = generate_inode(opts, gino->entries[i]->ino,
+						gino->entries[i]->mode);
+			restore_inode(opts, wri, nondir);
+			free_gino(nondir);
+
+			free(gino->entries[i]);
+			if (i != gino->nr_entries - 1)
+				gino->entries[i] = gino->entries[gino->nr_entries - 1];
+			gino->nr_entries--;
+			gino->nr_files--;
+			i--;
+		}
+	}
+
+	for (i = 0; i < gino->nr_xattrs; i++) {
+		ret = scoutfs_parallel_restore_add_xattr(wri, gino->xattrs[i]);
+		error_exit(ret, "thread add xattr %d", ret);
+	}
+}
+
+struct writer_args {
+	struct list_head head;
+
+	int dev_fd;
+	int pair_fd;
+
+	struct scoutfs_parallel_restore_slice slice;
+	u64 writer_nr;
+	u64 dir_height;
+	u64 ino_start;
+	u64 ino_len;
+};
+
+struct write_result {
+	struct scoutfs_parallel_restore_progress prog;
+	struct scoutfs_parallel_restore_slice slice;
+	__le64 files_created;
+	__le64 bytes_written;
+};
+
+static void write_bufs_and_send(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
+				  void *buf, size_t buf_size, int dev_fd,
+				  struct write_result *res, bool get_slice, int pair_fd)
+{
+	size_t total;
+	int ret;
+
+	total = write_bufs(opts, wri, buf, buf_size, dev_fd);
+	le64_add_cpu(&res->bytes_written, total);
+
+	ret = scoutfs_parallel_restore_get_progress(wri, &res->prog);
+	error_exit(ret, "get prog %d", ret);
+
+	if (get_slice) {
+		ret = scoutfs_parallel_restore_get_slice(wri, &res->slice);
+		error_exit(ret, "thread get slice %d", ret);
+	}
+
+	ret = write(pair_fd, res, sizeof(struct write_result));
+	error_exit(ret != sizeof(struct write_result), "result send error");
+
+	memset(res, 0, sizeof(struct write_result));
+}
+
+/*
+ * Calculate the number of bytes in toplevel "dir-%llu" entry names for the given
+ * number of writers.
+ */
+static u64 topdir_entry_bytes(u64 nr_writers)
+{
+	u64 bytes = (3 + 1) * nr_writers;
+	u64 limit;
+	u64 done;
+	u64 wid;
+	u64 nr;
+
+	for (done = 0, wid = 1, limit = 10; done < nr_writers; done += nr, wid++, limit *= 10) {
+		nr = min(limit - done, nr_writers - done);
+		bytes += nr * wid;
+	}
+
+	return bytes;
+}
+
+struct dir_pos {
+	struct gen_inode *gino;
+	u64 pos;
+};
+
+static void writer_proc(struct opts *opts, struct writer_args *args)
+{
+	struct scoutfs_parallel_restore_writer *wri = NULL;
+	struct scoutfs_parallel_restore_entry *entry;
+	struct dir_pos *dirs = NULL;
+	struct write_result res;
+	struct gen_inode *gino;
+	void *buf = NULL;
+	u64 level;
+	u64 ino;
+	int ret;
+
+	memset(&res, 0, sizeof(res));
+
+	dirs = calloc(args->dir_height, sizeof(struct dir_pos));
+	error_exit(errno, "error allocating parent dirs "ERRF, ERRA);
+
+	errno = posix_memalign((void **)&buf, 4096, opts->buf_size);
+	error_exit(errno, "error allocating block buf "ERRF, ERRA);
+
+	ret = scoutfs_parallel_restore_create_writer(&wri);
+	error_exit(ret, "create writer %d", ret);
+
+	ret = scoutfs_parallel_restore_add_slice(wri, &args->slice);
+	error_exit(ret, "add slice %d", ret);
+
+	/* writer 0 creates the root dir */
+	if (args->writer_nr == 0) {
+		gino = generate_inode(opts, SCOUTFS_ROOT_INO, DIR_MODE);
+		gino->inode.nr_subdirs = opts->nr_writers;
+		gino->inode.total_entry_name_bytes = topdir_entry_bytes(opts->nr_writers);
+
+		ret = scoutfs_parallel_restore_add_inode(wri, &gino->inode);
+		error_exit(ret, "thread add root inode %d", ret);
+		free_gino(gino);
+	}
+
+	/* create root entry for our top level dir */
+	ino = args->ino_start++;
+	args->ino_len--;
+
+	entry = generate_entry(opts, "top", args->writer_nr,
+			       SCOUTFS_ROOT_INO, SCOUTFS_DIRENT_FIRST_POS + args->writer_nr,
+			       ino, DIR_MODE);
+
+	ret = scoutfs_parallel_restore_add_entry(wri, entry);
+	error_exit(ret, "thread top entry %d", ret);
+	free(entry);
+
+	level = args->dir_height - 1;
+
+	while (args->ino_len > 0 && level < args->dir_height) {
+		gino = dirs[level].gino;
+
+		/* generate and restore if we follow entries */
+		if (!gino) {
+			gino = generate_dir(opts, ino, args->ino_start, args->ino_len, level == 0);
+			args->ino_start += gino->nr_entries;
+			args->ino_len -= gino->nr_entries;
+			le64_add_cpu(&res.files_created, gino->nr_files);
+
+			restore_inode(opts, wri, gino);
+			dirs[level].gino = gino;
+		}
+
+		if (dirs[level].pos == gino->nr_entries) {
+			/* ascend if we're done with this dir */
+			dirs[level].gino = NULL;
+			dirs[level].pos = 0;
+			free_gino(gino);
+			level++;
+
+		} else {
+			/* otherwise descend into subdir entry */
+			ino = gino->entries[dirs[level].pos]->ino;
+			dirs[level].pos++;
+			level--;
+		}
+
+		/* do a partial write at batch intervals when there's still more to do */
+		if (le64_to_cpu(res.files_created) >= opts->write_batch && args->ino_len > 0)
+			write_bufs_and_send(opts, wri, buf, opts->buf_size, args->dev_fd,
+					    &res, false, args->pair_fd);
+	}
+
+	write_bufs_and_send(opts, wri, buf, opts->buf_size, args->dev_fd,
+			    &res, true, args->pair_fd);
+
+	scoutfs_parallel_restore_destroy_writer(&wri);
+
+	free(dirs);
+	free(buf);
+}
+
+/*
+ * If any of our children exited with an error code, we hard exit.
+ * The child processes should themselves report out any errors
+ * encountered. Any remaining children will receive SIGHUP and
+ * terminate.
+ */
+static void sigchld_handler(int signo, siginfo_t *info, void *context)
+{
+	if (info->si_status)
+		exit(EXIT_FAILURE);
+}
+
+static void fork_writer(struct opts *opts, struct writer_args *args)
+{
+	pid_t parent = getpid();
+	pid_t pid;
+	int ret;
+
+	pid = fork();
+	error_exit(pid == -1, "fork error");
+
+	if (pid != 0)
+		return;
+
+	ret = prctl(PR_SET_PDEATHSIG, SIGHUP);
+	error_exit(ret < 0, "failed to set parent death sig");
+
+	printf("pid %u getpid() %u parent %u getppid() %u\n",
+		pid, getpid(), parent, getppid());
+	error_exit(getppid() != parent, "child parent already changed");
+
+	writer_proc(opts, args);
+	exit(0);
+}
+
+static int do_restore(struct opts *opts)
+{
+	struct scoutfs_parallel_restore_writer *wri = NULL;
+	struct scoutfs_parallel_restore_slice *slices = NULL;
+	struct scoutfs_parallel_restore_quota_rule *rule = NULL;
+	struct scoutfs_super_block *super = NULL;
+	struct write_result res;
+	struct writer_args *args;
+	struct timespec begin;
+	struct timespec end;
+	LIST_HEAD(writers);
+	u64 next_ino;
+	u64 ino_per;
+	u64 avg_dirs;
+	u64 avg_files;
+	u64 dir_height;
+	u64 tot_files;
+	u64 tot_bytes;
+	int pair[2] = {-1, -1};
+	float secs;
+	void *buf = NULL;
+	int dev_fd = -1;
+	int ret;
+	int i;
+
+	ret = socketpair(PF_LOCAL, SOCK_STREAM, 0, pair);
+	error_exit(ret, "socketpair error "ERRF, ERRA);
+
+	dev_fd = open(opts->meta_path, O_DIRECT | (opts->read_only ? O_RDONLY : (O_RDWR|O_EXCL)));
+	error_exit(dev_fd < 0, "error opening '%s': "ERRF, opts->meta_path, ERRA);
+
+	errno = posix_memalign((void **)&super, 4096, SCOUTFS_BLOCK_SM_SIZE) ?:
+		posix_memalign((void **)&buf, 4096, opts->buf_size);
+	error_exit(errno, "error allocating block bufs "ERRF, ERRA);
+
+	ret = pread(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
+		    SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
+	error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error reading super, ret %d", ret);
+
+	ret = scoutfs_parallel_restore_create_writer(&wri);
+	error_exit(ret, "create writer %d", ret);
+
+	ret = scoutfs_parallel_restore_import_super(wri, super, dev_fd);
+	error_exit(ret, "import super %d", ret);
+
+	rule = generate_quota(opts);
+	ret = scoutfs_parallel_restore_add_quota_rule(wri, rule);
+	free(rule);
+	error_exit(ret, "add quotas %d", ret);
+
+	slices = calloc(1 + opts->nr_writers, sizeof(struct scoutfs_parallel_restore_slice));
+	error_exit(!slices, "alloc slices");
+
+	scoutfs_parallel_restore_init_slices(wri, slices, 1 + opts->nr_writers);
+
+	ret = scoutfs_parallel_restore_add_slice(wri, &slices[0]);
+	error_exit(ret, "add slices[0] %d", ret);
+
+	next_ino = (SCOUTFS_ROOT_INO | SCOUTFS_LOCK_INODE_GROUP_MASK) + 1;
+	ino_per = opts->total_files / opts->nr_writers;
+	avg_dirs = (opts->low_dirs + opts->high_dirs) / 2;
+	avg_files = (opts->low_files + opts->high_files) / 2;
+
+	dir_height = 1;
+	tot_files = avg_files * opts->nr_writers;
+
+	while (tot_files < opts->total_files) {
+		dir_height++;
+		tot_files *= avg_dirs;
+	}
+
+	dprintf("height %llu tot %llu total %llu\n", dir_height, tot_files, opts->total_files);
+
+	clock_gettime(CLOCK_MONOTONIC_RAW, &begin);
+
+	/* start each writing process */
+	for (i = 0; i < opts->nr_writers; i++) {
+		args = calloc(1, sizeof(struct writer_args));
+		error_exit(!args, "alloc writer args");
+
+		args->dev_fd = dev_fd;
+		args->pair_fd = pair[1];
+		args->slice = slices[1 + i];
+		args->writer_nr = i;
+		args->dir_height = dir_height;
+		args->ino_start = next_ino;
+		args->ino_len = ino_per;
+
+		list_add_tail(&args->head, &writers);
+		next_ino += ino_per;
+
+		fork_writer(opts, args);
+	}
+
+	/* read results and watch for writers to finish */
+	tot_files = 0;
+	tot_bytes = 0;
+	i = 0;
+	while (i < opts->nr_writers) {
+		ret = read(pair[0], &res, sizeof(struct write_result));
+		error_exit(ret != sizeof(struct write_result), "result read error %d", ret);
+
+		ret = scoutfs_parallel_restore_add_progress(wri, &res.prog);
+		error_exit(ret, "add thr prog %d", ret);
+
+		if (res.slice.meta_len != 0) {
+			ret = scoutfs_parallel_restore_add_slice(wri, &res.slice);
+			error_exit(ret, "add thr slice %d", ret);
+			i++;
+		}
+
+		tot_files += le64_to_cpu(res.files_created);
+		tot_bytes += le64_to_cpu(res.bytes_written);
+	}
+
+	tot_bytes += write_bufs(opts, wri, buf, opts->buf_size, dev_fd);
+
+	ret = scoutfs_parallel_restore_export_super(wri, super);
+	error_exit(ret, "update super %d", ret);
+
+	if (!opts->read_only) {
+		ret = pwrite(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
+			     SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
+		error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error writing super, ret %d", ret);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC_RAW, &end);
+
+	scoutfs_parallel_restore_destroy_writer(&wri);
+
+	secs = ((float)end.tv_sec + ((float)end.tv_nsec/NSEC_PER_SEC)) -
+	       ((float)begin.tv_sec + ((float)begin.tv_nsec/NSEC_PER_SEC));
+	printf("created %llu files in %llu bytes and %f secs => %f bytes/file, %f files/sec\n",
+		tot_files, tot_bytes, secs,
+		(float)tot_bytes / tot_files, (float)tot_files / secs);
+
+	if (dev_fd >= 0)
+		close(dev_fd);
+	if (pair[0] >= 0)
+		close(pair[0]);
+	if (pair[1] >= 0)
+		close(pair[1]);
+	free(super);
+	free(slices);
+	free(buf);
+
+	return 0;
+}
+
+static int parse_low_high(char *str, u64 *low_ret, u64 *high_ret)
+{
+	char *sep;
+	int ret = 0;
+
+	sep = index(str, ':');
+	if (sep) {
+		*sep = '\0';
+		ret = parse_u64(sep + 1, high_ret);
+	}
+
+	if (ret == 0)
+		ret = parse_u64(str, low_ret);
+
+	if (sep)
+		*sep = ':';
+
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	struct opts opts = {
+		.buf_size = (32 * 1024 * 1024),
+
+		.write_batch = 1000000,
+		.low_dirs = 5,
+		.high_dirs = 10,
+		.low_files = 10,
+		.high_files = 20,
+		.total_files = 100,
+	};
+	struct sigaction act = { 0 };
+	int ret;
+	int c;
+
+	opts.seed = random64();
+	opts.nr_writers = sysconf(_SC_NPROCESSORS_ONLN);
+
+        while ((c = getopt(argc, argv, "b:d:f:m:n:rs:w:")) != -1) {
+                switch(c) {
+                case 'b':
+			ret = parse_u64(optarg, &opts.write_batch);
+			error_exit(ret, "error parsing -b '%s'\n", optarg);
+			error_exit(opts.write_batch == 0, "-b can't be 0");
+                        break;
+                case 'd':
+			ret = parse_low_high(optarg, &opts.low_dirs, &opts.high_dirs);
+			error_exit(ret, "error parsing -d '%s'\n", optarg);
+                        break;
+                case 'f':
+			ret = parse_low_high(optarg, &opts.low_files, &opts.high_files);
+			error_exit(ret, "error parsing -f '%s'\n", optarg);
+                        break;
+                case 'm':
+                        opts.meta_path = strdup(optarg);
+                        break;
+                case 'n':
+			ret = parse_u64(optarg, &opts.total_files);
+			error_exit(ret, "error parsing -n '%s'\n", optarg);
+                        break;
+                case 'r':
+			opts.read_only = true;
+			break;
+                case 's':
+			ret = parse_u64(optarg, &opts.seed);
+			error_exit(ret, "error parsing -s '%s'\n", optarg);
+                        break;
+                case 'w':
+			ret = parse_u64(optarg, &opts.nr_writers);
+			error_exit(ret, "error parsing -w '%s'\n", optarg);
+                        break;
+                case '?':
+                        printf("Unknown option '%c'\n", optopt);
+                        usage();
+			exit(1);
+                }
+        }
+
+	error_exit(opts.low_dirs > opts.high_dirs, "LOW > HIGH in -d %llu:%llu",
+		   opts.low_dirs, opts.high_dirs);
+	error_exit(opts.low_files > opts.high_files, "LOW > HIGH in -f %llu:%llu",
+		   opts.low_files, opts.high_files);
+	error_exit(!opts.meta_path, "must specify metadata device path with -m");
+
+	printf("recreate with: -d %llu:%llu -f %llu:%llu -n %llu -s %llu -w %llu\n",
+		opts.low_dirs, opts.high_dirs, opts.low_files, opts.high_files,
+		opts.total_files, opts.seed, opts.nr_writers);
+
+	act.sa_flags = SA_SIGINFO | SA_RESTART;
+	act.sa_sigaction = &sigchld_handler;
+	if (sigaction(SIGCHLD, &act, NULL) == -1)
+		error_exit(ret, "error setting up signal handler\n");
+
+	ret = do_restore(&opts);
+
+	free(opts.meta_path);
+
+	return ret == 0 ? 0 : 1;
+}
@@ -0,0 +1,817 @@
+#define _GNU_SOURCE /* O_DIRECT */
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/xattr.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <ctype.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+#include <time.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <sys/signal.h>
+#include <sys/statfs.h>
+#include <dirent.h>
+
+#include "../../utils/src/sparse.h"
+#include "../../utils/src/util.h"
+#include "../../utils/src/list.h"
+#include "../../utils/src/parse.h"
+#include "../../kmod/src/format.h"
+#include "../../kmod/src/ioctl.h"
+#include "../../utils/src/parallel_restore.h"
+
+/*
+ * XXX:
+ */
+
+#define ERRF " errno %d (%s)"
+#define ERRA errno, strerror(errno)
+
+#define error_exit(cond, fmt, args...)			\
+do {							\
+	if (cond) {					\
+		printf("error: "fmt"\n", ##args);	\
+		exit(1);				\
+	}						\
+} while (0)
+
+#define REG_MODE (S_IFREG | 0644)
+#define DIR_MODE (S_IFDIR | 0755)
+#define LNK_MODE (S_IFLNK | 0777)
+
+/*
+ * At about 1k files we seem to be writing about 1MB of data, so
+ * set buffer sizes adequately above that.
+ */
+#define BATCH_FILES 1024
+#define BUF_SIZ 2 * 1024 * 1024
+
+/*
+ * We can't make duplicate inodes for hardlinked files, so we
+ * will need to track these as we generate them. Not too costly
+ * to do, since it's just an integer, and sorting shouldn't matter
+ * until we get into the millions of entries, hopefully.
+ */
+static struct list_head hardlinks;
+struct hardlink_head {
+	struct list_head head;
+	u64 ino;
+};
+
+struct opts {
+	char *meta_path;
+	char *source_dir;
+};
+
+static bool warn_scoutfs = false;
+
+static void usage(void)
+{
+	printf("usage:\n"
+	       " -m PATH     | path to metadata device\n"
+	       " -s PATH     | path to source directory\n"
+	       );
+}
+
+static size_t write_bufs(struct scoutfs_parallel_restore_writer *wri,
+			 void *buf, int dev_fd)
+{
+	size_t total = 0;
+	size_t count;
+	off_t off;
+	int ret;
+
+	do {
+		ret = scoutfs_parallel_restore_write_buf(wri, buf, BUF_SIZ, &off, &count);
+		error_exit(ret, "write buf %d", ret);
+
+		if (count > 0) {
+			ret = pwrite(dev_fd, buf, count, off);
+			error_exit(ret != count, "pwrite count %zu ret %d", count, ret);
+			total += ret;
+		}
+	} while (count > 0);
+
+	return total;
+}
+
+struct write_result {
+	struct scoutfs_parallel_restore_progress prog;
+	struct scoutfs_parallel_restore_slice slice;
+	__le64 files_created;
+	__le64 dirs_created;
+	__le64 bytes_written;
+	bool complete;
+};
+
+static void write_bufs_and_send(struct scoutfs_parallel_restore_writer *wri,
+				void *buf, int dev_fd,
+				struct write_result *res, bool get_slice, int pair_fd)
+{
+	size_t total;
+	int ret;
+
+	total = write_bufs(wri, buf, dev_fd);
+	le64_add_cpu(&res->bytes_written, total);
+
+	ret = scoutfs_parallel_restore_get_progress(wri, &res->prog);
+	error_exit(ret, "get prog %d", ret);
+
+	if (get_slice) {
+		ret = scoutfs_parallel_restore_get_slice(wri, &res->slice);
+		error_exit(ret, "thread get slice %d", ret);
+	}
+
+	ret = write(pair_fd, res, sizeof(struct write_result));
+	error_exit(ret != sizeof(struct write_result), "result send error");
+
+	memset(res, 0, sizeof(struct write_result));
+}
+
+/*
+ * Adding xattrs is supported for files and directories only.
+ *
+ * If the filesystem on which the path resides isn't scoutfs, we omit the
+ * scoutfs specific ioctl to fetch hidden xattrs.
+ *
+ * Untested if the hidden xattr ioctl works on directories or symlinks.
+ */
+static void add_xattrs(struct scoutfs_parallel_restore_writer *wri, char *path, u64 ino, bool is_scoutfs)
+{
+	struct scoutfs_ioctl_listxattr_hidden lxh;
+	struct scoutfs_parallel_restore_xattr *xattr;
+	char *buf = NULL;
+	char *name = NULL;
+	int fd = -1;
+	int bytes;
+	int len;
+	int value_len;
+	int ret;
+	int pos = 0;
+
+	if (!is_scoutfs)
+		goto normal_xattrs;
+
+	fd = open(path, O_RDONLY);
+	error_exit(fd < 0, "open"ERRF, ERRA);
+
+	memset(&lxh, 0, sizeof(lxh));
+	lxh.id_pos = 0;
+	lxh.hash_pos = 0;
+	lxh.buf_bytes = 256 * 1024;
+
+	buf = malloc(lxh.buf_bytes);
+	error_exit(!buf, "alloc xattr_hidden buf");
+	lxh.buf_ptr = (unsigned long)buf;
+
+	/* hidden */
+	for (;;) {
+		ret = ioctl(fd, SCOUTFS_IOC_LISTXATTR_HIDDEN, &lxh);
+		if (ret == 0) /* done */
+			break;
+		error_exit(ret < 0, "listxattr_hidden"ERRF, ERRA);
+		bytes = ret;
+		error_exit(bytes > lxh.buf_bytes, "listxattr_hidden overflow");
+		error_exit(buf[bytes - 1] != '\0', "listxattr_hidden didn't term");
+
+		name = buf;
+
+		do {
+			len = strlen(name);
+			error_exit(len == 0, "listxattr_hidden empty name");
+			error_exit(len > SCOUTFS_XATTR_MAX_NAME_LEN, "listxattr_hidden long name");
+
+			/* get value len */
+			value_len = fgetxattr(fd, name, NULL, 0);
+			error_exit(value_len < 0, "malloc value hidden"ERRF, ERRA);
+
+			/* allocate everything at once */
+			xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + len + value_len);
+			error_exit(!xattr, "error allocating generated xattr");
+
+			*xattr = (struct scoutfs_parallel_restore_xattr) {
+				.ino = ino,
+				.pos = pos++,
+				.name_len = len,
+				.value_len = value_len,
+			};
+			xattr->name = (void *)(xattr + 1);
+			xattr->value = (void *)(xattr->name + len);
+
+			/* get value into xattr directly */
+			ret = fgetxattr(fd, name, (void *)(xattr->name + len), value_len);
+			error_exit(ret != value_len, "fgetxattr value"ERRF, ERRA);
+
+			memcpy(xattr->name, name, len);
+
+			ret = scoutfs_parallel_restore_add_xattr(wri, xattr);
+			error_exit(ret, "add hidden xattr %d", ret);
+
+			free(xattr);
+
+			name += len + 1;
+			bytes -= len + 1;
+		} while (bytes > 0);
+	}
+
+	free(buf);
+	close(fd);
+
+normal_xattrs:
+	value_len = listxattr(path, NULL, 0);
+	error_exit(value_len < 0, "hidden listxattr "ERRF, ERRA);
+	if (value_len == 0)
+		return;
+
+	buf = calloc(1, value_len);
+	error_exit(!buf, "malloc value"ERRF, ERRA);
+
+	ret = listxattr(path, buf, value_len);
+	error_exit(ret < 0, "hidden listxattr %d", ret);
+
+	name = buf;
+	bytes = ret;
+	do {
+		len = strlen(name);
+
+		error_exit(len == 0, "listxattr_hidden empty name");
+		error_exit(len > SCOUTFS_XATTR_MAX_NAME_LEN, "listxattr_hidden long name");
+
+		value_len = getxattr(path, name, NULL, 0);
+		error_exit(value_len < 0, "value "ERRF, ERRA);
+
+		xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + len + value_len);
+		error_exit(!xattr, "error allocating generated xattr");
+
+		*xattr = (struct scoutfs_parallel_restore_xattr) {
+			.ino = ino,
+			.pos = pos++,
+			.name_len = len,
+			.value_len = value_len,
+		};
+		xattr->name = (void *)(xattr + 1);
+		xattr->value = (void *)(xattr->name + len);
+
+		ret = getxattr(path, name, (void *)(xattr->name + len), value_len);
+		error_exit(ret != value_len, "fgetxattr value"ERRF, ERRA);
+
+		memcpy(xattr->name, name, len);
+
+		ret = scoutfs_parallel_restore_add_xattr(wri, xattr);
+		error_exit(ret, "add xattr %d", ret);
+
+		free(xattr);
+
+		name += len + 1;
+		bytes -= len + 1;
+	} while (bytes > 0);
+
+	free(buf);
+}
+
+/*
+ * We can't store the same inode multiple times, so we need to make
+ * sure to account for hardlinks. Maintain a LL that stores the first
+ * hardlink inode we encounter, and every subsequent hardlink to this
+ * inode will omit inserting an inode, and just adds another entry
+ */
+static bool is_new_inode_item(bool nlink, u64 ino)
+{
+	struct hardlink_head *hh_tmp;
+	struct hardlink_head *hh;
+
+	if (!nlink)
+		return true;
+
+	/* lineair search, pretty awful, should be a binary tree */
+	list_for_each_entry_safe(hh, hh_tmp, &hardlinks, head) {
+		if (hh->ino == ino)
+			return false;
+	}
+
+	/* insert item */
+	hh = malloc(sizeof(struct hardlink_head));
+	error_exit(!hh, "malloc");
+	hh->ino = ino;
+	list_add_tail(&hh->head, &hardlinks);
+
+	/*
+	 *  XXX
+	 *
+	 * We can be confident that if we don't traverse filesystems
+	 * that once we've created N entries of an N-linked inode, that
+	 * it can be removed from the LL. This would significantly
+	 * improve the manageability of the list.
+	 *
+	 * All we'd need to do is add a counter and compare it to the nr_links
+	 * field of the inode.
+	 */
+
+	return true;
+}
+
+/*
+ * create the inode data for a given path as best as possible
+ * duplicating the exact data from the source path
+ */
+static struct scoutfs_parallel_restore_inode *read_inode_data(char *path, u64 ino, bool *nlink, bool is_scoutfs)
+{
+	struct scoutfs_parallel_restore_inode *inode = NULL;
+	struct scoutfs_ioctl_stat_more stm;
+	struct stat st;
+	int ret;
+	int fd;
+
+	inode = calloc(1, sizeof(struct scoutfs_parallel_restore_inode));
+	error_exit(!inode, "failure allocating inode");
+
+	ret = lstat(path, &st);
+	error_exit(ret, "failure stat inode");
+
+	/* use exact inode numbers from path, except for root ino */
+	if (ino != SCOUTFS_ROOT_INO)
+		inode->ino = st.st_ino;
+	else
+		inode->ino = SCOUTFS_ROOT_INO;
+
+	inode->mode = st.st_mode;
+	inode->uid = st.st_uid;
+	inode->gid = st.st_gid;
+	inode->atime = st.st_atim;
+	inode->ctime = st.st_ctim;
+	inode->mtime = st.st_mtim;
+	inode->size = st.st_size;
+
+	inode->rdev = st.st_rdev;
+
+	/* scoutfs specific */
+	inode->meta_seq = 0;
+	inode->data_seq = 0;
+	inode->crtime = st.st_ctim;
+
+	if (S_ISREG(inode->mode)) {
+		if (inode->size > 0)
+			inode->offline = true;
+
+		if (is_scoutfs) {
+			fd = open(path, O_RDONLY);
+			error_exit(!fd, "open failure"ERRF, ERRA);
+
+			ret = ioctl(fd, SCOUTFS_IOC_STAT_MORE, &stm);
+			error_exit(ret, "failure SCOUTFS_IOC_STAT_MORE inode");
+
+			inode->meta_seq = stm.meta_seq;
+			inode->data_seq = stm.data_seq;
+			inode->crtime = (struct timespec){.tv_sec = stm.crtime_sec, .tv_nsec = stm.crtime_nsec};
+
+			close(fd);
+		}
+
+	}
+
+	/* pass whether item is hardlinked or not */
+	*nlink = (st.st_nlink > 1);
+
+	return inode;
+}
+
+struct writer_args {
+	struct list_head head;
+
+	int dev_fd;
+	int pair_fd;
+
+	struct scoutfs_parallel_restore_slice slice;
+};
+
+static void restore_path(struct scoutfs_parallel_restore_writer *wri, struct writer_args *args, struct write_result *res, void *buf, char *path, u64 ino)
+{
+	struct scoutfs_parallel_restore_inode *inode;
+	struct scoutfs_parallel_restore_entry *entry;
+	DIR *dirp = NULL;
+	char *subdir = NULL;
+	char link[PATH_MAX + 1];
+	struct dirent *ent;
+	struct statfs stf;
+	int ret = 0;
+	int subdir_count = 0, file_count = 0;
+	size_t ent_len = 0;
+	size_t pos = 0;
+	bool nlink = false;
+	char ind = '?';
+	u64 mode;
+	bool is_scoutfs = false;
+
+	/* get fs info once per path */
+	ret = statfs(path, &stf);
+	error_exit(ret != 0, "statfs"ERRF, ERRA);
+	is_scoutfs = (stf.f_type == 0x554f4353);
+
+	if (!is_scoutfs && !warn_scoutfs) {
+		warn_scoutfs = true;
+		fprintf(stderr, "Non-scoutfs source path detected: scoutfs specific features disabled\n");
+	}
+
+	/* traverse the entire tree */
+	dirp = opendir(path);
+	errno = 0;
+	while ((ent = readdir(dirp))) {
+		if (ent->d_type == DT_DIR) {
+			if ((strcmp(ent->d_name, ".") == 0) ||
+			    (strcmp(ent->d_name, "..") == 0)) {
+				/* position still matters */
+				pos++;
+				continue;
+			}
+
+			/* recurse into subdir */
+			ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
+			error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
+			restore_path(wri, args, res, buf, subdir, ent->d_ino);
+
+			subdir_count++;
+
+			ent_len += strlen(ent->d_name);
+
+			entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
+			error_exit(!entry, "error allocating generated entry");
+
+			*entry = (struct scoutfs_parallel_restore_entry) {
+				.dir_ino = ino,
+				.pos = pos++,
+				.ino = ent->d_ino,
+				.mode = DIR_MODE,
+				.name = (void *)(entry + 1),
+				.name_len = strlen(ent->d_name),
+			};
+
+			memcpy(entry->name, ent->d_name, strlen(ent->d_name));
+			ret = scoutfs_parallel_restore_add_entry(wri, entry);
+			error_exit(ret, "add entry %d", ret);
+			free(entry);
+
+			add_xattrs(wri, subdir, ent->d_ino, is_scoutfs);
+
+			free(subdir);
+
+			le64_add_cpu(&res->dirs_created, 1);
+		} else if (ent->d_type == DT_REG) {
+
+			file_count++;
+
+			ent_len += strlen(ent->d_name);
+
+			entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
+			error_exit(!entry, "error allocating generated entry");
+
+			*entry = (struct scoutfs_parallel_restore_entry) {
+				.dir_ino = ino,
+				.pos = pos++,
+				.ino = ent->d_ino,
+				.mode = REG_MODE,
+				.name = (void *)(entry + 1),
+				.name_len = strlen(ent->d_name),
+			};
+
+			memcpy(entry->name, ent->d_name, strlen(ent->d_name));
+			ret = scoutfs_parallel_restore_add_entry(wri, entry);
+			error_exit(ret, "add entry %d", ret);
+			free(entry);
+
+			ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
+			error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
+
+			/* file inode */
+			inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
+			fprintf(stdout, "f %s/%s\n", path, ent->d_name);
+			if (is_new_inode_item(nlink, ent->d_ino)) {
+				ret = scoutfs_parallel_restore_add_inode(wri, inode);
+				error_exit(ret, "add reg file inode %d", ret);
+
+				/* xattrs */
+				add_xattrs(wri, subdir, ent->d_ino, is_scoutfs);
+			}
+			free(inode);
+
+			free(subdir);
+
+			le64_add_cpu(&res->files_created, 1);
+		} else if (ent->d_type == DT_LNK) {
+			/* readlink */
+
+			ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
+			error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
+
+			ent_len += strlen(ent->d_name);
+
+			ret = readlink(subdir, link, PATH_MAX);
+			error_exit(ret < 0, "readlink %d", ret);
+			/* must 0-terminate if we want to print it */
+			link[ret] = 0;
+
+			entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
+			error_exit(!entry, "error allocating generated entry");
+
+			*entry = (struct scoutfs_parallel_restore_entry) {
+				.dir_ino = ino,
+				.pos = pos++,
+				.ino = ent->d_ino,
+				.mode = LNK_MODE,
+				.name = (void *)(entry + 1),
+				.name_len = strlen(ent->d_name),
+			};
+
+			memcpy(entry->name, ent->d_name, strlen(ent->d_name));
+			ret = scoutfs_parallel_restore_add_entry(wri, entry);
+			error_exit(ret, "add symlink entry %d", ret);
+
+			/* link inode */
+			inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
+
+			fprintf(stdout, "l %s/%s -> %s\n", path, ent->d_name, link);
+
+			inode->mode = LNK_MODE;
+			inode->target = link;
+			inode->target_len = strlen(link) + 1; /* scoutfs null terminates symlinks */
+
+			ret = scoutfs_parallel_restore_add_inode(wri, inode);
+			error_exit(ret, "add syml inode %d", ret);
+
+			free(inode);
+			free(subdir);
+
+			le64_add_cpu(&res->files_created, 1);
+		} else {
+			/* odd stuff */
+			switch(ent->d_type) {
+			case DT_CHR:
+				ind = 'c';
+				mode = S_IFCHR;
+				break;
+			case DT_BLK:
+				ind = 'b';
+				mode = S_IFBLK;
+				break;
+			case DT_FIFO:
+				ind = 'p';
+				mode = S_IFIFO;
+				break;
+			case DT_SOCK:
+				ind = 's';
+				mode = S_IFSOCK;
+				break;
+			default:
+				error_exit(true, "Unknown readdir entry type");
+				;;
+			}
+
+			file_count++;
+
+			ent_len += strlen(ent->d_name);
+
+			entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
+			error_exit(!entry, "error allocating generated entry");
+
+			*entry = (struct scoutfs_parallel_restore_entry) {
+				.dir_ino = ino,
+				.pos = pos++,
+				.ino = ent->d_ino,
+				.mode = mode,
+				.name = (void *)(entry + 1),
+				.name_len = strlen(ent->d_name),
+			};
+
+			memcpy(entry->name, ent->d_name, strlen(ent->d_name));
+			ret = scoutfs_parallel_restore_add_entry(wri, entry);
+			error_exit(ret, "add entry %d", ret);
+
+			free(entry);
+
+			ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
+			error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
+
+			/* file inode */
+			inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
+			fprintf(stdout, "%c %lld %s/%s\n", ind, inode->ino, path, ent->d_name);
+			if (is_new_inode_item(nlink, ent->d_ino)) {
+				ret = scoutfs_parallel_restore_add_inode(wri, inode);
+				error_exit(ret, "add reg file inode %d", ret);
+			}
+			free(inode);
+
+			free(subdir);
+
+			le64_add_cpu(&res->files_created, 1);
+		}
+
+		/* batch out changes, will be about 1M */
+		if (le64_to_cpu(res->files_created) > BATCH_FILES) {
+			write_bufs_and_send(wri, buf, args->dev_fd, res, false, args->pair_fd);
+		}
+
+	}
+	if (ent != NULL)
+		error_exit(errno, "readdir"ERRF, ERRA);
+	closedir(dirp);
+
+	/* create the dir itself */
+	inode = read_inode_data(path, ino, &nlink, is_scoutfs);
+	inode->nr_subdirs = subdir_count;
+	inode->total_entry_name_bytes = ent_len;
+	fprintf(stdout, "d %s\n", path);
+
+	ret = scoutfs_parallel_restore_add_inode(wri, inode);
+	error_exit(ret, "add dir inode %d", ret);
+
+	free(inode);
+
+	/* No need to send, we'll send final after last directory is complete */
+}
+
+static int do_restore(struct opts *opts)
+{
+	struct scoutfs_parallel_restore_writer *pwri, *wri = NULL;
+	struct scoutfs_parallel_restore_slice *slices = NULL;
+	struct scoutfs_super_block *super = NULL;
+	struct writer_args *args;
+	struct write_result res;
+	int pair[2] = {-1, -1};
+	LIST_HEAD(writers);
+	void *buf = NULL;
+	void *bufp = NULL;
+	int dev_fd = -1;
+	pid_t pid;
+	int ret;
+	u64 tot_bytes;
+	u64 tot_dirs;
+	u64 tot_files;
+
+	ret = socketpair(PF_LOCAL, SOCK_STREAM, 0, pair);
+	error_exit(ret, "socketpair error "ERRF, ERRA);
+
+	dev_fd = open(opts->meta_path, O_DIRECT | (O_RDWR|O_EXCL));
+	error_exit(dev_fd < 0, "error opening '%s': "ERRF, opts->meta_path, ERRA);
+
+	errno = posix_memalign((void **)&super, 4096, SCOUTFS_BLOCK_SM_SIZE) ?:
+		posix_memalign((void **)&buf, 4096, BUF_SIZ);
+	error_exit(errno, "error allocating block bufs "ERRF, ERRA);
+
+	ret = pread(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
+		    SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
+	error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error reading super, ret %d", ret);
+
+	error_exit((super->flags & SCOUTFS_FLAG_IS_META_BDEV) == 0, "super block is not meta dev");
+
+	ret = scoutfs_parallel_restore_create_writer(&wri);
+	error_exit(ret, "create writer %d", ret);
+
+	ret = scoutfs_parallel_restore_import_super(wri, super, dev_fd);
+	error_exit(ret, "import super %d", ret);
+
+	slices = calloc(2, sizeof(struct scoutfs_parallel_restore_slice));
+	error_exit(!slices, "alloc slices");
+
+	scoutfs_parallel_restore_init_slices(wri, slices, 2);
+
+	ret = scoutfs_parallel_restore_add_slice(wri, &slices[0]);
+	error_exit(ret, "add slices[0] %d", ret);
+
+	args = calloc(1, sizeof(struct writer_args));
+	error_exit(!args, "alloc writer args");
+
+	args->dev_fd = dev_fd;
+	args->slice = slices[1];
+	args->pair_fd = pair[1];
+	list_add_tail(&args->head, &writers);
+
+	/* fork writer process */
+	pid = fork();
+	error_exit(pid == -1, "fork error");
+
+	if (pid == 0) {
+		ret = prctl(PR_SET_PDEATHSIG, SIGHUP);
+		error_exit(ret < 0, "failed to set parent death sig");
+
+		errno = posix_memalign((void **)&bufp, 4096, BUF_SIZ);
+		error_exit(errno, "error allocating block bufp "ERRF, ERRA);
+
+		ret = scoutfs_parallel_restore_create_writer(&pwri);
+		error_exit(ret, "create pwriter %d", ret);
+
+		ret = scoutfs_parallel_restore_add_slice(pwri, &args->slice);
+		error_exit(ret, "add pslice %d", ret);
+
+		memset(&res, 0, sizeof(res));
+
+		restore_path(pwri, args, &res, bufp, opts->source_dir, SCOUTFS_ROOT_INO);
+
+		res.complete = true;
+
+		write_bufs_and_send(pwri, buf, args->dev_fd, &res, true, args->pair_fd);
+
+		scoutfs_parallel_restore_destroy_writer(&pwri);
+		free(bufp);
+
+		exit(0);
+	};
+
+	/* read results and wait for writer to finish */
+	tot_bytes = 0;
+	tot_dirs = 1;
+	tot_files = 0;
+	for (;;) {
+		ret = read(pair[0], &res, sizeof(struct write_result));
+		error_exit(ret != sizeof(struct write_result), "result read error %d", ret);
+
+		ret = scoutfs_parallel_restore_add_progress(wri, &res.prog);
+		error_exit(ret, "add thr prog %d", ret);
+
+		if (res.slice.meta_len != 0) {
+			ret = scoutfs_parallel_restore_add_slice(wri, &res.slice);
+			error_exit(ret, "add thr slice %d", ret);
+
+			if (res.complete)
+				break;
+		}
+
+		tot_bytes += le64_to_cpu(res.bytes_written);
+		tot_files += le64_to_cpu(res.files_created);
+		tot_dirs += le64_to_cpu(res.dirs_created);
+	}
+
+	tot_bytes += write_bufs(wri, buf, args->dev_fd);
+
+	fprintf(stdout, "Wrote %lld directories, %lld files, %lld bytes total\n",
+		tot_dirs, tot_files, tot_bytes);
+
+	/* write super to finalize */
+	ret = scoutfs_parallel_restore_export_super(wri, super);
+	error_exit(ret, "update super %d", ret);
+
+	ret = pwrite(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
+		     SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
+	error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error writing super, ret %d", ret);
+
+	scoutfs_parallel_restore_destroy_writer(&wri);
+
+	if (dev_fd >= 0)
+		close(dev_fd);
+	if (pair[0] > 0)
+		close(pair[0]);
+	if (pair[1] > 0)
+		close(pair[1]);
+	free(super);
+	free(args);
+	free(slices);
+	free(buf);
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	struct opts opts = (struct opts){ 0 };
+	struct hardlink_head *hh_tmp;
+	struct hardlink_head *hh;
+	int ret;
+	int c;
+
+	INIT_LIST_HEAD(&hardlinks);
+
+        while ((c = getopt(argc, argv, "b:m:s:")) != -1) {
+                switch(c) {
+                case 'm':
+                        opts.meta_path = strdup(optarg);
+                        break;
+		case 's':
+			opts.source_dir = strdup(optarg);
+			break;
+                case '?':
+                        printf("Unknown option '%c'\n", optopt);
+                        usage();
+			exit(1);
+                }
+        }
+
+	error_exit(!opts.meta_path, "must specify metadata device path with -m");
+	error_exit(!opts.source_dir, "must specify source directory path with -s");
+
+	ret = do_restore(&opts);
+
+	free(opts.meta_path);
+	free(opts.source_dir);
+
+	list_for_each_entry_safe(hh, hh_tmp, &hardlinks, head) {
+		list_del_init(&hh->head);
+		free(hh);
+	}
+
+	return ret == 0 ? 0 : 1;
+}
@@ -11,7 +11,7 @@ FILE="$T_D0/file"
 # final block as we truncated past it.
 #
 echo "== truncate writes zeroed partial end of file block"
-yes 2>/dev/null | dd of="$FILE" bs=8K count=1 status=none iflag=fullblock
+yes | dd of="$FILE" bs=8K count=1 status=none iflag=fullblock
 sync

 # not passing iflag=fullblock causes the file occasionally to just be
@@ -88,11 +88,6 @@ rm -rf "$SCR/xattrs"

 echo "== make sure we can create again"
 file="$SCR/file-after"
-C=120
-while (( C-- )); do
-	touch $file 2> /dev/null && break
-	sleep 1
-done
 touch $file
 setfattr -n user.scoutfs-enospc -v 1 "$file"
 sync
@@ -38,6 +38,6 @@ while [ "$SECONDS" -lt "$END" ]; do
 done

 echo "== stopping background load"
-t_silent_kill $load_pids
+kill $load_pids

 t_pass
@@ -1,40 +0,0 @@
-#
-# We had a lock server refcounting bug that could let one thread get a
-# reference on a lock struct that was being freed by another thread.  We
-# were able to reproduce this by having all clients try and produce a
-# lot of read and null requests.
-#
-# This will manfiest as a hung lock and timed out test runs, probably
-# with hung task messages on the console.  Depending on how the race
-# turns out, it can trigger KASAN warnings in
-# process_waiting_requests().
-#
-
-READERS_PER=3
-SECS=30
-
-echo "=== setup"
-touch "$T_D0/file"
-
-echo "=== spin reading and shrinking"
-END=$((SECONDS + SECS))
-for m in $(t_fs_nrs); do
-	eval file="\$T_D${m}/file"
-
-	# lots of tasks reading as fast as they can
-	for t in $(seq 1 $READERS_PER); do
-		(while [ $SECONDS -lt $END ]; do
-			stat $file > /dev/null
-		 done) &
-	done
-	# one task shrinking (triggering null requests) and reading
-	(while [ $SECONDS -lt $END ]; do
-		stat $file > /dev/null
-		t_trigger_arm_silent statfs_lock_purge $m
-		stat -f "$file" > /dev/null
-	 done) &
-done
-
-wait
-
-t_pass
@@ -1,54 +0,0 @@
-#
-# test mmap() and normal read/write consistency between different nodes
-#
-
-t_require_commands mmap_stress mmap_validate scoutfs xfs_io
-
-echo "== mmap_stress"
-mmap_stress 8192 2000 "$T_D0/mmap_stress" "$T_D1/mmap_stress" "$T_D2/mmap_stress" "$T_D3/mmap_stress" "$T_D4/mmap_stress" | sed 's/:.*//g' | sort
-
-echo "== basic mmap/read/write consistency checks"
-mmap_validate 256 1000 "$T_D0/mmap_val1" "$T_D1/mmap_val1"
-mmap_validate 8192 1000 "$T_D0/mmap_val2" "$T_D1/mmap_val2"
-mmap_validate 88400 1000 "$T_D0/mmap_val3" "$T_D1/mmap_val3"
-
-echo "== mmap read from offline extent"
-F="$T_D0/mmap-offline"
-touch "$F"
-xfs_io -c "pwrite -S 0xEA 0 8192" "$F" > /dev/null
-cp "$F" "${F}-stage"
-vers=$(scoutfs stat -s data_version "$F")
-scoutfs release "$F" -V "$vers" -o 0 -l 8192
-scoutfs get-fiemap -L "$F"
-xfs_io -c "mmap -rwx 0 8192" \
-	-c "mread -v 512 16" "$F" &
-sleep 1
-# should be 1 - data waiting
-jobs | wc -l
-scoutfs stage "${F}-stage" "$F" -V "$vers" -o 0 -l 8192
-# xfs_io thread <here> will output 16 bytes of read data
-sleep 1
-# should be 0 - no more waiting jobs, xfs_io should have exited
-jobs | wc -l
-scoutfs get-fiemap -L "$F"
-
-echo "== mmap write to an offline extent"
-# reuse the same file
-scoutfs release "$F" -V "$vers" -o 0 -l 8192
-scoutfs get-fiemap -L "$F"
-xfs_io -c "mmap -rwx 0 8192" \
-	-c "mwrite -S 0x11 528 16" "$F" &
-sleep 1
-# should be 1 job waiting
-jobs | wc -l
-scoutfs stage "${F}-stage" "$F" -V "$vers" -o 0 -l 8192
-# no output here from write
-sleep 1
-# should be 0 - no more waiting jobs, xfs_io should have exited
-jobs | wc -l
-scoutfs get-fiemap -L "$F"
-# read back contents to assure write changed the file
-dd status=none if="$F" bs=1 count=48 skip=512 | hexdump -C
-
-echo "== done"
-t_pass
@@ -5,6 +5,18 @@
 t_require_commands sleep touch sync stat handle_cat kill rm
 t_require_mounts 2

+#
+# usually bash prints an annoying output message when jobs
+# are killed.  We can avoid that by redirecting stderr for
+# the bash process when it reaps the jobs that are killed.
+#
+silent_kill() {
+	exec {ERR}>&2 2>/dev/null
+	kill "$@"
+	wait "$@"
+	exec 2>&$ERR {ERR}>&-
+}
+
 #
 # We don't have a great way to test that inode items still exist.   We
 # don't prevent opening handles with nlink 0 today, so we'll use that.
@@ -40,7 +52,7 @@ inode_exists $ino || echo "$ino didn't exist"

 echo "== orphan from failed evict deletion is picked up"
 # pending kill signal stops evict from getting locks and deleting
-t_silent_kill $pid
+silent_kill $pid
 t_set_sysfs_mount_option 0 orphan_scan_delay_ms 1000
 sleep 5
 inode_exists $ino && echo "$ino still exists"
@@ -58,7 +70,7 @@ for nr in $(t_fs_nrs); do
 	rm -f "$path"
 done
 sync
-t_silent_kill $pids
+silent_kill $pids
 for nr in $(t_fs_nrs); do
 	t_force_umount $nr
 done
@@ -70,15 +82,7 @@ done
 # wait for orphan scans to run
 t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
 # also have to wait for delayed log merge work from mount
-C=120
-while (( C-- )); do
-	brk=1
-	for ino in $inos; do
-		inode_exists $ino && brk=0
-	done
-	test $brk -eq 1 && break
-	sleep 1
-done
+sleep 15
 for ino in $inos; do
 	inode_exists $ino && echo "$ino still exists"
 done
@@ -127,7 +131,7 @@ while [ $SECONDS -lt $END ]; do
 	done

 	# trigger eviction deletion of each file in each mount
-	t_silent_kill $pids
+	silent_kill $pids

 	wait || t_fail "handle_fsetxattr failed"

@@ -0,0 +1,78 @@
+#
+# validate parallel restore library
+#
+
+t_require_commands scoutfs parallel_restore find xargs
+
+SCR="$T_TMPDIR/mnt.scratch"
+mkdir -p "$SCR"
+
+scratch_mkfs() {
+	scoutfs mkfs $@ \
+		-A -f -Q 0,127.0.0.1,53000 $T_EX_META_DEV $T_EX_DATA_DEV
+}
+
+scratch_check() {
+	# give ample time for writes to commit
+	sleep 1
+	sync
+	scoutfs check -d ${T_TMPDIR}/check.debug $T_EX_META_DEV $T_EX_DATA_DEV
+}
+
+scratch_mount() {
+	mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 $T_EX_DATA_DEV $SCR
+}
+
+
+echo "== simple mkfs/restore/mount"
+# meta device just big enough for reserves and the metadata we'll fill
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_META_DEV" > /dev/null || t_fail "parallel_restore"
+scratch_check || t_fail "check failed"
+scratch_mount
+
+scoutfs statfs -p "$SCR" | grep -v -e 'fsid' -e 'rid'
+find "$SCR" -exec scoutfs list-hidden-xattrs {} \; | wc
+scoutfs search-xattrs -p "$SCR" scoutfs.hide.srch.sam_vol_F01030L6 -p "$SCR" | wc
+find "$SCR" -type f -name "file-*" | head -n 4 | xargs -n 1 scoutfs get-fiemap -L
+scoutfs df -p "$SCR"
+scoutfs quota-list -p "$SCR"
+umount "$SCR"
+scratch_check || t_fail "check after mount failed"
+
+echo "== just under ENOSPC"
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_META_DEV" -n 3000000 > /dev/null || t_fail "parallel_restore"
+scratch_check || t_fail "check failed"
+scratch_mount
+scoutfs df -p "$SCR"
+umount "$SCR"
+scratch_check || t_fail "check after mount failed"
+
+echo "== just over ENOSPC"
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_META_DEV" -n 3500000 | grep died 2>&1 && t_fail "parallel_restore"
+scratch_check || t_fail "check failed"
+
+echo "== ENOSPC"
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_META_DEV" -d 600:1000 -f 600:1000 -n 4000000 | grep died 2>&1 && t_fail "parallel_restore"
+
+echo "== attempt to restore data device"
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_DATA_DEV" | grep died 2>&1 && t_fail "parallel_restore"
+
+echo "== attempt format_v1 restore"
+scratch_mkfs -V 1 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+parallel_restore -m "$T_EX_META_DEV" | grep died 2>&1 && t_fail "parallel_restore"
+
+echo "== test if previously mounted"
+scratch_mkfs -V 2 -m 10G -d 60G > $T_TMP.mkfs.out 2>&1 || t_fail "mkfs failed"
+mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
+	"$T_EX_DATA_DEV" "$SCR"
+umount "$SCR"
+parallel_restore -m "$T_EX_META_DEV" | grep died 2>&1 && t_fail "parallel_restore"
+
+echo "== cleanup"
+rmdir "$SCR"
+t_pass
@@ -1,37 +0,0 @@
-#
-# verify d_off output of xfs_io is consistent.
-#
-
-t_require_commands xfs_io
-
-filt()
-{
-	grep d_off | cut -d ' ' -f 1,4-
-}
-
-echo "== create content"
-for s in $(seq 1 7 250); do
-	f=$(printf '%*s' $s | tr ' ' 'a')
-	touch ${T_D0}/$f
-done
-
-echo "== readdir all"
-xfs_io -c "readdir -v" $T_D0 | filt
-
-echo "== readdir offset"
-xfs_io -c "readdir -v -o 20" $T_D0 | filt
-
-echo "== readdir len (bytes)"
-xfs_io -c "readdir -v -l 193" $T_D0 | filt
-
-echo "== introduce gap"
-for s in $(seq 57 7 120); do
-	f=$(printf '%*s' $s | tr ' ' 'a')
-	rm -f ${T_D0}/$f
-done
-xfs_io -c "readdir -v" $T_D0 | filt
-
-echo "== cleanup"
-rm -rf $T_D0
-
-t_pass
@@ -65,14 +65,26 @@ EOF

 cat << EOF > local.exclude
 generic/003	# missing atime update in buffered read
+generic/029	# mmap missing
+generic/030	# mmap missing
 generic/075	# file content mismatch failures (fds, etc)
+generic/080	# mmap missing
 generic/103	# enospc causes trans commit failures
 generic/108	# mount fails on failing device?
 generic/112	# file content mismatch failures (fds, etc)
+generic/120	# (can't exec 'cause no mmap)
+generic/126	# (can't exec 'cause no mmap)
+generic/141	# mmap missing
 generic/213	# enospc causes trans commit failures
+generic/215	# mmap missing
+generic/246	# mmap missing
+generic/247	# mmap missing
+generic/248	# mmap missing
 generic/318	# can't support user namespaces until v5.11
 generic/321	# requires selinux enabled for '+' in ls?
+generic/325	# mmap missing
 generic/338	# BUG_ON update inode error handling
+generic/346	# mmap missing
 generic/347	# _dmthin_mount doesn't work?
 generic/356	# swap
 generic/357	# swap
@@ -80,13 +92,16 @@ generic/409	# bind mounts not scripted yet
 generic/410	# bind mounts not scripted yet
 generic/411	# bind mounts not scripted yet
 generic/423	# symlink inode size is strlen() + 1 on scoutfs
+generic/428	# mmap missing
 generic/430	# xfs_io copy_range missing in el7
 generic/431	# xfs_io copy_range missing in el7
 generic/432	# xfs_io copy_range missing in el7
 generic/433	# xfs_io copy_range missing in el7
 generic/434	# xfs_io copy_range missing in el7
+generic/437	# mmap missing
 generic/441	# dm-mapper
 generic/444	# el9's posix_acl_update_mode is buggy ?
+generic/452	# exec test - no mmap
 generic/467	# open_by_handle ESTALE
 generic/472	# swap
 generic/484	# dm-mapper
@@ -103,9 +118,11 @@ generic/565	# xfs_io copy_range missing in el7
 generic/568	# falloc not resulting in block count increase
 generic/569	# swap
 generic/570	# swap
+generic/614	# mmap missing
 generic/620	# dm-hugedisk
-generic/633	# id-mapped mounts missing in el7
+generic/633	# mmap, id-mapped mounts missing in el7
 generic/636	# swap
+generic/638	# mmap missing
 generic/641	# swap
 generic/643	# swap
 EOF
@@ -7,7 +7,7 @@ FMTIOC_H := format.h ioctl.h
 FMTIOC_KMOD := $(addprefix ../kmod/src/,$(FMTIOC_H))

 CFLAGS := -Wall -O2 -Werror -D_FILE_OFFSET_BITS=64 -g -msse4.2 \
-	-fno-strict-aliasing \
+	-I src/ -fno-strict-aliasing \
 	-DSCOUTFS_FORMAT_HASH=0x$(SCOUTFS_FORMAT_HASH)LLU

 ifneq ($(wildcard $(firstword $(FMTIOC_KMOD))),)
@@ -15,10 +15,13 @@ CFLAGS += -I../kmod/src
 endif

 BIN := src/scoutfs
-OBJ := $(patsubst %.c,%.o,$(wildcard src/*.c))
-DEPS := $(wildcard */*.d)
+OBJ_DIRS := src src/check
+OBJ := $(foreach dir,$(OBJ_DIRS),$(patsubst %.c,%.o,$(wildcard $(dir)/*.c)))
+DEPS := $(foreach dir,$(OBJ_DIRS),$(wildcard $(dir)/*.d))

-all: $(BIN)
+AR := src/scoutfs_parallel_restore.a
+
+all: $(BIN) $(AR)

 ifneq ($(DEPS),)
 -include $(DEPS)
@@ -36,6 +39,10 @@ $(BIN): $(OBJ)
 	$(QU)  [BIN $@]
 	$(VE)gcc -o $@ $^ -luuid -lm -lcrypto -lblkid

+$(AR): $(OBJ)
+	$(QU)  [AR $@]
+	$(VE)ar rcs $@ $^
+
 %.o %.d: %.c Makefile sparse.sh
 	$(QU)  [CC $<]
 	$(VE)gcc $(CFLAGS) -MD -MP -MF $*.d -c $< -o $*.o
@@ -76,6 +76,41 @@ run when the file system will not be mounted.
 .RE
 .PD

+.TP
+.BI "check META-DEVICE DATA-DEVICE [-d|--debug FILE]"
+.sp
+Performs an offline file system check. The program iterates through all the
+data structures on disk directly - the filesystem must not be mounted while
+this operation is running.
+.RS 1.0i
+.PD 0
+.sp
+.TP
+.B "-d, --debug FILE"
+An output file where the program can output debug information about the
+state of the filesystem as it performs the check. If
+.B FILE
+is "-", the debug output is written to the Standard Error output.
+.TP
+.RE
+.sp
+.B RETURN VALUE
+The check function can return the following exit codes:
+.RS
+.TP
+\fB 0 \fR - no filesystem issues detected
+.TP
+\fB 1 \fR - file system issues were detected
+.TP
+\fB 8 \fR - operational error
+.TP
+\fB 16 \fR - usage error
+.TP
+\fB 32 \fR - cancelled by user (SIGINT)
+.TP
+.RE
+.PD
+
 .TP
 .BI "counters [-t|--table] SYSFS-DIR"
 .sp
@@ -54,6 +54,8 @@ cp man/*.8.gz $RPM_BUILD_ROOT%{_mandir}/man8/.
 install -m 755 -D src/scoutfs $RPM_BUILD_ROOT%{_sbindir}/scoutfs
 install -m 644 -D src/ioctl.h $RPM_BUILD_ROOT%{_includedir}/scoutfs/ioctl.h
 install -m 644 -D src/format.h $RPM_BUILD_ROOT%{_includedir}/scoutfs/format.h
+install -m 644 -D src/parallel_restore.h $RPM_BUILD_ROOT%{_includedir}/scoutfs/parallel_restore.h
+install -m 644 -D src/scoutfs_parallel_restore.a $RPM_BUILD_ROOT%{_libdir}/scoutfs/libscoutfs_parallel_restore.a
 install -m 755 -D fenced/scoutfs-fenced $RPM_BUILD_ROOT%{_libexecdir}/scoutfs-fenced/scoutfs-fenced
 install -m 644 -D fenced/scoutfs-fenced.service $RPM_BUILD_ROOT%{_unitdir}/scoutfs-fenced.service
 install -m 644 -D fenced/scoutfs-fenced.conf.example $RPM_BUILD_ROOT%{_sysconfdir}/scoutfs/scoutfs-fenced.conf.example
@@ -70,6 +72,7 @@ install -m 644 -D fenced/scoutfs-fenced.conf.example $RPM_BUILD_ROOT%{_sysconfdi
 %files -n scoutfs-devel
 %defattr(644,root,root,755)
 %{_includedir}/scoutfs
+%{_libdir}/scoutfs

 %clean
 rm -rf %{buildroot}
@@ -0,0 +1,166 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/mman.h>
+#include <errno.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "bitmap.h"
+#include "key.h"
+
+#include "alloc.h"
+#include "block.h"
+#include "btree.h"
+#include "extent.h"
+#include "iter.h"
+#include "sns.h"
+
+/*
+ * We check the list blocks serially.
+ *
+ * XXX:
+ *  - compare ref seqs
+ *  - detect cycles?
+ */
+int alloc_list_meta_iter(struct scoutfs_alloc_list_head *lhead, extent_cb_t cb, void *cb_arg)
+{
+	struct scoutfs_alloc_list_block *lblk;
+	struct scoutfs_block_ref ref;
+	struct block *blk = NULL;
+	u64 blkno;
+	int ret;
+
+	ref = lhead->ref;
+
+	while (ref.blkno) {
+		blkno = le64_to_cpu(ref.blkno);
+
+		ret = cb(blkno, 1, cb_arg);
+		if (ret < 0) {
+			ret = xlate_iter_errno(ret);
+			goto out;
+		}
+
+		ret = block_get(&blk, blkno, 0);
+		if (ret < 0)
+			goto out;
+
+		lblk = block_buf(blk);
+		/* XXX verify block */
+		ret = block_hdr_valid(blk, blkno, 0, SCOUTFS_BLOCK_MAGIC_ALLOC_LIST);
+		if (ret < 0)
+			goto out;
+
+		/* XXX sort?   maybe */
+
+		ref = lblk->next;
+
+		block_put(&blk);
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+int alloc_root_meta_iter(struct scoutfs_alloc_root *root, extent_cb_t cb, void *cb_arg)
+{
+	return btree_meta_iter(&root->root, cb, cb_arg);
+}
+
+int alloc_list_extent_iter(struct scoutfs_alloc_list_head *lhead, extent_cb_t cb, void *cb_arg)
+{
+	struct scoutfs_alloc_list_block *lblk;
+	struct scoutfs_block_ref ref;
+	struct block *blk = NULL;
+	u64 blkno;
+	int ret;
+	int i;
+
+	ref = lhead->ref;
+
+	while (ref.blkno) {
+		blkno = le64_to_cpu(ref.blkno);
+
+		ret = block_get(&blk, blkno, 0);
+		if (ret < 0)
+			goto out;
+
+		sns_push("alloc_list_block", blkno, 0);
+
+		lblk = block_buf(blk);
+		/* XXX verify block */
+		ret = block_hdr_valid(blk, blkno, 0, SCOUTFS_BLOCK_MAGIC_ALLOC_LIST);
+		if (ret < 0)
+			goto out;
+		/* XXX sort?   maybe */
+
+		ret = 0;
+		for (i = 0; i < le32_to_cpu(lblk->nr); i++) {
+			blkno = le64_to_cpu(lblk->blknos[le32_to_cpu(lblk->start) + i]);
+
+			ret = cb(blkno, 1, cb_arg);
+			if (ret < 0)
+				break;
+		}
+
+		ref = lblk->next;
+
+		block_put(&blk);
+		sns_pop();
+		if (ret < 0) {
+			ret = xlate_iter_errno(ret);
+			goto out;
+		}
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static bool valid_free_extent_key(struct scoutfs_key *key)
+{
+	return (key->sk_zone == SCOUTFS_FREE_EXTENT_BLKNO_ZONE ||
+	        key->sk_zone == SCOUTFS_FREE_EXTENT_ORDER_ZONE) &&
+	       (!key->_sk_fourth && !key->sk_type &&
+		(key->sk_zone == SCOUTFS_FREE_EXTENT_ORDER_ZONE || !key->_sk_third));
+}
+
+static int free_item_cb(struct scoutfs_key *key, void *val, u16 val_len, void *cb_arg)
+{
+	struct extent_cb_arg_t *ecba = cb_arg;
+	u64 start;
+	u64 len;
+
+	/* XXX not sure these eios are what we want */
+
+	if (val_len != 0)
+		return -EIO;
+
+	if (!valid_free_extent_key(key))
+		return -EIO;
+
+	if (key->sk_zone == SCOUTFS_FREE_EXTENT_ORDER_ZONE)
+		return -ECHECK_ITER_DONE;
+
+	start = le64_to_cpu(key->skfb_end) - le64_to_cpu(key->skfb_len) + 1;
+	len = le64_to_cpu(key->skfb_len);
+
+	return ecba->cb(start, len, ecba->cb_arg);
+}
+
+/*
+ * Call the callback with each of the primary BLKNO free extents stored
+ * in item in the given alloc root.  It doesn't visit the secondary
+ * ORDER extents.
+ */
+int alloc_root_extent_iter(struct scoutfs_alloc_root *root, extent_cb_t cb, void *cb_arg)
+{
+	struct extent_cb_arg_t ecba = { .cb = cb, .cb_arg = cb_arg };
+
+	return btree_item_iter(&root->root, free_item_cb, &ecba);
+}
@@ -0,0 +1,12 @@
+#ifndef _SCOUTFS_UTILS_CHECK_ALLOC_H
+#define _SCOUTFS_UTILS_CHECK_ALLOC_H
+
+#include "extent.h"
+
+int alloc_list_meta_iter(struct scoutfs_alloc_list_head *lhead, extent_cb_t cb, void *cb_arg);
+int alloc_root_meta_iter(struct scoutfs_alloc_root *root, extent_cb_t cb, void *cb_arg);
+
+int alloc_list_extent_iter(struct scoutfs_alloc_list_head *lhead, extent_cb_t cb, void *cb_arg);
+int alloc_root_extent_iter(struct scoutfs_alloc_root *root, extent_cb_t cb, void *cb_arg);
+
+#endif
@@ -0,0 +1,613 @@
+#define _ISOC11_SOURCE /* aligned_alloc */
+#define _DEFAULT_SOURCE /* syscall() */
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <errno.h>
+#include <sys/syscall.h>
+#include <linux/aio_abi.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "list.h"
+#include "cmp.h"
+#include "hash.h"
+
+#include "block.h"
+#include "debug.h"
+#include "super.h"
+#include "eno.h"
+#include "crc.h"
+#include "sns.h"
+
+static struct block_data {
+	struct list_head *hash_lists;
+	size_t hash_nr;
+
+	struct list_head active_head;
+	struct list_head inactive_head;
+	struct list_head dirty_list;
+	size_t nr_active;
+	size_t nr_inactive;
+	size_t nr_dirty;
+
+	int meta_fd;
+	size_t max_cached;
+	size_t nr_events;
+
+	aio_context_t ctx;
+	struct iocb *iocbs;
+	struct iocb **iocbps;
+	struct io_event *events;
+} global_bdat;
+
+struct block {
+	struct list_head hash_head;
+	struct list_head lru_head;
+	struct list_head dirty_head;
+	struct list_head submit_head;
+	unsigned long refcount;
+	unsigned long uptodate:1,
+		      active:1;
+	u64 blkno;
+	void *buf;
+	size_t size;
+};
+
+#define BLK_FMT \
+	"blkno %llu rc %ld d %u a %u"
+#define BLK_ARG(blk) \
+	(blk)->blkno, (blk)->refcount, !list_empty(&(blk)->dirty_head), blk->active
+#define debug_blk(blk, fmt, args...) \
+	debug(fmt " " BLK_FMT, ##args, BLK_ARG(blk))
+
+/*
+ * This just allocates and initialzies the block.  The caller is
+ * responsible for putting it on the appropriate initial lists and
+ * managing refcounts.
+ */
+static struct block *alloc_block(struct block_data *bdat, u64 blkno, size_t size)
+{
+	struct block *blk;
+
+	blk = calloc(1, sizeof(struct block));
+	if (blk) {
+		blk->buf = aligned_alloc(4096, size); /* XXX static alignment :/ */
+		if (!blk->buf) {
+			free(blk);
+			blk = NULL;
+		} else {
+			INIT_LIST_HEAD(&blk->hash_head);
+			INIT_LIST_HEAD(&blk->lru_head);
+			INIT_LIST_HEAD(&blk->dirty_head);
+			INIT_LIST_HEAD(&blk->submit_head);
+			blk->blkno = blkno;
+			blk->size = size;
+		}
+	}
+
+	return blk;
+}
+
+static void free_block(struct block_data *bdat, struct block *blk)
+{
+	debug_blk(blk, "free");
+
+	if (!list_empty(&blk->lru_head)) {
+		if (blk->active)
+			bdat->nr_active--;
+		else
+			bdat->nr_inactive--;
+		list_del(&blk->lru_head);
+	}
+
+	if (!list_empty(&blk->dirty_head)) {
+		bdat->nr_dirty--;
+		list_del(&blk->dirty_head);
+	}
+
+	if (!list_empty(&blk->hash_head))
+		list_del(&blk->hash_head);
+
+	if (!list_empty(&blk->submit_head))
+		list_del(&blk->submit_head);
+
+	free(blk->buf);
+	free(blk);
+}
+
+static bool blk_is_dirty(struct block *blk)
+{
+	return !list_empty(&blk->dirty_head);
+}
+
+/*
+ * Rebalance the cache.
+ *
+ * First we shrink the cache to limit it to max_cached blocks.
+ * Logically, we walk from oldest to newest in the inactive list and
+ * then in the active list.  Since these lists are physically one
+ * list_head list we achieve this with a reverse walk starting from the
+ * active head.
+ *
+ * Then we rebalnace the size of the two lists.  The constraint is that
+ * we don't let the active list grow larger than the inactive list.  We
+ * move blocks from the oldest tail of the active list to the newest
+ * head of the inactive list.
+ *
+ * <- [active head] <-> [ .. active list .. ] <-> [inactive head] <-> [ .. inactive list .. ] ->
+ */
+static void rebalance_cache(struct block_data *bdat)
+{
+	struct block *blk;
+	struct block *blk_;
+
+	list_for_each_entry_safe_reverse(blk, blk_, &bdat->active_head, lru_head) {
+		if ((bdat->nr_active + bdat->nr_inactive) < bdat->max_cached)
+			break;
+
+		if (&blk->lru_head == &bdat->inactive_head || blk->refcount > 0 ||
+		    blk_is_dirty(blk))
+			continue;
+
+		free_block(bdat, blk);
+	}
+
+	list_for_each_entry_safe_reverse(blk, blk_, &bdat->inactive_head, lru_head) {
+		if (bdat->nr_active <= bdat->nr_inactive || &blk->lru_head == &bdat->active_head)
+			break;
+
+		list_move(&blk->lru_head, &bdat->inactive_head);
+		blk->active = 0;
+		bdat->nr_active--;
+		bdat->nr_inactive++;
+	}
+}
+
+static void make_active(struct block_data *bdat, struct block *blk)
+{
+	if (!blk->active) {
+		if (!list_empty(&blk->lru_head)) {
+			list_move(&blk->lru_head, &bdat->active_head);
+			bdat->nr_inactive--;
+		} else {
+			list_add(&blk->lru_head, &bdat->active_head);
+		}
+
+		blk->active = 1;
+		bdat->nr_active++;
+	}
+}
+
+static int compar_iocbp(const void *A, const void *B)
+{
+	struct iocb *a = *(struct iocb **)A;
+	struct iocb *b = *(struct iocb **)B;
+
+	return scoutfs_cmp(a->aio_offset, b->aio_offset);
+}
+
+static int submit_and_wait(struct block_data *bdat, struct list_head *list)
+{
+	struct io_event *event;
+	struct iocb *iocb;
+	struct block *blk;
+	int ret;
+	int err;
+	int nr;
+	int i;
+
+	err = 0;
+	nr = 0;
+	list_for_each_entry(blk, list, submit_head) {
+		iocb = &bdat->iocbs[nr];
+		bdat->iocbps[nr] = iocb;
+
+		memset(iocb, 0, sizeof(struct iocb));
+
+		iocb->aio_data = (intptr_t)blk;
+		iocb->aio_lio_opcode = blk_is_dirty(blk) ? IOCB_CMD_PWRITE : IOCB_CMD_PREAD;
+		iocb->aio_fildes = bdat->meta_fd;
+		iocb->aio_buf = (intptr_t)blk->buf;
+		iocb->aio_nbytes = blk->size;
+		iocb->aio_offset = blk->blkno * blk->size;
+
+		nr++;
+
+		debug_blk(blk, "submit");
+
+		if ((nr < bdat->nr_events) && blk->submit_head.next != list)
+			continue;
+
+		qsort(bdat->iocbps, nr, sizeof(bdat->iocbps[0]), compar_iocbp);
+
+		ret = syscall(__NR_io_submit, bdat->ctx, nr, bdat->iocbps);
+		if (ret != nr) {
+			if (ret >= 0)
+				errno = EIO;
+			ret = -errno;
+			fprintf(stderr, "fatal system error submitting async IO: "ENO_FMT"\n",
+				ENO_ARG(-ret));
+			goto out;
+		}
+
+		ret = syscall(__NR_io_getevents, bdat->ctx, nr, nr, bdat->events, NULL);
+		if (ret != nr) {
+			if (ret >= 0)
+				errno = EIO;
+			ret = -errno;
+			fprintf(stderr, "fatal system error getting IO events: "ENO_FMT"\n",
+				ENO_ARG(-ret));
+			goto out;
+		}
+
+		ret = 0;
+		for (i = 0; i < nr; i++) {
+			event = &bdat->events[i];
+			iocb = (struct iocb *)(intptr_t)event->obj;
+			blk = (struct block *)(intptr_t)event->data;
+
+			debug_blk(blk, "complete res %lld", (long long)event->res);
+
+			if (event->res >= 0 && event->res != blk->size)
+				event->res = -EIO;
+
+			/* io errors are fatal */
+			if (event->res < 0) {
+				ret = event->res;
+				goto out;
+			}
+
+			if (iocb->aio_lio_opcode == IOCB_CMD_PREAD) {
+				blk->uptodate = 1;
+			} else {
+				list_del_init(&blk->dirty_head);
+				bdat->nr_dirty--;
+			}
+		}
+		nr = 0;
+	}
+
+	ret = 0;
+out:
+	return ret ?: err;
+}
+
+static void inc_refcount(struct block *blk)
+{
+	blk->refcount++;
+}
+
+void block_put(struct block **blkp)
+{
+	struct block_data *bdat = &global_bdat;
+	struct block *blk = *blkp;
+
+	if (blk) {
+		blk->refcount--;
+		*blkp = NULL;
+
+		rebalance_cache(bdat);
+	}
+}
+
+static struct list_head *hash_bucket(struct block_data *bdat, u64 blkno)
+{
+	u32 hash = scoutfs_hash32(&blkno, sizeof(blkno));
+
+	return &bdat->hash_lists[hash % bdat->hash_nr];
+}
+
+int block_hdr_valid(struct block *blk, u64 blkno, int bf, u32 magic)
+{
+	struct scoutfs_block_header *hdr;
+	size_t size = (bf & BF_SM) ? SCOUTFS_BLOCK_SM_SIZE : SCOUTFS_BLOCK_LG_SIZE;
+	int ret;
+	u32 crc;
+
+	ret = block_get(&blk, blkno, bf);
+	if (ret < 0) {
+		fprintf(stderr, "error reading block %llu\n", blkno);
+		goto out;
+	}
+
+	hdr = block_buf(blk);
+
+	crc = crc_block(hdr, size);
+
+	/*
+	 * a bad CRC is easy to repair, so we pass a different error code
+	 * back. Unless the other data is also wrong - then it's EINVAL
+	 * to signal that this isn't a valid block hdr at all.
+	 */
+	if (le32_to_cpu(hdr->crc) != crc)
+		ret = -EIO; /* keep checking other fields */
+
+	if (le32_to_cpu(hdr->magic) != magic)
+		ret = -EINVAL;
+
+	/*
+	 * Our first caller fills in global_super. Until this completes,
+	 * we can't do this check.
+	 */
+	if ((blkno != SCOUTFS_SUPER_BLKNO) &&
+	    (hdr->fsid != global_super->hdr.fsid))
+		ret = -EINVAL;
+
+	block_put(&blk);
+
+	debug("%s blk_hdr_valid blkno %llu size %lu crc 0x%08x magic 0x%08x ret %d",
+	      sns_str(), blkno, size, le32_to_cpu(hdr->crc), le32_to_cpu(hdr->magic),
+	      ret);
+
+out:
+	return ret;
+}
+
+static struct block *get_or_alloc(struct block_data *bdat, u64 blkno, int bf)
+{
+	struct list_head *bucket = hash_bucket(bdat, blkno);
+	struct block *search;
+	struct block *blk;
+	size_t size;
+
+	size = (bf & BF_SM) ? SCOUTFS_BLOCK_SM_SIZE : SCOUTFS_BLOCK_LG_SIZE;
+
+	blk = NULL;
+	list_for_each_entry(search, bucket, hash_head) {
+		if (search->blkno == blkno && search->size == size) {
+			blk = search;
+			break;
+		}
+	}
+
+	if (!blk) {
+		blk = alloc_block(bdat, blkno, size);
+		if (blk) {
+			list_add(&blk->hash_head, bucket);
+			list_add(&blk->lru_head, &bdat->inactive_head);
+			bdat->nr_inactive++;
+		}
+	}
+	if (blk)
+		inc_refcount(blk);
+
+	return blk;
+}
+
+/*
+ * Get a block.
+ *
+ * The caller holds a refcount to the block while it's in use that
+ * prevents it from being removed from the cache.  It must be dropped
+ * with block_put();
+ */
+int block_get(struct block **blk_ret, u64 blkno, int bf)
+{
+	struct block_data *bdat = &global_bdat;
+	struct block *blk;
+	LIST_HEAD(list);
+	int ret;
+
+	blk = get_or_alloc(bdat, blkno, bf);
+	if (!blk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if ((bf & BF_ZERO)) {
+		memset(blk->buf, 0, blk->size);
+		blk->uptodate = 1;
+	}
+
+	if (bf & BF_OVERWRITE)
+		blk->uptodate = 1;
+
+	if (!blk->uptodate) {
+		list_add(&blk->submit_head, &list);
+		ret = submit_and_wait(bdat, &list);
+		list_del_init(&blk->submit_head);
+		if (ret < 0)
+			goto out;
+	}
+
+	if ((bf & BF_DIRTY) && !blk_is_dirty(blk)) {
+		list_add_tail(&bdat->dirty_list, &blk->dirty_head);
+		bdat->nr_dirty++;
+	}
+
+	make_active(bdat, blk);
+
+	rebalance_cache(bdat);
+	ret = 0;
+out:
+	if (ret < 0)
+		block_put(&blk);
+	*blk_ret = blk;
+	return ret;
+}
+
+void *block_buf(struct block *blk)
+{
+	return blk->buf;
+}
+
+size_t block_size(struct block *blk)
+{
+	return blk->size;
+}
+
+/*
+ * Drop the block from the cache, regardless of if it was free or not.
+ * This is used to avoid writing blocks which were dirtied but then
+ * later freed.
+ *
+ * The block is immediately freed and can't be referenced after this
+ * returns.
+ */
+void block_drop(struct block **blkp)
+{
+	struct block_data *bdat = &global_bdat;
+
+	free_block(bdat, *blkp);
+	*blkp = NULL;
+	rebalance_cache(bdat);
+}
+
+/*
+ * This doesn't quite work for mixing large and small blocks, but that's
+ * fine, we never do that.
+ */
+static int compar_u64(const void *A, const void *B)
+{
+	u64 a = *((u64 *)A);
+	u64 b = *((u64 *)B);
+
+	return scoutfs_cmp(a, b);
+}
+
+/*
+ * This read-ahead is synchronous and errors are ignored.  If any of the
+ * blknos aren't present in the cache then we issue concurrent reads for
+ * them and wait.  Any existing cached blocks will be left as is.
+ *
+ * We might be trying to read a lot more than the number of events so we
+ * sort the caller's blknos before iterating over them rather than
+ * relying on submission sorting the blocks in each submitted set.
+ */
+void block_readahead(u64 *blknos, size_t nr)
+{
+	struct block_data *bdat = &global_bdat;
+	struct block *blk;
+	struct block *blk_;
+	LIST_HEAD(list);
+	size_t i;
+
+	if (nr == 0)
+		return;
+
+	qsort(blknos, nr, sizeof(blknos[0]), compar_u64);
+
+	for (i = 0; i < nr; i++) {
+		blk = get_or_alloc(bdat, blknos[i], 0);
+		if (blk) {
+			if (!blk->uptodate)
+				list_add_tail(&blk->submit_head, &list);
+			else
+				block_put(&blk);
+		}
+	}
+
+	(void)submit_and_wait(bdat, &list);
+
+	list_for_each_entry_safe(blk, blk_, &list, submit_head) {
+		list_del_init(&blk->submit_head);
+		block_put(&blk);
+	}
+
+	rebalance_cache(bdat);
+}
+
+/*
+ * The caller's block changes form a consistent transaction.  If the amount of dirty
+ * blocks is large enough we issue a write.
+ */
+int block_try_commit(bool force)
+{
+	struct block_data *bdat = &global_bdat;
+	struct block *blk;
+	struct block *blk_;
+	LIST_HEAD(list);
+	int ret;
+
+	if (!force && bdat->nr_dirty < bdat->nr_events)
+		return 0;
+
+	list_for_each_entry(blk, &bdat->dirty_list, dirty_head) {
+		list_add_tail(&blk->submit_head, &list);
+		inc_refcount(blk);
+	}
+
+	ret = submit_and_wait(bdat, &list);
+
+	list_for_each_entry_safe(blk, blk_, &list, submit_head) {
+		list_del_init(&blk->submit_head);
+		block_put(&blk);
+	}
+
+	if (ret < 0) {
+		fprintf(stderr, "error writing dirty transaction blocks\n");
+		goto out;
+	}
+
+	ret = block_get(&blk, SCOUTFS_SUPER_BLKNO, BF_SM | BF_OVERWRITE | BF_DIRTY);
+	if (ret == 0) {
+		list_add(&blk->submit_head, &list);
+		ret = submit_and_wait(bdat, &list);
+		list_del_init(&blk->submit_head);
+		block_put(&blk);
+	} else {
+		ret = -ENOMEM;
+	}
+	if (ret < 0)
+		fprintf(stderr, "error writing super block to commit transaction\n");
+
+out:
+	rebalance_cache(bdat);
+	return ret;
+}
+
+int block_setup(int meta_fd, size_t max_cached_bytes, size_t max_dirty_bytes)
+{
+	struct block_data *bdat = &global_bdat;
+	size_t i;
+	int ret;
+
+	bdat->max_cached = DIV_ROUND_UP(max_cached_bytes, SCOUTFS_BLOCK_LG_SIZE);
+	bdat->hash_nr = bdat->max_cached / 4;
+	bdat->nr_events = DIV_ROUND_UP(max_dirty_bytes, SCOUTFS_BLOCK_LG_SIZE);
+
+	bdat->iocbs = calloc(bdat->nr_events, sizeof(bdat->iocbs[0]));
+	bdat->iocbps = calloc(bdat->nr_events, sizeof(bdat->iocbps[0]));
+	bdat->events = calloc(bdat->nr_events, sizeof(bdat->events[0]));
+	bdat->hash_lists = calloc(bdat->hash_nr, sizeof(bdat->hash_lists[0]));
+	if (!bdat->iocbs || !bdat->iocbps || !bdat->events || !bdat->hash_lists) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&bdat->active_head);
+	INIT_LIST_HEAD(&bdat->inactive_head);
+	INIT_LIST_HEAD(&bdat->dirty_list);
+	bdat->meta_fd = meta_fd;
+	list_add(&bdat->inactive_head, &bdat->active_head);
+
+	for (i = 0; i < bdat->hash_nr; i++)
+		INIT_LIST_HEAD(&bdat->hash_lists[i]);
+
+	ret = syscall(__NR_io_setup, bdat->nr_events, &bdat->ctx);
+
+out:
+	if (ret < 0) {
+		free(bdat->iocbs);
+		free(bdat->iocbps);
+		free(bdat->events);
+		free(bdat->hash_lists);
+	}
+
+	return ret;
+}
+
+void block_shutdown(void)
+{
+	struct block_data *bdat = &global_bdat;
+
+	syscall(SYS_io_destroy, bdat->ctx);
+
+	free(bdat->iocbs);
+	free(bdat->iocbps);
+	free(bdat->events);
+	free(bdat->hash_lists);
+}
@@ -0,0 +1,34 @@
+#ifndef _SCOUTFS_UTILS_CHECK_BLOCK_H_
+#define _SCOUTFS_UTILS_CHECK_BLOCK_H_
+
+#include <unistd.h>
+#include <stdbool.h>
+
+struct block;
+
+#include "sparse.h"
+
+/* block flags passed to block_get() */
+enum {
+	BF_ZERO      = (1 << 0), /* zero contents buf as block is returned */
+	BF_DIRTY     = (1 << 1), /* block will be written with transaction */
+	BF_SM        = (1 << 2), /* small 4k block instead of large 64k block */
+	BF_OVERWRITE = (1 << 3), /* caller will overwrite contents, don't read */
+};
+
+int block_get(struct block **blk_ret, u64 blkno, int bf);
+void block_put(struct block **blkp);
+
+void *block_buf(struct block *blk);
+size_t block_size(struct block *blk);
+void block_drop(struct block **blkp);
+
+void block_readahead(u64 *blknos, size_t nr);
+int block_try_commit(bool force);
+
+int block_setup(int meta_fd, size_t max_cached_bytes, size_t max_dirty_bytes);
+void block_shutdown(void);
+
+int block_hdr_valid(struct block *blk, u64 blkno, int bf, u32 magic);
+
+#endif
@@ -0,0 +1,217 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <errno.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "key.h"
+#include "avl.h"
+
+#include "block.h"
+#include "btree.h"
+#include "extent.h"
+#include "iter.h"
+#include "sns.h"
+#include "meta.h"
+#include "problem.h"
+
+static inline void *item_val(struct scoutfs_btree_block *bt, struct scoutfs_btree_item *item)
+{
+	return (void *)bt + le16_to_cpu(item->val_off);
+}
+
+static void readahead_refs(struct scoutfs_btree_block *bt)
+{
+	struct scoutfs_btree_item *item;
+	struct scoutfs_avl_node *node;
+	struct scoutfs_block_ref *ref;
+	u64 *blknos;
+	u64 blkno;
+	u16 valid = 0;
+	u16 nr = le16_to_cpu(bt->nr_items);
+	int i;
+
+	blknos = calloc(nr, sizeof(blknos[0]));
+	if (!blknos)
+		return;
+
+	node = avl_first(&bt->item_root);
+
+	for (i = 0; i < nr; i++) {
+		item = container_of(node, struct scoutfs_btree_item, node);
+		ref = item_val(bt, item);
+		blkno = le64_to_cpu(ref->blkno);
+
+		if (valid_meta_blkno(blkno))
+			blknos[valid++] = blkno;
+
+		node = avl_next(&bt->item_root, &item->node);
+	}
+
+	if (valid > 0)
+		block_readahead(blknos, valid);
+	free(blknos);
+}
+
+/*
+ * Call the callback on the referenced block.  Then if the block
+ * contains referneces read it and recurse into all its references.
+ */
+static int btree_ref_meta_iter(struct scoutfs_block_ref *ref, unsigned level, extent_cb_t cb,
+			       void *cb_arg)
+{
+	struct scoutfs_btree_item *item;
+	struct scoutfs_btree_block *bt;
+	struct scoutfs_avl_node *node;
+	struct block *blk = NULL;
+	u64 blkno;
+	int ret;
+	int i;
+
+	blkno = le64_to_cpu(ref->blkno);
+	if (!blkno)
+		return 0;
+
+	ret = cb(blkno, 1, cb_arg);
+	if (ret < 0) {
+		ret = xlate_iter_errno(ret);
+		return 0;
+	}
+
+	if (level == 0)
+		return 0;
+
+	ret = block_get(&blk, blkno, 0);
+	if (ret < 0)
+		return ret;
+
+	ret = block_hdr_valid(blk, blkno, 0, SCOUTFS_BLOCK_MAGIC_BTREE);
+	if (ret < 0)
+		return ret;
+
+	sns_push("btree_parent", blkno, 0);
+
+	bt = block_buf(blk);
+
+	/* XXX integrate verification with block cache */
+	if (bt->level != level) {
+		problem(PB_BTREE_BLOCK_BAD_LEVEL, "expected %u level %u", level, bt->level);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* read-ahead last level of parents */
+	if (level == 2)
+		readahead_refs(bt);
+
+	node = avl_first(&bt->item_root);
+
+	for (i = 0; i < le16_to_cpu(bt->nr_items); i++) {
+		item = container_of(node, struct scoutfs_btree_item, node);
+		ref = item_val(bt, item);
+
+		ret = btree_ref_meta_iter(ref, level - 1, cb, cb_arg);
+		if (ret < 0)
+			goto out;
+
+		node = avl_next(&bt->item_root, &item->node);
+	}
+
+	ret = 0;
+out:
+	block_put(&blk);
+	sns_pop();
+
+	return ret;
+}
+
+int btree_meta_iter(struct scoutfs_btree_root *root, extent_cb_t cb, void *cb_arg)
+{
+	/* XXX check root */
+	if (root->height == 0)
+		return 0;
+
+	return btree_ref_meta_iter(&root->ref, root->height - 1, cb, cb_arg);
+}
+
+static int btree_ref_item_iter(struct scoutfs_block_ref *ref, unsigned level,
+			       btree_item_cb_t cb, void *cb_arg)
+{
+	struct scoutfs_btree_item *item;
+	struct scoutfs_btree_block *bt;
+	struct scoutfs_avl_node *node;
+	struct block *blk = NULL;
+	u64 blkno;
+	int ret;
+	int i;
+
+	blkno = le64_to_cpu(ref->blkno);
+	if (!blkno)
+		return 0;
+
+	ret = block_get(&blk, blkno, 0);
+	if (ret < 0)
+		return ret;
+
+	if (level)
+		sns_push("btree_parent", blkno, 0);
+	else
+		sns_push("btree_leaf", blkno, 0);
+
+	ret = block_hdr_valid(blk, blkno, 0, SCOUTFS_BLOCK_MAGIC_BTREE);
+	if (ret < 0)
+		return ret;
+
+	bt = block_buf(blk);
+
+	/* XXX integrate verification with block cache */
+	if (bt->level != level) {
+		problem(PB_BTREE_BLOCK_BAD_LEVEL, "expected %u level %u", level, bt->level);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* read-ahead leaves that contain items */
+	if (level == 1)
+		readahead_refs(bt);
+
+	node = avl_first(&bt->item_root);
+
+	for (i = 0; i < le16_to_cpu(bt->nr_items); i++) {
+		item = container_of(node, struct scoutfs_btree_item, node);
+
+		if (level) {
+			ref = item_val(bt, item);
+			ret = btree_ref_item_iter(ref, level - 1, cb, cb_arg);
+		} else {
+			ret = cb(&item->key, item_val(bt, item),
+				 le16_to_cpu(item->val_len), cb_arg);
+			debug("free item key "SK_FMT" ret %d", SK_ARG(&item->key), ret);
+		}
+		if (ret < 0) {
+			ret = xlate_iter_errno(ret);
+			goto out;
+		}
+
+		node = avl_next(&bt->item_root, &item->node);
+	}
+
+	ret = 0;
+out:
+	block_put(&blk);
+	sns_pop();
+
+	return ret;
+}
+
+int btree_item_iter(struct scoutfs_btree_root *root, btree_item_cb_t cb, void *cb_arg)
+{
+	/* XXX check root */
+	if (root->height == 0)
+		return 0;
+
+	return btree_ref_item_iter(&root->ref, root->height - 1, cb, cb_arg);
+}
@@ -0,0 +1,14 @@
+#ifndef _SCOUTFS_UTILS_CHECK_BTREE_H_
+#define _SCOUTFS_UTILS_CHECK_BTREE_H_
+
+#include "util.h"
+#include "format.h"
+
+#include "extent.h"
+
+typedef int (*btree_item_cb_t)(struct scoutfs_key *key, void *val, u16 val_len, void *cb_arg);
+
+int btree_meta_iter(struct scoutfs_btree_root *root, extent_cb_t cb, void *cb_arg);
+int btree_item_iter(struct scoutfs_btree_root *root, btree_item_cb_t cb, void *cb_arg);
+
+#endif
@@ -0,0 +1,184 @@
+#define _GNU_SOURCE /* O_DIRECT */
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <string.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <argp.h>
+
+#include "sparse.h"
+#include "parse.h"
+#include "util.h"
+#include "format.h"
+#include "ioctl.h"
+#include "cmd.h"
+#include "dev.h"
+
+#include "alloc.h"
+#include "block.h"
+#include "debug.h"
+#include "meta.h"
+#include "super.h"
+#include "problem.h"
+
+struct check_args {
+	char *meta_device;
+	char *data_device;
+	char *debug_path;
+};
+
+static int do_check(struct check_args *args)
+{
+	int debug_fd = -1;
+	int meta_fd = -1;
+	int data_fd = -1;
+	int ret;
+
+	if (args->debug_path) {
+		if (strcmp(args->debug_path, "-") == 0)
+			debug_fd = dup(STDERR_FILENO);
+		else
+			debug_fd = open(args->debug_path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
+		if (debug_fd < 0) {
+			ret = -errno;
+			fprintf(stderr, "error opening debug output file '%s': %s (%d)\n",
+				args->debug_path, strerror(errno), errno);
+			goto out;
+		}
+
+		debug_enable(debug_fd);
+	}
+
+	meta_fd = open(args->meta_device, O_DIRECT | O_RDWR | O_EXCL);
+	if (meta_fd < 0) {
+		ret = -errno;
+		fprintf(stderr, "failed to open meta device '%s': %s (%d)\n",
+			args->meta_device, strerror(errno), errno);
+		goto out;
+	}
+
+	data_fd = open(args->data_device, O_DIRECT | O_RDWR | O_EXCL);
+	if (data_fd < 0) {
+		ret = -errno;
+		fprintf(stderr, "failed to open data device '%s': %s (%d)\n",
+			args->data_device, strerror(errno), errno);
+		goto out;
+	}
+
+	ret = block_setup(meta_fd, 128 * 1024 * 1024, 32 * 1024 * 1024);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * At some point we may convert this to a multi-pass system where we may
+	 * try and repair items, and, as long as repairs are made, we will rerun
+	 * the checks more times. We may need to start counting how many problems we
+	 * fix in the process of these loops, so that we don't stall on unrepairable
+	 * problems and are making actual repair progress. IOW - when we do a full
+	 * check loop without any problems fixed, we stop trying.
+	 */
+	ret = check_supers(data_fd) ?:
+	      check_super_in_use(meta_fd) ?:
+	      check_meta_alloc() ?:
+	      check_super_crc();
+
+	if (ret < 0)
+		goto out;
+
+	debug("problem count %lu", problems_count());
+	if (problems_count() > 0)
+		printf("Problems detected.\n");
+
+out:
+	/* and tear it all down */
+	block_shutdown();
+	super_shutdown();
+	debug_disable();
+
+	if (meta_fd >= 0)
+		close(meta_fd);
+	if (data_fd >= 0)
+		close(data_fd);
+	if (debug_fd >= 0)
+		close(debug_fd);
+
+	return ret;
+}
+
+static int parse_opt(int key, char *arg, struct argp_state *state)
+{
+	struct check_args *args = state->input;
+
+	switch (key) {
+	case 'd':
+		args->debug_path = strdup_or_error(state, arg);
+		break;
+	case 'e':
+	case ARGP_KEY_ARG:
+		if (!args->meta_device)
+			args->meta_device = strdup_or_error(state, arg);
+		else if (!args->data_device)
+			args->data_device = strdup_or_error(state, arg);
+		else
+			argp_error(state, "more than two device arguments given");
+		break;
+	case ARGP_KEY_FINI:
+		if (!args->meta_device)
+			argp_error(state, "no metadata device argument given");
+		if (!args->data_device)
+			argp_error(state, "no data device argument given");
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static struct argp_option options[] = {
+	{ "debug", 'd', "FILE_PATH", 0, "Path to debug output file, will be created or truncated"},
+	{ NULL }
+};
+
+static struct argp argp = {
+	options,
+	parse_opt,
+	"META-DEVICE DATA-DEVICE",
+	"Check filesystem consistency"
+};
+
+/* Exit codes used by fsck-type programs */
+#define FSCK_EX_NONDESTRUCT	1	/* File system errors corrected */
+#define FSCK_EX_UNCORRECTED	4	/* File system errors left uncorrected */
+#define FSCK_EX_ERROR		8	/* Operational error */
+#define FSCK_EX_USAGE		16	/* Usage or syntax error */
+
+static int check_cmd(int argc, char **argv)
+{
+	struct check_args check_args = {NULL};
+	int ret;
+
+	ret = argp_parse(&argp, argc, argv, 0, NULL, &check_args);
+	if (ret)
+		exit(FSCK_EX_USAGE);
+
+	ret = do_check(&check_args);
+	if (ret < 0)
+		ret = FSCK_EX_ERROR;
+
+	if (problems_count() > 0)
+		ret |= FSCK_EX_UNCORRECTED;
+
+	exit(ret);
+}
+
+static void __attribute__((constructor)) check_ctor(void)
+{
+	cmd_register_argp("check", &argp, GROUP_CORE, check_cmd);
+}
@@ -0,0 +1,16 @@
+#include <stdlib.h>
+
+#include "debug.h"
+
+int debug_fd = -1;
+
+void debug_enable(int fd)
+{
+	debug_fd = fd;
+}
+
+void debug_disable(void)
+{
+	if (debug_fd >= 0)
+		debug_fd = -1;
+}
@@ -0,0 +1,17 @@
+#ifndef _SCOUTFS_UTILS_CHECK_DEBUG_H_
+#define _SCOUTFS_UTILS_CHECK_DEBUG_H_
+
+#include <stdio.h>
+
+#define debug(fmt, args...)				\
+do {							\
+	if (debug_fd >= 0)				\
+		dprintf(debug_fd, fmt"\n", ##args);	\
+} while (0)
+
+extern int debug_fd;
+
+void debug_enable(int fd);
+void debug_disable(void);
+
+#endif
@@ -0,0 +1,9 @@
+#ifndef _SCOUTFS_UTILS_CHECK_ENO_H_
+#define _SCOUTFS_UTILS_CHECK_ENO_H_
+
+#include <errno.h>
+
+#define ENO_FMT		"%d (%s)"
+#define ENO_ARG(eno)	eno, strerror(eno)
+
+#endif
@@ -0,0 +1,313 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <errno.h>
+
+#include "util.h"
+#include "lk_rbtree_wrapper.h"
+
+#include "debug.h"
+#include "extent.h"
+
+/*
+ * In-memory extent management in rbtree nodes.
+ */
+
+bool extents_overlap(u64 a_start, u64 a_len, u64 b_start, u64 b_len)
+{
+	u64 a_end = a_start + a_len;
+	u64 b_end = b_start + b_len;
+
+	return !((a_end <= b_start) || (b_end <= a_start));
+}
+
+static int ext_contains(struct extent_node *ext, u64 start, u64 len)
+{
+	return ext->start <= start && ext->start + ext->len >= start + len;
+}
+
+/*
+ * True if the given extent is bisected by the given range; there's
+ * leftover containing extents on both the left and right sides of the
+ * range in the extent.
+ */
+static int ext_bisected(struct extent_node *ext, u64 start, u64 len)
+{
+	return ext->start < start && ext->start + ext->len > start + len;
+}
+
+static struct extent_node *ext_from_rbnode(struct rb_node *rbnode)
+{
+	return rbnode ? container_of(rbnode, struct extent_node, rbnode) : NULL;
+}
+
+static struct extent_node *next_ext(struct extent_node *ext)
+{
+	return ext ? ext_from_rbnode(rb_next(&ext->rbnode)) : NULL;
+}
+
+static struct extent_node *prev_ext(struct extent_node *ext)
+{
+	return ext ? ext_from_rbnode(rb_prev(&ext->rbnode)) : NULL;
+}
+
+struct walk_results {
+	unsigned bisect_to_leaf:1;
+	struct extent_node *found;
+	struct extent_node *next;
+	struct rb_node *parent;
+	struct rb_node **node;
+};
+
+static void walk_extents(struct extent_root *root, u64 start, u64 len, struct walk_results *wlk)
+{
+	struct rb_node **node = &root->rbroot.rb_node;
+	struct extent_node *ext;
+	u64 end = start + len;
+	int cmp;
+
+	wlk->found = NULL;
+	wlk->next = NULL;
+	wlk->parent = NULL;
+
+	while (*node) {
+		wlk->parent = *node;
+		ext = ext_from_rbnode(*node);
+		cmp = end <= ext->start ? -1 :
+		      start >= ext->start + ext->len ? 1 : 0;
+
+		if (cmp < 0) {
+			node = &ext->rbnode.rb_left;
+			wlk->next = ext;
+		} else if (cmp > 0) {
+			node = &ext->rbnode.rb_right;
+		} else {
+			wlk->found = ext;
+			if (!(wlk->bisect_to_leaf && ext_bisected(ext, start, len)))
+				break;
+			/* walk right so we can insert greater right from bisection */
+			node = &ext->rbnode.rb_right;
+		}
+	}
+
+	wlk->node = node;
+}
+
+/*
+ * Return an extent that overlaps with the given range.
+ */
+int extent_lookup(struct extent_root *root, u64 start, u64 len, struct extent_node *found)
+{
+	struct walk_results wlk = { 0, };
+	int ret;
+
+	walk_extents(root, start, len, &wlk);
+	if (wlk.found) {
+		memset(found, 0, sizeof(struct extent_node));
+		found->start = wlk.found->start;
+		found->len = wlk.found->len;
+		ret = 0;
+	} else {
+		ret = -ENOENT;
+	}
+
+	return ret;
+}
+
+/*
+ * Callers can iterate through direct node references and are entirely
+ * responsible for consistency when doing so.
+ */
+struct extent_node *extent_first(struct extent_root *root)
+{
+	struct walk_results wlk = { 0, };
+
+	walk_extents(root, 0, 1, &wlk);
+
+	return wlk.found ?: wlk.next;
+}
+
+struct extent_node *extent_next(struct extent_node *ext)
+{
+	return next_ext(ext);
+}
+
+struct extent_node *extent_prev(struct extent_node *ext)
+{
+	return prev_ext(ext);
+}
+
+/*
+ * Insert a new extent into the tree.  We can extend existing nodes,
+ * merge with neighbours, or remove existing extents entirely if we
+ * insert a range that fully spans existing nodes.
+ */
+static int walk_insert(struct extent_root *root, u64 start, u64 len, int found_err)
+{
+	struct walk_results wlk = { 0, };
+	struct extent_node *ext;
+	struct extent_node *nei;
+	int ret;
+
+	walk_extents(root, start, len, &wlk);
+
+	ext = wlk.found;
+	if (ext && found_err) {
+		ret = found_err;
+		goto out;
+	}
+
+	if (!ext) {
+		ext = malloc(sizeof(struct extent_node));
+		if (!ext) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ext->start = start;
+		ext->len = len;
+
+		rb_link_node(&ext->rbnode, wlk.parent, wlk.node);
+		rb_insert_color(&ext->rbnode, &root->rbroot);
+	}
+
+	/* start by expanding an existing extent if our range is larger */
+	if (start < ext->start) {
+		ext->len += ext->start - start;
+		ext->start = start;
+	}
+	if (ext->start + ext->len < start + len)
+		ext->len += (start + len) - (ext->start + ext->len);
+
+	/* drop any fully spanned neighbors, possibly merging with a final adjacent one */
+
+	while ((nei = prev_ext(ext))) {
+		if (nei->start + nei->len < ext->start)
+			break;
+
+		if (nei->start < ext->start) {
+			ext->len += ext->start - nei->start;
+			ext->start = nei->start;
+		}
+
+		rb_erase(&nei->rbnode, &root->rbroot);
+		free(nei);
+	}
+
+	while ((nei = next_ext(ext))) {
+		if (ext->start + ext->len < nei->start)
+			break;
+
+		if (ext->start + ext->len < nei->start + nei->len)
+			ext->len += (nei->start + nei->len) - (ext->start + ext->len);
+
+		rb_erase(&nei->rbnode, &root->rbroot);
+		free(nei);
+	}
+
+	ret = 0;
+out:
+	if (ret < 0)
+		debug("start %llu len %llu ret %d", start, len, ret);
+	return ret;
+}
+
+/*
+ * Insert a new extent.  The specified extent must not overlap with any
+ * existing extents or -EEXIST is returned.
+ */
+int extent_insert_new(struct extent_root *root, u64 start, u64 len)
+{
+	return walk_insert(root, start, len, true);
+}
+
+/*
+ * Insert an extent, extending any existing extents that may overlap.
+ */
+int extent_insert_extend(struct extent_root *root, u64 start, u64 len)
+{
+	return walk_insert(root, start, len, false);
+}
+
+/*
+ * Remove the specified extent from an existing node.  The given extent must be fully
+ * contained in a single node or -ENOENT is returned.
+ */
+int extent_remove(struct extent_root *root, u64 start, u64 len)
+{
+	struct extent_node *ext;
+	struct extent_node *ins;
+	struct walk_results wlk = {
+		.bisect_to_leaf = 1,
+	};
+	int ret;
+
+	walk_extents(root, start, len, &wlk);
+
+	if (!(ext = wlk.found) || !ext_contains(ext, start, len)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	if (ext_bisected(ext, start, len)) {
+		debug("found bisected start %llu len %llu", ext->start, ext->len);
+		ins = malloc(sizeof(struct extent_node));
+		if (!ins) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ins->start = start + len;
+		ins->len = (ext->start + ext->len) - ins->start;
+
+		rb_link_node(&ins->rbnode, wlk.parent, wlk.node);
+		rb_insert_color(&ins->rbnode, &root->rbroot);
+	}
+
+	if (start > ext->start) {
+		ext->len = start - ext->start;
+	} else if (len < ext->len) {
+		ext->start += len;
+		ext->len -= len;
+	} else {
+		rb_erase(&ext->rbnode, &root->rbroot);
+	}
+
+	ret = 0;
+out:
+	debug("start %llu len %llu ret %d", start, len, ret);
+
+	return ret;
+}
+
+void extent_root_init(struct extent_root *root)
+{
+	root->rbroot = RB_ROOT;
+	root->total = 0;
+}
+
+void extent_root_free(struct extent_root *root)
+{
+	struct extent_node *ext;
+	struct rb_node *node;
+	struct rb_node *tmp;
+
+	for (node = rb_first(&root->rbroot); node && ((tmp = rb_next(node)), 1); node = tmp) {
+		ext = rb_entry(node, struct extent_node, rbnode);
+		rb_erase(&ext->rbnode, &root->rbroot);
+		free(ext);
+	}
+}
+
+void extent_root_print(struct extent_root *root)
+{
+	struct extent_node *ext;
+	struct rb_node *node;
+	struct rb_node *tmp;
+
+	for (node = rb_first(&root->rbroot); node && ((tmp = rb_next(node)), 1); node = tmp) {
+		ext = rb_entry(node, struct extent_node, rbnode);
+		debug("  start %llu len %llu", ext->start, ext->len);
+	}
+}
@@ -0,0 +1,38 @@
+#ifndef _SCOUTFS_UTILS_CHECK_EXTENT_H_
+#define _SCOUTFS_UTILS_CHECK_EXTENT_H_
+
+#include "lk_rbtree_wrapper.h"
+
+struct extent_root {
+	struct rb_root rbroot;
+	u64 total;
+};
+
+struct extent_node {
+	struct rb_node rbnode;
+	u64 start;
+	u64 len;
+};
+
+typedef int (*extent_cb_t)(u64 start, u64 len, void *arg);
+
+struct extent_cb_arg_t {
+	extent_cb_t cb;
+	void *cb_arg;
+};
+
+bool extents_overlap(u64 a_start, u64 a_len, u64 b_start, u64 b_len);
+
+int extent_lookup(struct extent_root *root, u64 start, u64 len, struct extent_node *found);
+struct extent_node *extent_first(struct extent_root *root);
+struct extent_node *extent_next(struct extent_node *ext);
+struct extent_node *extent_prev(struct extent_node *ext);
+int extent_insert_new(struct extent_root *root, u64 start, u64 len);
+int extent_insert_extend(struct extent_root *root, u64 start, u64 len);
+int extent_remove(struct extent_root *root, u64 start, u64 len);
+
+void extent_root_init(struct extent_root *root);
+void extent_root_free(struct extent_root *root);
+void extent_root_print(struct extent_root *root);
+
+#endif
@@ -0,0 +1,540 @@
+#define _GNU_SOURCE /* O_DIRECT */
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <string.h>
+#include <stdbool.h>
+#include <argp.h>
+
+#include "sparse.h"
+#include "bitmap.h"
+#include "parse.h"
+#include "util.h"
+#include "format.h"
+#include "crc.h"
+#include "cmd.h"
+#include "dev.h"
+
+#include "alloc.h"
+#include "block.h"
+#include "btree.h"
+#include "log_trees.h"
+#include "super.h"
+
+/* huh. */
+#define OFF_MAX (off_t)((u64)((off_t)~0ULL) >> 1)
+
+#define SCOUTFS_META_IMAGE_HEADER_MAGIC		0x8aee00d098fa60c5ULL
+#define SCOUTFS_META_IMAGE_BLOCK_HEADER_MAGIC	0x70bd5e9269effd86ULL
+
+struct scoutfs_meta_image_header {
+	__le64 magic;
+	__le64 total_bytes;
+	__le32 version;
+} __packed;
+
+struct scoutfs_meta_image_block_header {
+	__le64 magic;
+	__le64 offset;
+	__le32 size;
+	__le32 crc;
+} __packed;
+
+struct image_args {
+	char *meta_device;
+	bool is_read;
+	bool show_header;
+	u64 ra_window;
+};
+
+struct block_bitmaps {
+	unsigned long *bits;
+	u64 size;
+	u64 count;
+};
+
+#define errf(fmt, args...) \
+	dprintf(STDERR_FILENO, fmt, ##args)
+
+static int set_meta_bit(u64 start, u64 len, void *arg)
+{
+	struct block_bitmaps *bm = arg;
+	int ret;
+
+	if (len != 1) {
+		ret = -EINVAL;
+	} else {
+		if (!test_bit(bm->bits, start)) {
+			set_bit(bm->bits, start);
+			bm->count++;
+		}
+		ret = 0;
+	}
+
+	return ret;
+}
+
+static int get_ref_bits(struct block_bitmaps *bm)
+{
+	struct scoutfs_super_block *super = global_super;
+	int ret;
+	u64 i;
+
+	/*
+	 * There are almost no small blocks we need to read, so we read
+	 * them as the large blocks that contain them to simplify the
+	 * block reading process.
+	 */
+	set_meta_bit(SCOUTFS_SUPER_BLKNO >> SCOUTFS_BLOCK_SM_LG_SHIFT, 1, bm);
+
+	for (i = 0; i < SCOUTFS_QUORUM_BLOCKS; i++)
+		set_meta_bit((SCOUTFS_QUORUM_BLKNO + i) >> SCOUTFS_BLOCK_SM_LG_SHIFT, 1, bm);
+
+	ret = alloc_root_meta_iter(&super->meta_alloc[0], set_meta_bit, bm) ?:
+	      alloc_root_meta_iter(&super->meta_alloc[1], set_meta_bit, bm) ?:
+	      alloc_root_meta_iter(&super->data_alloc, set_meta_bit, bm) ?:
+	      alloc_list_meta_iter(&super->server_meta_avail[0], set_meta_bit, bm) ?:
+	      alloc_list_meta_iter(&super->server_meta_avail[1], set_meta_bit, bm) ?:
+	      alloc_list_meta_iter(&super->server_meta_freed[0], set_meta_bit, bm) ?:
+	      alloc_list_meta_iter(&super->server_meta_freed[1], set_meta_bit, bm) ?:
+	      btree_meta_iter(&super->fs_root, set_meta_bit, bm) ?:
+	      btree_meta_iter(&super->logs_root, set_meta_bit, bm) ?:
+	      btree_meta_iter(&super->log_merge, set_meta_bit, bm) ?:
+	      btree_meta_iter(&super->mounted_clients, set_meta_bit, bm) ?:
+	      btree_meta_iter(&super->srch_root, set_meta_bit, bm) ?:
+	      log_trees_meta_iter(set_meta_bit, bm);
+
+	return ret;
+}
+
+/*
+ * Note that this temporarily modifies the header that it's given.
+ */
+static __le32 calc_crc(struct scoutfs_meta_image_block_header *bh, void *buf, size_t size)
+{
+	__le32 saved = bh->crc;
+	u32 crc = ~0;
+
+	bh->crc = 0;
+	crc = crc32c(crc, bh, sizeof(*bh));
+	crc = crc32c(crc, buf, size);
+	bh->crc = saved;
+
+	return cpu_to_le32(crc);
+}
+
+static void printf_header(struct scoutfs_meta_image_header *hdr)
+{
+	errf("magic: 0x%016llx\n"
+	     "total_bytes: %llu\n"
+	     "version: %u\n",
+	       le64_to_cpu(hdr->magic),
+	       le64_to_cpu(hdr->total_bytes),
+	       le32_to_cpu(hdr->version));
+}
+
+typedef ssize_t (*rw_func_t)(int fd, void *buf, size_t count, off_t offset);
+
+static inline ssize_t rw_read(int fd, void *buf, size_t count, off_t offset)
+{
+	return read(fd, buf, count);
+}
+
+static inline ssize_t rw_pread(int fd, void *buf, size_t count, off_t offset)
+{
+	return pread(fd, buf, count, offset);
+}
+
+static inline ssize_t rw_write(int fd, void *buf, size_t count, off_t offset)
+{
+	return write(fd, buf, count);
+}
+
+static inline ssize_t rw_pwrite(int fd, void *buf, size_t count, off_t offset)
+{
+	return pwrite(fd, buf, count, offset);
+}
+
+static int rw_full_count(rw_func_t func, u64 *tot, int fd, void *buf, size_t count, off_t offset)
+{
+	ssize_t sret;
+
+	while (count > 0) {
+		sret = func(fd, buf, count, offset);
+		if (sret <= 0 || sret > count) {
+			if (sret < 0)
+				return -errno;
+			else
+				return -EIO;
+		}
+
+		if (tot)
+			*tot += sret;
+		buf += sret;
+		count -= sret;
+	}
+
+	return 0;
+}
+
+static int read_image(struct image_args *args, int fd, struct block_bitmaps *bm)
+{
+	struct scoutfs_meta_image_block_header bh;
+	struct scoutfs_meta_image_header hdr;
+	u64 opening;
+	void *buf;
+	off_t off;
+	u64 bit;
+	u64 ra;
+	int ret;
+
+	buf = malloc(SCOUTFS_BLOCK_LG_SIZE);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	hdr.magic = cpu_to_le64(SCOUTFS_META_IMAGE_HEADER_MAGIC);
+	hdr.total_bytes = cpu_to_le64(sizeof(hdr) +
+				      (bm->count * (SCOUTFS_BLOCK_LG_SIZE + sizeof(bh))));
+	hdr.version = cpu_to_le32(1);
+
+	if (args->show_header) {
+		printf_header(&hdr);
+		ret = 0;
+		goto out;
+	}
+
+	ret = rw_full_count(rw_write, NULL, STDOUT_FILENO, &hdr, sizeof(hdr), 0);
+	if (ret < 0)
+		goto out;
+
+	opening = args->ra_window;
+	ra = 0;
+	bit = 0;
+
+	for (bit = 0; (bit = find_next_set_bit(bm->bits, bit, bm->size)) < bm->size; bit++) {
+
+		/* readahead to open the full window, then a block at a time */
+		do {
+			ra = find_next_set_bit(bm->bits, ra, bm->size);
+			if (ra < bm->size) {
+				off = ra << SCOUTFS_BLOCK_LG_SHIFT;
+				posix_fadvise(fd, off, SCOUTFS_BLOCK_LG_SIZE, POSIX_FADV_WILLNEED);
+				ra++;
+				if (opening)
+					opening -= min(opening, SCOUTFS_BLOCK_LG_SIZE);
+			}
+		} while (opening > 0);
+
+		off = bit << SCOUTFS_BLOCK_LG_SHIFT;
+		ret = rw_full_count(rw_pread, NULL, fd, buf, SCOUTFS_BLOCK_LG_SIZE, off);
+		if (ret < 0)
+			goto out;
+
+		/*
+		 * Might as well try to drop the pages we've used to
+		 * reduce memory pressure on our read-ahead pages that
+		 * are waiting.
+		 */
+		posix_fadvise(fd, off, SCOUTFS_BLOCK_LG_SIZE, POSIX_FADV_DONTNEED);
+
+		bh.magic = cpu_to_le64(SCOUTFS_META_IMAGE_BLOCK_HEADER_MAGIC);
+		bh.offset = cpu_to_le64(off);
+		bh.size = cpu_to_le32(SCOUTFS_BLOCK_LG_SIZE);
+		bh.crc = calc_crc(&bh, buf, SCOUTFS_BLOCK_LG_SIZE);
+
+		ret = rw_full_count(rw_write, NULL, STDOUT_FILENO, &bh, sizeof(bh), 0) ?:
+		      rw_full_count(rw_write, NULL, STDOUT_FILENO, buf, SCOUTFS_BLOCK_LG_SIZE, 0);
+		if (ret < 0)
+			goto out;
+	}
+
+out:
+	free(buf);
+
+	return ret;
+}
+
+static int invalid_header(struct scoutfs_meta_image_header *hdr)
+{
+	if (le64_to_cpu(hdr->magic) != SCOUTFS_META_IMAGE_HEADER_MAGIC) {
+		errf("bad image header magic 0x%016llx (!= expected %016llx)\n",
+		       le64_to_cpu(hdr->magic), SCOUTFS_META_IMAGE_HEADER_MAGIC);
+
+	} else if (le32_to_cpu(hdr->version) != 1) {
+		errf("unknown image header version %u\n", le32_to_cpu(hdr->version));
+
+	} else {
+		return 0;
+	}
+
+	return -EIO;
+}
+
+/*
+ * Doesn't catch offset+size overflowing, presumes pwrite() will return
+ * an error.
+ */
+static int invalid_block_header(struct scoutfs_meta_image_block_header *bh)
+{
+	if (le64_to_cpu(bh->magic) != SCOUTFS_META_IMAGE_BLOCK_HEADER_MAGIC) {
+		errf("bad block header magic 0x%016llx (!= expected %016llx)\n",
+		       le64_to_cpu(bh->magic), SCOUTFS_META_IMAGE_BLOCK_HEADER_MAGIC);
+
+	} else if (le32_to_cpu(bh->size) == 0) {
+		errf("invalid block header size %u\n", le32_to_cpu(bh->size));
+
+	} else if (le32_to_cpu(bh->size) > SIZE_MAX) {
+		errf("block header size %u too large for size_t (> %zu)\n",
+		       le32_to_cpu(bh->size), (size_t)SIZE_MAX);
+
+	} else if (le64_to_cpu(bh->offset) > OFF_MAX) {
+		errf("block header offset %llu too large for off_t (> %llu)\n",
+		       le64_to_cpu(bh->offset), (u64)OFF_MAX);
+
+	} else {
+		return 0;
+	}
+
+	return -EIO;
+}
+
+static int write_image(struct image_args *args, int fd, struct block_bitmaps *bm)
+{
+	struct scoutfs_meta_image_block_header bh;
+	struct scoutfs_meta_image_header hdr;
+	size_t writeback_batch = (2 * 1024 * 1024);
+	size_t buf_size;
+	size_t dirty;
+	size_t size;
+	off_t first;
+	off_t last;
+	off_t off;
+	__le32 calc;
+	void *buf;
+	u64 tot;
+	int ret;
+
+	tot = 0;
+
+	ret = rw_full_count(rw_read, &tot, STDIN_FILENO, &hdr, sizeof(hdr), 0);
+	if (ret < 0)
+		goto out;
+
+	if (args->show_header) {
+		printf_header(&hdr);
+		ret = 0;
+		goto out;
+	}
+
+	ret = invalid_header(&hdr);
+	if (ret < 0)
+		goto out;
+
+	dirty = 0;
+	first = OFF_MAX;
+	last = 0;
+	buf = NULL;
+	buf_size = 0;
+
+	while (tot < le64_to_cpu(hdr.total_bytes)) {
+
+		ret = rw_full_count(rw_read, &tot, STDIN_FILENO, &bh, sizeof(bh), 0);
+		if (ret < 0)
+			goto out;
+
+		ret = invalid_block_header(&bh);
+		if (ret < 0)
+			goto out;
+
+		size = le32_to_cpu(bh.size);
+		if (buf_size < size) {
+			buf = realloc(buf, size);
+			if (!buf) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			buf_size = size;
+		}
+
+		ret = rw_full_count(rw_read, &tot, STDIN_FILENO, buf, size, 0);
+		if (ret < 0)
+			goto out;
+
+		calc = calc_crc(&bh, buf, size);
+		if (calc != bh.crc) {
+			errf("crc err");
+			ret = -EIO;
+			goto out;
+		}
+
+		off = le64_to_cpu(bh.offset);
+
+		ret = rw_full_count(rw_pwrite, NULL, fd, buf, size, off);
+		if (ret < 0)
+			goto out;
+
+		dirty += size;
+		first = min(first, off);
+		last = max(last, off);
+		if (dirty >= writeback_batch) {
+			posix_fadvise(fd, first, last, POSIX_FADV_DONTNEED);
+			dirty = 0;
+			first = OFF_MAX;
+			last = 0;
+		}
+	}
+
+	ret = fsync(fd);
+	if (ret < 0) {
+		ret = -errno;
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+static int do_image(struct image_args *args)
+{
+	struct block_bitmaps bm = { .bits = NULL };
+	int meta_fd = -1;
+	u64 dev_size;
+	mode_t mode;
+	int ret;
+
+	mode = args->is_read ? O_RDONLY : O_RDWR;
+
+	meta_fd = open(args->meta_device, mode);
+	if (meta_fd < 0) {
+		ret = -errno;
+		errf("failed to open meta device '%s': %s (%d)\n",
+		     args->meta_device, strerror(errno), errno);
+		goto out;
+	}
+
+	if (args->is_read) {
+		ret = flush_device(meta_fd);
+		if (ret < 0)
+			goto out;
+
+		ret = get_device_size(args->meta_device, meta_fd, &dev_size);
+		if (ret < 0)
+			goto out;
+
+		bm.size = DIV_ROUND_UP(dev_size, SCOUTFS_BLOCK_LG_SIZE);
+		bm.bits = calloc(1, round_up(bm.size, BITS_PER_LONG) / 8);
+		if (!bm.bits) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = block_setup(meta_fd, 128 * 1024 * 1024, 32 * 1024 * 1024) ?:
+		      check_supers(-1) ?:
+		      get_ref_bits(&bm) ?:
+		      read_image(args, meta_fd, &bm);
+		block_shutdown();
+	} else {
+		ret = write_image(args, meta_fd, &bm);
+	}
+out:
+	free(bm.bits);
+
+	if (meta_fd >= 0)
+		close(meta_fd);
+
+	return ret;
+}
+
+static int parse_opt(int key, char *arg, struct argp_state *state)
+{
+	struct image_args *args = state->input;
+	int ret;
+
+	switch (key) {
+	case 'h':
+		args->show_header = true;
+		break;
+	case 'r':
+		ret = parse_u64(arg, &args->ra_window);
+		if (ret)
+			argp_error(state, "readahead winddoe parse error");
+		break;
+	case ARGP_KEY_ARG:
+		if (!args->meta_device)
+			args->meta_device = strdup_or_error(state, arg);
+		else
+			argp_error(state, "more than two device arguments given");
+		break;
+	case ARGP_KEY_FINI:
+		if (!args->meta_device)
+			argp_error(state, "no metadata device argument given");
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static struct argp_option options[] = {
+	{ "show-header", 'h', NULL, 0, "Print image header and exit without processing stream" },
+	{ "readahead", 'r', "NR", 0, "Maintain read-ahead window of NR blocks" },
+	{ NULL }
+};
+
+static struct argp read_image_argp = {
+	options,
+	parse_opt,
+	"META-DEVICE",
+	"Read metadata image stream from metadata device file"
+};
+
+#define DEFAULT_RA_WINDOW (512 * 1024)
+
+static int read_image_cmd(int argc, char **argv)
+{
+	struct image_args image_args = {
+		.is_read = true,
+		.ra_window = DEFAULT_RA_WINDOW,
+	};
+	int ret;
+
+	ret = argp_parse(&read_image_argp, argc, argv, 0, NULL, &image_args);
+	if (ret)
+		return ret;
+
+	return do_image(&image_args);
+}
+
+static struct argp write_image_argp = {
+	options,
+	parse_opt,
+	"META-DEVICE",
+	"Write metadata image stream to metadata device file"
+};
+
+static int write_image_cmd(int argc, char **argv)
+{
+	struct image_args image_args = {
+		.is_read = false,
+		.ra_window = DEFAULT_RA_WINDOW,
+	};
+	int ret;
+
+	ret = argp_parse(&write_image_argp, argc, argv, 0, NULL, &image_args);
+	if (ret)
+		return ret;
+
+	return do_image(&image_args);
+}
+
+static void __attribute__((constructor)) image_ctor(void)
+{
+	cmd_register_argp("read-metadata-image", &read_image_argp, GROUP_CORE, read_image_cmd);
+	cmd_register_argp("write-metadata-image", &write_image_argp, GROUP_CORE, write_image_cmd);
+}
@@ -0,0 +1,15 @@
+#ifndef _SCOUTFS_UTILS_CHECK_ITER_H_
+#define _SCOUTFS_UTILS_CHECK_ITER_H_
+
+/*
+ * Callbacks can return a weird -errno that we'll never use to indicate
+ * that iteration can stop and return 0 for success.
+ */
+#define ECHECK_ITER_DONE EL2HLT
+
+static inline int xlate_iter_errno(int ret)
+{
+	return ret == -ECHECK_ITER_DONE ? 0 : ret;
+}
+
+#endif
@@ -0,0 +1,98 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "key.h"
+
+#include "alloc.h"
+#include "btree.h"
+#include "debug.h"
+#include "extent.h"
+#include "iter.h"
+#include "sns.h"
+#include "log_trees.h"
+#include "super.h"
+
+struct iter_args {
+	extent_cb_t cb;
+	void *cb_arg;
+};
+
+static int lt_meta_iter(struct scoutfs_key *key, void *val, u16 val_len, void *cb_arg)
+{
+	struct iter_args *ia = cb_arg;
+	struct scoutfs_log_trees *lt;
+	int ret;
+
+	if (val_len != sizeof(struct scoutfs_log_trees))
+		; /* XXX */
+
+	lt = val;
+
+	sns_push("log_trees", le64_to_cpu(lt->rid), le64_to_cpu(lt->nr));
+
+	debug("lt rid 0x%16llx nr %llu", le64_to_cpu(lt->rid), le64_to_cpu(lt->nr));
+
+	sns_push("meta_avail", 0, 0);
+	ret = alloc_list_meta_iter(&lt->meta_avail, ia->cb, ia->cb_arg);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("meta_freed", 0, 0);
+	ret = alloc_list_meta_iter(&lt->meta_freed, ia->cb, ia->cb_arg);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("item_root", 0, 0);
+	ret = btree_meta_iter(&lt->item_root, ia->cb, ia->cb_arg);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	if (lt->bloom_ref.blkno) {
+		sns_push("bloom_ref", 0, 0);
+		ret = ia->cb(le64_to_cpu(lt->bloom_ref.blkno), 1, ia->cb_arg);
+		sns_pop();
+		if (ret < 0) {
+			ret = xlate_iter_errno(ret);
+			goto out;
+		}
+	}
+
+	sns_push("data_avail", 0, 0);
+	ret = alloc_root_meta_iter(&lt->data_avail, ia->cb, ia->cb_arg);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("data_freed", 0, 0);
+	ret = alloc_root_meta_iter(&lt->data_freed, ia->cb, ia->cb_arg);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	ret = 0;
+out:
+	sns_pop();
+
+	return ret;
+}
+
+/*
+ * Call the callers callback with the extent of all the metadata block references contained
+ * in log btrees.  We walk the logs_root btree items and walk all the metadata structures
+ * they reference.
+ */
+int log_trees_meta_iter(extent_cb_t cb, void *cb_arg)
+{
+	struct scoutfs_super_block *super = global_super;
+	struct iter_args ia = { .cb = cb, .cb_arg = cb_arg };
+
+	return btree_item_iter(&super->logs_root, lt_meta_iter, &ia);
+}
@@ -0,0 +1,8 @@
+#ifndef _SCOUTFS_UTILS_CHECK_LOG_TREES_H_
+#define _SCOUTFS_UTILS_CHECK_LOG_TREES_H_
+
+#include "extent.h"
+
+int log_trees_meta_iter(extent_cb_t cb, void *cb_arg);
+
+#endif
@@ -0,0 +1,367 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/mman.h>
+#include <errno.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "bitmap.h"
+#include "key.h"
+
+#include "alloc.h"
+#include "btree.h"
+#include "debug.h"
+#include "extent.h"
+#include "sns.h"
+#include "log_trees.h"
+#include "meta.h"
+#include "problem.h"
+#include "super.h"
+
+static struct meta_data {
+	struct extent_root meta_refed;
+	struct extent_root meta_free;
+	struct {
+		u64 ref_blocks;
+		u64 free_extents;
+		u64 free_blocks;
+	} stats;
+} global_mdat;
+
+bool valid_meta_blkno(u64 blkno)
+{
+	u64 tot = le64_to_cpu(global_super->total_meta_blocks);
+
+	return blkno >= SCOUTFS_META_DEV_START_BLKNO && blkno < tot;
+}
+
+static bool valid_meta_extent(u64 start, u64 len)
+{
+	u64 tot = le64_to_cpu(global_super->total_meta_blocks);
+	bool valid;
+
+	valid = len > 0 &&
+		start >= SCOUTFS_META_DEV_START_BLKNO &&
+		start < tot &&
+		len <= tot &&
+		((start + len) <= tot) &&
+		((start + len) > start);
+
+	debug("start %llu len %llu valid %u", start, len, !!valid);
+
+	if (!valid)
+		problem(PB_META_EXTENT_INVALID, "start %llu len %llu", start, len);
+
+	return valid;
+}
+
+/*
+ * Track references to individual metadata blocks.  This uses the extent
+ * callback type but is only ever called for single block references.
+ * Any reference to a block that has already been referenced is
+ * considered invalid and is ignored.  Later repair will resolve
+ * duplicate references.
+ */
+static int insert_meta_ref(u64 start, u64 len, void *arg)
+{
+	struct meta_data *mdat = &global_mdat;
+	struct extent_root *root = arg;
+	int ret = 0;
+
+	/* this is tracking single metadata block references */
+	if (len != 1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (valid_meta_blkno(start)) {
+		ret = extent_insert_new(root, start, len);
+		if (ret == 0)
+			mdat->stats.ref_blocks++;
+		else if (ret == -EEXIST)
+			problem(PB_META_REF_OVERLAPS_EXISTING, "blkno %llu", start);
+	}
+
+out:
+	return ret;
+}
+
+static int insert_meta_free(u64 start, u64 len, void *arg)
+{
+	struct meta_data *mdat = &global_mdat;
+	struct extent_root *root = arg;
+	int ret = 0;
+
+	if (valid_meta_extent(start, len)) {
+		ret = extent_insert_new(root, start, len);
+		if (ret == 0) {
+			mdat->stats.free_extents++;
+			mdat->stats.free_blocks++;
+
+		} else if (ret == -EEXIST) {
+			problem(PB_META_FREE_OVERLAPS_EXISTING,
+				"start %llu llen %llu", start, len);
+		}
+
+	}
+
+	return ret;
+}
+
+/*
+ * Walk all metadata references in the system.  This walk doesn't need
+ * to read metadata that doesn't contain any metadata references so it
+ * can skip the bulk of metadata blocks.  This gives us the set of
+ * referenced metadata blocks which we can then use to repair metadata
+ * allocator structures.
+ */
+static int get_meta_refs(void)
+{
+	struct meta_data *mdat = &global_mdat;
+	struct scoutfs_super_block *super = global_super;
+	int ret;
+
+	extent_root_init(&mdat->meta_refed);
+
+	/* XXX record reserved blocks around super as referenced */
+
+	sns_push("meta_alloc", 0, 0);
+	ret = alloc_root_meta_iter(&super->meta_alloc[0], insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("meta_alloc", 1, 0);
+	ret = alloc_root_meta_iter(&super->meta_alloc[1], insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("data_alloc", 1, 0);
+	ret = alloc_root_meta_iter(&super->data_alloc, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_avail", 0, 0);
+	ret = alloc_list_meta_iter(&super->server_meta_avail[0],
+				   insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_avail", 1, 0);
+	ret = alloc_list_meta_iter(&super->server_meta_avail[1],
+				   insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_freed", 0, 0);
+	ret = alloc_list_meta_iter(&super->server_meta_freed[0],
+				   insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_freed", 1, 0);
+	ret = alloc_list_meta_iter(&super->server_meta_freed[1],
+				   insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("fs_root", 0, 0);
+	ret = btree_meta_iter(&super->fs_root, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("logs_root", 0, 0);
+	ret = btree_meta_iter(&super->logs_root, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("log_merge", 0, 0);
+	ret = btree_meta_iter(&super->log_merge, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("mounted_clients", 0, 0);
+	ret = btree_meta_iter(&super->mounted_clients, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("srch_root", 0, 0);
+	ret = btree_meta_iter(&super->srch_root, insert_meta_ref, &mdat->meta_refed);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	ret = log_trees_meta_iter(insert_meta_ref, &mdat->meta_refed);
+	if (ret < 0)
+		goto out;
+
+	debug("found %llu referenced metadata blocks", mdat->stats.ref_blocks);
+	ret = 0;
+out:
+	return ret;
+}
+
+static int get_meta_free(void)
+{
+	struct meta_data *mdat = &global_mdat;
+	struct scoutfs_super_block *super = global_super;
+	int ret;
+
+	extent_root_init(&mdat->meta_free);
+
+	sns_push("meta_alloc", 0, 0);
+	ret = alloc_root_extent_iter(&super->meta_alloc[0], insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("meta_alloc", 1, 0);
+	ret = alloc_root_extent_iter(&super->meta_alloc[1], insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_avail", 0, 0);
+	ret = alloc_list_extent_iter(&super->server_meta_avail[0],
+				     insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_avail", 1, 0);
+	ret = alloc_list_extent_iter(&super->server_meta_avail[1],
+				     insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_freed", 0, 0);
+	ret = alloc_list_extent_iter(&super->server_meta_freed[0],
+				     insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	sns_push("server_meta_freed", 1, 0);
+	ret = alloc_list_extent_iter(&super->server_meta_freed[1],
+				     insert_meta_free, &mdat->meta_free);
+	sns_pop();
+	if (ret < 0)
+		goto out;
+
+	debug("found %llu free metadata blocks in %llu extents",
+	       mdat->stats.free_blocks, mdat->stats.free_extents);
+	ret = 0;
+out:
+	return ret;
+}
+
+/*
+ * All the space between referenced blocks must be recorded in the free
+ * extents.  The free extent walk didn't check that the extents
+ * overlapped with references, we do that here.  Remember that metadata
+ * block references were merged into extents here, the refed extents
+ * aren't necessarily all a single block.
+ */
+static int compare_refs_and_free(void)
+{
+	struct meta_data *mdat = &global_mdat;
+	struct extent_node *ref;
+	struct extent_node *free;
+	struct extent_node *next;
+	struct extent_node *prev;
+	u64 expect;
+	u64 start;
+	u64 end;
+
+	expect = 0;
+	ref = extent_first(&mdat->meta_refed);
+	free = extent_first(&mdat->meta_free);
+	while (ref || free) {
+
+		debug("exp %llu ref %llu.%llu free %llu.%llu",
+			expect, ref ? ref->start : 0, ref ? ref->len : 0,
+			free ? free->start : 0, free ? free->len : 0);
+
+		/* referenced marked free, remove ref from free and continue from same point */
+		if (ref && free && extents_overlap(ref->start, ref->len, free->start, free->len)) {
+			debug("ref extent %llu.%llu overlaps free %llu %llu",
+				ref->start, ref->len, free->start, free->len);
+
+			start = max(ref->start, free->start);
+			end = min(ref->start + ref->len, free->start + free->len);
+
+			prev = extent_prev(free);
+
+			extent_remove(&mdat->meta_free, start, end - start);
+
+			if (prev)
+				free = extent_next(prev);
+			else
+				free = extent_first(&mdat->meta_free);
+			continue;
+		}
+
+		/* see which extent starts earlier */
+		if (!free || (ref && ref->start <= free->start))
+			next = ref;
+		else
+			next = free;
+
+		/* untracked region before next extent */
+		if (expect < next->start) {
+			debug("missing free extent %llu.%llu", expect, next->start - expect);
+			expect = next->start;
+			continue;
+		}
+
+
+		/* didn't overlap, advance past next extent */
+		expect = next->start + next->len;
+		if (next == ref)
+			ref = extent_next(ref);
+		else
+			free = extent_next(free);
+	}
+
+	return 0;
+}
+
+/*
+ * Check the metadata allocators by comparing the set of referenced
+ * blocks with the set of free blocks that are stored in free btree
+ * items and alloc list blocks.
+ */
+int check_meta_alloc(void)
+{
+	int ret;
+
+	ret = get_meta_refs();
+	if (ret < 0)
+		goto out;
+
+	ret = get_meta_free();
+	if (ret < 0)
+		goto out;
+
+	ret = compare_refs_and_free();
+	if (ret < 0)
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
@@ -0,0 +1,9 @@
+#ifndef _SCOUTFS_UTILS_CHECK_META_H_
+#define _SCOUTFS_UTILS_CHECK_META_H_
+
+bool valid_meta_blkno(u64 blkno);
+
+int check_meta_alloc(void);
+
+#endif
+
@@ -0,0 +1,23 @@
+#include <string.h>
+#include <stdbool.h>
+
+#include "util.h"
+#include "padding.h"
+
+bool padding_is_zeros(const void *data, size_t sz)
+{
+	static char zeros[32] = {0,};
+	const size_t batch = array_size(zeros);
+
+	while (sz >= batch) {
+		if (memcmp(data, zeros, batch))
+			return false;
+		data += batch;
+		sz -= batch;
+	}
+
+	if (sz > 0 && memcmp(data, zeros, sz))
+		return false;
+
+	return true;
+}
@@ -0,0 +1,6 @@
+#ifndef _SCOUTFS_UTILS_CHECK_PADDING_H_
+#define _SCOUTFS_UTILS_CHECK_PADDING_H_
+
+bool padding_is_zeros(const void *data, size_t sz);
+
+#endif
@@ -0,0 +1,44 @@
+#include <stdio.h>
+#include <stdint.h>
+
+#include "problem.h"
+
+#define PROB_STR(pb) [pb] = #pb
+char *prob_strs[] = {
+	PROB_STR(PB_META_EXTENT_INVALID),
+	PROB_STR(PB_META_REF_OVERLAPS_EXISTING),
+	PROB_STR(PB_META_FREE_OVERLAPS_EXISTING),
+	PROB_STR(PB_BTREE_BLOCK_BAD_LEVEL),
+	PROB_STR(PB_SB_HDR_CRC_INVALID),
+	PROB_STR(PB_SB_HDR_MAGIC_INVALID),
+	PROB_STR(PB_FS_IN_USE),
+	PROB_STR(PB_MOUNTED_CLIENTS_REF_BLKNO),
+	PROB_STR(PB_SB_BAD_FLAG),
+	PROB_STR(PB_SB_BAD_FMT_VERS),
+	PROB_STR(PB_QCONF_WRONG_VERSION),
+	PROB_STR(PB_QSLOT_BAD_FAM),
+	PROB_STR(PB_QSLOT_BAD_PORT),
+	PROB_STR(PB_QSLOT_NO_ADDR),
+	PROB_STR(PB_QSLOT_BAD_ADDR),
+	PROB_STR(PB_DATA_DEV_SB_INVALID),
+};
+
+static struct problem_data {
+	uint64_t counts[PB__NR];
+	uint64_t count;
+} global_pdat;
+
+void problem_record(prob_t pb)
+{
+	struct problem_data *pdat = &global_pdat;
+
+	pdat->counts[pb]++;
+	pdat->count++;
+}
+
+uint64_t problems_count(void)
+{
+	struct problem_data *pdat = &global_pdat;
+
+	return pdat->count;
+}
@@ -0,0 +1,38 @@
+#ifndef _SCOUTFS_UTILS_CHECK_PROBLEM_H_
+#define _SCOUTFS_UTILS_CHECK_PROBLEM_H_
+
+#include "debug.h"
+#include "sns.h"
+
+typedef enum {
+	PB_META_EXTENT_INVALID,
+	PB_META_REF_OVERLAPS_EXISTING,
+	PB_META_FREE_OVERLAPS_EXISTING,
+	PB_BTREE_BLOCK_BAD_LEVEL,
+	PB_SB_HDR_CRC_INVALID,
+	PB_SB_HDR_MAGIC_INVALID,
+	PB_FS_IN_USE,
+	PB_MOUNTED_CLIENTS_REF_BLKNO,
+	PB_SB_BAD_FLAG,
+	PB_SB_BAD_FMT_VERS,
+	PB_QCONF_WRONG_VERSION,
+	PB_QSLOT_BAD_FAM,
+	PB_QSLOT_BAD_PORT,
+	PB_QSLOT_NO_ADDR,
+	PB_QSLOT_BAD_ADDR,
+	PB_DATA_DEV_SB_INVALID,
+	PB__NR,
+} prob_t;
+
+extern char *prob_strs[];
+
+#define problem(pb, fmt, ...)							\
+do {										\
+	debug("problem found: "#pb": %s: "fmt, sns_str(), __VA_ARGS__);	\
+	problem_record(pb);							\
+} while (0)
+
+void problem_record(prob_t pb);
+uint64_t problems_count(void);
+
+#endif
@@ -0,0 +1,118 @@
+#include <stdlib.h>
+#include <string.h>
+
+#include "sns.h"
+
+/*
+ * This "str num stack" is used to describe our location in metadata at
+ * any given time.
+ *
+ * As we descend into structures we pop a string on decribing them,
+ * perhaps with associated numbers.  Pushing and popping is very cheap
+ * and only rarely do we format the stack into a string, as an arbitrary
+ * example:
+ *   super.fs_root.btree_parent:1231.btree_leaf:3231"
+ */
+
+#define SNS_MAX_DEPTH	1000
+#define SNS_STR_SIZE	(SNS_MAX_DEPTH * (SNS_MAX_STR_LEN + 1 + 16 + 1))
+
+static struct sns_data {
+	unsigned int depth;
+
+	struct sns_entry {
+		char *str;
+		size_t len;
+		u64 a;
+		u64 b;
+	} ents[SNS_MAX_DEPTH];
+
+	char str[SNS_STR_SIZE];
+
+} global_lsdat;
+
+void _sns_push(char *str, size_t len, u64 a, u64 b)
+{
+	struct sns_data *lsdat = &global_lsdat;
+
+	if (lsdat->depth < SNS_MAX_DEPTH) {
+		lsdat->ents[lsdat->depth++] = (struct sns_entry) {
+			.str = str,
+			.len = len,
+			.a = a,
+			.b = b,
+		};
+	}
+}
+
+void sns_pop(void)
+{
+	struct sns_data *lsdat = &global_lsdat;
+
+	if (lsdat->depth > 0)
+		lsdat->depth--;
+}
+
+static char *append_str(char *pos, char *str, size_t len)
+{
+	memcpy(pos, str, len);
+	return pos + len;
+}
+
+/*
+ * This is not called for x = 0 so we don't need to emit an initial 0.
+ * We could by using do {} while instead of while {}.
+ */
+static char *append_u64x(char *pos, u64 x)
+{
+	static char hex[] = "0123456789abcdef";
+
+	while (x) {
+		*pos++ = hex[x & 0xf];
+		x >>= 4;
+	}
+
+	return pos;
+}
+
+static char *append_char(char *pos, char c)
+{
+	*(pos++) = c;
+	return pos;
+}
+
+/*
+ * Return a pointer to a null terminated string that describes the
+ * current location stack.  The string buffer is global.
+ */
+char *sns_str(void)
+{
+	struct sns_data *lsdat = &global_lsdat;
+	struct sns_entry *ent;
+	char *pos;
+	int i;
+
+	pos = lsdat->str;
+	for (i = 0; i < lsdat->depth; i++) {
+		ent = &lsdat->ents[i];
+
+		if (i)
+			pos = append_char(pos, '.');
+
+		pos = append_str(pos, ent->str, ent->len);
+
+		if (ent->a) {
+			pos = append_char(pos, ':');
+			pos = append_u64x(pos, ent->a);
+		}
+
+		if (ent->b) {
+			pos = append_char(pos, ':');
+			pos = append_u64x(pos, ent->b);
+		}
+	}
+
+	*pos = '\0';
+
+	return lsdat->str;
+}
@@ -0,0 +1,20 @@
+#ifndef _SCOUTFS_UTILS_CHECK_SNS_H_
+#define _SCOUTFS_UTILS_CHECK_SNS_H_
+
+#include <assert.h>
+
+#include "sparse.h"
+
+#define SNS_MAX_STR_LEN 20
+
+#define sns_push(str, a, b)					\
+do {								\
+	build_assert(sizeof(str) - 1 <= SNS_MAX_STR_LEN);	\
+	_sns_push((str), sizeof(str) - 1, a, b);		\
+} while (0)
+
+void _sns_push(char *str, size_t len, u64 a, u64 b);
+void sns_pop(void);
+char *sns_str(void);
+
+#endif
@@ -0,0 +1,252 @@
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+
+#include "sparse.h"
+#include "util.h"
+#include "format.h"
+#include "crc.h"
+
+#include "block.h"
+#include "super.h"
+#include "problem.h"
+
+/*
+ * After we check the super blocks we provide a global buffer to track
+ * the current super block.  It is referenced to get static information
+ * about the system and is also modified and written as part of
+ * transactions.
+ */
+struct scoutfs_super_block *global_super;
+
+/*
+ * Check superblock crc. We can't use global_super here since it's not the
+ * whole block itself, but only the struct scoutfs_super_block, so it needs
+ * to reload a copy here.
+ */
+int check_super_crc(void)
+{
+	struct scoutfs_super_block *super = NULL;
+	struct scoutfs_block_header *hdr;
+	struct block *blk = NULL;
+	u32 crc;
+	int ret;
+
+	ret = block_get(&blk, SCOUTFS_SUPER_BLKNO, BF_SM | BF_DIRTY);
+	if (ret < 0) {
+		fprintf(stderr, "error reading super block\n");
+		return ret;
+	}
+
+	super = block_buf(blk);
+	crc = crc_block((struct scoutfs_block_header *)super, block_size(blk));
+	hdr = &global_super->hdr;
+	debug("superblock crc 0x%04x calculated 0x%04x " "%s", le32_to_cpu(hdr->crc), crc, le32_to_cpu(hdr->crc) == crc ? "(match)" : "(mismatch)");
+
+	if (crc != le32_to_cpu(hdr->crc))
+		problem(PB_SB_HDR_CRC_INVALID, "crc 0x%04x calculated 0x%04x", le32_to_cpu(hdr->crc), crc);
+	block_put(&blk);
+
+	return 0;
+}
+
+/*
+ * Crude check for the unlikely cases where the fs appears to still be mounted.
+ */
+int check_super_in_use(int meta_fd)
+{
+	int ret = meta_super_in_use(meta_fd, global_super);
+	debug("meta_super_in_use ret %d", ret);
+
+	if (ret < 0)
+		problem(PB_FS_IN_USE, "File system appears in use. ret %d", ret);
+
+	debug("global_super->mounted_clients.ref.blkno 0x%08llx", global_super->mounted_clients.ref.blkno);
+	if (global_super->mounted_clients.ref.blkno != 0)
+		problem(PB_MOUNTED_CLIENTS_REF_BLKNO, "Mounted clients ref blkno 0x%08llx",
+			 global_super->mounted_clients.ref.blkno);
+
+	return ret;
+}
+
+/*
+ * quick glance data device superblock checks.
+ *
+ * -EIO for crc failures, all others -EINVAL
+ *
+ * caller must have run check_supers() first so that global_super is
+ * setup, so that we can cross-ref to it.
+ */
+static int check_data_super(int data_fd)
+{
+	struct scoutfs_super_block *super = NULL;
+	char *buf;
+	int ret = 0;
+	u32 crc;
+	ssize_t size = SCOUTFS_BLOCK_SM_SIZE;
+	off_t off = SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT;
+
+	buf = aligned_alloc(4096, size); /* XXX static alignment :/ */
+	if (!buf)
+		return -ENOMEM;
+
+	memset(buf, 0, size);
+
+	if (lseek(data_fd, off, SEEK_SET) != off)
+		return -errno;
+
+	if (read(data_fd, buf, size) < 0) {
+		ret = -errno;
+		goto out;
+	}
+
+	super = (struct scoutfs_super_block *)buf;
+
+	crc = crc_block((struct scoutfs_block_header *)buf, size);
+
+	debug("data fsid 0x%016llx", le64_to_cpu(super->hdr.fsid));
+	debug("data super magic 0x%04x", super->hdr.magic);
+	debug("data crc calc 0x%08x exp 0x%08x %s", crc, le32_to_cpu(super->hdr.crc),
+	      crc == le32_to_cpu(super->hdr.crc) ? "(match)" : "(mismatch)");
+	debug("data flags %llu fmt_vers %llu", le64_to_cpu(super->flags), le64_to_cpu(super->fmt_vers));
+
+	if (crc != le32_to_cpu(super->hdr.crc))
+		/* tis but a scratch */
+		ret = -EIO;
+
+	if (le64_to_cpu(super->hdr.fsid) != le64_to_cpu(global_super->hdr.fsid))
+		/* mismatched data bdev? not good */
+		ret = -EINVAL;
+
+	if (le32_to_cpu(super->hdr.magic) != SCOUTFS_BLOCK_MAGIC_SUPER)
+		/* fsid matched but not a superblock? yikes */
+		ret = -EINVAL;
+
+	if (le64_to_cpu(super->flags) != 0) /* !SCOUTFS_FLAG_IS_META_BDEV */
+		ret = -EINVAL;
+
+	if ((le64_to_cpu(super->fmt_vers) < SCOUTFS_FORMAT_VERSION_MIN) ||
+	    (le64_to_cpu(super->fmt_vers) > SCOUTFS_FORMAT_VERSION_MAX))
+		ret = -EINVAL;
+
+	if (ret != 0)
+		problem(PB_DATA_DEV_SB_INVALID, "data device is invalid or corrupt (%d)", ret);
+out:
+	free(buf);
+	return ret;
+}
+
+/*
+ * After checking the supers we save a copy of it in a global buffer that's used by
+ * other modules to track the current super.  It can be modified and written during commits.
+ */
+int check_supers(int data_fd)
+{
+	struct scoutfs_super_block *super = NULL;
+	struct block *blk = NULL;
+	struct scoutfs_quorum_slot* slot = NULL;
+	struct in_addr in;
+	uint16_t family;
+	uint16_t port;
+	int ret;
+
+	sns_push("supers", 0, 0);
+
+	global_super = malloc(sizeof(struct scoutfs_super_block));
+	if (!global_super) {
+		fprintf(stderr, "error allocating super block buffer\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = block_get(&blk, SCOUTFS_SUPER_BLKNO, BF_SM);
+	if (ret < 0) {
+		fprintf(stderr, "error reading super block\n");
+		goto out;
+	}
+
+	ret = block_hdr_valid(blk, SCOUTFS_SUPER_BLKNO, BF_SM, SCOUTFS_BLOCK_MAGIC_SUPER);
+
+	super = block_buf(blk);
+
+	if (ret < 0) {
+		/* */
+		if (ret == -EINVAL) {
+			/* that's really bad */
+			fprintf(stderr, "superblock invalid magic\n");
+			goto out;
+		} else if (ret == -EIO)
+			/* just report/count a CRC error */
+			problem(PB_SB_HDR_MAGIC_INVALID, "superblock magic invalid: 0x%04x is not 0x%04x",
+				super->hdr.magic, SCOUTFS_BLOCK_MAGIC_SUPER);
+	}
+
+	memcpy(global_super, super, sizeof(struct scoutfs_super_block));
+
+	debug("Superblock flag: %llu", global_super->flags);
+	if (le64_to_cpu(global_super->flags) != SCOUTFS_FLAG_IS_META_BDEV)
+		problem(PB_SB_BAD_FLAG, "Bad flag: %llu expecting: 1 or 0", global_super->flags);
+
+	debug("Superblock fmt_vers: %llu", le64_to_cpu(global_super->fmt_vers));
+	if ((le64_to_cpu(global_super->fmt_vers) < SCOUTFS_FORMAT_VERSION_MIN) ||
+	    (le64_to_cpu(global_super->fmt_vers) > SCOUTFS_FORMAT_VERSION_MAX))
+		problem(PB_SB_BAD_FMT_VERS, "Bad fmt_vers: %llu outside supported range (%d-%d)",
+			le64_to_cpu(global_super->fmt_vers), SCOUTFS_FORMAT_VERSION_MIN,
+			SCOUTFS_FORMAT_VERSION_MAX);
+
+	debug("Quorum Config Version: %llu", global_super->qconf.version);
+	if (le64_to_cpu(global_super->qconf.version) != 1)
+		problem(PB_QCONF_WRONG_VERSION, "Wrong Version: %llu (expected 1)", global_super->qconf.version);
+
+	for (int i = 0; i < SCOUTFS_QUORUM_MAX_SLOTS; i++) {
+		slot = &global_super->qconf.slots[i];
+		family = le16_to_cpu(slot->addr.v4.family);
+		port = le16_to_cpu(slot->addr.v4.port);
+		in.s_addr = le32_to_cpu(slot->addr.v4.addr);
+
+		if (family == SCOUTFS_AF_NONE) {
+			debug("Quorum slot %u is empty", i);
+			continue;
+		}
+
+		debug("Quorum slot %u family: %u, port: %u, address: %s", i, family, port, inet_ntoa(in));
+		if (family != SCOUTFS_AF_IPV4)
+			problem(PB_QSLOT_BAD_FAM, "Quorum Slot %u doesn't have valid address", i);
+
+		if (port == 0)
+			problem(PB_QSLOT_BAD_PORT, "Quorum Slot %u has bad port", i);
+
+		if (!in.s_addr) {
+			problem(PB_QSLOT_NO_ADDR, "Quorum Slot %u has not been assigned ipv4 address", i);
+		} else if (!(in.s_addr & 0xff000000)) {
+			problem(PB_QSLOT_BAD_ADDR, "Quorum Slot %u has invalid ipv4 address", i);
+		} else if ((in.s_addr & 0xff) == 0xff) {
+			problem(PB_QSLOT_BAD_ADDR, "Quorum Slot %u has invalid ipv4 address", i);
+		}
+	}
+
+	debug("super magic 0x%04x", global_super->hdr.magic);
+	if (le32_to_cpu(global_super->hdr.magic) != SCOUTFS_BLOCK_MAGIC_SUPER)
+		problem(PB_SB_HDR_MAGIC_INVALID, "superblock magic invalid: 0x%04x is not 0x%04x",
+			global_super->hdr.magic, SCOUTFS_BLOCK_MAGIC_SUPER);
+
+	/* `scoutfs image` command doesn't open data_fd */
+	if (data_fd < 0)
+		ret = 0;
+	else
+		ret = check_data_super(data_fd);
+out:
+	block_put(&blk);
+
+	sns_pop();
+
+	return ret;
+}
+
+void super_shutdown(void)
+{
+	free(global_super);
+}
@@ -0,0 +1,12 @@
+#ifndef _SCOUTFS_UTILS_CHECK_SUPER_H_
+#define _SCOUTFS_UTILS_CHECK_SUPER_H_
+
+extern struct scoutfs_super_block *global_super;
+
+int check_super_crc();
+int check_supers(int data_fd);
+int super_commit(void);
+int check_super_in_use(int meta_fd);
+void super_shutdown(void);
+
+#endif
@@ -0,0 +1,125 @@
+#ifndef _SCOUTFS_PARALLEL_RESTORE_H_
+#define _SCOUTFS_PARALLEL_RESTORE_H_
+
+#include <errno.h>
+
+struct scoutfs_parallel_restore_progress {
+	struct scoutfs_btree_root fs_items;
+	struct scoutfs_btree_root root_items;
+	struct scoutfs_srch_file sfl;
+	struct scoutfs_block_ref bloom_ref;
+	__le64 inode_count;
+	__le64 max_ino;
+};
+
+struct scoutfs_parallel_restore_slice {
+	__le64 fsid;
+	__le64 meta_start;
+	__le64 meta_len;
+};
+
+struct scoutfs_parallel_restore_entry {
+	u64 dir_ino;
+	u64 pos;
+	u64 ino;
+	mode_t mode;
+	char *name;
+	unsigned int name_len;
+};
+
+struct scoutfs_parallel_restore_xattr {
+	u64 ino;
+	u64 pos;
+	char *name;
+	unsigned int name_len;
+	void *value;
+	unsigned int value_len;
+};
+
+struct scoutfs_parallel_restore_inode {
+	/* all inodes */
+	u64 ino;
+	u64 meta_seq;
+	u64 data_seq;
+	u64 nr_xattrs;
+	u32 uid;
+	u32 gid;
+	u32 mode;
+	u32 rdev;
+	u32 flags;
+	u8 pad[4];
+	struct timespec atime;
+	struct timespec ctime;
+	struct timespec mtime;
+	struct timespec crtime;
+	u64 proj;
+
+	/* regular files */
+	u64 data_version;
+	u64 size;
+	bool offline;
+
+	/* only used for directories */
+	u64 nr_subdirs;
+	u64 total_entry_name_bytes;
+
+	/* only used for symlnks */
+	char *target;
+	unsigned int target_len; /* not including null terminator */
+};
+
+struct scoutfs_parallel_restore_quota_rule {
+	u64 limit;
+	u8  prio;
+	u8  op;
+	u8  rule_flags;
+	struct quota_rule_name {
+		u64 val;
+		u8  source;
+		u8  flags;
+	} names [3];
+	char *value;
+	unsigned int value_len;
+};
+
+typedef __typeof__(EINVAL) spr_err_t;
+
+struct scoutfs_parallel_restore_writer;
+
+spr_err_t scoutfs_parallel_restore_create_writer(struct scoutfs_parallel_restore_writer **wrip);
+void scoutfs_parallel_restore_destroy_writer(struct scoutfs_parallel_restore_writer **wrip);
+
+spr_err_t scoutfs_parallel_restore_init_slices(struct scoutfs_parallel_restore_writer *wri,
+					       struct scoutfs_parallel_restore_slice *slices,
+					       int nr);
+spr_err_t scoutfs_parallel_restore_add_slice(struct scoutfs_parallel_restore_writer *wri,
+					    struct scoutfs_parallel_restore_slice *slice);
+spr_err_t scoutfs_parallel_restore_get_slice(struct scoutfs_parallel_restore_writer *wri,
+					    struct scoutfs_parallel_restore_slice *slice);
+
+spr_err_t scoutfs_parallel_restore_add_inode(struct scoutfs_parallel_restore_writer *wri,
+					     struct scoutfs_parallel_restore_inode *inode);
+spr_err_t scoutfs_parallel_restore_add_entry(struct scoutfs_parallel_restore_writer *wri,
+					     struct scoutfs_parallel_restore_entry *entry);
+spr_err_t scoutfs_parallel_restore_add_xattr(struct scoutfs_parallel_restore_writer *wri,
+					     struct scoutfs_parallel_restore_xattr *xattr);
+
+spr_err_t scoutfs_parallel_restore_get_progress(struct scoutfs_parallel_restore_writer *wri,
+						struct scoutfs_parallel_restore_progress *prog);
+spr_err_t scoutfs_parallel_restore_add_progress(struct scoutfs_parallel_restore_writer *wri,
+						struct scoutfs_parallel_restore_progress *prog);
+
+spr_err_t scoutfs_parallel_restore_add_quota_rule(struct scoutfs_parallel_restore_writer *wri,
+						struct scoutfs_parallel_restore_quota_rule *rule);
+
+spr_err_t scoutfs_parallel_restore_write_buf(struct scoutfs_parallel_restore_writer *wri,
+					     void *buf, size_t len, off_t *off_ret,
+					     size_t *count_ret);
+
+spr_err_t scoutfs_parallel_restore_import_super(struct scoutfs_parallel_restore_writer *wri,
+						struct scoutfs_super_block *super, int fd);
+spr_err_t scoutfs_parallel_restore_export_super(struct scoutfs_parallel_restore_writer *wri,
+						struct scoutfs_super_block *super);
+
+
+#endif
@@ -7,6 +7,7 @@
 #include <errno.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <wordexp.h>

 #include "util.h"
 #include "format.h"
@@ -17,15 +18,26 @@

 static int open_path(char *path, int flags)
 {
+	wordexp_t exp_result;
 	int ret;

-	ret = open(path, flags);
+	ret = wordexp(path, &exp_result, WRDE_NOCMD | WRDE_SHOWERR | WRDE_UNDEF);
+	if (ret) {
+		fprintf(stderr, "wordexp() failure for \"%s\": %d\n", path, ret);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = open(exp_result.we_wordv[0], flags);
 	if (ret < 0) {
 		ret = -errno;
 		fprintf(stderr, "failed to open '%s': %s (%d)\n",
 			path, strerror(errno), errno);
 	}

+out:
+	wordfree(&exp_result);
+
 	return ret;
 }
Author	SHA1	Message	Date
Chao Wang	305361c5ea	WIP	2024-10-28 15:50:47 -07:00
Chao Wang	eb244065dc	WIP	2024-10-28 15:35:10 -07:00
Chao Wang	2de875e4d8	WIP	2024-10-28 14:34:30 -07:00
Chao Wang	f47dcba80c	WIP	2024-10-28 14:21:08 -07:00
Chao Wang	1f345e350d	WIP	2024-10-25 14:45:52 -07:00
Auke Kok	ae4b55a147	Add basic parallel_restore test script This script executes the basic parallel_restore test binary which incorporates the parallel restore library. The test binary creates a few files with xattrs. After restoring, we mount the filesystem and do some basic checks to see that the restore was complete. Added just under and just over ENOSPC cases to make sure that we are returning the final 5% of this disk that we reserver for log trees. Signed-off-by: Auke Kok <auke.kok@versity.com> Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 13:35:35 -07:00
Hunter Shaffer	2be15d416d	Add Quota support Adds a function to to insert quota rules as filesystem items. This will then have an outward facing function that takes a writer and a mirror of the _squota_rule struct in quota.c and is called _parallel_restore_quota_rule. Adds testing to make sure we are restoring a test quota. Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 12:01:04 -07:00
Hunter Shaffer	c4147a7e8d	Add Retention Flag support Adds a check in the inode creation path that checks whether the retention feature is present. If it is then we set the inode's flag value otherwise it stays as 0. Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 12:01:04 -07:00
Hunter Shaffer	d653c78504	Add Project ID support Project IDs add a new field 'proj' to the inode struct. In this patch we simply check if the feature is present before we compile and if it is we allocate the field within the parallel_restore inode and make sure this value set in the restored inode. If it is not present we ignore it. Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 12:01:04 -07:00
Hunter Shaffer	0d910eb7ab	Check if source device has been mounted The filesystem we are restoring into needs to be empty and never mounted. Here we check the all of the quorum blocks timestamps to see whether the device we are restoring into has been mounted before. Adds a test in the test script that attempts to restore a previously mounted device. Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 12:01:04 -07:00
Hunter Shaffer	41b1d1180b	Check device format before restore When we import the superblock we first check whether the filesystem is at least format version 2. We then verify that the device given is a meta_dev. The test script now also initialize the new filesystem to format_V2. Adds coverage in the test script for the new checks. Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	130e10626d	Copy a tree using parallel restore library. This tool compies a source tree (whether it's scoutfs or not) into an offline scoutfs meta device. It has only those 2 parameters and does a single-process walk of the tree to restore all items while preservice as much of the metadata as possible. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	281cd4f87a	Create a 4k offline extent for each regular file. After this change, all files have a single offline extent: ``` $ sudo src/filefrag-gc57857a5 -b4096 -v /mnt/scratch/top-0/file-1094 Filesystem type is: 554f4353 File size of /mnt/scratch/top-0/file-1094 is 4096 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 0.. 0: 1: last,unknown_loc,eof /mnt/scratch/top-0/file-1094: 1 extent found ``` Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	c78b5cdecc	Detect child process exiting with errors. Adds a signal handler for SIGCHLD and sets up a signal handler with SA_SIGINFO. This way we can inspect the exit code emitted by the child and abort processing when a child process exits with an error. Without this handler, any child process that exits with e.g. ENOSPC will keep the parent hanging indefinitely. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	6da7034d48	Pass meta_seq and data_seq to _restore_inode. This allows callers to pass in seq values for generated inodes. The tester code initializes them now before calling, instead of being hard set in the library. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	60e14e20dc	Fix offline extents not being able to be created. While online extents (non-zero size) worked just fine with this code, the offline extent code inserted a btree item without the appropriate key, which results in duplicate (null) keys being inserted, hence the "duplicate" error. All that is needed to fix is to put the created key in the btree item to be inserted. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	5316905d12	Fix symlink insertion. This block never called insert_fs_item() creating dangling keys that never got inserted. Additionally, the _sk_second member is le64 and we have to use the proper intrinsic to increment it. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	ea41b198a4	Fix printing alloc list block extents The list alloc blocks have an array of blknos that are offset by a start field in the block header. The print code wasn't using that and was always referencing the beginning of the array, which could miss blocks. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	027a6ebce6	Import a few more functions to our list.h Import a few more functions from the kernel's list.h into our imported copy. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	1ac0e5bfd3	Add test for parallel restore Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	f6a40de3b0	Add parallel restore Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	17451841bf	Add userspace NSEC_PER_SEC Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	51f50529fc	Add bloom filter index calc for userspace utils Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	7707d98b54	Add srch_encode_entry() for userspace utils Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	8c195ee4ab	Add put_unaligned_leXX() for userspace Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	7b5f59ca53	Add fls64() alias for userspace flsll() Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	597ce6a4c0	Promote userspace btree block initialization Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	afeeb47918	Add userspace version of our mode to type Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	660f46a3b4	Add userspace version of our dirent name hash Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	4697424c7c	Add lk rbtree wrapper Import the kernel's rbtree implementation with a wrapper so we can use it from userspace. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	11f624926b	Superblock checks for meta and data dev. We check superblock magic, crc, flags. data device superblock is checked but a little less thorough. We check whether the device is still mounted, since that would make checking invalid to begin with. Quorum blocks are validated to have sane contents. We add a global problem counter so we can trivially measure and report whether any problem was found at all, instead of iterating over all the problems and checking each individual count. We pick the standard exit code values from `fsck` and mirror their intentional behavior. This results in `fsck.scoutfs` can now be trivially created by making it a wrapper around `scoutfs check`. Signed-off-by: Auke Kok <auke.kok@versity.com> Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	173e0f1edd	Add man page content for check. Adds basic man page content for the `check` subcommand. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Auke Kok	ca57794a00	Generic block header checks: crc, magic. Generally as we call block_get() we should validate that if the block has a hdr, at a minimum the crc is correct and the magic value is the expected value passed, and the fsid matches the superblock. This function implements just that. Returns -EINVAL, up to the caller to report a problem() and handle the outcome. For now the code just hard fails, which incedentally makes it fail the clobber-repair.sh tests I wrote. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	f5f39f4432	Add test_bit to utils bitmap Add test_bit() to the trivial utils bitmap.c implementation. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	022e280f0b	Add {read,write}-metadata-image scoutfs commands Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	897f26c839	Fix partial rename to check_meta_alloc As I was committing the initial check command I had only partially completed a rename of the function that checks the metadata allocators. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-17 11:42:16 -07:00
Zach Brown	25d5b507a1	Add check command Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-17 11:41:42 -07:00