Previously we added a ilookup variant that ignored I_FREEING inodes to avoid a deadlock between lock invalidation (lock->I_FREEING) and eviction (I_FREEING->lock); Now we're seeing similar deadlocks between eviction (I_FREEING->lock) and fh_to_dentry's iget (lock->I_FREEING). I think it's reasonable to ignore all inodes with I_FREEING set when we're using our _test callback in ilookup or iget. We can remove the _nofreeing ilookup variant and move its I_FREEING test into the iget_test callback provided to both ilookup and iget. Callers will get the same result, it will just happen without waiting for a previously I_FREEING inode to leave. They'll get NULL instead of waiting from ilookup. They'll allocate and start to initialize a newer instance of the inode and insert it along side the previous instance. We don't have inode number re-use so we don't have the problem where a newly allocated inode number is relying on inode cache serialization to not find a previously allocated inode that is being evicted. This change does allow for concurrent iget of an inode number that is being deleted on a local node. This could happen in fh_to_dentry with a raw inode number. But this was already a problem between mounts because they don't have a shared inode cache to serialize them. Once we fix that between nodes, we fix it on a single node as well. Signed-off-by: Zach Brown <zab@versity.com>
Introduction
scoutfs is a clustered in-kernel Linux filesystem designed and built from the ground up to support large archival systems.
Its key differentiating features are:
- Integrated consistent indexing accelerates archival maintenance operations
- Commit logs allow nodes to write concurrently without contention
It meets best of breed expectations:
- Fully consistent POSIX semantics between nodes
- Rich metadata to ensure the integrity of metadata references
- Atomic transactions to maintain consistent persistent structures
- First class kernel implementation for high performance and low latency
- Open GPLv2 implementation
Learn more in the white paper.
Current Status
Alpha Open Source Development
scoutfs is under heavy active development. We're developing it in the open to give the community an opportunity to affect the design and implementation.
The core architectural design elements are in place. Much surrounding functionality hasn't been implemented. It's appropriate for early adopters and interested developers, not for production use.
In that vein, expect significant incompatible changes to both the format of network messages and persistent structures. Since the format hash-checking has now been removed in preparation for release, if there is any doubt, mkfs is strongly recommended.
The current kernel module is developed against the RHEL/CentOS 7.x kernel to minimize the friction of developing and testing with partners' existing infrastructure. Once we're happy with the design we'll shift development to the upstream kernel while maintaining distro compatibility branches.
Community Mailing List
Please join us on the open scoutfs-devel@scoutfs.org mailing list hosted on Google Groups for all discussion of scoutfs.
Quick Start
This following a very rough example of the procedure to get up and running, experience will be needed to fill in the gaps. We're happy to help on the mailing list.
The requirements for running scoutfs on a small cluster are:
- One or more nodes running x86-64 CentOS/RHEL 7.4 (or 7.3)
- Access to two shared block devices
- IPv4 connectivity between the nodes
The steps for getting scoutfs mounted and operational are:
- Get the kernel module running on the nodes
- Make a new filesystem on the devices with the userspace utilities
- Mount the devices on all the nodes
In this example we use three nodes. The names of the block devices are the same on all the nodes. Two of the nodes will be quorum members. A majority of quorum members must be mounted to elect a leader to run a server that all the mounts connect to. It should be noted that two quorum members results in a majority of one, each member itself, so split brain elections are possible but so unlikely that it's fine for a demonstration.
-
Get the Kernel Module and Userspace Binaries
- Either use snapshot RPMs built from git by Versity:
rpm -i https://scoutfs.s3-us-west-2.amazonaws.com/scoutfs-repo-0.0.1-1.el7_4.noarch.rpm yum install scoutfs-utils kmod-scoutfs- Or use the binaries built from checked out git repositories:
yum install kernel-devel git clone git@github.com:versity/scoutfs.git make -C scoutfs modprobe libcrc32c insmod scoutfs/kmod/src/scoutfs.ko alias scoutfs=$PWD/scoutfs/utils/src/scoutfs -
Make a New Filesystem (destroys contents)
We specify quorum slots with the addresses of each of the quorum member nodes, the metadata device, and the data device.
scoutfs mkfs -Q 0,$NODE0_ADDR,12345 -Q 1,$NODE1_ADDR,12345 /dev/meta_dev /dev/data_dev -
Mount the Filesystem
First, mount each of the quorum nodes so that they can elect and start a server for the remaining node to connect to. The slot numbers were specified with the leading "0,..." and "1,..." in the mkfs options above.
mount -t scoutfs -o quorum_slot_nr=$SLOT_NR,metadev_path=/dev/meta_dev /dev/data_dev /mnt/scoutfsThen mount the remaining node which can now connect to the running server.
mount -t scoutfs -o metadev_path=/dev/meta_dev /dev/data_dev /mnt/scoutfs -
For Kicks, Observe the Metadata Change Index
The
meta_seqindex tracks the inodes that are changed in each transaction.scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/one; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/two; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/one; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs