With many concurrent writers we were seeing excessive commits forced because it thought the data allocator was running low. The transaction was checking the raw total_len value in the data_avail alloc_root for the number of free data blocks. But this read wasn't locked, and allocators could completely remove a large free extent and then re-insert a slightly smaller free extent as they perform their alloction. The transaction could see a temporary very small total_len and trigger a commit. Data allocations are serialized by a heavy mutex so we don't want to have the reader try and use that to see a consistent total_len. Instead we create a data allocator run-time struct that has a consistent total_len that is updated after all the extent items are manipulated. This also gives us a place to put the caller's cached extent so that it can be included in the total_len, previously it wasn't included in the free total that the transaction saw. The file data allocator can then initialize and use this struct instead of its raw use of the root and cached extent. Then the transaction can sample its consistent total_len that reflects the root and cached extent. A subtle detail is that fallocate can't use _free_data to return an allocated extent on error to the avail pool. It instead frees into the data_free pool like normal frees. It doesn't really matter that this could prematurely drain the avail pool because it's in an error path. Signed-off-by: Zach Brown <zab@versity.com>
Introduction
scoutfs is a clustered in-kernel Linux filesystem designed and built from the ground up to support large archival systems.
Its key differentiating features are:
- Integrated consistent indexing accelerates archival maintenance operations
- Commit logs allow nodes to write concurrently without contention
It meets best of breed expectations:
- Fully consistent POSIX semantics between nodes
- Rich metadata to ensure the integrity of metadata references
- Atomic transactions to maintain consistent persistent structures
- First class kernel implementation for high performance and low latency
- Open GPLv2 implementation
Learn more in the white paper.
Current Status
Alpha Open Source Development
scoutfs is under heavy active development. We're developing it in the open to give the community an opportunity to affect the design and implementation.
The core architectural design elements are in place. Much surrounding functionality hasn't been implemented. It's appropriate for early adopters and interested developers, not for production use.
In that vein, expect significant incompatible changes to both the format of network messages and persistent structures. To avoid mistakes the implementation currently calculates a hash of the format and ioctl header files in the source tree. The kernel module will refuse to mount a volume created by userspace utilities with a mismatched hash, and it will refuse to connect to a remote node with a mismatched hash. This means having to unmount, mkfs, and remount everything across many functional changes. Once the format is nailed down we'll wire up forward and back compat machinery and remove this temporary safety measure.
The current kernel module is developed against the RHEL/CentOS 7.x kernel to minimize the friction of developing and testing with partners' existing infrastructure. Once we're happy with the design we'll shift development to the upstream kernel while maintaining distro compatibility branches.
Community Mailing List
Please join us on the open scoutfs-devel@scoutfs.org mailing list hosted on Google Groups for all discussion of scoutfs.
Quick Start
This following a very rough example of the procedure to get up and running, experience will be needed to fill in the gaps. We're happy to help on the mailing list.
The requirements for running scoutfs on a small cluster are:
- One or more nodes running x86-64 CentOS/RHEL 7.4 (or 7.3)
- Access to two shared block devices
- IPv4 connectivity between the nodes
The steps for getting scoutfs mounted and operational are:
- Get the kernel module running on the nodes
- Make a new filesystem on the devices with the userspace utilities
- Mount the devices on all the nodes
In this example we run all of these commands on three nodes. The names of the block devices are the same on all the nodes.
-
Get the Kernel Module and Userspace Binaries
- Either use snapshot RPMs built from git by Versity:
rpm -i https://scoutfs.s3-us-west-2.amazonaws.com/scoutfs-repo-0.0.1-1.el7_4.noarch.rpm yum install scoutfs-utils kmod-scoutfs- Or use the binaries built from checked out git repositories:
yum install kernel-devel git clone git@github.com:versity/scoutfs.git make -C scoutfs modprobe libcrc32c insmod scoutfs/kmod/src/scoutfs.ko alias scoutfs=$PWD/scoutfs/utils/src/scoutfs -
Make a New Filesystem (destroys contents, no questions asked)
We specify that two of our three nodes must be present to form a quorum for the system to function.
scoutfs mkfs -Q 2 /dev/meta_dev /dev/data_dev -
Mount the Filesystem
Each mounting node provides its local IP address on which it will run an internal server for the other mounts if it is elected the leader by the quorum.
mkdir /mnt/scoutfs mount -t scoutfs -o server_addr=$NODE_ADDR,metadev_path=/dev/meta_dev /dev/data_dev /mnt/scoutfs -
For Kicks, Observe the Metadata Change Index
The
meta_seqindex tracks the inodes that are changed in each transaction.scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/one; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/two; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs touch /mnt/scoutfs/one; sync scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs