Zach Brown b1b75cbe9f Fix block cache shrink and read racing crash
The block cache wasn't safely racing readers walking the rcu radix_tree
and the shrinker walking the LRU list.  A reader could get a reference
to a block that had been removed from the radix and was queued for
freeing.  It'd clobber the free's llist_head union member by putting the
block back on the lru and both the read and free would crash as they
each corrupted each other's memory.  We rarely saw this in heavy load
testing.

The fix is to clean up the use of rcu, refcounting, and freeing.

First, we get rid of the LRU list.  Now we don't have to worry about
resolving racing accesses of blocks between two independent structures.
Instead of shrinking walking the LRU list, we can mark blocks on access
such that shrinking can walk all blocks randomly and expect to quickly
find candidates to shrink.

To make it easier to concurrently walk all the blocks we switch to the
rhashtable instead of the radix tree.  It also has nice per-bucket
locking so we can get rid of the global lock that protected the LRU list
and radix insertion.  (And it isn't limited to 'long' keys so we can get
rid of the check for max meta blknos that couldn't be cached.)

Now we need to tighten up when read can get a reference and when shrink
can remove blocks.  We have presence in the hash table hold a refcount
but we make it a magic high bit in the refcount so that it can be
differentiated from other references.  Now lookup can atomically get a
reference to blocks that are in the hash table, and shrinking can
atomically remove blocks when it is the only other reference.

We also clean up freeing a bit. It has to wait for the rcu grace period
to ensure that no other rcu readers can reference the blocks its
freeing.  It has to iterate over the list with _safe because it's
freeing as it goes.

Interestingly, when reworking the shrinker I noticed that we weren't
scaling the nr_to_scan from the pages we returned in previous shrink
calls back to blocks.  We now divide the input from pages back into
blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:49:15 -08:00
2020-12-07 09:47:12 -08:00
2020-12-07 10:39:20 -08:00

Introduction

scoutfs is a clustered in-kernel Linux filesystem designed and built from the ground up to support large archival systems.

Its key differentiating features are:

  • Integrated consistent indexing accelerates archival maintenance operations
  • Commit logs allow nodes to write concurrently without contention

It meets best of breed expectations:

  • Fully consistent POSIX semantics between nodes
  • Rich metadata to ensure the integrity of metadata references
  • Atomic transactions to maintain consistent persistent structures
  • First class kernel implementation for high performance and low latency
  • Open GPLv2 implementation

Learn more in the white paper.

Current Status

Alpha Open Source Development

scoutfs is under heavy active development. We're developing it in the open to give the community an opportunity to affect the design and implementation.

The core architectural design elements are in place. Much surrounding functionality hasn't been implemented. It's appropriate for early adopters and interested developers, not for production use.

In that vein, expect significant incompatible changes to both the format of network messages and persistent structures. Since the format hash-checking has now been removed in preparation for release, if there is any doubt, mkfs is strongly recommended.

The current kernel module is developed against the RHEL/CentOS 7.x kernel to minimize the friction of developing and testing with partners' existing infrastructure. Once we're happy with the design we'll shift development to the upstream kernel while maintaining distro compatibility branches.

Community Mailing List

Please join us on the open scoutfs-devel@scoutfs.org mailing list hosted on Google Groups for all discussion of scoutfs.

Quick Start

This following a very rough example of the procedure to get up and running, experience will be needed to fill in the gaps. We're happy to help on the mailing list.

The requirements for running scoutfs on a small cluster are:

  1. One or more nodes running x86-64 CentOS/RHEL 7.4 (or 7.3)
  2. Access to two shared block devices
  3. IPv4 connectivity between the nodes

The steps for getting scoutfs mounted and operational are:

  1. Get the kernel module running on the nodes
  2. Make a new filesystem on the devices with the userspace utilities
  3. Mount the devices on all the nodes

In this example we use three nodes. The names of the block devices are the same on all the nodes. Two of the nodes will be quorum members. A majority of quorum members must be mounted to elect a leader to run a server that all the mounts connect to. It should be noted that two quorum members results in a majority of one, each member itself, so split brain elections are possible but so unlikely that it's fine for a demonstration.

  1. Get the Kernel Module and Userspace Binaries

    • Either use snapshot RPMs built from git by Versity:
    rpm -i https://scoutfs.s3-us-west-2.amazonaws.com/scoutfs-repo-0.0.1-1.el7_4.noarch.rpm
    yum install scoutfs-utils kmod-scoutfs
    
    • Or use the binaries built from checked out git repositories:
    yum install kernel-devel
    git clone git@github.com:versity/scoutfs.git
    make -C scoutfs
    modprobe libcrc32c
    insmod scoutfs/kmod/src/scoutfs.ko
    alias scoutfs=$PWD/scoutfs/utils/src/scoutfs
    
  2. Make a New Filesystem (destroys contents)

    We specify quorum slots with the addresses of each of the quorum member nodes, the metadata device, and the data device.

    scoutfs mkfs -Q 0,$NODE0_ADDR,12345 -Q 1,$NODE1_ADDR,12345 /dev/meta_dev /dev/data_dev
    
  3. Mount the Filesystem

    First, mount each of the quorum nodes so that they can elect and start a server for the remaining node to connect to. The slot numbers were specified with the leading "0,..." and "1,..." in the mkfs options above.

    mount -t scoutfs -o quorum_slot_nr=$SLOT_NR,metadev_path=/dev/meta_dev /dev/data_dev /mnt/scoutfs
    

    Then mount the remaining node which can now connect to the running server.

    mount -t scoutfs -o metadev_path=/dev/meta_dev /dev/data_dev /mnt/scoutfs
    
  4. For Kicks, Observe the Metadata Change Index

    The meta_seq index tracks the inodes that are changed in each transaction.

    scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs
    touch /mnt/scoutfs/one; sync
    scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs
    touch /mnt/scoutfs/two; sync
    scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs
    touch /mnt/scoutfs/one; sync
    scoutfs walk-inodes meta_seq 0 -1 /mnt/scoutfs
    
Description
No description provided
Readme 6.9 MiB
Languages
C 87.1%
Shell 9.2%
Roff 2.5%
TeX 0.9%
Makefile 0.3%