% This was derived from the usenix templates, whose introductory
% comment is as follows:
%
% TEMPLATE for Usenix papers, specifically to meet requirements of
%  USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
%   written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
%   Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
%   both the above pioneers
% use at your own risk.  Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point

% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document.  Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file. 

% Note that since 2010, USENIX does not require endnotes. If you want
% foot of page notes, don't include the endnotes package in the 
% usepackage command, below.

% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass[letterpaper,twocolumn,10pt]{article}
\usepackage{usenix2019,epsfig}
\begin{document}

%don't want date printed
\date{}

%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf scoutfs : A Scalable Archival Filesystem}

%for single author (just remove % characters)
\author{
{\rm Zach Brown}\\
Versity Software, Inc.
}

\maketitle

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
% \thispagestyle{empty}

\section{Metadata Items}

scoutfs stores filesystem metadata in items that are identified by a
key and contain a variable length value payload.\\

Every key uses a generic structure with a fixed number of fields.

{\tt \small
\begin{verbatim}
struct scoutfs_key {
        __u8    sk_zone;
        __le64  _sk_first;
        __u8    sk_type;
        __le64  _sk_second;
        __le64  _sk_third;
        __u8    _sk_fourth;
};
\end{verbatim}
}

Using a shared key struct lets us sort all the metadata items in the
filesystem in one key space regardless of their form or function.  The
generic keys are displayed, sorted, and computed (incrementing, finding
difference) without needing to know the specific fields of each item
type.

Different structures are identified by their zone and type pair.  They
then map their type's fields to the remaining generic fields to
determine the sorting of the item keys within their type.

For example, when storing inodes we use the {\tt SCOUTFS\_FS\_ZONE} and
{\tt SCOUTFS\_INODE\_TYPE} and put the inode number in the first generic
key field.

{\tt \small
\begin{verbatim}
        #define ski_ino _sk_first
\end{verbatim}
}

{\tt \small
\begin{verbatim}
        key.sk_zone = SCOUTFS_FS_ZONE;
        key.ski_ino = ino;
        key.sk_type = SCOUTFS_INODE_TYPE;
\end{verbatim}
}

Continuing this example, metadata that is associated with inodes also
use the {\tt SCOUTFS\_FS\_ZONE} and store the inode number in {\tt
\_sk\_first} but then have different type values.  For example {\tt
SCOUTFS\_XATTR\_TYPE} or {\tt SCOUTFS\_SYMLINK\_TYPE}.  When the items'
keys are sorted we end up with all the items for a given inode stored
near each other.

\subsection{Directory Entries}

A directory entry is stored in three different metadata items, each with
a different key and used for a different purpose.  Each item shares the
same key format and directory entry value payload, however.

The key stores the entry's directory inode number and major and minor
values associated with the type of directory entry being stored.

{\tt \small
\begin{verbatim}
        #define skd_ino         _sk_first
        #define skd_major       _sk_second
        #define skd_minor       _sk_third
\end{verbatim}
}

The value contains a directory entry struct with all the metadata
associated with a directory entry, including the full entry name.

{\tt \small
\begin{verbatim}
struct scoutfs_dirent {
        __le64 ino;
        __le64 hash;
        __le64 pos;
        __u8 type;
        __u8 name[0];
};
\end{verbatim}
}

Each item contains a full copy of the item value.  This duplicates
storage across each item type but also lets each operation be satisfied
by one item lookup.  Once the item value is obtained its fields can be
used to construct the keys for each of the items associated with the
entry.

\subsubsection{Directory Entry Lookup Items}

{\tt \small
\begin{verbatim}
        key.sk_zone = SCOUTFS_FS_ZONE;
        key.skd_ino = dir_ino;
        key.sk_type = SCOUTFS_DIRENT_TYPE;
        key.skd_major = hash(entry_name);
        key.skd_minor = dir_pos;
\end{verbatim}
}

Lookup entries are stored in the parent directory at the hash of the
name of the entry.  These entries are used to map names to inode numbers
during path traversal.

The major key value is set to a 64bit hash of the file name.  These hash
values can collide so the minor key value is set to the readdir position
in the directory of the entry.  This readdir position is unique for
every entry and ensures that keys are unique when hash values collide.

A name lookup is performed by iterating over all the keys with the major
that matches the hashed name.  The full name in the dirent value struct
is compared to the search name.  It will be very rare to have more than
one item with a given hash value.

\subsubsection{Directory Entry Readdir Items}

{\tt \small
\begin{verbatim}
        key.sk_zone = SCOUTFS_FS_ZONE;
        key.skd_ino = dir_ino;
        key.sk_type = SCOUTFS_READDIR_TYPE;
        key.skd_major = dir_pos;
        key.skd_minor = 0;
\end{verbatim}
}

Readdir entries are used to iterate over entries for the readdir()
call.  By providing a unique 64bit {\tt dir\_pos} for each entry we avoid
having to track multiple entries for a given readdir position value.

readdir() returns entries in {\tt dir\_pos} order which depends on entry
creation order and matches inode allocation order.  Accessing the inodes
that are referenced by the entries returned from readdir() will result
in efficient forward iteration over the readdir and inode items,
assuming that files were simply created.

Renaming files or creating hard links to existing files creates a new
entry but can't reassign the inode number and can result in mismatched
access patterns of the readdir entry items and the inode items.

\subsubsection{Directory Entry Link Backref Items}

{\tt \small
\begin{verbatim}
        key.sk_zone = SCOUTFS_FS_ZONE;
        key.skd_ino = target_ino;
        key.sk_type = SCOUTFS_LINK_BACKREF_TYPE;
        key.skd_major = dir_ino;
        key.skd_minor = dir_pos;
\end{verbatim}
}

Link backref entry items are stored with the target inode number and the
inode number and readdir position of the entry in its directory.
They're used to iterate over all the entries that refer to a given
inode.  Full relative paths from the root directory to a target inode
can be constructed by walking up through each parent entry as its
discovered.

Both inode numbers and readdir positions are allocated by strictly
increasing the next free number.  Old inode numbers or readdir positions
are never reused.  This means that resolving paths for existing inodes
will always walk keys that are strictly sorted less than the keys that
will be created as new files are created.  This tends to isolate read
access patterns during backround archival policy processing from write
access patterns during new file creation and increases performance by
reducing contention.

\end{document}