scoutfs/utils/man/scoutfs.5

.TH scoutfs 5
.SH NAME
scoutfs \- high level overview of the scoutfs filesystem
.SH DESCRIPTION
A scoutfs filesystem is stored on two block devices.  Multiple mounts of
the filesystem are supported between hosts that share access to the
block device.  A new filesystem is created with the
.B mkfs
command in the
.BR scoutfs (8)
utility.
.SH MOUNT OPTIONS
The following mount options are supported by scoutfs in addition to the
general mount options described in the
.BR mount (8)
manual page.
.TP
.B acl
The acl mount option enables support for POSIX Access Control Lists
as detailed in
.BR acl (5) .
Support for POSIX ACLs is the default.
.TP
.B data_prealloc_blocks=<blocks>
Set the size of preallocation regions of data files, in 4KiB blocks.
Writes to these regions that contain no extents will attempt to
preallocate the size of the full region.  This can waste a lot of space
with small files, files with sparse regions, and files whose final
length isn't a multiple of the preallocation size.  The following
data_prealloc_contig_only option, which is the default, restricts this
behaviour to waste less space.
.sp
All the preallocation options can be changed in an active mount by
writing to their respective files in the options directory in the
mount's sysfs directory.
.sp
It is worth noting that it is always more efficient in every way to use
.BR fallocate (2)
to precisely allocate large extents for the resulting size of the file.
Always attempt to enable it in software that supports it.
.TP
.B data_prealloc_contig_only=<0|1>
This option, currently the default, limits file data preallocation in
two ways.  First, it will only preallocate when extending a fully
allocated file.  Second, it will limit the size of preallocation to the
existing length of the file.  These limits reduce the amount of
preallocation wasted per file at the cost of multiple initial extents in
all files.  It only supports simple streaming writes, any other write
pattern will not be recognized and could result in many fragmented
extent allocations.
.sp
This option can be disabled to encourage large allocated extents
regardless of write patterns.  This can be helpful if files are written
with initial sparse regions (perhaps by multiple threads writing to
different regions) and wasted space isn't an issue (perhaps because the
file population contains few small files).
.TP
.B log_merge_wait_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that log merge
creation can wait before timing out.  This setting is per-mount, only
changes the behavior of that mount, and only affects the server when it
is running in that mount.
.sp
This determines how long it may take for mounts to synchronize
committing their log trees to create a log merge operation.  Setting it
too high can create long latencies in the event that a mount takes a
long time to commit their log.  Setting it too low can result in the
creation of excessive numbers of log trees that are never merged.  The
default is 500 and it can not be less than 100 nor greater than 60000.
.TP
.B metadev_path=<device>
The metadev_path option specifies the path to the block device that
contains the filesystem's metadata.
.sp
This option is required.
.TP
.B noacl
The noacl mount option disables the default support for POSIX Access
Control Lists.  Any existing system.posix_acl_default and
system.posix_acl_access extended attributes remain in inodes.   They
will appear in listings from
.BR listxattr (5)
but specific retrieval or reomval operations will fail.  They will be
used for enforcement again if ACL support is later enabled.
.TP
.B orphan_scan_delay_ms=<number>
This option sets the average expected delay, in milliseconds, between
each mount's scan of the global orphaned inode list.  Jitter is added to
avoid contention so each individual delay between scans is a random
value up to 20% less than or greater than this average expected delay.
.sp
The minimum value for this option is 100ms which is very short and is
only reasonable for testing or experiments.   The default is 10000ms (10
seconds) and the maximum is 60000ms (1 minute).
.sp
This option can be changed in an active mount by writing to its file in
the options directory in the mount's sysfs directory.  Writing a new
value will cause the next pending orphan scan to be rescheduled
with the newly written delay time.
.TP
.B quorum_heartbeat_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that a quorum
member will wait without receiving heartbeat messages from the current
leader before trying to take over as leader.  This setting is per-mount
and only changes the behavior of that mount.
.sp
This determines how long it may take before a failed leader is replaced
by a waiting quorum member.  Setting it too low may lead to spurious
fencing as active leaders are prematurely replaced due to task or
network delays that prevent the quorum members from promptly sending and
receiving messages.  The ideal setting is the longest acceptable
downtime during server failover.  The default is 10000 (10s) and it can
not be less than 2000 greater than 60000.
.sp
This option can be changed in an active mount by writing to its file in
the options directory in the mount's sysfs directory.  Writing a new
value will take effect the next time the quorum agent receives a
heartbeat message and sets the next timeout.
.TP
.B quorum_slot_nr=<number>
The quorum_slot_nr option assigns a quorum member slot to the mount.
The mount will use the slot assignment to claim exclusive ownership of
the slot's configured address and an associated metadata device block.
Each slot number must be used by only one mount at any given time.
.sp
When a mount is assigned a quorum slot it becomes a quorum member and
will participate in the raft leader election process and could start
the server for the filesystem if it is elected leader.
.sp
The assigned number must match one of the slots defined with \-Q options
when the filesystem was created with mkfs.  If the number assigned
doesn't match a number created during mkfs then the mount will fail.
.TP
.B tcp_keepalive_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that a client
connection will wait for active TCP packets, before deciding that
the connection is dead. This setting is per-mount and only changes
the behavior of that mount.
.sp
The default value of this setting is 60000msec (60s). Any precision
beyond a whole second is likely unrealistic due to the nature of
TCP keepalive mechanisms in the Linux kernel. Valid values are any
value higher than 3000 (3s).
.sp
The TCP keepalive mechanism is complex and observing a lost connection
quickly is important to maintain cluster stability. If the local
network suffers from intermittent outages this option may provide
some respite to overcome these outages without the cluster becoming
desynchronized.
.SH VOLUME OPTIONS
Volume options are persistent options which are stored in the super
block in the metadata device and which apply to all mounts of the volume.
.sp
Volume options may be initially specified as the volume is created
as described in the mkfs command in
.BR scoutfs (8).
.sp
Volume options may be changed at runtime by writing to files in sysfs
while the volume is mounted.  Volume options are found in the
volume_options/ directory with a file for each option.  Reading the
file provides the current setting of the option and an empty string
is returned if the option is not set.  To set the option, write
the new value ofthe option to the file.  To clear the option, write
a blank line with a newline to the file.  The write syscall will
return an error if the set operation fails and a message will be written
to the console.
.sp
The following volume options are supported:
.TP
.B data_alloc_zone_blocks=<zone size in 4KiB blocks>
When the data_alloc_zone_blocks option is set the data device is
logically divided into zones of equal length as specified by the value
of the option.  The size of the zones must be greater than a minimum
allocation pool size, large enough to result in no more than 1024 zones,
and not more than the total number of blocks in the data device.
.sp
When set, the server will try to provide each mount with free data
extents that don't share a zone with other mounts.  When a mount has free
extents in a given zone the server will try and find more free extents
in that zone.  When the mount is not in a zone, or its zone has no more
free extents, the server will try and find free extents in a zone that
no other mount currently occupies.  The result is to try and produce
write streams where only one mount is writing into each zone.
.SH FENCING
.B scoutfs
mounts coordinate exclusive access to shared resources through
comminication with the mount that was elected leader.
A mount can malfunction and stop participating at which point it needs
to be safely isolated ("fenced off") from shared resources before other mounts can
have their turn at exclusive access.
.sp
Only the elected leader can fence mounts.  As the leader decides that a
mount must be fenced, typically by timeouts expiring without
comminication from the mount, it creates a fence request.   Fence
requests are visible as directories in the leader mount's sysfs
directory.  The fence request directory is named for the RID of the
mount being fenced.  The directory contains the following files:

.RS
.TP
.B elapsec_secs
Reading this file gives the number of seconds that have passed since
this fence request was created.
.TP
.B error
This file contains 0 when the fence request is created.  Userspace
fencing agents write 1 into this file if they are unable to fence the
mount.  The volume can not make progress until the mount is fenced so
this will cause the server to stop and another mount will be elected
leader.
.TP
.B fenced
This file contains 0 when the fence request is created.  Userspace
fencing agents write 1 into this file once the mount has been fenced.
.TP
.B ipv4_addr
This file contains the dotted quad IPv4 peer address of the last
connected socket from the mount.  Userspace fencing agents can use this
to find the host that contains the mount.
.TP
.B reason
This file contains a text string that indicates the reason that the
mount is being fenced:

.B client_recovery
- During startup the server found persistent items recording the presence
of a mount that didn't reconnect to the server in time.
.sp
.B client_reconnect
- A mount disconnected from the server and didn't reconnect in time.
.sp
.B quorum_block_leader
- As a leader was elected it read persistent blocks that indicated that
a previous leader had not shut down and cleared their quorum block.
.TP
.B rid
This file contains the hex string of the RID of the mount to be fenced.
.RE

The request directories enable userspace processes to gather the
information to find the host with the mount to fence, isolate the mount
by whatever means are appropriate (f.e. cut off network and storage
communication, force unmount the mount, isolate storage fabric ports,
reboot the host) and write to the
.I fenced
file.
.sp
Once the
.I fenced
file is written to the server reclaims the resources
associated with the fenced mount and resumes normal operations.
.sp
If the
.I error
file is written to then the server cannot make forward progress and
shuts down.  The request can similarly enter an errored state if enough
time passes before userspace completes the request.

.SH EXTENDED ATTRIBUTE TAGS

.B scoutfs
adds the
.IB scoutfs.
extended attribute namespace which uses a system of tags to extend the
functionality of extended attributes.  Immediately following the
scoutfs. prefix are a series of tag words seperated by dots.
Any text starting after the last recognized tag is considered the xattr
name and is not parsed.
.sp
Tags may be combined in any order.   Specifying a tag more than once
will return an error.  There is no explicit boundary between the end of
tags and the start of the name so unknown or incorrect tags will be
successfully parsed as part of the name of the xattr.  Tags can only be
created, updated, or removed with the CAP_SYS_ADMIN capability.

The following tags are currently supported:

.RS
.TP
.B .hide.
Attributes with the .hide. tag are not visible to the
.BR listxattr(2)
system call.  They will instead be included in the output of the
.IB LISTXATTR_HIDDEN
ioctl.  This is meant to be used by archival management agents to store
metadata that is bound to a specific volume and should not be
transferred with the file by tools that read extended attributes, like
.BR tar(1) .
.TP
.B .indx.
Attributes with the .indx. tag dd the inode containing the attribute to
a filesystem-wide index.  The name of the extended attribute must end
with strings representing two values separated by dots.  The first value
is an unsigned 8bit value and the second is an unsigned 64bit value.
These attributes can only be modified with root privileges and the
attributes can not have a value.
.sp
The inodes in the index are stored in increasing sort order of the
values, with the first u8 value being most significant.  Inodes can be
at many positions as tracked by many extended attributes, and their
position follows the creation, renaming, or deletion of the attributes.
The index can be read with the read-xattr-index command which uses the
underlying READ_XATTR_INDEX ioctl.
.TP
.B .srch.
Attributes with the .srch. tag are indexed so that they can be
found by the
.IB SEARCH_XATTRS
ioctl.   The search ioctl takes an extended attribute name and returns
the inode number of all the inodes which contain an extended attribute
with that name.  The indexing structures behind .srch. tags are designed
to efficiently handle a large number of .srch. attributes per file with
no limits on the number of indexed files.
.TP
.B .totl.
Attributes with the .totl. flag are used to efficiently maintain counts
across all files in the system.  The attribute's name must end in three
64bit values seperated by dots that specify the global total that the
extended attribute will contribute to.   The value of the extended
attribute is a string representation of the 64bit quantity which will be
added to the total.   As attributes are added, updated, or removed (and
particularly as a file is finally deleted), the corresponding global
total is also updated by the file system.  All the totals with their
name, total value, and a count of contributing attributes can be read
with the
.IB READ_XATTR_TOTALS
ioctl.
.RE

.SH FILE RETENTION MODE
A file can be set to retention mode by setting the
.IB RETENTION
attribute with the
.IB SET_ATTR_X
ioctl.  This flag can only be set on regular files and requires root
permission (the
.IB CAP_SYS_ADMIN
capability).
.sp
Once in retention mode all modifications of the file will fail.  The
only exceptions are that system extended attributes (all those without
the "user." prefix) may be modified.  The retention bit may be cleared
with sufficient priveledges to remove the retention restrictions on
other modifications.
.RE

.SH PROJECT IDs
All inodes have a project ID attribute that can be set via the
SET_ATTR_X ioctl and displayed with the GET_ATTR_X ioctl.  Project IDs
are an unsigned 64bit value and the value of 0 is reserved to indicate
that no project ID is assigned.   If a project ID is set on a directory
then all inodes created with it as the initial parent inheret that ID,
for all file types.  This includes files initially unlinked from the
namespace when created with O_TMPFILE.  Project IDs are only
automatically inherited from the parent dir on initial creation.
They're not changed as directory entry linkes to the inode are created
or renamed.
.RE

.SH FORMAT VERSION
The format version defines the layout and use of structures stored on
devices and passed over the network.  The version is incremented for
every change in structures that is not backwards compatible with
previous versions.  A single version implies all changes, individual
changes can't be selectively adopted.
.sp
As a new file system is created the format version is stored in both of
the super blocks written to the metadata and data devices.  By default
the greatest supported version is written while an older supported
version may be specified.
.sp
During mount the kernel module verifies that the format versions stored
in both of the super blocks match and are supported.   That version
defines the set of features and behavior of all the mounts using the
file system, including the network protocol that is communicated over
the wire.
.sp
Any combination of software release versions that support the current
format version of the file system can safely be used concurrently.  This
allows for rolling software updates of multiple mounts using a shared
file system.
.sp
To use new incompatible features added in newer format versions the super blocks must
be updated.   This can currently only be safely performed on a
completely and cleanly unmounted file system.  The
.BR scoutfs (8)
.I change-format-version
command can be used with the
.I --offline
option to write a newer supported version into the super blocks.  It
will fail if it sees any indication of unresolved mounts that may be
using the devices: either active quorum members working with their
quorum blocks or persistent records of mounted clients that haven't been
resolved.  Like creating a new file system, there is no protection
against multiple invocations of the change command corrupting the
system.  Once the version is updated older software can no longer use
the file system so this change should be performed with care.  Once the
newer format version is successfully written it can be mounted and newer
features can be used.
.sp
Each layer of the system can show its supported format versions:
.RS
.TP
.B Userspace utilities
.B scoutfs --help
includes the range of supported format versions for a given release
of the userspace utilities.
.TP
.B Kernel module
.I modinfo MODULE
shows the range of supproted versions for a kernel module file in the
.I scoutfs_format_version_min
and
.I scoutfs_format_version_min
fields.
.TP
.B Inserted module
The supported version range of an inserted module can be found in
.I .note.scoutfs_format_version_min
and
.I .note.scoutfs_format_version_max
notes files in the sysfs notes directory for the inserted module,
typically
.I /sys/module/scoutfs/notes/
.TP
.B Metadata and data devices
.I scoutfs print DEVICE
shows the
.I fmt_vers
field in the initial output of the super block on the device.
.TP
.B Mounted filesystem
The version that a mount is using is shown in the
.I format_version
file in the mount's sysfs directory, typically
.I /sys/fs/scoutfs/f.FSID.r.RID/
.RE
.sp
The defined format versions are:
.RS
.TP
.sp
.B 1
Initial format version.
.TP
.B 2
Added retention mode by setting the retention attribute.  Added the
project ID inode attribute.  Added quota rules and enforcement.  Added
the .indx. extended attribute tag.
.RE

.SH CORRUPTION DETECTION
A
.B scoutfs
filesystem can detect corruption at runtime.  A catalog of kernel log
messages that indicate corruption can be found in
.BR scoutfs-corruption (8)
\&.

.SH SEE ALSO
.BR scoutfs (8),
.BR scoutfs-corruption (7).

.SH AUTHORS
Zach Brown <zab@versity.com>