mirror of
https://github.com/versity/scoutfs.git
synced 2026-01-07 12:35:28 +00:00
We had previously started on a relatively simple notion of an interoperability version which wasn't quite right. This fleshes out support for a more functional format version. The super blocks have a single version that defines behaviour of the running system. The code supports a range of versions and we add some initial interfaces for updating the version while the system is offline. All of this together should let us safely change the underlying format over time. Signed-off-by: Zach Brown <zab@versity.com>
297 lines
11 KiB
Groff
297 lines
11 KiB
Groff
.TH scoutfs 5
|
|
.SH NAME
|
|
scoutfs \- high level overview of the scoutfs filesystem
|
|
.SH DESCRIPTION
|
|
A scoutfs filesystem is stored on two block devices. Multiple mounts of
|
|
the filesystem are supported between hosts that share access to the
|
|
block device. A new filesystem is created with the
|
|
.B mkfs
|
|
command in the
|
|
.BR scoutfs (8)
|
|
utility.
|
|
.SH MOUNT OPTIONS
|
|
The following mount options are supported by scoutfs in addition to the
|
|
general mount options described in the
|
|
.BR mount (8)
|
|
manual page.
|
|
.TP
|
|
.B metadev_path=<device>
|
|
The metadev_path option specifies the path to the block device that
|
|
contains the filesystem's metadata.
|
|
.sp
|
|
This option is required.
|
|
.TP
|
|
.B quorum_slot_nr=<number>
|
|
The quorum_slot_nr option assigns a quorum member slot to the mount.
|
|
The mount will use the slot assignment to claim exclusive ownership of
|
|
the slot's configured address and an associated metadata device block.
|
|
Each slot number must be used by only one mount at any given time.
|
|
.sp
|
|
When a mount is assigned a quorum slot it becomes a quorum member and
|
|
will participate in the raft leader election process and could start
|
|
the server for the filesystem if it is elected leader.
|
|
.sp
|
|
The assigned number must match one of the slots defined with \-Q options
|
|
when the filesystem was created with mkfs. If the number assigned
|
|
doesn't match a number created during mkfs then the mount will fail.
|
|
.SH VOLUME OPTIONS
|
|
Volume options are persistent options which are stored in the super
|
|
block in the metadata device and which apply to all mounts of the volume.
|
|
.sp
|
|
Volume options may be initially specified as the volume is created
|
|
as described in the mkfs command in
|
|
.BR scoutfs (8).
|
|
.sp
|
|
Volume options may be changed at runtime by writing to files in sysfs
|
|
while the volume is mounted. Volume options are found in the
|
|
volume_options/ directory with a file for each option. Reading the
|
|
file provides the current setting of the option and an empty string
|
|
is returned if the option is not set. To set the option, write
|
|
the new value ofthe option to the file. To clear the option, write
|
|
a blank line with a newline to the file. The write syscall will
|
|
return an error if the set operation fails and a message will be written
|
|
to the console.
|
|
.sp
|
|
The following volume options are supported:
|
|
.TP
|
|
.B data_alloc_zone_blocks=<zone size in 4KiB blocks>
|
|
When the data_alloc_zone_blocks option is set the data device is
|
|
logically divided into zones of equal length as specified by the value
|
|
of the option. The size of the zones must be greater than a minimum
|
|
allocation pool size, large enough to result in no more than 1024 zones,
|
|
and not more than the total number of blocks in the data device.
|
|
.sp
|
|
When set, the server will try to provide each mount with free data
|
|
extents that don't share a zone with other mounts. When a mount has free
|
|
extents in a given zone the server will try and find more free extents
|
|
in that zone. When the mount is not in a zone, or its zone has no more
|
|
free extents, the server will try and find free extents in a zone that
|
|
no other mount currently occupies. The result is to try and produce
|
|
write streams where only one mount is writing into each zone.
|
|
.SH FENCING
|
|
.B scoutfs
|
|
mounts coordinate exclusive access to shared resources through
|
|
comminication with the mount that was elected leader.
|
|
A mount can malfunction and stop participating at which point it needs
|
|
to be safely isolated ("fenced off") from shared resources before other mounts can
|
|
have their turn at exclusive access.
|
|
.sp
|
|
Only the elected leader can fence mounts. As the leader decides that a
|
|
mount must be fenced, typically by timeouts expiring without
|
|
comminication from the mount, it creates a fence request. Fence
|
|
requests are visible as directories in the leader mount's sysfs
|
|
directory. The fence request directory is named for the RID of the
|
|
mount being fenced. The directory contains the following files:
|
|
|
|
.RS
|
|
.TP
|
|
.B elapsec_secs
|
|
Reading this file gives the number of seconds that have passed since
|
|
this fence request was created.
|
|
.TP
|
|
.B error
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file if they are unable to fence the
|
|
mount. The volume can not make progress until the mount is fenced so
|
|
this will cause the server to stop and another mount will be elected
|
|
leader.
|
|
.TP
|
|
.B fenced
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file once the mount has been fenced.
|
|
.TP
|
|
.B ipv4_addr
|
|
This file contains the dotted quad IPv4 peer address of the last
|
|
connected socket from the mount. Userspace fencing agents can use this
|
|
to find the host that contains the mount.
|
|
.TP
|
|
.B reason
|
|
This file contains a text string that indicates the reason that the
|
|
mount is being fenced:
|
|
|
|
.B client_recovery
|
|
- During startup the server found persistent items recording the presence
|
|
of a mount that didn't reconnect to the server in time.
|
|
.sp
|
|
.B client_reconnect
|
|
- A mount disconnected from the server and didn't reconnect in time.
|
|
.sp
|
|
.B quorum_block_leader
|
|
- As a leader was elected it read persistent blocks that indicated that
|
|
a previous leader had not shut down and cleared their quorum block.
|
|
.TP
|
|
.B rid
|
|
This file contains the hex string of the RID of the mount to be fenced.
|
|
.RE
|
|
|
|
The request directories enable userspace processes to gather the
|
|
information to find the host with the mount to fence, isolate the mount
|
|
by whatever means are appropriate (f.e. cut off network and storage
|
|
communication, force unmount the mount, isolate storage fabric ports,
|
|
reboot the host) and write to the
|
|
.I fenced
|
|
file.
|
|
.sp
|
|
Once the
|
|
.I fenced
|
|
file is written to the server reclaims the resources
|
|
associated with the fenced mount and resumes normal operations.
|
|
.sp
|
|
If the
|
|
.I error
|
|
file is written to then the server cannot make forward progress and
|
|
shuts down. The request can similarly enter an errored state if enough
|
|
time passes before userspace completes the request.
|
|
|
|
.SH EXTENDED ATTRIBUTE TAGS
|
|
|
|
.B scoutfs
|
|
adds the
|
|
.IB scoutfs.
|
|
extended attribute namespace which uses a system of tags to extend the
|
|
functionality of extended attributes. Immediately following the
|
|
scoutfs. prefix are a series of tag words seperated by dots.
|
|
Any text starting after the last recognized tag is considered the xattr
|
|
name and is not parsed.
|
|
.sp
|
|
Tags may be combined in any order. Specifying a tag more than once
|
|
will return an error. There is no explicit boundary between the end of
|
|
tags and the start of the name so unknown or incorrect tags will be
|
|
successfully parsed as part of the name of the xattr. Tags can only be
|
|
created, updated, or removed with the CAP_SYS_ADMIN capability.
|
|
|
|
The following tags are currently supported:
|
|
|
|
.RS
|
|
.TP
|
|
.B .hide.
|
|
Attributes with the .hide. tag are not visible to the
|
|
.BR listxattr(2)
|
|
system call. They will instead be included in the output of the
|
|
.IB LISTXATTR_HIDDEN
|
|
ioctl. This is meant to be used by archival management agents to store
|
|
metadata that is bound to a specific volume and should not be
|
|
transferred with the file by tools that read extended attributes, like
|
|
.BR tar(1) .
|
|
.TP
|
|
.B .srch.
|
|
Attributes with the .srch. tag are indexed so that they can be
|
|
found by the
|
|
.IB SEARCH_XATTRS
|
|
ioctl. The search ioctl takes an extended attribute name and returns
|
|
the inode number of all the inodes which contain an extended attribute
|
|
with that name. The indexing structures behind .srch. tags are designed
|
|
to efficiently handle a large number of .srch. attributes per file with
|
|
no limits on the number of indexed files.
|
|
.TP
|
|
.B .totl.
|
|
Attributes with the .totl. flag are used to efficiently maintain counts
|
|
across all files in the system. The attribute's name must end in three
|
|
64bit values seperated by dots that specify the global total that the
|
|
extended attribute will contribute to. The value of the extended
|
|
attribute is a string representation of the 64bit quantity which will be
|
|
added to the total. As attributes are added, updated, or removed (and
|
|
particularly as a file is finally deleted), the corresponding global
|
|
total is also updated by the file system. All the totals with their
|
|
name, total value, and a count of contributing attributes can be read
|
|
with the
|
|
.IB READ_XATTR_TOTALS
|
|
ioctl.
|
|
.RE
|
|
|
|
.SH FORMAT VERSION
|
|
The format version defines the layout and use of structures stored on
|
|
devices and passed over the network. The version is incremented for
|
|
every change in structures that is not backwards compatible with
|
|
previous versions. A single version implies all changes, individual
|
|
changes can't be selectively adopted.
|
|
.sp
|
|
As a new file system is created the format version is stored in both of
|
|
the super blocks written to the metadata and data devices. By default
|
|
the greatest supported version is written while an older supported
|
|
version may be specified.
|
|
.sp
|
|
During mount the kernel module verifies that the format versions stored
|
|
in both of the super blocks match and are supported. That version
|
|
defines the set of features and behavior of all the mounts using the
|
|
file system, including the network protocol that is communicated over
|
|
the wire.
|
|
.sp
|
|
Any combination of software release versions that support the current
|
|
format version of the file system can safely be used concurrently. This
|
|
allows for rolling software updates of multiple mounts using a shared
|
|
file system.
|
|
.sp
|
|
To use new incompatible features added in newer format versions the super blocks must
|
|
be updated. This can currently only be safely performed on a
|
|
completely and cleanly unmounted file system. The
|
|
.BR scoutfs (8)
|
|
.I change-format-version
|
|
command can be used with the
|
|
.I --offline
|
|
option to write a newer supported version into the super blocks. It
|
|
will fail if it sees any indication of unresolved mounts that may be
|
|
using the devices: either active quorum members working with their
|
|
quorum blocks or persistent records of mounted clients that haven't been
|
|
resolved. Like creating a new file system, there is no protection
|
|
against multiple invocations of the change command corrupting the
|
|
system. Once the version is updated older software can no longer use
|
|
the file system so this change should be performed with care. Once the
|
|
newer format version is successfully written it can be mounted and newer
|
|
features can be used.
|
|
.sp
|
|
Each layer of the system can show its supported format versions:
|
|
.RS
|
|
.TP
|
|
.B Userspace utilities
|
|
.B scoutfs --help
|
|
includes the range of supported format versions for a given release
|
|
of the userspace utilities.
|
|
.TP
|
|
.B Kernel module
|
|
.I modinfo MODULE
|
|
shows the range of supproted versions for a kernel module file in the
|
|
.I scoutfs_format_version_min
|
|
and
|
|
.I scoutfs_format_version_min
|
|
fields.
|
|
.TP
|
|
.B Inserted module
|
|
The supported version range of an inserted module can be found in
|
|
.I .note.scoutfs_format_version_min
|
|
and
|
|
.I .note.scoutfs_format_version_max
|
|
notes files in the sysfs notes directory for the inserted module,
|
|
typically
|
|
.I /sys/module/scoutfs/notes/
|
|
.TP
|
|
.B Metadata and data devices
|
|
.I scoutfs print DEVICE
|
|
shows the
|
|
.I fmt_vers
|
|
field in the initial output of the super block on the device.
|
|
.TP
|
|
.B Mounted filesystem
|
|
The version that a mount is using is shown in the
|
|
.I format_version
|
|
file in the mount's sysfs directory, typically
|
|
.I /sys/fs/scoutfs/f.FSID.r.RID/
|
|
.RE
|
|
|
|
.SH CORRUPTION DETECTION
|
|
A
|
|
.B scoutfs
|
|
filesystem can detect corruption at runtime. A catalog of kernel log
|
|
messages that indicate corruption can be found in
|
|
.BR scoutfs-corruption (8)
|
|
\&.
|
|
|
|
.SH SEE ALSO
|
|
.BR scoutfs (8),
|
|
.BR scoutfs-corruption (7).
|
|
|
|
.SH AUTHORS
|
|
Zach Brown <zab@versity.com>
|
|
|
|
|