mirror of
https://github.com/versity/scoutfs.git
synced 2025-12-23 05:25:18 +00:00
Add an option that can limit the number of inode numbers that are allocated per lock group. Signed-off-by: Zach Brown <zab@versity.com>
476 lines
19 KiB
Groff
476 lines
19 KiB
Groff
.TH scoutfs 5
|
|
.SH NAME
|
|
scoutfs \- high level overview of the scoutfs filesystem
|
|
.SH DESCRIPTION
|
|
A scoutfs filesystem is stored on two block devices. Multiple mounts of
|
|
the filesystem are supported between hosts that share access to the
|
|
block device. A new filesystem is created with the
|
|
.B mkfs
|
|
command in the
|
|
.BR scoutfs (8)
|
|
utility.
|
|
.SH MOUNT OPTIONS
|
|
The following mount options are supported by scoutfs in addition to the
|
|
general mount options described in the
|
|
.BR mount (8)
|
|
manual page.
|
|
.TP
|
|
.B acl
|
|
The acl mount option enables support for POSIX Access Control Lists
|
|
as detailed in
|
|
.BR acl (5) .
|
|
Support for POSIX ACLs is the default.
|
|
.TP
|
|
.B data_prealloc_blocks=<blocks>
|
|
Set the size of preallocation regions of data files, in 4KiB blocks.
|
|
Writes to these regions that contain no extents will attempt to
|
|
preallocate the size of the full region. This can waste a lot of space
|
|
with small files, files with sparse regions, and files whose final
|
|
length isn't a multiple of the preallocation size. The following
|
|
data_prealloc_contig_only option, which is the default, restricts this
|
|
behaviour to waste less space.
|
|
.sp
|
|
All the preallocation options can be changed in an active mount by
|
|
writing to their respective files in the options directory in the
|
|
mount's sysfs directory.
|
|
.sp
|
|
It is worth noting that it is always more efficient in every way to use
|
|
.BR fallocate (2)
|
|
to precisely allocate large extents for the resulting size of the file.
|
|
Always attempt to enable it in software that supports it.
|
|
.TP
|
|
.B data_prealloc_contig_only=<0|1>
|
|
This option, currently the default, limits file data preallocation in
|
|
two ways. First, it will only preallocate when extending a fully
|
|
allocated file. Second, it will limit the size of preallocation to the
|
|
existing length of the file. These limits reduce the amount of
|
|
preallocation wasted per file at the cost of multiple initial extents in
|
|
all files. It only supports simple streaming writes, any other write
|
|
pattern will not be recognized and could result in many fragmented
|
|
extent allocations.
|
|
.sp
|
|
This option can be disabled to encourage large allocated extents
|
|
regardless of write patterns. This can be helpful if files are written
|
|
with initial sparse regions (perhaps by multiple threads writing to
|
|
different regions) and wasted space isn't an issue (perhaps because the
|
|
file population contains few small files).
|
|
.TP
|
|
.B ino_alloc_per_lock=<number>
|
|
This option determines how many inode numbers are allocated in the same
|
|
cluster lock. The default, and maximum, is 1024. The minimum is 1.
|
|
Allocating fewer inodes per lock can allow more parallelism between
|
|
mounts because there are more locks that cover the same number of
|
|
created files. This can be helpful when working with smaller numbers of
|
|
large files.
|
|
.TP
|
|
.B log_merge_wait_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that log merge
|
|
creation can wait before timing out. This setting is per-mount, only
|
|
changes the behavior of that mount, and only affects the server when it
|
|
is running in that mount.
|
|
.sp
|
|
This determines how long it may take for mounts to synchronize
|
|
committing their log trees to create a log merge operation. Setting it
|
|
too high can create long latencies in the event that a mount takes a
|
|
long time to commit their log. Setting it too low can result in the
|
|
creation of excessive numbers of log trees that are never merged. The
|
|
default is 500 and it can not be less than 100 nor greater than 60000.
|
|
.TP
|
|
.B metadev_path=<device>
|
|
The metadev_path option specifies the path to the block device that
|
|
contains the filesystem's metadata.
|
|
.sp
|
|
This option is required.
|
|
.TP
|
|
.B noacl
|
|
The noacl mount option disables the default support for POSIX Access
|
|
Control Lists. Any existing system.posix_acl_default and
|
|
system.posix_acl_access extended attributes remain in inodes. They
|
|
will appear in listings from
|
|
.BR listxattr (5)
|
|
but specific retrieval or reomval operations will fail. They will be
|
|
used for enforcement again if ACL support is later enabled.
|
|
.TP
|
|
.B orphan_scan_delay_ms=<number>
|
|
This option sets the average expected delay, in milliseconds, between
|
|
each mount's scan of the global orphaned inode list. Jitter is added to
|
|
avoid contention so each individual delay between scans is a random
|
|
value up to 20% less than or greater than this average expected delay.
|
|
.sp
|
|
The minimum value for this option is 100ms which is very short and is
|
|
only reasonable for testing or experiments. The default is 10000ms (10
|
|
seconds) and the maximum is 60000ms (1 minute).
|
|
.sp
|
|
This option can be changed in an active mount by writing to its file in
|
|
the options directory in the mount's sysfs directory. Writing a new
|
|
value will cause the next pending orphan scan to be rescheduled
|
|
with the newly written delay time.
|
|
.TP
|
|
.B quorum_heartbeat_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that a quorum
|
|
member will wait without receiving heartbeat messages from the current
|
|
leader before trying to take over as leader. This setting is per-mount
|
|
and only changes the behavior of that mount.
|
|
.sp
|
|
This determines how long it may take before a failed leader is replaced
|
|
by a waiting quorum member. Setting it too low may lead to spurious
|
|
fencing as active leaders are prematurely replaced due to task or
|
|
network delays that prevent the quorum members from promptly sending and
|
|
receiving messages. The ideal setting is the longest acceptable
|
|
downtime during server failover. The default is 10000 (10s) and it can
|
|
not be less than 2000 greater than 60000.
|
|
.sp
|
|
This option can be changed in an active mount by writing to its file in
|
|
the options directory in the mount's sysfs directory. Writing a new
|
|
value will take effect the next time the quorum agent receives a
|
|
heartbeat message and sets the next timeout.
|
|
.TP
|
|
.B quorum_slot_nr=<number>
|
|
The quorum_slot_nr option assigns a quorum member slot to the mount.
|
|
The mount will use the slot assignment to claim exclusive ownership of
|
|
the slot's configured address and an associated metadata device block.
|
|
Each slot number must be used by only one mount at any given time.
|
|
.sp
|
|
When a mount is assigned a quorum slot it becomes a quorum member and
|
|
will participate in the raft leader election process and could start
|
|
the server for the filesystem if it is elected leader.
|
|
.sp
|
|
The assigned number must match one of the slots defined with \-Q options
|
|
when the filesystem was created with mkfs. If the number assigned
|
|
doesn't match a number created during mkfs then the mount will fail.
|
|
.TP
|
|
.B tcp_keepalive_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that a client
|
|
connection will wait for active TCP packets, before deciding that
|
|
the connection is dead. This setting is per-mount and only changes
|
|
the behavior of that mount.
|
|
.sp
|
|
The default value of this setting is 60000msec (60s). Any precision
|
|
beyond a whole second is likely unrealistic due to the nature of
|
|
TCP keepalive mechanisms in the Linux kernel. Valid values are any
|
|
value higher than 3000 (3s).
|
|
.sp
|
|
The TCP keepalive mechanism is complex and observing a lost connection
|
|
quickly is important to maintain cluster stability. If the local
|
|
network suffers from intermittent outages this option may provide
|
|
some respite to overcome these outages without the cluster becoming
|
|
desynchronized.
|
|
.SH VOLUME OPTIONS
|
|
Volume options are persistent options which are stored in the super
|
|
block in the metadata device and which apply to all mounts of the volume.
|
|
.sp
|
|
Volume options may be initially specified as the volume is created
|
|
as described in the mkfs command in
|
|
.BR scoutfs (8).
|
|
.sp
|
|
Volume options may be changed at runtime by writing to files in sysfs
|
|
while the volume is mounted. Volume options are found in the
|
|
volume_options/ directory with a file for each option. Reading the
|
|
file provides the current setting of the option and an empty string
|
|
is returned if the option is not set. To set the option, write
|
|
the new value ofthe option to the file. To clear the option, write
|
|
a blank line with a newline to the file. The write syscall will
|
|
return an error if the set operation fails and a message will be written
|
|
to the console.
|
|
.sp
|
|
The following volume options are supported:
|
|
.TP
|
|
.B data_alloc_zone_blocks=<zone size in 4KiB blocks>
|
|
When the data_alloc_zone_blocks option is set the data device is
|
|
logically divided into zones of equal length as specified by the value
|
|
of the option. The size of the zones must be greater than a minimum
|
|
allocation pool size, large enough to result in no more than 1024 zones,
|
|
and not more than the total number of blocks in the data device.
|
|
.sp
|
|
When set, the server will try to provide each mount with free data
|
|
extents that don't share a zone with other mounts. When a mount has free
|
|
extents in a given zone the server will try and find more free extents
|
|
in that zone. When the mount is not in a zone, or its zone has no more
|
|
free extents, the server will try and find free extents in a zone that
|
|
no other mount currently occupies. The result is to try and produce
|
|
write streams where only one mount is writing into each zone.
|
|
.SH FENCING
|
|
.B scoutfs
|
|
mounts coordinate exclusive access to shared resources through
|
|
comminication with the mount that was elected leader.
|
|
A mount can malfunction and stop participating at which point it needs
|
|
to be safely isolated ("fenced off") from shared resources before other mounts can
|
|
have their turn at exclusive access.
|
|
.sp
|
|
Only the elected leader can fence mounts. As the leader decides that a
|
|
mount must be fenced, typically by timeouts expiring without
|
|
comminication from the mount, it creates a fence request. Fence
|
|
requests are visible as directories in the leader mount's sysfs
|
|
directory. The fence request directory is named for the RID of the
|
|
mount being fenced. The directory contains the following files:
|
|
|
|
.RS
|
|
.TP
|
|
.B elapsec_secs
|
|
Reading this file gives the number of seconds that have passed since
|
|
this fence request was created.
|
|
.TP
|
|
.B error
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file if they are unable to fence the
|
|
mount. The volume can not make progress until the mount is fenced so
|
|
this will cause the server to stop and another mount will be elected
|
|
leader.
|
|
.TP
|
|
.B fenced
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file once the mount has been fenced.
|
|
.TP
|
|
.B ipv4_addr
|
|
This file contains the dotted quad IPv4 peer address of the last
|
|
connected socket from the mount. Userspace fencing agents can use this
|
|
to find the host that contains the mount.
|
|
.TP
|
|
.B reason
|
|
This file contains a text string that indicates the reason that the
|
|
mount is being fenced:
|
|
|
|
.B client_recovery
|
|
- During startup the server found persistent items recording the presence
|
|
of a mount that didn't reconnect to the server in time.
|
|
.sp
|
|
.B client_reconnect
|
|
- A mount disconnected from the server and didn't reconnect in time.
|
|
.sp
|
|
.B quorum_block_leader
|
|
- As a leader was elected it read persistent blocks that indicated that
|
|
a previous leader had not shut down and cleared their quorum block.
|
|
.TP
|
|
.B rid
|
|
This file contains the hex string of the RID of the mount to be fenced.
|
|
.RE
|
|
|
|
The request directories enable userspace processes to gather the
|
|
information to find the host with the mount to fence, isolate the mount
|
|
by whatever means are appropriate (f.e. cut off network and storage
|
|
communication, force unmount the mount, isolate storage fabric ports,
|
|
reboot the host) and write to the
|
|
.I fenced
|
|
file.
|
|
.sp
|
|
Once the
|
|
.I fenced
|
|
file is written to the server reclaims the resources
|
|
associated with the fenced mount and resumes normal operations.
|
|
.sp
|
|
If the
|
|
.I error
|
|
file is written to then the server cannot make forward progress and
|
|
shuts down. The request can similarly enter an errored state if enough
|
|
time passes before userspace completes the request.
|
|
|
|
.SH EXTENDED ATTRIBUTE TAGS
|
|
|
|
.B scoutfs
|
|
adds the
|
|
.IB scoutfs.
|
|
extended attribute namespace which uses a system of tags to extend the
|
|
functionality of extended attributes. Immediately following the
|
|
scoutfs. prefix are a series of tag words seperated by dots.
|
|
Any text starting after the last recognized tag is considered the xattr
|
|
name and is not parsed.
|
|
.sp
|
|
Tags may be combined in any order. Specifying a tag more than once
|
|
will return an error. There is no explicit boundary between the end of
|
|
tags and the start of the name so unknown or incorrect tags will be
|
|
successfully parsed as part of the name of the xattr. Tags can only be
|
|
created, updated, or removed with the CAP_SYS_ADMIN capability.
|
|
|
|
The following tags are currently supported:
|
|
|
|
.RS
|
|
.TP
|
|
.B .hide.
|
|
Attributes with the .hide. tag are not visible to the
|
|
.BR listxattr(2)
|
|
system call. They will instead be included in the output of the
|
|
.IB LISTXATTR_HIDDEN
|
|
ioctl. This is meant to be used by archival management agents to store
|
|
metadata that is bound to a specific volume and should not be
|
|
transferred with the file by tools that read extended attributes, like
|
|
.BR tar(1) .
|
|
.TP
|
|
.B .indx.
|
|
Attributes with the .indx. tag dd the inode containing the attribute to
|
|
a filesystem-wide index. The name of the extended attribute must end
|
|
with strings representing two values separated by dots. The first value
|
|
is an unsigned 8bit value and the second is an unsigned 64bit value.
|
|
These attributes can only be modified with root privileges and the
|
|
attributes can not have a value.
|
|
.sp
|
|
The inodes in the index are stored in increasing sort order of the
|
|
values, with the first u8 value being most significant. Inodes can be
|
|
at many positions as tracked by many extended attributes, and their
|
|
position follows the creation, renaming, or deletion of the attributes.
|
|
The index can be read with the read-xattr-index command which uses the
|
|
underlying READ_XATTR_INDEX ioctl.
|
|
.TP
|
|
.B .srch.
|
|
Attributes with the .srch. tag are indexed so that they can be
|
|
found by the
|
|
.IB SEARCH_XATTRS
|
|
ioctl. The search ioctl takes an extended attribute name and returns
|
|
the inode number of all the inodes which contain an extended attribute
|
|
with that name. The indexing structures behind .srch. tags are designed
|
|
to efficiently handle a large number of .srch. attributes per file with
|
|
no limits on the number of indexed files.
|
|
.TP
|
|
.B .totl.
|
|
Attributes with the .totl. flag are used to efficiently maintain counts
|
|
across all files in the system. The attribute's name must end in three
|
|
64bit values seperated by dots that specify the global total that the
|
|
extended attribute will contribute to. The value of the extended
|
|
attribute is a string representation of the 64bit quantity which will be
|
|
added to the total. As attributes are added, updated, or removed (and
|
|
particularly as a file is finally deleted), the corresponding global
|
|
total is also updated by the file system. All the totals with their
|
|
name, total value, and a count of contributing attributes can be read
|
|
with the
|
|
.IB READ_XATTR_TOTALS
|
|
ioctl.
|
|
.RE
|
|
|
|
.SH FILE RETENTION MODE
|
|
A file can be set to retention mode by setting the
|
|
.IB RETENTION
|
|
attribute with the
|
|
.IB SET_ATTR_X
|
|
ioctl. This flag can only be set on regular files and requires root
|
|
permission (the
|
|
.IB CAP_SYS_ADMIN
|
|
capability).
|
|
.sp
|
|
Once in retention mode all modifications of the file will fail. The
|
|
only exceptions are that system extended attributes (all those without
|
|
the "user." prefix) may be modified. The retention bit may be cleared
|
|
with sufficient priveledges to remove the retention restrictions on
|
|
other modifications.
|
|
.RE
|
|
|
|
.SH PROJECT IDs
|
|
All inodes have a project ID attribute that can be set via the
|
|
SET_ATTR_X ioctl and displayed with the GET_ATTR_X ioctl. Project IDs
|
|
are an unsigned 64bit value and the value of 0 is reserved to indicate
|
|
that no project ID is assigned. If a project ID is set on a directory
|
|
then all inodes created with it as the initial parent inheret that ID,
|
|
for all file types. This includes files initially unlinked from the
|
|
namespace when created with O_TMPFILE. Project IDs are only
|
|
automatically inherited from the parent dir on initial creation.
|
|
They're not changed as directory entry linkes to the inode are created
|
|
or renamed.
|
|
.RE
|
|
|
|
.SH FORMAT VERSION
|
|
The format version defines the layout and use of structures stored on
|
|
devices and passed over the network. The version is incremented for
|
|
every change in structures that is not backwards compatible with
|
|
previous versions. A single version implies all changes, individual
|
|
changes can't be selectively adopted.
|
|
.sp
|
|
As a new file system is created the format version is stored in both of
|
|
the super blocks written to the metadata and data devices. By default
|
|
the greatest supported version is written while an older supported
|
|
version may be specified.
|
|
.sp
|
|
During mount the kernel module verifies that the format versions stored
|
|
in both of the super blocks match and are supported. That version
|
|
defines the set of features and behavior of all the mounts using the
|
|
file system, including the network protocol that is communicated over
|
|
the wire.
|
|
.sp
|
|
Any combination of software release versions that support the current
|
|
format version of the file system can safely be used concurrently. This
|
|
allows for rolling software updates of multiple mounts using a shared
|
|
file system.
|
|
.sp
|
|
To use new incompatible features added in newer format versions the super blocks must
|
|
be updated. This can currently only be safely performed on a
|
|
completely and cleanly unmounted file system. The
|
|
.BR scoutfs (8)
|
|
.I change-format-version
|
|
command can be used with the
|
|
.I --offline
|
|
option to write a newer supported version into the super blocks. It
|
|
will fail if it sees any indication of unresolved mounts that may be
|
|
using the devices: either active quorum members working with their
|
|
quorum blocks or persistent records of mounted clients that haven't been
|
|
resolved. Like creating a new file system, there is no protection
|
|
against multiple invocations of the change command corrupting the
|
|
system. Once the version is updated older software can no longer use
|
|
the file system so this change should be performed with care. Once the
|
|
newer format version is successfully written it can be mounted and newer
|
|
features can be used.
|
|
.sp
|
|
Each layer of the system can show its supported format versions:
|
|
.RS
|
|
.TP
|
|
.B Userspace utilities
|
|
.B scoutfs --help
|
|
includes the range of supported format versions for a given release
|
|
of the userspace utilities.
|
|
.TP
|
|
.B Kernel module
|
|
.I modinfo MODULE
|
|
shows the range of supproted versions for a kernel module file in the
|
|
.I scoutfs_format_version_min
|
|
and
|
|
.I scoutfs_format_version_min
|
|
fields.
|
|
.TP
|
|
.B Inserted module
|
|
The supported version range of an inserted module can be found in
|
|
.I .note.scoutfs_format_version_min
|
|
and
|
|
.I .note.scoutfs_format_version_max
|
|
notes files in the sysfs notes directory for the inserted module,
|
|
typically
|
|
.I /sys/module/scoutfs/notes/
|
|
.TP
|
|
.B Metadata and data devices
|
|
.I scoutfs print DEVICE
|
|
shows the
|
|
.I fmt_vers
|
|
field in the initial output of the super block on the device.
|
|
.TP
|
|
.B Mounted filesystem
|
|
The version that a mount is using is shown in the
|
|
.I format_version
|
|
file in the mount's sysfs directory, typically
|
|
.I /sys/fs/scoutfs/f.FSID.r.RID/
|
|
.RE
|
|
.sp
|
|
The defined format versions are:
|
|
.RS
|
|
.TP
|
|
.sp
|
|
.B 1
|
|
Initial format version.
|
|
.TP
|
|
.B 2
|
|
Added retention mode by setting the retention attribute. Added the
|
|
project ID inode attribute. Added quota rules and enforcement. Added
|
|
the .indx. extended attribute tag.
|
|
.RE
|
|
|
|
.SH CORRUPTION DETECTION
|
|
A
|
|
.B scoutfs
|
|
filesystem can detect corruption at runtime. A catalog of kernel log
|
|
messages that indicate corruption can be found in
|
|
.BR scoutfs-corruption (8)
|
|
\&.
|
|
|
|
.SH SEE ALSO
|
|
.BR scoutfs (8),
|
|
.BR scoutfs-corruption (7).
|
|
|
|
.SH AUTHORS
|
|
Zach Brown <zab@versity.com>
|
|
|
|
|