mirror of
https://github.com/versity/scoutfs.git
synced 2026-01-05 03:44:05 +00:00
The default TCP keepalive value is currently 10s, resulting in clients being disconnected after 10 seconds of not replying to a TCP keepalive packet. These keepalive values are reasonable most of the times, but we've seen client disconnects where this timeout has been exceeded, resulting in fencing. The cause for this is unknown at this time, but it is suspected that network intermissions are happening. This change adds a configurable value for this specific client socket timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose value remains unchanged. The default value of 10000ms (10s) is changed to 60s. This is the value we're assuming is much better suited for customers and has been briefly trialed, showing that it may help to avoid network level interruptions better. Signed-off-by: Auke Kok <auke.kok@versity.com>
468 lines
19 KiB
Groff
468 lines
19 KiB
Groff
.TH scoutfs 5
|
|
.SH NAME
|
|
scoutfs \- high level overview of the scoutfs filesystem
|
|
.SH DESCRIPTION
|
|
A scoutfs filesystem is stored on two block devices. Multiple mounts of
|
|
the filesystem are supported between hosts that share access to the
|
|
block device. A new filesystem is created with the
|
|
.B mkfs
|
|
command in the
|
|
.BR scoutfs (8)
|
|
utility.
|
|
.SH MOUNT OPTIONS
|
|
The following mount options are supported by scoutfs in addition to the
|
|
general mount options described in the
|
|
.BR mount (8)
|
|
manual page.
|
|
.TP
|
|
.B acl
|
|
The acl mount option enables support for POSIX Access Control Lists
|
|
as detailed in
|
|
.BR acl (5) .
|
|
Support for POSIX ACLs is the default.
|
|
.TP
|
|
.B data_prealloc_blocks=<blocks>
|
|
Set the size of preallocation regions of data files, in 4KiB blocks.
|
|
Writes to these regions that contain no extents will attempt to
|
|
preallocate the size of the full region. This can waste a lot of space
|
|
with small files, files with sparse regions, and files whose final
|
|
length isn't a multiple of the preallocation size. The following
|
|
data_prealloc_contig_only option, which is the default, restricts this
|
|
behaviour to waste less space.
|
|
.sp
|
|
All the preallocation options can be changed in an active mount by
|
|
writing to their respective files in the options directory in the
|
|
mount's sysfs directory.
|
|
.sp
|
|
It is worth noting that it is always more efficient in every way to use
|
|
.BR fallocate (2)
|
|
to precisely allocate large extents for the resulting size of the file.
|
|
Always attempt to enable it in software that supports it.
|
|
.TP
|
|
.B data_prealloc_contig_only=<0|1>
|
|
This option, currently the default, limits file data preallocation in
|
|
two ways. First, it will only preallocate when extending a fully
|
|
allocated file. Second, it will limit the size of preallocation to the
|
|
existing length of the file. These limits reduce the amount of
|
|
preallocation wasted per file at the cost of multiple initial extents in
|
|
all files. It only supports simple streaming writes, any other write
|
|
pattern will not be recognized and could result in many fragmented
|
|
extent allocations.
|
|
.sp
|
|
This option can be disabled to encourage large allocated extents
|
|
regardless of write patterns. This can be helpful if files are written
|
|
with initial sparse regions (perhaps by multiple threads writing to
|
|
different regions) and wasted space isn't an issue (perhaps because the
|
|
file population contains few small files).
|
|
.TP
|
|
.B log_merge_wait_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that log merge
|
|
creation can wait before timing out. This setting is per-mount, only
|
|
changes the behavior of that mount, and only affects the server when it
|
|
is running in that mount.
|
|
.sp
|
|
This determines how long it may take for mounts to synchronize
|
|
committing their log trees to create a log merge operation. Setting it
|
|
too high can create long latencies in the event that a mount takes a
|
|
long time to commit their log. Setting it too low can result in the
|
|
creation of excessive numbers of log trees that are never merged. The
|
|
default is 500 and it can not be less than 100 nor greater than 60000.
|
|
.TP
|
|
.B metadev_path=<device>
|
|
The metadev_path option specifies the path to the block device that
|
|
contains the filesystem's metadata.
|
|
.sp
|
|
This option is required.
|
|
.TP
|
|
.B noacl
|
|
The noacl mount option disables the default support for POSIX Access
|
|
Control Lists. Any existing system.posix_acl_default and
|
|
system.posix_acl_access extended attributes remain in inodes. They
|
|
will appear in listings from
|
|
.BR listxattr (5)
|
|
but specific retrieval or reomval operations will fail. They will be
|
|
used for enforcement again if ACL support is later enabled.
|
|
.TP
|
|
.B orphan_scan_delay_ms=<number>
|
|
This option sets the average expected delay, in milliseconds, between
|
|
each mount's scan of the global orphaned inode list. Jitter is added to
|
|
avoid contention so each individual delay between scans is a random
|
|
value up to 20% less than or greater than this average expected delay.
|
|
.sp
|
|
The minimum value for this option is 100ms which is very short and is
|
|
only reasonable for testing or experiments. The default is 10000ms (10
|
|
seconds) and the maximum is 60000ms (1 minute).
|
|
.sp
|
|
This option can be changed in an active mount by writing to its file in
|
|
the options directory in the mount's sysfs directory. Writing a new
|
|
value will cause the next pending orphan scan to be rescheduled
|
|
with the newly written delay time.
|
|
.TP
|
|
.B quorum_heartbeat_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that a quorum
|
|
member will wait without receiving heartbeat messages from the current
|
|
leader before trying to take over as leader. This setting is per-mount
|
|
and only changes the behavior of that mount.
|
|
.sp
|
|
This determines how long it may take before a failed leader is replaced
|
|
by a waiting quorum member. Setting it too low may lead to spurious
|
|
fencing as active leaders are prematurely replaced due to task or
|
|
network delays that prevent the quorum members from promptly sending and
|
|
receiving messages. The ideal setting is the longest acceptable
|
|
downtime during server failover. The default is 10000 (10s) and it can
|
|
not be less than 2000 greater than 60000.
|
|
.sp
|
|
This option can be changed in an active mount by writing to its file in
|
|
the options directory in the mount's sysfs directory. Writing a new
|
|
value will take effect the next time the quorum agent receives a
|
|
heartbeat message and sets the next timeout.
|
|
.TP
|
|
.B quorum_slot_nr=<number>
|
|
The quorum_slot_nr option assigns a quorum member slot to the mount.
|
|
The mount will use the slot assignment to claim exclusive ownership of
|
|
the slot's configured address and an associated metadata device block.
|
|
Each slot number must be used by only one mount at any given time.
|
|
.sp
|
|
When a mount is assigned a quorum slot it becomes a quorum member and
|
|
will participate in the raft leader election process and could start
|
|
the server for the filesystem if it is elected leader.
|
|
.sp
|
|
The assigned number must match one of the slots defined with \-Q options
|
|
when the filesystem was created with mkfs. If the number assigned
|
|
doesn't match a number created during mkfs then the mount will fail.
|
|
.TP
|
|
.B tcp_keepalive_timeout_ms=<number>
|
|
This option sets the amount of time, in milliseconds, that a client
|
|
connection will wait for active TCP packets, before deciding that
|
|
the connection is dead. This setting is per-mount and only changes
|
|
the behavior of that mount.
|
|
.sp
|
|
The default value of this setting is 60000msec (60s). Any precision
|
|
beyond a whole second is likely unrealistic due to the nature of
|
|
TCP keepalive mechanisms in the Linux kernel. Valid values are any
|
|
value higher than 3000 (3s).
|
|
.sp
|
|
The TCP keepalive mechanism is complex and observing a lost connection
|
|
quickly is important to maintain cluster stability. If the local
|
|
network suffers from intermittent outages this option may provide
|
|
some respite to overcome these outages without the cluster becoming
|
|
desynchronized.
|
|
.SH VOLUME OPTIONS
|
|
Volume options are persistent options which are stored in the super
|
|
block in the metadata device and which apply to all mounts of the volume.
|
|
.sp
|
|
Volume options may be initially specified as the volume is created
|
|
as described in the mkfs command in
|
|
.BR scoutfs (8).
|
|
.sp
|
|
Volume options may be changed at runtime by writing to files in sysfs
|
|
while the volume is mounted. Volume options are found in the
|
|
volume_options/ directory with a file for each option. Reading the
|
|
file provides the current setting of the option and an empty string
|
|
is returned if the option is not set. To set the option, write
|
|
the new value ofthe option to the file. To clear the option, write
|
|
a blank line with a newline to the file. The write syscall will
|
|
return an error if the set operation fails and a message will be written
|
|
to the console.
|
|
.sp
|
|
The following volume options are supported:
|
|
.TP
|
|
.B data_alloc_zone_blocks=<zone size in 4KiB blocks>
|
|
When the data_alloc_zone_blocks option is set the data device is
|
|
logically divided into zones of equal length as specified by the value
|
|
of the option. The size of the zones must be greater than a minimum
|
|
allocation pool size, large enough to result in no more than 1024 zones,
|
|
and not more than the total number of blocks in the data device.
|
|
.sp
|
|
When set, the server will try to provide each mount with free data
|
|
extents that don't share a zone with other mounts. When a mount has free
|
|
extents in a given zone the server will try and find more free extents
|
|
in that zone. When the mount is not in a zone, or its zone has no more
|
|
free extents, the server will try and find free extents in a zone that
|
|
no other mount currently occupies. The result is to try and produce
|
|
write streams where only one mount is writing into each zone.
|
|
.SH FENCING
|
|
.B scoutfs
|
|
mounts coordinate exclusive access to shared resources through
|
|
comminication with the mount that was elected leader.
|
|
A mount can malfunction and stop participating at which point it needs
|
|
to be safely isolated ("fenced off") from shared resources before other mounts can
|
|
have their turn at exclusive access.
|
|
.sp
|
|
Only the elected leader can fence mounts. As the leader decides that a
|
|
mount must be fenced, typically by timeouts expiring without
|
|
comminication from the mount, it creates a fence request. Fence
|
|
requests are visible as directories in the leader mount's sysfs
|
|
directory. The fence request directory is named for the RID of the
|
|
mount being fenced. The directory contains the following files:
|
|
|
|
.RS
|
|
.TP
|
|
.B elapsec_secs
|
|
Reading this file gives the number of seconds that have passed since
|
|
this fence request was created.
|
|
.TP
|
|
.B error
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file if they are unable to fence the
|
|
mount. The volume can not make progress until the mount is fenced so
|
|
this will cause the server to stop and another mount will be elected
|
|
leader.
|
|
.TP
|
|
.B fenced
|
|
This file contains 0 when the fence request is created. Userspace
|
|
fencing agents write 1 into this file once the mount has been fenced.
|
|
.TP
|
|
.B ipv4_addr
|
|
This file contains the dotted quad IPv4 peer address of the last
|
|
connected socket from the mount. Userspace fencing agents can use this
|
|
to find the host that contains the mount.
|
|
.TP
|
|
.B reason
|
|
This file contains a text string that indicates the reason that the
|
|
mount is being fenced:
|
|
|
|
.B client_recovery
|
|
- During startup the server found persistent items recording the presence
|
|
of a mount that didn't reconnect to the server in time.
|
|
.sp
|
|
.B client_reconnect
|
|
- A mount disconnected from the server and didn't reconnect in time.
|
|
.sp
|
|
.B quorum_block_leader
|
|
- As a leader was elected it read persistent blocks that indicated that
|
|
a previous leader had not shut down and cleared their quorum block.
|
|
.TP
|
|
.B rid
|
|
This file contains the hex string of the RID of the mount to be fenced.
|
|
.RE
|
|
|
|
The request directories enable userspace processes to gather the
|
|
information to find the host with the mount to fence, isolate the mount
|
|
by whatever means are appropriate (f.e. cut off network and storage
|
|
communication, force unmount the mount, isolate storage fabric ports,
|
|
reboot the host) and write to the
|
|
.I fenced
|
|
file.
|
|
.sp
|
|
Once the
|
|
.I fenced
|
|
file is written to the server reclaims the resources
|
|
associated with the fenced mount and resumes normal operations.
|
|
.sp
|
|
If the
|
|
.I error
|
|
file is written to then the server cannot make forward progress and
|
|
shuts down. The request can similarly enter an errored state if enough
|
|
time passes before userspace completes the request.
|
|
|
|
.SH EXTENDED ATTRIBUTE TAGS
|
|
|
|
.B scoutfs
|
|
adds the
|
|
.IB scoutfs.
|
|
extended attribute namespace which uses a system of tags to extend the
|
|
functionality of extended attributes. Immediately following the
|
|
scoutfs. prefix are a series of tag words seperated by dots.
|
|
Any text starting after the last recognized tag is considered the xattr
|
|
name and is not parsed.
|
|
.sp
|
|
Tags may be combined in any order. Specifying a tag more than once
|
|
will return an error. There is no explicit boundary between the end of
|
|
tags and the start of the name so unknown or incorrect tags will be
|
|
successfully parsed as part of the name of the xattr. Tags can only be
|
|
created, updated, or removed with the CAP_SYS_ADMIN capability.
|
|
|
|
The following tags are currently supported:
|
|
|
|
.RS
|
|
.TP
|
|
.B .hide.
|
|
Attributes with the .hide. tag are not visible to the
|
|
.BR listxattr(2)
|
|
system call. They will instead be included in the output of the
|
|
.IB LISTXATTR_HIDDEN
|
|
ioctl. This is meant to be used by archival management agents to store
|
|
metadata that is bound to a specific volume and should not be
|
|
transferred with the file by tools that read extended attributes, like
|
|
.BR tar(1) .
|
|
.TP
|
|
.B .indx.
|
|
Attributes with the .indx. tag dd the inode containing the attribute to
|
|
a filesystem-wide index. The name of the extended attribute must end
|
|
with strings representing two values separated by dots. The first value
|
|
is an unsigned 8bit value and the second is an unsigned 64bit value.
|
|
These attributes can only be modified with root privileges and the
|
|
attributes can not have a value.
|
|
.sp
|
|
The inodes in the index are stored in increasing sort order of the
|
|
values, with the first u8 value being most significant. Inodes can be
|
|
at many positions as tracked by many extended attributes, and their
|
|
position follows the creation, renaming, or deletion of the attributes.
|
|
The index can be read with the read-xattr-index command which uses the
|
|
underlying READ_XATTR_INDEX ioctl.
|
|
.TP
|
|
.B .srch.
|
|
Attributes with the .srch. tag are indexed so that they can be
|
|
found by the
|
|
.IB SEARCH_XATTRS
|
|
ioctl. The search ioctl takes an extended attribute name and returns
|
|
the inode number of all the inodes which contain an extended attribute
|
|
with that name. The indexing structures behind .srch. tags are designed
|
|
to efficiently handle a large number of .srch. attributes per file with
|
|
no limits on the number of indexed files.
|
|
.TP
|
|
.B .totl.
|
|
Attributes with the .totl. flag are used to efficiently maintain counts
|
|
across all files in the system. The attribute's name must end in three
|
|
64bit values seperated by dots that specify the global total that the
|
|
extended attribute will contribute to. The value of the extended
|
|
attribute is a string representation of the 64bit quantity which will be
|
|
added to the total. As attributes are added, updated, or removed (and
|
|
particularly as a file is finally deleted), the corresponding global
|
|
total is also updated by the file system. All the totals with their
|
|
name, total value, and a count of contributing attributes can be read
|
|
with the
|
|
.IB READ_XATTR_TOTALS
|
|
ioctl.
|
|
.RE
|
|
|
|
.SH FILE RETENTION MODE
|
|
A file can be set to retention mode by setting the
|
|
.IB RETENTION
|
|
attribute with the
|
|
.IB SET_ATTR_X
|
|
ioctl. This flag can only be set on regular files and requires root
|
|
permission (the
|
|
.IB CAP_SYS_ADMIN
|
|
capability).
|
|
.sp
|
|
Once in retention mode all modifications of the file will fail. The
|
|
only exceptions are that system extended attributes (all those without
|
|
the "user." prefix) may be modified. The retention bit may be cleared
|
|
with sufficient priveledges to remove the retention restrictions on
|
|
other modifications.
|
|
.RE
|
|
|
|
.SH PROJECT IDs
|
|
All inodes have a project ID attribute that can be set via the
|
|
SET_ATTR_X ioctl and displayed with the GET_ATTR_X ioctl. Project IDs
|
|
are an unsigned 64bit value and the value of 0 is reserved to indicate
|
|
that no project ID is assigned. If a project ID is set on a directory
|
|
then all inodes created with it as the initial parent inheret that ID,
|
|
for all file types. This includes files initially unlinked from the
|
|
namespace when created with O_TMPFILE. Project IDs are only
|
|
automatically inherited from the parent dir on initial creation.
|
|
They're not changed as directory entry linkes to the inode are created
|
|
or renamed.
|
|
.RE
|
|
|
|
.SH FORMAT VERSION
|
|
The format version defines the layout and use of structures stored on
|
|
devices and passed over the network. The version is incremented for
|
|
every change in structures that is not backwards compatible with
|
|
previous versions. A single version implies all changes, individual
|
|
changes can't be selectively adopted.
|
|
.sp
|
|
As a new file system is created the format version is stored in both of
|
|
the super blocks written to the metadata and data devices. By default
|
|
the greatest supported version is written while an older supported
|
|
version may be specified.
|
|
.sp
|
|
During mount the kernel module verifies that the format versions stored
|
|
in both of the super blocks match and are supported. That version
|
|
defines the set of features and behavior of all the mounts using the
|
|
file system, including the network protocol that is communicated over
|
|
the wire.
|
|
.sp
|
|
Any combination of software release versions that support the current
|
|
format version of the file system can safely be used concurrently. This
|
|
allows for rolling software updates of multiple mounts using a shared
|
|
file system.
|
|
.sp
|
|
To use new incompatible features added in newer format versions the super blocks must
|
|
be updated. This can currently only be safely performed on a
|
|
completely and cleanly unmounted file system. The
|
|
.BR scoutfs (8)
|
|
.I change-format-version
|
|
command can be used with the
|
|
.I --offline
|
|
option to write a newer supported version into the super blocks. It
|
|
will fail if it sees any indication of unresolved mounts that may be
|
|
using the devices: either active quorum members working with their
|
|
quorum blocks or persistent records of mounted clients that haven't been
|
|
resolved. Like creating a new file system, there is no protection
|
|
against multiple invocations of the change command corrupting the
|
|
system. Once the version is updated older software can no longer use
|
|
the file system so this change should be performed with care. Once the
|
|
newer format version is successfully written it can be mounted and newer
|
|
features can be used.
|
|
.sp
|
|
Each layer of the system can show its supported format versions:
|
|
.RS
|
|
.TP
|
|
.B Userspace utilities
|
|
.B scoutfs --help
|
|
includes the range of supported format versions for a given release
|
|
of the userspace utilities.
|
|
.TP
|
|
.B Kernel module
|
|
.I modinfo MODULE
|
|
shows the range of supproted versions for a kernel module file in the
|
|
.I scoutfs_format_version_min
|
|
and
|
|
.I scoutfs_format_version_min
|
|
fields.
|
|
.TP
|
|
.B Inserted module
|
|
The supported version range of an inserted module can be found in
|
|
.I .note.scoutfs_format_version_min
|
|
and
|
|
.I .note.scoutfs_format_version_max
|
|
notes files in the sysfs notes directory for the inserted module,
|
|
typically
|
|
.I /sys/module/scoutfs/notes/
|
|
.TP
|
|
.B Metadata and data devices
|
|
.I scoutfs print DEVICE
|
|
shows the
|
|
.I fmt_vers
|
|
field in the initial output of the super block on the device.
|
|
.TP
|
|
.B Mounted filesystem
|
|
The version that a mount is using is shown in the
|
|
.I format_version
|
|
file in the mount's sysfs directory, typically
|
|
.I /sys/fs/scoutfs/f.FSID.r.RID/
|
|
.RE
|
|
.sp
|
|
The defined format versions are:
|
|
.RS
|
|
.TP
|
|
.sp
|
|
.B 1
|
|
Initial format version.
|
|
.TP
|
|
.B 2
|
|
Added retention mode by setting the retention attribute. Added the
|
|
project ID inode attribute. Added quota rules and enforcement. Added
|
|
the .indx. extended attribute tag.
|
|
.RE
|
|
|
|
.SH CORRUPTION DETECTION
|
|
A
|
|
.B scoutfs
|
|
filesystem can detect corruption at runtime. A catalog of kernel log
|
|
messages that indicate corruption can be found in
|
|
.BR scoutfs-corruption (8)
|
|
\&.
|
|
|
|
.SH SEE ALSO
|
|
.BR scoutfs (8),
|
|
.BR scoutfs-corruption (7).
|
|
|
|
.SH AUTHORS
|
|
Zach Brown <zab@versity.com>
|
|
|
|
|