.TH scoutfs 5 .SH NAME scoutfs \- high level overview of the scoutfs filesystem .SH DESCRIPTION A scoutfs filesystem is stored on two block devices. Multiple mounts of the filesystem are supported between hosts that share access to the block device. A new filesystem is created with the .B mkfs command in the .BR scoutfs (8) utility. .SH MOUNT OPTIONS The following mount options are supported by scoutfs in addition to the general mount options described in the .BR mount (8) manual page. .TP .B acl The acl mount option enables support for POSIX Access Control Lists as detailed in .BR acl (5) . Support for POSIX ACLs is the default. .TP .B data_prealloc_blocks= Set the size of preallocation regions of data files, in 4KiB blocks. Writes to these regions that contain no extents will attempt to preallocate the size of the full region. This can waste a lot of space with small files, files with sparse regions, and files whose final length isn't a multiple of the preallocation size. The following data_prealloc_contig_only option, which is the default, restricts this behaviour to waste less space. .sp All the preallocation options can be changed in an active mount by writing to their respective files in the options directory in the mount's sysfs directory. .sp It is worth noting that it is always more efficient in every way to use .BR fallocate (2) to precisely allocate large extents for the resulting size of the file. Always attempt to enable it in software that supports it. .TP .B data_prealloc_contig_only=<0|1> This option, currently the default, limits file data preallocation in two ways. First, it will only preallocate when extending a fully allocated file. Second, it will limit the size of preallocation to the existing length of the file. These limits reduce the amount of preallocation wasted per file at the cost of multiple initial extents in all files. It only supports simple streaming writes, any other write pattern will not be recognized and could result in many fragmented extent allocations. .sp This option can be disabled to encourage large allocated extents regardless of write patterns. This can be helpful if files are written with initial sparse regions (perhaps by multiple threads writing to different regions) and wasted space isn't an issue (perhaps because the file population contains few small files). .TP .B ino_alloc_per_lock= This option determines how many inode numbers are allocated in the same cluster lock. The default, and maximum, is 1024. The minimum is 1. Allocating fewer inodes per lock can allow more parallelism between mounts because there are more locks that cover the same number of created files. This can be helpful when working with smaller numbers of large files. .TP .B log_merge_wait_timeout_ms= This option sets the amount of time, in milliseconds, that log merge creation can wait before timing out. This setting is per-mount, only changes the behavior of that mount, and only affects the server when it is running in that mount. .sp This determines how long it may take for mounts to synchronize committing their log trees to create a log merge operation. Setting it too high can create long latencies in the event that a mount takes a long time to commit their log. Setting it too low can result in the creation of excessive numbers of log trees that are never merged. The default is 500 and it can not be less than 100 nor greater than 60000. .TP .B metadev_path= The metadev_path option specifies the path to the block device that contains the filesystem's metadata. .sp This option is required. .TP .B noacl The noacl mount option disables the default support for POSIX Access Control Lists. Any existing system.posix_acl_default and system.posix_acl_access extended attributes remain in inodes. They will appear in listings from .BR listxattr (5) but specific retrieval or reomval operations will fail. They will be used for enforcement again if ACL support is later enabled. .TP .B orphan_scan_delay_ms= This option sets the average expected delay, in milliseconds, between each mount's scan of the global orphaned inode list. Jitter is added to avoid contention so each individual delay between scans is a random value up to 20% less than or greater than this average expected delay. .sp The minimum value for this option is 100ms which is very short and is only reasonable for testing or experiments. The default is 10000ms (10 seconds) and the maximum is 60000ms (1 minute). .sp This option can be changed in an active mount by writing to its file in the options directory in the mount's sysfs directory. Writing a new value will cause the next pending orphan scan to be rescheduled with the newly written delay time. .TP .B quorum_heartbeat_timeout_ms= This option sets the amount of time, in milliseconds, that a quorum member will wait without receiving heartbeat messages from the current leader before trying to take over as leader. This setting is per-mount and only changes the behavior of that mount. .sp This determines how long it may take before a failed leader is replaced by a waiting quorum member. Setting it too low may lead to spurious fencing as active leaders are prematurely replaced due to task or network delays that prevent the quorum members from promptly sending and receiving messages. The ideal setting is the longest acceptable downtime during server failover. The default is 10000 (10s) and it can not be less than 2000 greater than 60000. .sp This option can be changed in an active mount by writing to its file in the options directory in the mount's sysfs directory. Writing a new value will take effect the next time the quorum agent receives a heartbeat message and sets the next timeout. .TP .B quorum_slot_nr= The quorum_slot_nr option assigns a quorum member slot to the mount. The mount will use the slot assignment to claim exclusive ownership of the slot's configured address and an associated metadata device block. Each slot number must be used by only one mount at any given time. .sp When a mount is assigned a quorum slot it becomes a quorum member and will participate in the raft leader election process and could start the server for the filesystem if it is elected leader. .sp The assigned number must match one of the slots defined with \-Q options when the filesystem was created with mkfs. If the number assigned doesn't match a number created during mkfs then the mount will fail. .TP .B tcp_keepalive_timeout_ms= This option sets the amount of time, in milliseconds, that a client connection will wait for active TCP packets, before deciding that the connection is dead. This setting is per-mount and only changes the behavior of that mount. .sp The default value of this setting is 60000msec (60s). Any precision beyond a whole second is likely unrealistic due to the nature of TCP keepalive mechanisms in the Linux kernel. Valid values are any value higher than 3000 (3s). .sp The TCP keepalive mechanism is complex and observing a lost connection quickly is important to maintain cluster stability. If the local network suffers from intermittent outages this option may provide some respite to overcome these outages without the cluster becoming desynchronized. .SH VOLUME OPTIONS Volume options are persistent options which are stored in the super block in the metadata device and which apply to all mounts of the volume. .sp Volume options may be initially specified as the volume is created as described in the mkfs command in .BR scoutfs (8). .sp Volume options may be changed at runtime by writing to files in sysfs while the volume is mounted. Volume options are found in the volume_options/ directory with a file for each option. Reading the file provides the current setting of the option and an empty string is returned if the option is not set. To set the option, write the new value ofthe option to the file. To clear the option, write a blank line with a newline to the file. The write syscall will return an error if the set operation fails and a message will be written to the console. .sp The following volume options are supported: .TP .B data_alloc_zone_blocks= When the data_alloc_zone_blocks option is set the data device is logically divided into zones of equal length as specified by the value of the option. The size of the zones must be greater than a minimum allocation pool size, large enough to result in no more than 1024 zones, and not more than the total number of blocks in the data device. .sp When set, the server will try to provide each mount with free data extents that don't share a zone with other mounts. When a mount has free extents in a given zone the server will try and find more free extents in that zone. When the mount is not in a zone, or its zone has no more free extents, the server will try and find free extents in a zone that no other mount currently occupies. The result is to try and produce write streams where only one mount is writing into each zone. .SH FENCING .B scoutfs mounts coordinate exclusive access to shared resources through comminication with the mount that was elected leader. A mount can malfunction and stop participating at which point it needs to be safely isolated ("fenced off") from shared resources before other mounts can have their turn at exclusive access. .sp Only the elected leader can fence mounts. As the leader decides that a mount must be fenced, typically by timeouts expiring without comminication from the mount, it creates a fence request. Fence requests are visible as directories in the leader mount's sysfs directory. The fence request directory is named for the RID of the mount being fenced. The directory contains the following files: .RS .TP .B elapsec_secs Reading this file gives the number of seconds that have passed since this fence request was created. .TP .B error This file contains 0 when the fence request is created. Userspace fencing agents write 1 into this file if they are unable to fence the mount. The volume can not make progress until the mount is fenced so this will cause the server to stop and another mount will be elected leader. .TP .B fenced This file contains 0 when the fence request is created. Userspace fencing agents write 1 into this file once the mount has been fenced. .TP .B ipv4_addr This file contains the dotted quad IPv4 peer address of the last connected socket from the mount. Userspace fencing agents can use this to find the host that contains the mount. .TP .B reason This file contains a text string that indicates the reason that the mount is being fenced: .B client_recovery - During startup the server found persistent items recording the presence of a mount that didn't reconnect to the server in time. .sp .B client_reconnect - A mount disconnected from the server and didn't reconnect in time. .sp .B quorum_block_leader - As a leader was elected it read persistent blocks that indicated that a previous leader had not shut down and cleared their quorum block. .TP .B rid This file contains the hex string of the RID of the mount to be fenced. .RE The request directories enable userspace processes to gather the information to find the host with the mount to fence, isolate the mount by whatever means are appropriate (f.e. cut off network and storage communication, force unmount the mount, isolate storage fabric ports, reboot the host) and write to the .I fenced file. .sp Once the .I fenced file is written to the server reclaims the resources associated with the fenced mount and resumes normal operations. .sp If the .I error file is written to then the server cannot make forward progress and shuts down. The request can similarly enter an errored state if enough time passes before userspace completes the request. .SH EXTENDED ATTRIBUTE TAGS .B scoutfs adds the .IB scoutfs. extended attribute namespace which uses a system of tags to extend the functionality of extended attributes. Immediately following the scoutfs. prefix are a series of tag words seperated by dots. Any text starting after the last recognized tag is considered the xattr name and is not parsed. .sp Tags may be combined in any order. Specifying a tag more than once will return an error. There is no explicit boundary between the end of tags and the start of the name so unknown or incorrect tags will be successfully parsed as part of the name of the xattr. Tags can only be created, updated, or removed with the CAP_SYS_ADMIN capability. The following tags are currently supported: .RS .TP .B .hide. Attributes with the .hide. tag are not visible to the .BR listxattr(2) system call. They will instead be included in the output of the .IB LISTXATTR_HIDDEN ioctl. This is meant to be used by archival management agents to store metadata that is bound to a specific volume and should not be transferred with the file by tools that read extended attributes, like .BR tar(1) . .TP .B .indx. Attributes with the .indx. tag dd the inode containing the attribute to a filesystem-wide index. The name of the extended attribute must end with strings representing two values separated by dots. The first value is an unsigned 8bit value and the second is an unsigned 64bit value. These attributes can only be modified with root privileges and the attributes can not have a value. .sp The inodes in the index are stored in increasing sort order of the values, with the first u8 value being most significant. Inodes can be at many positions as tracked by many extended attributes, and their position follows the creation, renaming, or deletion of the attributes. The index can be read with the read-xattr-index command which uses the underlying READ_XATTR_INDEX ioctl. .TP .B .srch. Attributes with the .srch. tag are indexed so that they can be found by the .IB SEARCH_XATTRS ioctl. The search ioctl takes an extended attribute name and returns the inode number of all the inodes which contain an extended attribute with that name. The indexing structures behind .srch. tags are designed to efficiently handle a large number of .srch. attributes per file with no limits on the number of indexed files. .TP .B .totl. Attributes with the .totl. flag are used to efficiently maintain counts across all files in the system. The attribute's name must end in three 64bit values seperated by dots that specify the global total that the extended attribute will contribute to. The value of the extended attribute is a string representation of the 64bit quantity which will be added to the total. As attributes are added, updated, or removed (and particularly as a file is finally deleted), the corresponding global total is also updated by the file system. All the totals with their name, total value, and a count of contributing attributes can be read with the .IB READ_XATTR_TOTALS ioctl. .RE .SH FILE RETENTION MODE A file can be set to retention mode by setting the .IB RETENTION attribute with the .IB SET_ATTR_X ioctl. This flag can only be set on regular files and requires root permission (the .IB CAP_SYS_ADMIN capability). .sp Once in retention mode all modifications of the file will fail. The only exceptions are that system extended attributes (all those without the "user." prefix) may be modified. The retention bit may be cleared with sufficient priveledges to remove the retention restrictions on other modifications. .RE .SH PROJECT IDs All inodes have a project ID attribute that can be set via the SET_ATTR_X ioctl and displayed with the GET_ATTR_X ioctl. Project IDs are an unsigned 64bit value and the value of 0 is reserved to indicate that no project ID is assigned. If a project ID is set on a directory then all inodes created with it as the initial parent inheret that ID, for all file types. This includes files initially unlinked from the namespace when created with O_TMPFILE. Project IDs are only automatically inherited from the parent dir on initial creation. They're not changed as directory entry linkes to the inode are created or renamed. .RE .SH FORMAT VERSION The format version defines the layout and use of structures stored on devices and passed over the network. The version is incremented for every change in structures that is not backwards compatible with previous versions. A single version implies all changes, individual changes can't be selectively adopted. .sp As a new file system is created the format version is stored in both of the super blocks written to the metadata and data devices. By default the greatest supported version is written while an older supported version may be specified. .sp During mount the kernel module verifies that the format versions stored in both of the super blocks match and are supported. That version defines the set of features and behavior of all the mounts using the file system, including the network protocol that is communicated over the wire. .sp Any combination of software release versions that support the current format version of the file system can safely be used concurrently. This allows for rolling software updates of multiple mounts using a shared file system. .sp To use new incompatible features added in newer format versions the super blocks must be updated. This can currently only be safely performed on a completely and cleanly unmounted file system. The .BR scoutfs (8) .I change-format-version command can be used with the .I --offline option to write a newer supported version into the super blocks. It will fail if it sees any indication of unresolved mounts that may be using the devices: either active quorum members working with their quorum blocks or persistent records of mounted clients that haven't been resolved. Like creating a new file system, there is no protection against multiple invocations of the change command corrupting the system. Once the version is updated older software can no longer use the file system so this change should be performed with care. Once the newer format version is successfully written it can be mounted and newer features can be used. .sp Each layer of the system can show its supported format versions: .RS .TP .B Userspace utilities .B scoutfs --help includes the range of supported format versions for a given release of the userspace utilities. .TP .B Kernel module .I modinfo MODULE shows the range of supproted versions for a kernel module file in the .I scoutfs_format_version_min and .I scoutfs_format_version_min fields. .TP .B Inserted module The supported version range of an inserted module can be found in .I .note.scoutfs_format_version_min and .I .note.scoutfs_format_version_max notes files in the sysfs notes directory for the inserted module, typically .I /sys/module/scoutfs/notes/ .TP .B Metadata and data devices .I scoutfs print DEVICE shows the .I fmt_vers field in the initial output of the super block on the device. .TP .B Mounted filesystem The version that a mount is using is shown in the .I format_version file in the mount's sysfs directory, typically .I /sys/fs/scoutfs/f.FSID.r.RID/ .RE .sp The defined format versions are: .RS .TP .sp .B 1 Initial format version. .TP .B 2 Added retention mode by setting the retention attribute. Added the project ID inode attribute. Added quota rules and enforcement. Added the .indx. extended attribute tag. .RE .SH CORRUPTION DETECTION A .B scoutfs filesystem can detect corruption at runtime. A catalog of kernel log messages that indicate corruption can be found in .BR scoutfs-corruption (8) \&. .SH SEE ALSO .BR scoutfs (8), .BR scoutfs-corruption (7). .SH AUTHORS Zach Brown