diff --git a/kmod/Documentation/scoutfs.md b/kmod/Documentation/scoutfs.md deleted file mode 100644 index e4387e7a..00000000 --- a/kmod/Documentation/scoutfs.md +++ /dev/null @@ -1,354 +0,0 @@ - -# scoutfs Engineering Compendium - ------ - -## Document Overview - -This document is intended to be a relatively unstructured but thorough -coverage of the design, implementation, and deployment of scoutfs. - -*Not Yet Discussed: repair, dump/restore, remote namespace -synchronization, compression, encryption, trim, dedup, hole punching, -SMR, iops v. bw, range locking, sorting keys by type/inode, enospc, -compaction priority, manifest server, manifest network protocol, inode -allocation, clustered open-unlink, seq queries, offline data, LSM, -forward/back compat.* - -## Raison D'ĂȘtre - -scoutfs is an archival posix file system. It's built to provide a posix -interface to petabytes of data in trillions of files through thousands -of nodes. - -scoutfs uses log-structured merge trees to achieve high operation -throughput with low device command rates. It uses ranged locking to -maintain consistent POSIX semantics amongst clustered nodes with minimum -synchronization overhead. It offers additional metadata indexing and -data residency interfaces for efficiently executing archival policies. -It is deployed on a shared block fabric for high bandwidth and low -latency. - -## Super Block - -The super block is the anchor of all the persistent storage in the block -device. It contains volume-wide configuration information and -references to the current stable versions of persistent data structures -in the rest of the block device. The super block is stored in two 4KB -blocks at a known location at the start of the device. - -To read the current super block both block locations are read. The -valid super block with the most recent sequence number is used. Either -of the super blocks can be corrupt because they're overwritten in place -and a crash during a write could scramble the block. - -Each new version of the super block is written to the block that doesn't -contain the current super block. If this new super block write fails -then the old super block can still be used and no data is lost. - -The super block, and indeed all file system data, doesn't touch a few -blocks at the start of the device to avoid corrupting blocks that are -used by host platforms that store data inside devices to manage them. - -## Inodes - -Inodes are stored in items identified by the inode number. - - key = struct scoutfs_inode_key { - .type = SCOUTFS_INODE_KEY, - .ino, - } - - val = struct scoutfs_inode { - size, nlink, uid, gid, atime, mtime, ..., - } - -The variable length value that stores the item struct gives us dense -inode packing without having to predefine an inode storage size when the -file system is created and gives us a future expansion mechanism that -uses the item length to determine the version of the inode struct that -is written. - -Inode numbers are 64bit and are never re-used. By never re-using inode -numbers we don't need to manage an inode number allocator that would -need to be consistent across nodes. We can grant large ranges of -numbers to mount clients for allocation. Each inode number uniquely -identify the lifetime of a file and avoids having to store a seperate -generation number for each inode number. - -## Extended Attributes - -Extended attributes are stored in items on the inode at the full name of -the attribute. The attribute name is limited to 255 bytes and the -attribute values is limited to 64KB. The max xattr value size is larger -than our max item size so we can store an xattr in multiple items, but -in the common case a single xattr is efficiently stored in a single -item. - - key = struct scoutfs_xattr_key { - .type = SCOUTFS_XATTR_KEY, - .ino, - .name, - struct scoutfs_xattr_key_footer { - .null = '\0', - .part, - } - } - -Storing the null after the attribute name, which can't be found in any -name, lets us accurately locate a given name in the presence of other -names that share partial prefixes. The part identifies each key's -position in the set of keys that make up the large value. Storing the -full name in each key ensures that all the keys that make up an -attribute are stored adjacent to each other. - -Each item's value starts with a header which describes portion of the -attribute value stored in the item. - - val = struct scoutfs_xattr_val_header { - .part_len, - .last_part, - .data, - } - -The result of all this is that operations on xattrs iterate over keys -starting with the name and part 0 and stop when they hit the final part -(or error on corruption if the parts aren't consistent.) - -## Directory Entries - -Directory entry items store the target inode number referred to by a -given entry name in a parent directory. The name is limited to 255 -non-null bytes. The large keys supported by our items let us store -directory entries in items indexed by the full entry name itself. - - key = struct scoutfs_dirent_key { - .type = SCOUTFS_DIRENT_KEY, - .ino, - .name, - } - - val = struct scoutfs_dirent { - .ino, - .readdir_pos, - .type, - } - -These full precision items let us work on each item for a given name -directly rather than scrambling their sorting by storing them at a hash -value of their name. Storing at a hash value not only adds the -complexity of collisions, it critically causes entry lock attempts in a -directory between mounts to be perfectly randomly distributed and -constantly conflicting with each other. Storing and range locking the -directory entries at their full name preserves non-overlapping patterns -between mounts and gives them a chance to efficiently operate on -disjoint sets of names. - -We index the directory entry items by the full name of the entry so -there is no limit imposed on the number of entries in a directory. The -system will run out of blocks to store entries long before the index is -incapable of storing them. - -While we can satisfy lookups with a full precision index, readdir -doesn't use a full precision iterator. It forces us to describe each -entry with a small scalar directory position. We use a separate item -that's indexed by this readdir position instead of the file name. - - key = struct scoutfs_readdir_key { - .type = SCOUTFS_DIRENT_KEY, - .ino, - .readdir_pos, - } - - val = struct scoutfs_dirent { - .ino, - .readdir_pos, - .type, - .name, - } - -The key's position is allocated as each entry is created. This results -in readdir returning entries ordered by creation time. Like inode -numbers, readdir positions are never re-used so that we don't have to -risk contention by maintaining a consistent free position index across -nodes. - -## Directory Entry Link Backrefs - -The third and final item used by each directory entry is an item that is -stored at the target inode instead of in the parent directory. These -backref items can be traversed to find the full paths from the root -inode to all the entries that link to the target inode. - - key = struct scoutfs_link_backref_key { - .type = SCOUTFS_LINK_BACKREF_KEY, - .ino, - .dir_ino, - .name, - } - - /* no value */ - -Iterating over these items for a given target ino yields the parent -dir_ino and full file name of every entry that references the target -inode. The entry items in the parent dir are stored at the full file -name so the only way for us to reference them is with another copy of -the file name, brining the total to three full copies of the name stored -for every directory entry. - -Because we store the full name for these backref items they do not -impose a limit on the number of hard links to an inode. - -## Regular File Data Extents - -scoutfs stores file data in block extents at 4KB granularity. Items -describe the extents of 4KB blocks that map logical file offsets to -physical block extents in the device: - - key = struct scoutfs_extent_key { - .type = SCOUTFS_EXTENT_KEY, - .ino, - .iblock, - .blkno, - .count, - .flags, - } - - /* no value */ - -The flags field indicates the state of the extent, for example it can be -preallocated but unwritten or offline. If the extent is offline then -the blkno is unused and should be zero. - -Checksums of file data are contained in items at the physical block -offset of the checksumed blocks. Each item contains a fixed number of -checksums for a given group of blocks. - - key = struct scoutfs_checksum_key { - .type = SCOUTFS_CHECKSUM_KEY, - .blkno, - } - - val = { - .crcs[8], - } - -The checksum items are keyed by the physical block number instead of the -logical file position so that the checksum items are only written as new -data is written. The checksum items are left alone as the file data -references change: truncate, unlink, hole punching, and cloning don't -have to modify checksum items. - -With these structures in place the file read and write paths in scoutfs -look very much like most other block file systems in Linux. The generic -buffer_head support code is used and our get_blocks callback reads and -writes the extent items that reference block extents. Write and sync -patterns, with the help of delalloc, preallocation, and fallocate, -determine the physical contiguity of extent allocations. Buffered -read-ahead and O_DIRECT reads walk the extent items and build large -efficient bios if the extents are physically contiguous. - -## Allocating Regular File Data Extents - -The primary persistent allocator for blocks on the device uses an -efficient bitmap with a bit for each 1MB segment. File data allocation -wants to track extents at 4KB granularity and also index them by the -size of the free extent, neither of which the segment bitmap allocator -supports. - -We have free extent items that track free block extents in the device at -the finer 4K granularity. There are two keys for each free extent: one -indexed by the block location and one by the size of the free extent. -Modifying a free extent can thus modify three different positions in the -key namespace: the block location, the old size location, and the new -size location. LSM lets us generate and merge these disjoint items -across different mounts efficiently. - -To avoid the prohibitively expensive lock contention of modifying these -items from multiple mounts, we first create groups of free extents and -assign a given mount to a group for the lifetime of its mount. - - key = struct scoutfs_free_extent_loc_key - .type = SCOUTFS_FREE_EXTENT_LOC_KEY, - .group, - .blkno, - .count, - } - - key = struct scoutfs_free_extent_len_key - .type = SCOUTFS_FREE_EXTENT_LEN_KEY, - .group, - .count, - .blkno, - } - -Mounts are responsible for mangement of the free extent items. They're -populated with the result from requests from the manifest server for -free segment blocks. They're consumed as file data is written and -logical extents are allocated. They're repopulated as file data is -truncated and its extents are freed. They're returned to the segment -allocator when they contain aligned 1MB free extents. - -Like all persistent filesystem items, the free extent items are -protected by range locks. In the common case a single mount will be -operating on its group and having all the lock operations satisfied by -range matches. Any mount can modify any group's extents by acquiring -the right locks, but this should be limited to rare attempts to -defragment or migrate free extents between groups. - -The manifest server is responsible for tracking the assigment of mounts -to groups as mounts come and go through clean mounts and unclean crashes -and recovery. Free extents can get stranded in groups that don't have -an assigned mount. A mount scrambling to find free space in other -groups would need a mechanism to discover other groups, perhaps with a -set of keys that record the presence of extents in each group. - -## Indexing Inodes by Modification Time - -As files are modified archival agents need to find these modified files -so that the archive can be updated. As inode counts explode it becomes -infeasible to scan the entire inode population and meet archival -deadlines. - -scoutfs maintains an index of inodes by modification time. An ioctl is -offered which iterates over the inodes in the order that they were -modified. The ioctl takes a timespec cursor from which to walk. It -fills a buffer with inodes and the time they were modified, sorted by -time. - -The ioctl results are inherently racey. There's nothing to stop an -inode from being modified and moved in the index between when the call -returns and the caller operates on the inode. - -This index is maintained by having time fields in the inode and -modification time items at those time values. The item key sorts the -items by time for the ioctl to iterate over. The items have no value. - - .type = SCOUTFS_MODTIME_KEY, - .ino = inode, - .ts.tv_sec = seconds, - .ts.tv_nsec = nanoseconds, - -As inodes are modified deletion items are created for the old time and -new items are inserted. LSM's ability to let us create items without -strictly locking their key value keeps these items from creating -unacceptable lock contention. If the modifying task has sufficient -locking on the inode it can modify these items and LSM will eventually -merge them into place. - -The index is keyed on real world time so that we don't have to create -our own consistent advancing clock. The clock only needs to be as -accurate as the users of the index require (this often doesn't add -unreasonable requirements, it's often already the case that arhicval -policies involve time and motivate a reasonably synchronized clock -across the cluster.) - -As inodes are deleted their modification items are deleted. - -> *XXX Need to figure out how to resolve multiple items created by -> concurrent writers. We want concurrent parallel writers, say, and -> they'll all way to create their own items at their write times. We'd -> need to be able to find those to delete them during future -> modification or deletion. Sort of sounds like we want -> per-node-identity backrefs for each to maintain and to purge as nodes -> leave the cluster.