scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 21:03:12 +00:00

Author	SHA1	Message	Date
Zach Brown	70d6f3b042	Add initial metadata ref checking Add a checker that can walk blocks from the super block to make sure that all metadata block numbers are accounted for. This initial version isn't suitable for use without further refinement, but we keep it compiling to keep up with structures. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-03 14:16:52 -06:00
Zach Brown	38e6f11ee4	Add quota support Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	ee9e8c3e1a	Extract .totl. item merging into own functions The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl. total and value while reading fs items. We're going to want to do this in another reader so let's put these in their own funcions that clearly isolate the logic of merging the fs items into a coherent result. We can get rid of some of the totl_read_ counters that tracked which items we were merging. They weren't adding much value and conflated the reading ioctl interface with the merging logic. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	3a51ca369b	Add the weak item cache Add the weak item cache that is used for reads that can handle results being a little behind. This gives us a lot more freedom to implement the cache that biases concurrent reads. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	6a99ca9ede	Add attr_x core and ioctls The existing stat_more and setattr_more interfaces aren't extensible. This solves that problem by adding attribute interfaces which specify the specific fields to work with. We're about to add a few more inode fields and it makes sense to add them to this extensible structure rather than adding more ioctls or relatively clumsy xattrs. This is modeled loosely on the upstream kernel's statx support. The ioctl entry points call core functions so that we can also implement the existing stat_more and setattr_more interfaces in terms of these new attr_x functions. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 14:53:49 -07:00
Zach Brown	cca4fcb788	Use count/scan objects shrinking interface Move to the more recent interfaces for counting and scanning cached objects to shrink. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	29538a9f45	Add POSIX ACL support Add support for the POSIX ACLs as described in acl(5). Support is enabled by default and can be explicitly enabled or disabled with the acl or noacl mount options, respectively. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:36:10 -07:00
Zach Brown	b060eb4f5d	Add fencing subsystem Add the subsystem which tracks pending fence requests and exposes them to userspace for processing. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:25 -07:00
Zach Brown	0aa6005c99	Add volume options super, server, and sysfs Introduce global volume options. They're stored in the superblock and can be seen in sysfs files that use network commands to get and set the options on the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-19 14:15:06 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	a65775588f	Add server recovery helpers Add a little set of functions to help the server track which clients are waiting to recover which state. The open map messages need to wait for recovery so we're moving recovery out of being only in the lock server. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	d731c1577e	Filesystem version instead of format hash check Instead of hashing headers, define an interop version. Do not mount superblocks that have a different version, either higher or lower. Since this is pretty much the same as the format hash except it's a constant, minimal code changes are needed. Initial dev version is 0, with the intent that version will be bumped to 1 immediately prior to tagging initial release version. Update README. Fix comments. Add interop version to notes and modinfo. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-15 10:53:00 -08:00
Andy Grover	d9d9b65f14	scoutfs: remove __packed from all struct definitions Instead, explicitly add padding field, and adjust member ordering to eliminate compiler-added padding between members, and at the end of the struct (if possible: some structs end in a u8[0] array.) This should prevent unaligned accesses. Not a big deal on x86_64, but other archs like aarch64 really want this. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	dbea353b92	scoutfs: bring back sort_priv Bring back sort_priv, we have need for sorting with a caller argument. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	c61175e796	scoutfs: remove unused radix code Remove the radix allocator that was added as we expermented with packed extent items. It didn't work out. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8f946aa478	scoutfs: add btree item extent allocator Add an allocator which uses btree items to store extents. Both the client and server will use this for btree blocks, the client will use it for srch blocks and data extents, and the server will move extents between the core fs allocator btree roots and the clients' roots. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b605407c29	scoutfs: add extent layer Add infrastructure for working with extents. Callers provide callbacks which operate on their extent storage while this code performs the fiddly splitting and merging of extents. This layer doesn't have any persitent structures itself, it only operates on native structs in memory. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	45e594396f	scoutfs: add an item cache above the btrees Add an item cache between fs callers and the forest of btrees. Calling out to the btrees for every item operation was far too expensive. This gives us a flexible in-memory structure for working with items that isn't bound by the constrants of persistent block IO. We can rarely stream large groups of items to and from the btrees and then use efficient kernel memory structures for more frequent item operations. This adds the infrastructure, nothing is calling it yet. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f8e1812288	scoutfs: add srch infrastructure This introduces the srch mechanism that we'll use to accelerate finding files based on the presence of a given named xattr. This is an optimized version of the initial prototype that was using locked btree items for .indx. xattrs. This is built around specific compressed data structures, having the operation cost match the reality of orders of magnitude more writers than readers, and adopting a relaxed locking model. Combine all of this and maintaining the xattrs no longer tanks creation rates while maintaining excellent search latencies, given that searches are defined as rare and relatively expensive. The core data type is the srch entry which maps a hashed name to an inode number. Mounts can append entries to the end of unsorted log files during their transaction. The server tracks these files and rotates them into a list of files as they get large enough. Mounts have compaction work that regularly asks the server for a set of files to read and combine into a single sorted output file. The server only initiates compactions when it sees a number of files of roughly the same size. Searches then walk all the commited srch files, both log files and sorted compacted files, looking for entries that associate an xattr name with an inode number. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f59336085d	scoutfs: add avl Add the little avl implementation that we're going to use for indexing items within the btree blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	455a547e8e	scoutfs: add radix allocator Add the allocator that uses bits stored in the leaves of a cow radix. It'll replace two metadata and data allocators that were previously storing allocation bitmap fragments in btree items. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	858dad1d51	scoutfs: add forest subsystem The forest code presents a consistent item interface that's implemented on top of a forest of persistent btrees. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	bdafa6ede6	scoutfs: add block allocator Add our block allocator core. It'll be used shortly. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	e444c2b8c2	scoutfs: remove sort_priv The only user was item compaction in the btree and it has been removed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	2a6d209854	scoutfs: add kernelcompat files Add files that we'll use to detect and work around incompatibilities between kernel versions. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-15 14:57:57 -08:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	c34dd452a7	scoutfs: add quorum voting Add a quorum election implementation. The mounts that can participate in the election are specified in a quorum config array in the super block. Each configured participant is assigned a preallocated block that it can write to. All mounts read the quorum blocks to find the member who was elected the leader and should be running the server. The voting mounts loop reading voting blocks and writing their vote block until someone is elected with a amjority. Nothing calls this code yet, this adds the initial implementation and format. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f75e1e1322	scoutfs: reformat Makefile to one object per line Reformat the scoutfs-y object list so that there's one object per line. Diffs now clearly demonstrate what is changing instead of having word wrapping constantly obscuring changes in the built objects. (Did everyone spot the scoutfs_trace sorting mistake? Another reason not to mash everything into wrapped lines :)). Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	00adbd31be	scoutfs: add sparse bitmap library Add a quick library for maintaining a very large bitmap with sparse allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	c4cb5c0651	scoutfs: add trivial seq file wrapper Add a seq file wrapper which lets callers track objects easily. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	d708421cfb	scoutfs: remove unused client and server code The previous commit added shared networking code and disabled the old unused code. This removes all that unused client and server code that was refactored to become the shared networking code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	dfac36a9aa	scoutfs: trace key struct The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. We also remove the mapping of zone and type to strings. It's smaller to print the values directly and gets rid of some silly code. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	869d11fd0f	scoutfs: add core extent functions Add a file of extent functions that callers will use to manipulate and store extents in different persistent formats. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	b0bd273acc	scoutfs: remove support for multi-element kvecs Originally the item interfaces were written with full support for vectored keys and values. Callers constructed keys and values made up of header structs and data buffers. Segments supported much larger values which could span pages when stored in memory. But over time we've pulled that support back. Keys are described by a key struct instead of a multi-element kvec. Values are now much smaller and don't span pages. The item interfaces still use the kvec arrays but everyone only uses a single element. So let's make the world a whole lot less awful but having the item interfaces only supporting a single value buffer specified by a kvec. A bunch of code disappears and the result is much easier to understand. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	f52dc28322	scoutfs: simplify lock use of kernel dlm We had an excessive number of layers between scoutfs and the dlm code in the kernel. We had dlmglue, the scoutfs locks, and task refs. Each layer had structs that track the lifetime of the layer below it. We were about to add another layer to hold on to locks just a bit longer so that we can avoid down conversion and transaction commit storms under contention. This collapses all those layers into simple state machine in lock.c that manages the mode of dlm locks on behalf of the file system. The users of the lock interface are mainly unchanged. We did change from a heavier trylock to a lighter nonblock lock attempt and have to change the single rare readpage use. Lock fields change so a few external users of those fields change. This not only removes a lot of code it also contains functional improvements. For example, it can now convert directly to CW locks with a single lock request instead of having to use two by first converting to NL. It introduces the concept of an unlock grace period. Locks won't be dropped on behalf of other nodes soon after being unlocked so that tasks have a chance to batch up work before the other node gets a chance. This can result in two orders of magnitude improvements in the time it takes to, say, change a set of xattrs on the same file population from two nodes concurrently. There are significant changes to trace points, counters, and debug files that follow the implementation changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-14 15:00:17 -08:00
Mark Fasheh	ac09f03327	scoutfs: open by handle This is implemented by filling in our export ops functions. When we get those right, the VFS handles most of the details for us. Internally, scoutfs handles are two u64's (ino and parent ino) and a type which indicates whether the handle contains the parent ino or not. Surpisingly enough, no existing type matches this pattern so we use our own types to identify the handle. Most of the export ops are self explanatory scoutfs_encode_fh() takes an inode and an optional parent and encodes those into the smallest handle that would fit. scoutfs_fh_to_[dentry\|parent] turn an existing file handle into a dentry. scoutfs_get_parent() is a bit different and would be called on directory inodes to connect a disconnected dentry path. For scoutfs_get_parent(), we can export add_next_linkref() and use the backref mechanism to quickly find a parent directory. scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're linking an inode to a name which exists in the parent directory. We can also use add_next_linkref, and simply copy the name from the backref. As a result of this patch we can also now export scoutfs file systems via NFS, however testing NFS thoroughly is outside the scope of this work so export support should be considered experimental at best. Signed-off-by: Mark Fasheh <mfasheh@versity.com> [zab edited <= NAME_MAX]	2018-01-26 11:59:47 -08:00
Zach Brown	e354fd18b1	scoutfs: add sysfs.c, fsid file I wanted to add a sysfs file that exports the fsid for the mount of a given device. But our use of sysfs was confusing and spread through super.c and counters.c. This moves the core of our sysfs use to sysfs.c. Instead of defining the per-mount dir as a kset we define it as an object with attributes which gives us a place to add an fsid attribute. counters still have their own whack of sysfs implementation. We'll let it keep it for now but we could move it into sysfs.c. It's just counter interation around the insane sysfs obj/attr/type nonsense. For now it just needs to know to add its counters dir as a child of the per-mount dir instead of adding it to the kset. Signed-off-by: Zach Brown <zab@versity.com>	2017-12-20 12:21:13 -08:00
Zach Brown	9ed34f8892	scoutfs: add triggers Signed-off-by: Zach Brown <zab@versity.com>	2017-12-20 12:21:13 -08:00
Zach Brown	ce4daa817a	scoutfs: add support for format_hash Calculate the hash of format.h and ioctl.h and make sure the hash stored in the super during mkfs matches our calculated hash on mount. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-12 13:57:31 -07:00
Zach Brown	c3e690a1ac	scoutfs: add per_task storage helper Add some functions for storing and using per-task storage in a list. Callers can use this to pass pointers to children in a given scope when interfaces don't allow for passing individual arguments amongst concurrent callers in the scope. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Mark Fasheh	c0d3f99a6e	scoutfs: Cluster coherent read/write With trylock implemented we can add locking in readpage. After that it's pretty easy to implement our own read/write functions which at this point are more or less wrapping the kernel helpers in the correct cluster locking. Data invalidation is a bit interesting. If the lock we are invalidating is an inode group lock, we use the lock boundaries to incrementally search our inode cache. When an inode struct is found, we sync and (optionally) truncate pages. Signed-off-by: Mark Fasheh <mfasheh@versity.com> [zab: adapted to newer lock call, fixed some error handling] Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Mark Fasheh	72a8e9e171	scoutfs: pull in some of ocfs2 stackglue Dlmglue is built on top of this. Bring in the portions we need which includes the stackglue API as well as most of the fs/dlm implementation. I left off the Ocfs2 specific version and connection handling. Also left out is the old Ocfs2 dlm support which we'll never want. Like dlmglue, we keep as much of the generic stackglue code in tact here. This will make translating to/from upstream patches much easier. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-08-23 21:40:20 -05:00
Mark Fasheh	fc21a0253c	scoutfs: Hook dlmglue into our build system Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-08-23 15:54:08 -05:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00

1 2

87 Commits