scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-30 01:46:54 +00:00

Author	SHA1	Message	Date
Auke Kok	5e2009f939	Avoid double counting deltas from non-input finalized log trees. Readers currently accumulate all finalized log tree deltas into a single bucket for deciding whether they are already in fs_root or not, but, finalized trees that aren't inputs to a current merge will have higher seqs, and thus we may be double applying deltas already merged into fs_root. To distinguish, scoutfs_totl_merge_contribute() needs to know the merge status item seq. We change wkic's get_roots() from using the SCOUTFS_NET_CMD_GET_ROOTS RPC to reading the superblock directly. This is needed because totl merge resolution has to use the same data as the btree roots it is operating on, thus we can't grab it from a SCOUTFS_NET_CMD_GET_ROOTS packet - it likely is different. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-04-10 13:50:21 -07:00
Auke Kok	8bdc20af21	Rename/reword FINALIZED to MERGE_INPUT. These mislabeled members and enums were clearly not describing the actual data being handled and obfuscating the intent of avoiding mixing merge input items with non-merge input items. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-04-10 13:50:21 -07:00
Zach Brown	5f156b7a36	Add scoutfs_forest_read_items_roots Add a forest item reading interface that lets the caller specify the net roots instead of always getting them from a network request. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	4b87045447	Pre-declare scoutfs_lock in forest.h Definitions in forest.h use lock pointers. Pre-declare the struct so it doesn't break inclusion without lock.h, following current practice in the header. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-25 15:11:20 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	b9a0f1709f	Add xattr .totl. tag Add the .totl. xattr tag. When the tag is set the end of the name specifies a total name with 3 encoded u64s separated by dots. The value of the xattr is a u64 that is added to the named total. An ioctl is added to read the totals. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	9db3b475c0	Stop log merge work earlier during unmount The forest log merge work calls into the client to send commit requests to the server. The forest is usually destroyed relatively late in the sequence and can still be running after the client is destroyed. Adding a _forest_stop call lets us stop the log merging work before the client is destroyed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	65c39e5f97	Item seq is max of trans and lock write_seq Rename the item version to seq and set it to the max of the transaction seq and the lock's write_seq. This lets btree item merging chose a seq at which all dirty items written in future commits must have greater seqs. It can drop the seqs from items written to the fs tree during btree merging knowing that there aren't any older items out in transactions that could be mistaken for newer items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	ff532eba75	scoutfs: recover max lock write_version Write locks are given an increasing version number as they're granted which makes its way into items in the log btrees and is used to find the most recent version of an item. The initialization of the lock server's next write_version for granted locks dates back to the initial prototype of the forest of log btrees. It is only initialized to zero as the module is loaded. This means that reloading the module, perhaps by rebooting, resets all the item versions to 0 and can lead to newly written items being ignored in favour of older existing items with greater versions from a previous mount. To fix this we initialize the lock server's write_version to the greatest of all the versions in items in log btrees. We add a field to the log_trees struct which records the greatest version which is maintained as we write out items in transactions. These are read by the server as it starts. Then lock recovery needs to include the write_version so that the lock_server can be sure to set the next write_version past the greatest version in the currently granted locks. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	12067e99ab	scoutfs: remove item granular work from forest Now that the item cache is bearing the load of high frequency item calls, we can remove all the item granular work that the forest was trying to do. The item cache amortizes the cost of the forest so its remaining methods can go straight to the btrees and don't need complicated state to reduce the overhead of item calls. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b1757a061e	scoutfs: add forest methods for item cache Add forest calls that the item cache will use. It needs to read all the items in the leaf blocks of forest btree which could contain the key, write dirty items to the log btree, and dirty bits in the bloom block as items are dirtied. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f8e1812288	scoutfs: add srch infrastructure This introduces the srch mechanism that we'll use to accelerate finding files based on the presence of a given named xattr. This is an optimized version of the initial prototype that was using locked btree items for .indx. xattrs. This is built around specific compressed data structures, having the operation cost match the reality of orders of magnitude more writers than readers, and adopting a relaxed locking model. Combine all of this and maintaining the xattrs no longer tanks creation rates while maintaining excellent search latencies, given that searches are defined as rare and relatively expensive. The core data type is the srch entry which maps a hashed name to an inode number. Mounts can append entries to the end of unsorted log files during their transaction. The server tracks these files and rotates them into a list of files as they get large enough. Mounts have compaction work that regularly asks the server for a set of files to read and combine into a single sorted output file. The server only initiates compactions when it sees a number of files of roughly the same size. Searches then walk all the commited srch files, both log files and sorted compacted files, looking for entries that associate an xattr name with an inode number. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	858dad1d51	scoutfs: add forest subsystem The forest code presents a consistent item interface that's implemented on top of a forest of persistent btrees. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00

18 Commits