scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-02 10:55:44 +00:00

Author	SHA1	Message	Date
Zach Brown	91bbf90f71	Don't pin input btrees when merging The btree_merge code was pinning leaf blocks for all input btrees as it iterated over them. This doesn't work when there are a very large number of input btrees. It can run out of memory trying to hold a reference to a 64KiB leaf block for each input root. This reworks the btree merging code. It reads a window of blocks from all input trees to get a set of merged items. It can take multiple passes to complete the merge but by setting the merge window large enough this overhead is reduced. Merging now consumes a fixed amount of memory rather than using memory proportional to the number of input btrees. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:30:17 -08:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	3d1a0f06c0	Add scoutfs_btree_free_blocks Add a btree function for freeing all the blocks in a btree without having to cow the blocks to track which refs have been freed. We use a key from the caller to track which portions of the tree have been freed. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	d8478ed6f1	Add scoutfs_btree_rebalance() Add a btree call to just dirty to a leaf block, joining and splitting along the way so that the blocks in the path satisfy the balance constraints. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	0538c882bc	Add btree_merge() Add a btree function for merging the items in a range from a number of read-only input btrees into a destination btree. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	b6d0a45f6d	Add btree_{get,set}_parent Add calls for working with subtrees built around references to blocks in the last level of parents. This will let the server farm out btree merging work where concurrency is built around safely working with all the items and leaves that fall under a given parent block. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	d7f8896fac	Add scoutfs_btree_parent_range Add a btree helper for finding the range of keys which are found in leaves referenced by the last parent block when searching for a given key. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	1a994137f4	scoutfs: add btree methods for item cache Add btree calls to call a callback for all items in a leaf, and to insert a list of items into their leaf blocks. These will be used by the item cache to populate the cache and to write dirty items into dirty btree blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ac0e58839d	scoutfs: remove btree _before and _after There's no users of these variants of _prev and _next so they can be removed. Support for them was also dropped in the previous reworking of the internal structure of the btree blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ad99636af8	scoutfs: use scoutfs_key as btree key The btree currently uses variable length big-endian buffers that are compared with memcmp() as keys. This is a historical relic of the time when keys could be very large. We had dirent keys that included the name and manifest entries that included those fs keys. But now all the btree callers are jumping through hoops to translate their fs keys into big-endian btree keys. And the memcmp() of the keys is showing up in profiles. This makes the btree take native scoutfs_key structs as its key. The forest callers which are working with fs keys can just pass their keys straight through. The server btree callers with their private btrees get key fields definied for their use instead of having individual big-endian key structs. A nice side-effect of this is that splitting parents doesn't have to assume that a maximal key will be inserted by a child split. We can have more keys in parents and wider trees. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	43d416003a	scoutfs: add scoutfs_btree_force Add a btree_update variant which will insert the item if a previous wasn't found instead of returning -ENOENT. This saves callers from having to lookup befure updating to discover if they should call _create or _update. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	8775826d7e	scoutfs: have btree use blocks, allocator, writer Convert the btree to use our block cache, block allocation, and the caller's explicit dirty block tracking writer context instead of the ring. This is in preparation for the btree forest format where there are concurrent multiple writers of independent dynamically sized btrees instead of only the server writing one btree with a fixed maximum size. All the machinery for tracking dirty blocks in the ring and migrating is no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	3eaabe81de	scoutfs: add btree stored in persistent ring Add a cow btree whose blocks are stored in a persistently allocated ring. This will let us incrementally index very large data sets efficiently. This is an adaptation of the previous btree code which now uses the ring, stores variable length keys, and augments the items with bits that ored up through parents. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	97cb75bd88	Remove dead btree, block, and buddy code Remove all the unused dead code from the previous btree block design. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	af5955e95a	Add found_seq argument to scoutfs_btree_prev Add a *found_seq argument to _prev so that it can give the caller the seq of the item that's returned. The extent code is going to use this to find seqs of extents. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	37bc86b558	Add check_size_lte Add a _lte val boolean so that -EOVERFLOW is returned if the item is greater than the value vector. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	a77f88386c	Add scoutfs_btree_prev() We haven't yet had a pressing need for a search for the previous item before a given key value. File extent items offer the first strong candidate. We'd like to naturally store the start of the extent in the key so to find an extent that overlaps a block we'd like to find the previous key before the search block offset. The _prev search is easy enough to implement. We have to update tree walking to update the prev key and update leaf block processing to find the correct item position after the binary search. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Mark Fasheh	2fc1b99698	scoutfs: replace some open coded corruption checks We can trivially do the simple check of value length against what the caller expects in btree.c. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-10-27 17:25:05 -05:00
Zach Brown	84f23296fd	scoutfs: remove btree cursor The btree cursor was built to address two problems. First it accelerates iteration by avoiding full descents down the tree by holding on to leaf blocks. Second it lets callers reference item value contents directly to avoid copies. But it also has serious complexity costs. It pushes refcounting and locking out to the caller. There have already been a few bugs where callers did things while holding the cursor without realizing that they're holding a btree lock and can't perform certain btree operations or even copies to user space. Future changes to the allocator to use the btree motivates cleaning up the tree locking which is complicated by the cursor being a stand alone lock reference. Instead of continuing to layer complexity onto this construct let's remove it. The iteration acceleration will be addressed the same way we're going to accelerate the other btree operations: with per-cpu cached leaf block references. Unlike the cursor this doesn't push interface changes out to callers who want repeated btree calls to perform well. We'll leave the value copying for now. If it becomes an issue we can add variants that call a function to operate on the value. Let's hope we don't have to go there. This change replaces the cursor with a vector to memory that the value should be copied to and from. The vector has a fixed number of elements and is wrapped in a struct for easy declaration and initialization. This change to the interface looks noisy but each caller's change is pretty mechanical. They tend to involve: - replace the cursor with the value struct and initialization - allocate some memory to copy the value in to - reading functions return the number of value bytes copied - verify copied bytes makes sense for item being read - getting rid of confusing ((ret = _next())) looping - _next now returns -ENOENT instead of 0 for no next item - _next iterators now need to increase the key themselves - make sure to free allocated mem Sometimes the order of operations changes significantly. Now that we can't modify in place we need to read, modify, write. This looks like changing a modification of the item through the cursor to a lookup/update pattern. The symlink item iterators didn't need to use next because they walk a contiguous set of keys. They're changed to use simple insert or lookup. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	2bed78c269	scoutfs: specify btree root The btree functions currently don't take a specific root argument. They assume, deep down in btree_walk, that there's only one btree in the system. We're going to be adding a few more to support richer allocation. To prepare for this we have the btree functions take an explicit btree argument. This should make no functional difference. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	198ec2ed5b	scoutfs: have btree_update return errors We can certainly have btree update callers that haven't yet dirtied the blocks but who can deal with errors. So make it return errors and have its only current caller freak out if it fails. This will let the file data block mapping code attempt to get a dirty item without first dirtying. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-09 17:03:30 -07:00
Zach Brown	1fde47170b	scoutfs: simplify btree block format Now that we are using fixed smaller blocks we can make the btree format significantly simpler. The fixed small block size limits the number of items that will be stored in each block. We can use a simple sorted array of item offsets to maintain the item sort order instead of the treap. Getting rid of the treap not only removes a bunch of code, it makes tasks like verifying or repairing a btree block a lot simpler. The main impact on the code is that now an item doesn't record its position in the sort order. Users of sorted item position now need to track an items sorted position instead of just the item. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:28:08 -07:00
Zach Brown	7b18bce2e2	scoutfs: use buffer heads Now that we have a fixed small block size we don't need our own code for tracking contiguous memory for blocks that are larger than the page size. We can use buffer heads which support block sizes smaller than the page size. Our block API remains to enforce transactions, cheksumming, cow, and eventually invalidating and retrying reads of stale bloks. We set the logical blocksize of the bdev buffer cache to our fixed block size. We use a private bh state bit to indicate that the contents of a block have had their checksum verified. We use a small structure stored at b_private to track dirty blocks so that we can control when they're written. The btree block traversal code uses the buffer_head lock to serialize access to btree block contents now that the block rwsem has gone away. This isn't great but works for now. Not being able to relocate blocks in the buffer cache (really fragments of pages in the bdev page cache.. blkno determines memory location) means that the cow path always has to copy. Callers are easily translated: use struct buffer_head instead of scoutfs_block and use a little helper instead of dereferencing ->data directly. I took the opportunity to clean up some of the inconsistent block function names. Now more of the functions follow the scoutfs_block_*() pattern. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:28:08 -07:00
Zach Brown	b51511466a	scoutfs: add inodes_since ioctl Add the ioctl that let's us find out about inodes that have changed since a given sequence number. A sequence number is added to the btree items so that we can track the tree update that it last changed in. We update this as we modify items and maintain it across item copying for splits and merges. The big change is using the parent item ref and item sequence numbers to guide iteration over items in the tree. The easier change is to have the current iteration skip over items whose sequence number is too old. The more subtle change has to do with how iteration is terminated. The current termination could stop when it doesn't find an item because that could only happen at the final leaf. When we're ignoring items with old seqs this can happen at the end of any leaf. So we change iteration to keep advancing through leaf blocks until it crosses the last key value. We add an argument to btree walking which communicates the next key that can be used to continue iterating from the next leaf block. This works for the normal walk case as well as the seq walking case where walking terminates prematurely in an interior node full of parent items with old seqs. Now that we're more robustly advancing iteration with btree walk calls and the next key we can get rid fo the 'next_leaf' hack which was trying to do the same thing inside the btree walk code. It wasn't right for the seq walking case and was pretty fiddly. The next_key increment could wrap the maximal key at the right spine of the tree so we have _inc saturate instead of wrap. And finally, we want these inode scans to not have to skip over all the other items associated with each inode as it walks looking for inodes with the given sequence number. We change the item sort order to first sort by type instead of by inode. We've wanted this more generally to isolate item types that have different access patterns. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:46:20 -07:00
Zach Brown	a64ca8018a	scoutfs: add scoutfs_btree_hole() for finding keys Directory entries found a hole in the key range between the first and last possible hash value for a new entry's key. The xattrs want to do the same thing so let's extract this into a proper function. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:45:17 -07:00
Zach Brown	5651d48c18	scoutfs: add core btree functionality Previously we had stubbed out the btree item API with static inlines. Those are replaced with real functions in a reasonably functional btree implementation. The btree implementation itself is pretty straight forward. Operations are performed top-down and we dirty, lock, and split/merge blocks as we go. Callers are given a cursor to give them full access to the item. Items in the btree blocks are stored in a treap. There are a lot of comments in the code to help make things clear. We add the notion of block references and some block functions for reading and dirtying blocks by reference. This passes tests up to the point where unmount tries to write out data and the world catches fire. That's far enough to commit what we have and iterate from there. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-12 19:33:09 -07:00
Zach Brown	5369fa1e05	scoutfs: first step towards multiple btrees Starting to implement LSM merging made me really question if it is the right approach. I'd like to try an experiment to see if we can get our concurrent writes done with much simpler btrees. This commit removes all the functionality that derives from the large LSM segments and distributing the manifest. What's left is a multi-page block layer and the husk of the btree implementation which will give people access to items. Callers that work with items get translated to the btree interface. This gets as far as reading the super block but the format changes and large block size mean that the crc check fails and the mount returns an error. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-11 11:35:37 -07:00

30 Commits