scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-09 21:27:25 +00:00

Author	SHA1	Message	Date
Zach Brown	ef2daf8857	Make data preallocation tunable Make mount options for the size of preallocation and whether or not it should be restricted to extending writes. Disabling the default restriction to streaming writes lets it preallocate in aligned regions of the preallocation size when they contain no extents. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-14 14:03:35 -07:00
Zach Brown	233fbb39f3	Limit alloc_move per-call allocator consumption Recently scoutfs_alloc_move() was changed to try and limit the amount of metadata blocks it could allocate or free. The intent was to stop concurrent holders of a transaction from fully consuming the available allocator for the transaction. The limiting logic was a bit off. It stopped when the allocator had the caller's limit remaining, not when it had consumed the caller's limit. This is overly permissive and could still allow concurrent callers to consume the allocator. It was also triggering warning messages when a call consumed more than its allowed budget while holding a transaction. Unfortunately, we don't have per-caller tracking of allocator resource consumption. The best we can do is sample the allocators as we start and return if they drop by the caller's limit. This is overly conservative in that it accounts any consumption during concurrent callers to all callers. This isn't perfect but it makes the failure case less likely and the impact shouldn't be significant. We don't often have a lot of concurrency and the limits are larger than callers will typically consume. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	198d3cda32	Add scoutfs_alloc_meta_low_since() Add scoutfs_alloc_meta_low_since() to test if the metadata avail or freed resources have been used by a given amount since a previous snapshot. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:24:10 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	96ad8dd510	Add scoutfs_alloc_meta_remaining Add helper function to give the caller the number of blocks remaining in the first list block that's used for meta allocation and freeing. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:44 -07:00
Zach Brown	a53d6d1a8e	Add scoutfs_alloc_foreach_super which takes super Add an alloc_foreach variant which uses the caller's super to walk the allocators rather than always reading it off the device. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9c2122f7de	Add server btree merge processing This adds the server processing side of the btree merge functionality. The client isn't yet sending the log_merge messages so no merging will be performed. The bulk of the work happens as the server processess a get_log_merge message to build a merge request for the client. It starts a log merge if one isn't in flight. If one is in flight it checks to see if it should be spliced and maybe finished. In the common case it finds the next range to be merged and sends the request to the client to process. The commit_log_merge handler is the completion side of that request. If the request failed then we unwind its resources based on the stored request item. If it succeeds we record it in an item for get_ processing to splice eventually. Then we modify two existing server code paths. First, get_log_tree doesn't just create or use a single existing log btree for a client mount. If the existing log btree is large enough it sets its finalized flag and advances the nr to use a new log btree. That makes the old finalized log btree available for merging. Then we need to be a bit more careful when reclaiming the open log btree for a client. We can't use next to find the only open log btree, we use prev to find the last and make sure that it isn't already finalized. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	52c2a465db	Add zone awareness to scoutfs_alloc_move() Add parameters so that scoutfs_alloc_move() can first search for source extents in specified zones. It uses relatively cheap searches through the order items to find extents that intersect with the regions described by the zone bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	bc4975fad4	Add scoutfs_alloc_extents_cb() Add an allocator call for getting a callback for all the extents in btree items in an allocator root. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	c470c1c9f6	Allow read-mostly _alloc_meta_low Each transaction hold makes multiple calls to _alloc_meta_low to see if the transaction should be committed to refill allocators before the caller's hold is acquired and they can dirty blocks in the transaction. _alloc_meta_low was using a spinlock to sample the allocator list_head blocks to determine if there was space available. The lock and unlock stores were creating significant cacheline contention. The _alloc_meta_low calls are higher frequency than allocations. We can use a seqlock to have exclusive writers and allow concurrent _alloc_meta_low readers who retry if a writer intervenes. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Zach Brown	fc003a5038	Consistently sample data alloc total_len With many concurrent writers we were seeing excessive commits forced because it thought the data allocator was running low. The transaction was checking the raw total_len value in the data_avail alloc_root for the number of free data blocks. But this read wasn't locked, and allocators could completely remove a large free extent and then re-insert a slightly smaller free extent as they perform their alloction. The transaction could see a temporary very small total_len and trigger a commit. Data allocations are serialized by a heavy mutex so we don't want to have the reader try and use that to see a consistent total_len. Instead we create a data allocator run-time struct that has a consistent total_len that is updated after all the extent items are manipulated. This also gives us a place to put the caller's cached extent so that it can be included in the total_len, previously it wasn't included in the free total that the transaction saw. The file data allocator can then initialize and use this struct instead of its raw use of the root and cached extent. Then the transaction can sample its consistent total_len that reflects the root and cached extent. A subtle detail is that fallocate can't use _free_data to return an allocated extent on error to the avail pool. It instead frees into the data_free pool like normal frees. It doesn't really matter that this could prematurely drain the avail pool because it's in an error path. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-06 09:25:32 -08:00
Zach Brown	9375b9d3b7	scoutfs: commit while enough meta for dirty items Dirty items in a client transaction are stored in OS pages. When the transaction is committed each item is stored in its position in a dirty btree block in the client's existing log btree. Allocators are refilled between transaction commits so a given commit must have sufficient meta allocator space (avail blocks and unused freed entries) for all the btree blocks that are dirtied. The number of btree blocks that are written, thus the number of cow allocations and frees, depends on the number of blocks in the log btree and the distribution of dirty items amongst those blocks. In a typical load items will be near each other and many dirty items in smaller kernel pages will be stored in fewer larger btree blocks. But with the right circumstances, the ratio of dirty pages to dirty blocks can be much smaller. With a very large directory and random entry renames you can easily have 1 btree block dirtied for every page of dirty items. Our existing allocator meta allocator fill targets and the number of dirty item cache pages we allowed did not properly take this in to account. It was possible (and, it turned out, relatively easy to test for with a hgue directory and random renames) to run out of meta avail blocks while storing dirty items in dirtied btree blocks. This rebalances our targets and thresholds to make it more likely that we'll have enough allocator resources to commit dirty items. Instead of having an arbitrary limit on the number of dirty item cache pages, we require that a given number of dirty item cache pages have a given number of allocator blocks available. We require a decent number of avialable blocks for each dirty page, so we increase the server's target number of blocks to give the client so that it can still build large transactions. This code is conservative and should not be a problem in practice, but it's theoretically possible to build a log btree and set of dirty items that would dirty more blocks that this code assumes. We will probably revisit this as we add proper support for ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-02 09:25:13 -08:00
Zach Brown	a5d9ac5514	scoutfs: rework scoutfs_alloc_meta_low, takes arg Previously, scoutfs_alloc_meta_lo_thresh() returned true when a small static number of metadata blocks were either available to allocate or had space for freeing. This didn't make a lot of sense as the correct number depends on how many allocations each caller will make during their atomic transaction. Rework the call to take an argument for the number of avail or freed blocks available to test. This first pass just uses the existing number, we'll get to the callers. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-02 09:25:13 -08:00
Zach Brown	fb66372988	scoutfs: add alloc foreach cb iterator Add an alloc call which reads all the persistent allocators and calls a callback for each. This is going to be used to calculate free blocks in clients for df, and in an ioctl to give a more detailed view of allocators. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8f946aa478	scoutfs: add btree item extent allocator Add an allocator which uses btree items to store extents. Both the client and server will use this for btree blocks, the client will use it for srch blocks and data extents, and the server will move extents between the core fs allocator btree roots and the clients' roots. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	ff5a094833	scoutfs: store allocator regions in btree Convert the segment allocator to store its free region bitmaps in the btree. This is a very straight forward mechanical transformation. We split the allocator region into a big-endian index key and the bitmap value payload. We're careful to operate on aligned copies of the bitmaps so that they're long aligned. We can remove all the funky functions that were needed when writing the ring. All we're left with is a call to apply the pending allocations to dirty btree blocks before writing the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	5e0e9ac12e	Move to much simpler manifest/alloc storage Using the treap to be able to incrementally read and write the manifest and allocation storage from all nodes wasn't quite ready for prime time. The biggest problem is that invalidating cached nodes which are the target of native pointers, either for consistency or memory pressure, is problematic. This was getting in the way of adding shared support as readers and writers try to use as much of their treap caches as they can. There were other serious problems that we'd run into eventually: memory pressure from duplicate caching in native nodes and the page cache, small IOs from reading a page at a time, the risk of pathologically imbalanced treaps, and the ring being corrupted if the migration balancing doesn't work (the model assumed you could always dirty an individual node in a transaction, you have to dirty all the parents in each new transaction). Let's back off to a much simpler mechanism while we build the rest of the system around it. We can revisit aggressively optimizing this when it's our worst problem. We'll store the indexes that the manifest server needs in simple preallocated rings with log entries. The server has to read the index in its entirety into a native rbtree before it can work on it. We won't access the physical ring from mounts anymore, they'll send messages to the server. The ring callers are now working with a pinned tree in memory so the interface can be a bit simpler. By storing the indexes in their own rings the code and write path become a lot simper: we have an IO submission path for each index instead of "dirtying" calls per index and then a writing call. All this is much more robust and much less likely to get in our way as we stand up the rest of the system around it. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	6516ce7d57	Report free blocks in statfs Our statfs callback was still using the old buddy allocator. We add a free segments field to the super and have it track the number of free segments in the allocator. We then use that to calculate the number of free blocks for statfs. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	db9f2be728	Switch to indexed manifest using treap ring The first pass manifest and allocator storage used a simple ring log that was entirely replayed into memory to be used. That risked the manifest being too large to fit in memory, especially with large keys and large volumes. So we move to using an indexed persistent structure that can be read on demand and cached. We use a treap of byte referenced nodoes stored in a circular ring. The code interface is modeled a bit on the in-memory rbtree interface. Except that we can get IO errors and manage allocation so we return data pointers to the item payload istead of item structs and we can return errors. The manifest and allocator are converted over and the old ring code is removed entirely. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	c4954eb6f4	Add initial LSM write implementation Add all the core strutural components to be able to modify metadata. We modify items in fs write operations, track dirty items in the cache, allocate free segment block reagions, stream dirty items into segments, write out the segments, update the manifest to reference the written segments, and write out a new ring that has the new manifest. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:42:30 -07:00

24 Commits