scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-11 21:11:08 +00:00

Author	SHA1	Message	Date
Auke Kok	5e85f11e82	RIP bd_inode. v6.9-rc4-29-g203c1ce0bb06 removes bd_inode. The canonical replacement is bd_mapping->host, were applicable. We have one use where we directly need the mapping instead of the inode, as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 18:07:09 -04:00
Auke Kok	8885486bc8	Add several low level includes. Newer kernels include less header dependencies by default, so we have to add these. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Zach Brown	01c8bba56d	Merge pull request #109 from versity/zab/server_statfs_stable_blocks Zab/server statfs stable blocks	2023-01-12 09:58:48 -08:00
Zach Brown	c1bd7bcce5	Allow partial extent motion Refilling a client's data_avail is the only alloc_move call that doesn't try and limit the number of blocks that it dirties. If it doesn't find sufficiently large extents it can exhaust the server's alloc budget without hitting the target. It'll try to dirty blocks and return a hard error. This changes that behaviour to allow returning 0 if it moved any extents. Other callers can deal with partial progress as they already limit the blocks they dirty. This will also return ENOSPC if it hadn't moved anything just as the current code would. The result is that data fill can not necessarily hit the target. It might take multiple commits to fill the data_avail btree. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-15 20:47:41 -08:00
Zach Brown	fff07ce19c	Use stale block read retrying helper Transition from manual checking for persistent ESTALE to the shared helper that we just added. This should not change behavior. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	233fbb39f3	Limit alloc_move per-call allocator consumption Recently scoutfs_alloc_move() was changed to try and limit the amount of metadata blocks it could allocate or free. The intent was to stop concurrent holders of a transaction from fully consuming the available allocator for the transaction. The limiting logic was a bit off. It stopped when the allocator had the caller's limit remaining, not when it had consumed the caller's limit. This is overly permissive and could still allow concurrent callers to consume the allocator. It was also triggering warning messages when a call consumed more than its allowed budget while holding a transaction. Unfortunately, we don't have per-caller tracking of allocator resource consumption. The best we can do is sample the allocators as we start and return if they drop by the caller's limit. This is overly conservative in that it accounts any consumption during concurrent callers to all callers. This isn't perfect but it makes the failure case less likely and the impact shouldn't be significant. We don't often have a lot of concurrency and the limits are larger than callers will typically consume. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	198d3cda32	Add scoutfs_alloc_meta_low_since() Add scoutfs_alloc_meta_low_since() to test if the metadata avail or freed resources have been used by a given amount since a previous snapshot. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:24:10 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	96ad8dd510	Add scoutfs_alloc_meta_remaining Add helper function to give the caller the number of blocks remaining in the first list block that's used for meta allocation and freeing. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:44 -07:00
Zach Brown	7d71b610af	Add server extent motion tracking Add tracking in the alloc functions that the server uses to move extents between allocator structures on behalf of client mounts. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	a53d6d1a8e	Add scoutfs_alloc_foreach_super which takes super Add an alloc_foreach variant which uses the caller's super to walk the allocators rather than always reading it off the device. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	4d7191dc48	Print messages on extent ins/rem errors Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	cdff272163	Fix alloc list exhaustion calculation The last thing server commits do is move extents from the freed list into freed extents. It moves as many as it can until it runs out of avail meta blocks and space fore freed meta blocks in the current allocator's lists. The calculation for whether the lists had resources to move an extent was quite off. It missed that the first move might have to dirty the current allocator or the list block, that the btree could join/split blocks at each level down the paths, and boy does it look like the height component of the calculation was just bonkers. With the wrong calculation the server could overflow the freed list while moving extents and trigger a BUG_ON. We rarely saw this in testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-01 14:31:57 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	4c1181c055	Remove first_ and last_ super blkno fields There are fields in the super block that specify the range of blocks that would be used for metadata or data. They are from the time when a single block device was carved up into regions for metadata and data. They don't make sense now that we have separate metadata and data block devices. The starting blkno is static and we go to the end of the device. This removes the fields now that they serve no purpose. The only use of them to check that freed extents fell within the correct bounds can still be performed by using the static starting number or roughly using the size of the devices. It's not perfect, but this is already only a check to see that the blknos aren't utter nonsense. We're removing the fields now to avoid having to update them while worrying about users when resizing devices. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	a7828a6410	Add log merge item allocators to alloc detail The alloc iterator needs to find and include the totals of the avail and freed allocator list heads in the log merge items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	52c2a465db	Add zone awareness to scoutfs_alloc_move() Add parameters so that scoutfs_alloc_move() can first search for source extents in specified zones. It uses relatively cheap searches through the order items to find extents that intersect with the regions described by the zone bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	bc4975fad4	Add scoutfs_alloc_extents_cb() Add an allocator call for getting a callback for all the extents in btree items in an allocator root. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	9de3ae6dcb	Index free extents by order of length Allocators store free extents in two items, one sorted by their blkno position and the other by their precise length. The length index makes it easy to search for precise extent lengths, but it makes it hard to search for a large extent within a given blkno region. Skipping in the blkno dimension has to be done for every precise length value. We don't need that level of precision. If we index the extents by a coarser order of the length then we have a fixed number of orders in which we have to skip in the blkno dimension when searching within a specific region. This changes the length item to be stored at the log(8) order of the length of the extents. This groups extents into orders that are close to the human-friendly base 10 orders of magnitude. With this change the order field in the key no longer stores the precise extent length. To preserve the length of the extent we need to use another field. The only 64bit field remaining is the first which is a higher comparision priority than the type. So we use the highest comparison priority zone field to differentiate the position and order indexes and can now use all three 64bit fields in the key. Finally, we have to be careful when constructing a key to use _next when searching for a large extent. Previously keys were relying on the magic property that building a key from an extent length of 0 ended up at the key value -0 = 0. That only worked because we never stored zero length extents. We now store zero length orders so we can't use the negative trick anymore. We explicitly treat 0 length extents carefully when building keys and we subtract the order from U64_MAX to store the orders from largest to smallest. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:25:56 -07:00
Zach Brown	513d6b2734	Merge pull request #20 from versity/zab/remove_trans_spinlock Zab/remove trans spinlock	2021-03-04 13:59:07 -08:00
Zach Brown	c470c1c9f6	Allow read-mostly _alloc_meta_low Each transaction hold makes multiple calls to _alloc_meta_low to see if the transaction should be committed to refill allocators before the caller's hold is acquired and they can dirty blocks in the transaction. _alloc_meta_low was using a spinlock to sample the allocator list_head blocks to determine if there was space available. The lock and unlock stores were creating significant cacheline contention. The _alloc_meta_low calls are higher frequency than allocations. We can use a seqlock to have exclusive writers and allow concurrent _alloc_meta_low readers who retry if a writer intervenes. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Zach Brown	6237f0adc5	Add _block_dirty_ref to dirty blocks in one place To create dirty blocks in memory each block type caller currently gets a reference on a created block and then dirties it. The reference it gets could be an existing cached block that stale readers are currently using. This creates a problem with our block consistency protocol where writers can dirty and modify cached blocks that readers are currently reading in memory, leading to read corruption. This commit is the first step in addressing that problem. We add a scoutfs_block_dirty_ref() call which returns a reference to a dirtied block from the block core in one call. We're only changing the callers in this patch but we'll be reworking the dirtying mechanism in an upcoming patch to avoid corrupting readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Andy Grover	355eac79d2	Retry if transaction cannot alloc for fallocate or write Add a new distinguishable return value (ENOBUFS) from allocator for if the transaction cannot alloc space. This doesn't mean the filesystem is full -- opening a new transaction may result in forward progress. Alter fallocate and get_blocks code to check for this err val and retry with a new transaction. Handling actual ENOSPC can still happen, of course. Add counter called "alloc_trans_retry" and increment it from both spots. Signed-off-by: Andy Grover <agrover@versity.com> [zab@versity.com: fixed up write_begin error paths]	2021-01-25 09:32:01 -08:00
Zach Brown	fc003a5038	Consistently sample data alloc total_len With many concurrent writers we were seeing excessive commits forced because it thought the data allocator was running low. The transaction was checking the raw total_len value in the data_avail alloc_root for the number of free data blocks. But this read wasn't locked, and allocators could completely remove a large free extent and then re-insert a slightly smaller free extent as they perform their alloction. The transaction could see a temporary very small total_len and trigger a commit. Data allocations are serialized by a heavy mutex so we don't want to have the reader try and use that to see a consistent total_len. Instead we create a data allocator run-time struct that has a consistent total_len that is updated after all the extent items are manipulated. This also gives us a place to put the caller's cached extent so that it can be included in the total_len, previously it wasn't included in the free total that the transaction saw. The file data allocator can then initialize and use this struct instead of its raw use of the root and cached extent. Then the transaction can sample its consistent total_len that reflects the root and cached extent. A subtle detail is that fallocate can't use _free_data to return an allocated extent on error to the avail pool. It instead frees into the data_free pool like normal frees. It doesn't really matter that this could prematurely drain the avail pool because it's in an error path. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-06 09:25:32 -08:00
Zach Brown	a5d9ac5514	scoutfs: rework scoutfs_alloc_meta_low, takes arg Previously, scoutfs_alloc_meta_lo_thresh() returned true when a small static number of metadata blocks were either available to allocate or had space for freeing. This didn't make a lot of sense as the correct number depends on how many allocations each caller will make during their atomic transaction. Rework the call to take an argument for the number of avail or freed blocks available to test. This first pass just uses the existing number, we'll get to the callers. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-02 09:25:13 -08:00
Zach Brown	736d9d7df8	scoutfs: remove struct scoutfs_log_trees_val The log_trees structs store the data that is used by client commits. The primary struct is communicated over the wire so it includes the rid and nr that identify the log. The _val struct was stored in btree item values and was missing the rid and nr because those were stored in the item's key. It's madness to duplicate the entire struct just to shave off those two fields. We can remove the _val struct and store the main struct in item values, including the rid and nr. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	7a3749d591	scoutfs: incremental srch compaction Previously the srch compaction work would output the entire compacted file and delete the input files in one atomic commit. The server would send the input files and an allocator to the client, and the client would send back an output file and an allocator that included the deletion of the input files. The server would merge in the allocator and replace the input file items with the output file item. Doing it this way required giving an enormous allocation pool to the client in a radix, which would deal with recursive operations (allocating from and freeing to the radix that is being modified). We no longer have the radix allocator, and we use single block avail/free lists instead of recursively modifying the btrees with free extent items. The compaction RPC needs to work with a finite amount of allocator resources that can be stored in an alloc list block. The compaction work now does a fixed amount of work and a compaction operation spans multiple work iterations. A single compaction struct is now sent between the client and server in the get_compact and commit_compact messages. The client records any partial progress in the struct. The server writes that position into PENDING items. It first searchs for pending items to give to clients before searching for files to start a new compaction operation. The compact struct has flags to indicate whether the output file is being written or the input files are being deleted. The server manages the flags and sets the input file deletion flag only once the result of the compaction has been reflected in the btree items which record srch files. We added the progress fields to the compaction struct, making it even bigger than it already was, so we take the time to allocate them rather than declaring them on the stack. It's worth mentioning that each operation now takes a reasonably bounded amount of time will make it feasible to decide that it has failed and needs to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	fb66372988	scoutfs: add alloc foreach cb iterator Add an alloc call which reads all the persistent allocators and calls a callback for each. This is going to be used to calculate free blocks in clients for df, and in an ioctl to give a more detailed view of allocators. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8f946aa478	scoutfs: add btree item extent allocator Add an allocator which uses btree items to store extents. Both the client and server will use this for btree blocks, the client will use it for srch blocks and data extents, and the server will move extents between the core fs allocator btree roots and the clients' roots. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Mark Fasheh	3a5093c6ae	scoutfs: replace trace_printk in alloc.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	ff5a094833	scoutfs: store allocator regions in btree Convert the segment allocator to store its free region bitmaps in the btree. This is a very straight forward mechanical transformation. We split the allocator region into a big-endian index key and the bitmap value payload. We're careful to operate on aligned copies of the bitmaps so that they're long aligned. We can remove all the funky functions that were needed when writing the ring. All we're left with is a call to apply the pending allocations to dirty btree blocks before writing the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	8d59e6d071	scoutfs: fix alloc eio for free region It's possible for the next segno to fall at the end of an allocation region that doesn't have any bits set. The code shouldn't return -EIO in that case, it should carry on to the next region. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	cec3f9468a	Further isolate rings and compaction Each mount was still loading the manifest and allocator rings and starting compaction, even if they were coordinating segment reads and writes with the server. This moves ring and compaction setup and teardown from on mount and unmount to as the server starts up and shuts down. Now only the server has the rings resident and is running compaction. We had to null some of the super info fields so that we can repeatedly load and destroy the ring indices over the lifetime of a mount. We also have to be careful not to call between item transactions and compaction. We'll restore this functionality with the server in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5e0e9ac12e	Move to much simpler manifest/alloc storage Using the treap to be able to incrementally read and write the manifest and allocation storage from all nodes wasn't quite ready for prime time. The biggest problem is that invalidating cached nodes which are the target of native pointers, either for consistency or memory pressure, is problematic. This was getting in the way of adding shared support as readers and writers try to use as much of their treap caches as they can. There were other serious problems that we'd run into eventually: memory pressure from duplicate caching in native nodes and the page cache, small IOs from reading a page at a time, the risk of pathologically imbalanced treaps, and the ring being corrupted if the migration balancing doesn't work (the model assumed you could always dirty an individual node in a transaction, you have to dirty all the parents in each new transaction). Let's back off to a much simpler mechanism while we build the rest of the system around it. We can revisit aggressively optimizing this when it's our worst problem. We'll store the indexes that the manifest server needs in simple preallocated rings with log entries. The server has to read the index in its entirety into a native rbtree before it can work on it. We won't access the physical ring from mounts anymore, they'll send messages to the server. The ring callers are now working with a pinned tree in memory so the interface can be a bit simpler. By storing the indexes in their own rings the code and write path become a lot simper: we have an IO submission path for each index instead of "dirtying" calls per index and then a writing call. All this is much more robust and much less likely to get in our way as we stand up the rest of the system around it. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	6516ce7d57	Report free blocks in statfs Our statfs callback was still using the old buddy allocator. We add a free segments field to the super and have it track the number of free segments in the allocator. We then use that to calculate the number of free blocks for statfs. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	0a5fb7fd83	Add some counters Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	a333c507fb	Fix how dirty treap is tracked The transaction writing thread tests if the manifest and alloc treaps are dirty. It did this by testing if there were any dirty nodes in the treap. But this misses the case where the treap has been modified and all nodes have been removed. In that case the root references no dirty nodes but needs to be written. Instead let's specifically mark the treap dirty when it's modified. From then on sync will always try to write it out. We also integrate updating the persistent root as part of writing the dirty nodes to the persistent ring. It's required and every caller did it so it was silly to make it a separate step. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	db9f2be728	Switch to indexed manifest using treap ring The first pass manifest and allocator storage used a simple ring log that was entirely replayed into memory to be used. That risked the manifest being too large to fit in memory, especially with large keys and large volumes. So we move to using an indexed persistent structure that can be read on demand and cached. We use a treap of byte referenced nodoes stored in a circular ring. The code interface is modeled a bit on the in-memory rbtree interface. Except that we can get IO errors and manage allocation so we return data pointers to the item payload istead of item structs and we can return errors. The manifest and allocator are converted over and the old ring code is removed entirely. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	c4954eb6f4	Add initial LSM write implementation Add all the core strutural components to be able to modify metadata. We modify items in fs write operations, track dirty items in the cache, allocate free segment block reagions, stream dirty items into segments, write out the segments, update the manifest to reference the written segments, and write out a new ring that has the new manifest. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:42:30 -07:00

43 Commits