scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-01 18:35:43 +00:00

Author	SHA1	Message	Date
Zach Brown	c296bc1959	Remove scoutfs_data_wait_check_iter scoutfs_data_wait_check_iter() was checking the contiguous region of the file starting at its pos and extending for iter_iov_count() bytes. The caller can do that with the previous _data_wait_check() method by providing the same count that _check_iter() was using. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-25 15:11:20 -07:00
Auke Kok	d480243c11	Support .read/write_iter callbacks in lieu of .aio_read/write The aio_read and aio_write callbacks are no longer used by newer kernels which now uses iter based readers and writers. We can avoid implementing plain .read and .write as an iter will be generated when needed for us automatically. We add a new data_wait_check_iter() function accordingly. With these methods removed from the kernel, the el8 kernel no longer uses the extended ops wrapper struct and is much closer now to upstream. As a result, a lot of methods are moving around from inode_dir_operations to and from inode_file_operations etc, and perhaps things will look a bit more structured as a result. As a result, we need a slightly different data_wait_check() that accounts for the iter and offset properly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	8e067b3d3f	Truncate dirties zero tail extension When we truncate away from a partial block we need to zero its tail that was past i_size and dirty it so that it's written. We missed the typical vfs boilerplate of calling block_truncate_page from setattr->set_size that does this. We need to be a little careful to pass our file lock down to get_block and then queue the inode for writeback so its written out with the transaction. This follows the pattern in .write_end. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-06 10:31:31 -08:00
Zach Brown	70ede28e39	Remove unused traced_extent leavings Remove some lingering support helpers for the traced_extent struct that we haven't used in a while. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Andy Grover	0deb232d3f	Support O_TMPFILE and allow MOVE_BLOCKS into released extents Support O_TMPFILE: Create an unlinked file and put it on the orphan list. If it ever gains a link, take it off the orphan list. Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges. Ioctl callers must set a new flag to enable this operation mode. RH-compat: tmpfile support it actually backported by RH into 3.10 kernel. We need to use some of their kabi-maintaining wrappers to use it: use a struct inode_operations_wrapper instead of base struct inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets RH's modified vfs_tmpfile() find our tmpfile fn pointer. Add a test that tests both creating tmpfiles as well as moving their contents into a destination file via MOVE_BLOCKS. xfstests common/004 now runs because tmpfile is supported. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 14:23:44 -07:00
Zach Brown	3139d3ea68	Add move_blocks ioctl Add a relatively constrained ioctl that moves extents between regular files. This is intended to be used by tasks which combine many existing files into a much larger file without reading and writing all the file contents. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Benjamin LaHaise	f5863142be	scoutfs: add data_wait_err for reporting errors Add support for reporting errors to data waiters via a new SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error to readers when staging fails. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> [zab: renamed to data_wait_err, took ino arg] Signed-off-by: Zach Brown <zab@versity.com>	2020-05-29 13:50:13 -07:00
Zach Brown	9ad86d4d29	scoutfs: commit trans before premature enospc File data allocations come from radix allocators which are populated by the server before each client transation. It's possible to fully consume the data allocator within one transaction if the number of dirty metadata blocks is kept low. This could result in premature ENOSPC. This was happening to the archive-light-cycle test. If the transactions performed by previous tests lined up just right then the creation of the initial test files could see ENOSPC and cause all sorts of nonsense in the rest of the test, culminating in cmp commands stuck in offline waits. This introduces high and low data allocator water marks for transactions. The server tries to fill data allocators for each transaction to the high water mark and the client forces the commit of a transaction if its data allocator falls below the low water mark. The archive-light-cycle test now passes easily and we see the trans_commit_data_alloc_low counter increasing during the test. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	7da8ddb8a1	scoutfs: fix data.h include guard The identifier for data.h's include guard was brought over from an old file and still had the old name. Update it to reflect it's use in data, not filerw. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	300b7bc3ba	scoutfs: remove allocators that used btree items Now that we have the allocators that use radix blocks we can remove all the code that was using btree items to store free block bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	abbe76093b	scoutfs: store file data in extents Store file data mappings and free block ranges in extents instead of in block mapping items and bitmaps. This adds the new functionality and refactors the functions that use it. The old functions are no longer called and we stop at ifdeffing them out to keep the change small. We'll remove all the dead code in a future change. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c1311783d5	scoutfs: add tracking of online and offline blocks Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Mark Fasheh	afa30e60fe	scoutfs: use inclusive range for scoutfs_data_truncate_items() This makes calling it for truncate less cumbersome - we can safely use ~0ULL for the end point now. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00
Zach Brown	9e3954a918	scoutfs: add lock around data item truncation Add cluster lock coverage to scoutfs_data_truncate_items() and plumb the lock down into the item functions. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	1012ee5e8f	scoutfs: use block mapping items Move to static mapping items instead of unbounded extents. We get more predictable data structures and simpler code but still get reasonably dense metadata. We no longer need all the extent code needed to split and merge extents, test for overlaps, and all that. The functions that use the mappings (get_block, fiemap, truncate) now have a pattern where they decode the mapping item into an allocated native representation, do their work, and encode the result back into the dense item. We do have to grow the largest possible item value to fit the worst case encoding expansion of random block numbers. The local allocators are no longer two extents but are instead simple bitmaps: one for full segments and one for individual blocks. There are helper functions to free and allocate segments and blocks, with careful coordination of, for example, freeing a segment once all of its constituent blocks are free. _fiemap is refactored a bit to make it more clear what's going on. There's one function that either merges the next bit with the currently building extent or fills the current and starts recording from a non-mergable additional block. The old loop worked this way but was implemented with a single squirrely iteration over the extents. This wasn't feasible now that we're also iterating over blocks inside the mapping items. It's a lot clearer to call out to merge or fill the fiemap entry. The dirty item reservation counts for using the mappings is reduced significantly because each modification no longer has to assume that it might merge with two adjacent contiguous neighbours. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-19 11:25:38 -07:00
Zach Brown	4084d3d9dc	scoutfs: add offline flag, releasing, and fiemap Now that we have basic file extents we can add a flag to extents to track offline extents. We have to initialize and test the flags as we work with extents. Truncation can be told to leave removed extents around with no block mapping and the offline bit set. Only staging with the correct data version can write to the offline regions. Demand staging isn't implemented yet. Reads from offline extents are treated like sparse regions. Truncation is a straight forward iteration over the portions of existing extents which overlap with the truncated blocks. Writing to offline extents has to first remove the existing offline extent before then adding the new allocated extents. The 'changes' mechanism relied on being able to search the current items to find the changes that should be made before making any changes. This doesn't work for finding merge candidates for the new allocated insertion because the old offline extent change won't have been applied yet. We replace the change mechanism with straight forward item modification and unwinding. The generic block fiemap can't communicate offline extents and iterates over blocks instead of extents. We add our fiemap that iterates over extents and sets the 'UNKNOWN' flag on offline extents. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:12 -07:00
Zach Brown	6afeb97802	scoutfs: reference file data with extent items Our first attempt at storing file data put them in items. This was easy to implement but won't be acceptable in the long term. The cost of the power of LSM indexing is compaction overhead. That's acceptable for fine grained metadata but is totally unacceptable for bulk file data. This switches to storing file data in seperate block allocations which are referenced by extent items. The bulk of the change is the mechanics of working with extents. We have high level callers which add or remove logical extents and then underlying mechanisms that insert, merge, or split the items that the extents are stored in. We have three types of extent items. The primary type maps logical file regions to physical block extents. The next two store free extents per-node so that clients don't create lock and LSM contention as they try and allocate extents. To fill those per-node free extents we add messages that communcate free extents in the form of lists of segment allocations from the server. We don't do any fancy multi-block allocation yet. We only allocate blocks in get_blocks as writes find unmapped blocks. We do use some per-task cursors to cache block allocation positions so that these single block allocations are very likely to merge into larger extents as tasks stream wites. This is just the first chunk of the extent work that's coming. A later patch adds offline flags and fixes up the change nonsense that seemed like a good idea here. The final moving part is that we initiate writeback on all newly allocated extents before we commit the metadata that references the new blocks. We do this with our own dirty inode tracking because the high level vfs methods are unusably slow in some upstream kernels (they walk all inodes, not just dirty inodes.) Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:11 -07:00
Zach Brown	9f5e42f7dd	Add simple data items Add basic file data support by managing file data items from the page cache address space callbacks. Data is read by copying from cached items into page contents in readpage. Writes create new ephemeral items which reference dirty pages. The items are deleted once they're written in a transaction or if invalidatepage removes the dirty page they reference. There's a lot more to do to remove data copies, avoid compaction bw overhead, and add support for truncate, o_direct, and mmap. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00

25 Commits