scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-01 02:15:44 +00:00

Author	SHA1	Message	Date
Andy Grover	0deb232d3f	Support O_TMPFILE and allow MOVE_BLOCKS into released extents Support O_TMPFILE: Create an unlinked file and put it on the orphan list. If it ever gains a link, take it off the orphan list. Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges. Ioctl callers must set a new flag to enable this operation mode. RH-compat: tmpfile support it actually backported by RH into 3.10 kernel. We need to use some of their kabi-maintaining wrappers to use it: use a struct inode_operations_wrapper instead of base struct inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets RH's modified vfs_tmpfile() find our tmpfile fn pointer. Add a test that tests both creating tmpfiles as well as moving their contents into a destination file via MOVE_BLOCKS. xfstests common/004 now runs because tmpfile is supported. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 14:23:44 -07:00
Andy Grover	355eac79d2	Retry if transaction cannot alloc for fallocate or write Add a new distinguishable return value (ENOBUFS) from allocator for if the transaction cannot alloc space. This doesn't mean the filesystem is full -- opening a new transaction may result in forward progress. Alter fallocate and get_blocks code to check for this err val and retry with a new transaction. Handling actual ENOSPC can still happen, of course. Add counter called "alloc_trans_retry" and increment it from both spots. Signed-off-by: Andy Grover <agrover@versity.com> [zab@versity.com: fixed up write_begin error paths]	2021-01-25 09:32:01 -08:00
Andy Grover	bed33c7ffd	Remove item accounting Remove kmod/src/count.h Remove scoutfs_trans_track_item() Remove reserved/actual fields from scoutfs_reservation Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-20 17:01:08 -08:00
Zach Brown	3139d3ea68	Add move_blocks ioctl Add a relatively constrained ioctl that moves extents between regular files. This is intended to be used by tasks which combine many existing files into a much larger file without reading and writing all the file contents. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	fc003a5038	Consistently sample data alloc total_len With many concurrent writers we were seeing excessive commits forced because it thought the data allocator was running low. The transaction was checking the raw total_len value in the data_avail alloc_root for the number of free data blocks. But this read wasn't locked, and allocators could completely remove a large free extent and then re-insert a slightly smaller free extent as they perform their alloction. The transaction could see a temporary very small total_len and trigger a commit. Data allocations are serialized by a heavy mutex so we don't want to have the reader try and use that to see a consistent total_len. Instead we create a data allocator run-time struct that has a consistent total_len that is updated after all the extent items are manipulated. This also gives us a place to put the caller's cached extent so that it can be included in the total_len, previously it wasn't included in the free total that the transaction saw. The file data allocator can then initialize and use this struct instead of its raw use of the root and cached extent. Then the transaction can sample its consistent total_len that reflects the root and cached extent. A subtle detail is that fallocate can't use _free_data to return an allocated extent on error to the avail pool. It instead frees into the data_free pool like normal frees. It doesn't really matter that this could prematurely drain the avail pool because it's in an error path. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-06 09:25:32 -08:00
Zach Brown	807ae11ee9	Protect per-inode extent items with extent_sem Now that we have full precision extents a writer with i_mutex and a page lock can be modifying large extent items which cover much of the surrounding pages in the file. Readers can be in a different page with only the page lock and try to work with extent items as the writer is deleting and creating them. We add a per-inode rwsem which just protects file extent item manipulation. We try to acquire it as close to the item use as possible in data.c which is the only place we work with file extent items. This stops rare read corruption we were seeing where get_block in a reader was racing with extent item deletion in a stager at a further offset in the file. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 11:56:50 -08:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	6bacd95aea	scoutfs: fs uses item cache instead of forest Use the new item cache for all the item work in the fs instead of calling into the forest of btrees. Most of this is mechanical conversion from the _forest calls to the _item calls. The item cache no longer supports the kvec argument for describing values so all the callers pass in the value pointer and length directly. The item cache doesn't support saving items as they're deleted and later restoring them from an error unwinding path. There were only two users of this. Directory entries can easily guarantee that deletion won't fail by dirtying the items first in the item cache. Xattr updates were a little trickier. They can combine dirtying, creating, updating, and deleting to atomically switch between items that describe different versions of a multi-item value. This also fixed a bug in the srch xattrs where replacing an xattr would create a new id for the xattr and leave existing srch items referencing a now deleted id. Replacing now reuses the old id. And finally we add back in the locking and transaction item cache integration. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	177af7f746	scoutfs: use larger metadata blocks Introduce different constants for small and large metadata block sizes. The small 4KB size is used for the super block, quorum blocks, and as the granularity of file data block allocation. The larger 64KB size is used for the radix, btree, and forest bloom metadata block structures. The bulk of this are obvious transitions from the old single constant to the appropriate new constant. But there are a few more involved changes, though just barely. The block crc calculation now needs the caller to pass in the size of the block. The radix function to return free bytes instead returns free blocks and the caller is responsible for knowing how big its managed blocks are. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Benjamin LaHaise	f5863142be	scoutfs: add data_wait_err for reporting errors Add support for reporting errors to data waiters via a new SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error to readers when staging fails. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> [zab: renamed to data_wait_err, took ino arg] Signed-off-by: Zach Brown <zab@versity.com>	2020-05-29 13:50:13 -07:00
Zach Brown	9ad86d4d29	scoutfs: commit trans before premature enospc File data allocations come from radix allocators which are populated by the server before each client transation. It's possible to fully consume the data allocator within one transaction if the number of dirty metadata blocks is kept low. This could result in premature ENOSPC. This was happening to the archive-light-cycle test. If the transactions performed by previous tests lined up just right then the creation of the initial test files could see ENOSPC and cause all sorts of nonsense in the rest of the test, culminating in cmp commands stuck in offline waits. This introduces high and low data allocator water marks for transactions. The server tries to fill data allocators for each transaction to the high water mark and the client forces the commit of a transaction if its data allocator falls below the low water mark. The archive-light-cycle test now passes easily and we see the trans_commit_data_alloc_low counter increasing during the test. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	66f8b3814c	scoutfs: remove warning on reading while staging An incorrect warning condition was added as fallocate was implemented. It tried to warn against trying to read from the staging ioctl. But the staging boolean is set on the inode when the staging ioctl has the inode mutex. It protects against writes, but page reading doesn't use the mutex. It's perfectly acceptable for reads to be attempted while the staging ioctl is busy. We rely on it for a large read to consume staging being written. The warning caused reads to fail while the stager ioctl was working. Typically this would hit read-ahead and just force sync reads. But it could hit sync reads and cause EIO. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-10 12:02:18 -07:00
Zach Brown	6ae0ac936c	scoutfs: fix setattr offline extent length We miscalculated the length of extents to create when initializing offline extents for setattr_more. We were clamping the extent length in each packed extent item by the full size of the offline extent, ignoring the iblock position that we were starting from. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-03 14:46:48 -07:00
Zach Brown	88422c6405	scoutfs: fiemap with no extents returns 0 Don't return -ENOENT from fiemap on a file with no extents. The operation is supposed to succeed with no extents. Signed-off-by: Zach Brown <zab@versity.com>	2020-03-16 15:48:19 -07:00
Zach Brown	ef1dc677d0	scoutfs: store initialied offline unpacked extents The setattr_more ioctl has its own helper for creating uninitialized extents when we know that there can't be any other existing extents. We don't have to worry about freeing blocks they might have referenced. This helper forgot to actually store the modified extents back into packed extent items after setting extents offline. Signed-off-by: Zach Brown <zab@versity.com>	2020-03-16 15:48:19 -07:00
Zach Brown	462749cb87	scoutfs: add stage and release tracing Add a bit more tracing to stage, release, and unwritten extent conversion so we can get a bit more visibility into the threads staging and releasing. Signed-off-by: Zach Brown <zab@versity.com>	2020-03-05 09:02:06 -08:00
Zach Brown	300b7bc3ba	scoutfs: remove allocators that used btree items Now that we have the allocators that use radix blocks we can remove all the code that was using btree items to store free block bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	55fa73f407	scoutfs: add packed extent and bitmap tracing Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	48448d3926	scoutfs: convert fs callers to forest Convert fs callers to work with the btree forest calls instead of the lsm item cache calls. This is mostly a mechanical syntax conversion. The inode dirtying path does now update the item rather than simply dirtying it. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	6f5cfd8cc2	scoutfs: use rid instead of node_id in items Use the mount's generated random id in persistent items and the lock that protects them instead of the assigned node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	62d6c11e3c	scoutfs: clean up workqueue flags We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND in some workqueues where we wanted concurrency by scheduling across cpus instead of waiting for the current (very long running) work on a cpu to finish. We add NON_REENTRANT out of an abundance of caution. It has gone away in modern kernels and is probably not needed here, but according to the docs we would want it so we at least document that fact by using it. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	fddc3a7a75	scoutfs: minimize commit writeback latencies Our simple transaction machinery causes high commit latencies if we let too much dirty file data accumulate. Small files have a natural limit on the amount of dirty data because they have more dirty items per dirty page. They fill up the single segment sooner and kick off a commit which finds a relatively small amount of dirty file data. But large files can reference quite a lot of dirty data with a small amount of extent items which don't fill up the transaction's segment. During large streaming writes we can fill up memory with dirty file data before filling a segment with mapping extent metadata. This can lead to high commit latencies when memory is full of dirty file pages. Regularly kicking off background writeback behind streaming write positions reduces the amount of dirty data that commits will find and have to write out. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	2efba47b77	scoutfs: satisfy large allocs with smaller extents The previous fallocate and get_block allocators only looked for free extents larger than the requested allocation size. This prematurely returns -ENOSPC if a very large allocation is attempted. Some xfstests stress low free space situations by fallocating almost all the free space in the volume. This adds an allocation helper function that finds the biggest free extent to satisfy an allocation, psosibly after trying to get more free extents from the server. It looks for previous extents in the index of extents by length. This builds on the previously added item and extent _prev operations. Allocators need to then know the size of the allocation they got instead of assuming they got what they asked for. The server can also return a smaller extent so it needs to communicate the extent length, not just its start. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	04660dbfee	scoutfs: add scoutfs_extent_prev() Add an extent function for iterating backwards through extents. We add the wrapper and have the extent IO functions call their storage _prev functions. Data extent IO can now call the new scoutfs_item_prev(). Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	600ecd9fad	scoutfs: adapt to fallcated extents The addition of fallocate() now means that offline extents can be unwritten and allocated and that extents can now be found outside of i_size. Truncating needs to know about the possible flag combinations, writing preallocation needs to know to update an existing extent or allocate up to the next extent, get_block can't map unwritten extents for read, extent conversion needs to also clear offline, and truncate needs to drop extents outside i_size even if truncating to the existing file size. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dab0fd7d9a	scoutfs: update inode item after releasing The release ioctl forgot to update the inode item after truncating online block mappings. This meant that the offline block count update was lost when the inode was evicted and re-read, leading to inconsistent offline block counts. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	874a44aef0	scoutfs: remove dead file allocation cursor code This is no longer used now that we allocate large extents for concurrently extending files by preallocating unwritten extents. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fe94eb7363	scoutfs: add unwritten extents Now that we have extents we can address the fragmentation of concurrent writes with large preallocated unwritten extents instead of trying to allocate from disjoint free space with cursors. First we add support for unwritten extents. Truncate needs to make sure it doesn't treat truncated unwritten blocks as online just because they're not offline. If we try to write into them we convert them to written extents. And fiemap needs to flag them as unwritten and be sure to check for extents past i_size. Then we allocate unwritten extents only if we're extending a contiguous file. We try to preallocate the size of the file and cap it to a meg. This ends up with a power of two progression of preallocation sizes, which nicely balances extent sizes and wasted allocation as file sizes increase. We need to be careful to truncate the preallocated regions if the entire file is released. We take that as an indication that the user doesn't want the file consuming any more space. This removes most of the use of the cursor code. It will be completely removed in a further patch. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5eddd10eb7	scoutfs: remove dead block mapping code Remove all the code for tracking block mapping items and storing free blocks in bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	abbe76093b	scoutfs: store file data in extents Store file data mappings and free block ranges in extents instead of in block mapping items and bitmaps. This adds the new functionality and refactors the functions that use it. The old functions are no longer called and we stop at ifdeffing them out to keep the change small. We'll remove all the dead code in a future change. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	b0bd273acc	scoutfs: remove support for multi-element kvecs Originally the item interfaces were written with full support for vectored keys and values. Callers constructed keys and values made up of header structs and data buffers. Segments supported much larger values which could span pages when stored in memory. But over time we've pulled that support back. Keys are described by a key struct instead of a multi-element kvec. Values are now much smaller and don't span pages. The item interfaces still use the kvec arrays but everyone only uses a single element. So let's make the world a whole lot less awful but having the item interfaces only supporting a single value buffer specified by a kvec. A bunch of code disappears and the result is much easier to understand. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	08f544cc15	scoutfs: remove scoutfs_item_lookup_exact() size Every caller of scoutfs_item_lookup_exact() provided a size that matches the value buffer. Let's remove the redundant arg and use the value buffer length as the exact size to match. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	9c1b393404	scoutfs: don't track offline sparse blocks There were some mistakes in tracking offline blocks. Online and offline block counts are meant to only refer to actual data contents. Sparse blocks in an archived file shouldn't be counted as offline. But the code was marking unallocated blocks as offline. This could corrupt the offline block count if a release extended past i_size and marked the blocks in the mapping item as offline even though they're past i_size. We could have clamped the block walking to not go past i_size. But we still would have had the problem of having offline blocks track sparse blocks. Instead we can fix the problem by only marking blocks offline if they had allocated blocks. This means that sparse regions are never marked offline and will always read zeros. Now a release that extends past i_size will not do anything to the unallocated blocks in the mapping item past i_size and the offline block count will be consistent. (Also the 'modified' and 'dirty' booleans were redundant, we only need one of the two.) Signed-off-by: Zach Brown <zab@versity.com>	2018-04-02 10:21:58 -07:00
Zach Brown	995e43aa18	scoutfs: hold the alloc sem during truncate The super info's alloc_rwsem protects the local node free segment and block bitmap items. The truncate code wasn't holding using the rwsem so it could race with other local node allocator item users and corrupt the bitmaps. In the best case this could corrupt structures that trigger EIO. The corrupt items could also create duplicate block allocations that clobber each other and corrupt data. Signed-off-by: Zach Brown <zab@versity.com>	2018-03-16 09:18:50 -07:00
Zach Brown	302b0f5316	scoutfs: track inode 512b block count We weren't doing anything with the inode blocks field. We weren't even initializing it which explains why we'd sometimes see garbage i_blocks values in scoutfs inodes in segments. The logical blocks field reflects the contents of the file regardless of whether its online or not. It's the sum of our online and offline block tracking. So we can initialize it to our persistent online and offline counts and then keep it in sync as blocks are allocated and freed. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	c1311783d5	scoutfs: add tracking of online and offline blocks Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	f52dc28322	scoutfs: simplify lock use of kernel dlm We had an excessive number of layers between scoutfs and the dlm code in the kernel. We had dlmglue, the scoutfs locks, and task refs. Each layer had structs that track the lifetime of the layer below it. We were about to add another layer to hold on to locks just a bit longer so that we can avoid down conversion and transaction commit storms under contention. This collapses all those layers into simple state machine in lock.c that manages the mode of dlm locks on behalf of the file system. The users of the lock interface are mainly unchanged. We did change from a heavier trylock to a lighter nonblock lock attempt and have to change the single rare readpage use. Lock fields change so a few external users of those fields change. This not only removes a lot of code it also contains functional improvements. For example, it can now convert directly to CW locks with a single lock request instead of having to use two by first converting to NL. It introduces the concept of an unlock grace period. Locks won't be dropped on behalf of other nodes soon after being unlocked so that tasks have a chance to batch up work before the other node gets a chance. This can result in two orders of magnitude improvements in the time it takes to, say, change a set of xattrs on the same file population from two nodes concurrently. There are significant changes to trace points, counters, and debug files that follow the implementation changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-14 15:00:17 -08:00
Zach Brown	a49061a7d9	scoutfs: remove the size index We aren't using the size index. It has runtime and code maintenance costs that aren't worth paying. Let's remove it. Removing it from the format and no longer maintaining it are straight forward. The bulk of this patch is actually the act of removing it from the index locking functions. We no longer have to predict the size that will be stored during the transaction to lock the index items that will be created during the transaction. A bunch of code to predict the size and then pass it into locking and transactions goes away. Like other inode fields we now update the size as it changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-01-30 15:03:35 -08:00
Zach Brown	0712ca6b9b	scoutfs: correctly set new flag in get_blocks We weren't setting the new flag in the mapped buffer head. This tells the caller that the buffer is newly allocated and needs to be zeroed. Without this we expose unwritten newly allocated block contents. fsx found this almost immediately. With this fixed fsx passes. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-26 14:47:59 -07:00
Mark Fasheh	dd99a0127e	scoutfs: rename scoutfs_inode_index_lock_hold Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind as part of normal (not an error) operation. This lets us re-use the name in an upcoming patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00

1 2

87 Commits