scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 03:00:44 +00:00

Author	SHA1	Message	Date
Zach Brown	0c7ea66f57	scoutfs: add SIC_EXACT Add an item count call that lets the caller give the exact item count instead of basing it on the operation they're performing. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	002daf3c1c	scoutfs: return -ENOSPC to client alloc segno The server send_reply interface is confusing. It uses errors to shut down the connection. Clients getting enospc needs to happen in the message reply payload. The segno allocation server processing needs to set the segno to 0 so that the client gets it and translates that into -ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	876414065b	scoutfs: warn if we try IO outside the device We've had bugs in allocators that return success and crazy block numbers. The bad block numbers eventually make their way down to the context-free kernel warning that IO was attempted outside the device. This at least gives us a stack trace to help find where it's coming from. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	2efba47b77	scoutfs: satisfy large allocs with smaller extents The previous fallocate and get_block allocators only looked for free extents larger than the requested allocation size. This prematurely returns -ENOSPC if a very large allocation is attempted. Some xfstests stress low free space situations by fallocating almost all the free space in the volume. This adds an allocation helper function that finds the biggest free extent to satisfy an allocation, psosibly after trying to get more free extents from the server. It looks for previous extents in the index of extents by length. This builds on the previously added item and extent _prev operations. Allocators need to then know the size of the allocation they got instead of assuming they got what they asked for. The server can also return a smaller extent so it needs to communicate the extent length, not just its start. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	04660dbfee	scoutfs: add scoutfs_extent_prev() Add an extent function for iterating backwards through extents. We add the wrapper and have the extent IO functions call their storage _prev functions. Data extent IO can now call the new scoutfs_item_prev(). Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	d53ec115bc	scoutfs: add scoutfs_item_prev() Add scoutfs_item_prev() for searching for an item before a given key. This wasn't initially implemented because it's rarely needed and for a long time the segment reading and item cache populating code had a strong bias for iterating forward from the given search key. Since we've added limiting item cache reading to the keys covered by locks and reading in entire segments it's now very easy to iterate backwards through keys just like scoutfs_item_next() iterates forwards. The only remaining forward iteration bias was in check_range(). It had to give callers the start of the cached range that it found. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	600ecd9fad	scoutfs: adapt to fallcated extents The addition of fallocate() now means that offline extents can be unwritten and allocated and that extents can now be found outside of i_size. Truncating needs to know about the possible flag combinations, writing preallocation needs to know to update an existing extent or allocate up to the next extent, get_block can't map unwritten extents for read, extent conversion needs to also clear offline, and truncate needs to drop extents outside i_size even if truncating to the existing file size. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dab0fd7d9a	scoutfs: update inode item after releasing The release ioctl forgot to update the inode item after truncating online block mappings. This meant that the offline block count update was lost when the inode was evicted and re-read, leading to inconsistent offline block counts. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9c74f2011d	scoutfs: add server work tracing Add some server workqueue and work tracing to chase down the destruction of an active workqueue. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	08a6fab725	scoutfs: always trace item create/delete ret Add a trace event for item creation and always trace the return value of create and delete events. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	27d1f3bcf7	scoutfs: inode read shouldn't modify online blocks There was a typo in the addition of i_blocks tracking that would set online blocks to the value of offline blocks when reading an existing inode into memory. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9c80f109d5	scoutfs: don't always write deletion items Items deleted from the item cache would always write deletion items to segments. We need to write deletion items so that compaction can eventually combine them with the existing item and remove both. We don't need them for items that were only created in the current transaction. Writing a deletion item for them only results in a lot of extra work compacting the item down to the final segment level so that it can be removed. The upcoming extent code really demonstrated the cost of this overhead. It happens to create and delete quite a lot of temporary extent items during the transaction as all the different kinds of indexed extents change. This change tracks whether a given item in the cache reflects an item that is present in the persistent storage. This lets us free items that have only existed in the current transaction. This made a meaningful difference when writing a 4MB file with the current block mapping items, but it made an enormous difference when writing that same file with the extent items. It went from writing 1024 deletion items for 11 real items to only writing those real items. items deletions block mappings before: 25 5 block mappings after: 25 0 extents before: 11 1024 extents after: 11 0 Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1c5d84fa3e	scoutfs: add counters for items written in level 0 Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	e227c6446e	scoutfs: don't advance btree after wrapping The btree writes its blocks to a fixed ring of preallocated blocks. We added a trigger to force the index to advance to the next half of the ring to test conditions where the cached btree blocks are out of date with respect to the blocks on disk. We have to be careful to only advance the index once all the live blocks are migrated out of the half that we're about to advance to. The trigger tested that condition. But it missed the case where the normal btree block allocation just advanced into the next ring. In this case the migration needs to occur to make it safe to advance again to the previous half. But it missed this case because the migration keys are reset after we test the trigger. This resulted in leaving live btree blocks in the half that we advance to and start overwriting. The server got -ESTALE as it tried to read through blocks that had been overwritten and hilarity ensued. This precise condition of having the trigger fire just as we wrapped was amazingly caught by scoutfs/505 in xfstests. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	345721c933	scoutfs: preserve sticky deletion items We limit the number of lower segments that a compaction will read. A sticky compaction happens when the upper segment overlaps more lower segments. The remaining items in the upper segment are written back to the upper level -- they're stuck. A future compaction will attempt to compact the remaining items with the next set of overlapping lower segments. Deletion items are rightly discarded as they're compacted to the lowest level -- at that point they have no more matching items in lower segments to destroy and are done. Deletion items were being dropped instead of being written back into the upper level of a sticky compaction. The test for discarding the deletion items only considered the lowest level of the compaction, not the level that the items were being written to. We need to be careful to preserve the deletion items in the case of compaction to the lowest level writing sticky items back to the upper segment. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5f0c87970c	scoutfs: fix level 0 key iteration increment Compaction has to find the oldest level 0 segment for compaction. It iterates over the level 0 segments by their manifest entry's btree key. It was incorrectly incrementing the btree search key. It was incrementing the first key stored in the entry, but that's not the least significant field. The seq is the least significant field so this iteration could skip over segments written at different times with the same first key. The fix to have it visit all the entries is to increment the lowest precision seq field. Right now we have a single level 0 segment so this code never actually matters. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	874a44aef0	scoutfs: remove dead file allocation cursor code This is no longer used now that we allocate large extents for concurrently extending files by preallocating unwritten extents. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fe94eb7363	scoutfs: add unwritten extents Now that we have extents we can address the fragmentation of concurrent writes with large preallocated unwritten extents instead of trying to allocate from disjoint free space with cursors. First we add support for unwritten extents. Truncate needs to make sure it doesn't treat truncated unwritten blocks as online just because they're not offline. If we try to write into them we convert them to written extents. And fiemap needs to flag them as unwritten and be sure to check for extents past i_size. Then we allocate unwritten extents only if we're extending a contiguous file. We try to preallocate the size of the file and cap it to a meg. This ends up with a power of two progression of preallocation sizes, which nicely balances extent sizes and wasted allocation as file sizes increase. We need to be careful to truncate the preallocated regions if the entire file is released. We take that as an indication that the user doesn't want the file consuming any more space. This removes most of the use of the cursor code. It will be completely removed in a further patch. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dd091e18a9	scoutfs: add trans item tracking trace Add a trace event that records the changes to a reservation's dirty item count. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	19f7e0284b	scoutfs: add online/offline block trace event Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5eddd10eb7	scoutfs: remove dead block mapping code Remove all the code for tracking block mapping items and storing free blocks in bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	70b2a50c9a	scoutfs: remove individual online/offline calls Remove the functions that operate on online and offline blocks independently now that the file data mapping code isn't using it any more. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	abbe76093b	scoutfs: store file data in extents Store file data mappings and free block ranges in extents instead of in block mapping items and bitmaps. This adds the new functionality and refactors the functions that use it. The old functions are no longer called and we stop at ifdeffing them out to keep the change small. We'll remove all the dead code in a future change. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	869d11fd0f	scoutfs: add core extent functions Add a file of extent functions that callers will use to manipulate and store extents in different persistent formats. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	036577890f	scoutfs: add atomic online/offline blocks calls Add functions that atomically change and query the online and offline block counts as a pair. They're semantically linked and we shouldn't present counts that don't match if they're in the process of being updated. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	4ceb123473	scoutfs: include counters.h for messages The corruption helpers use counters and callers shouldn't have to include the counters header themselves. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	4fc554584a	scoutfs: add SCOUTFS_BLOCK_MAX Add the max possible logical block / physical blkno number given u64 bytes recorded at block size granularity. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	55e063d2a1	scoutfs: get rid of silly lock destroy BUG_ON The BUG_ON() at the start of scoutfs_lock_destroy() was intended to ensure that scoutfs_lock_shutdown() had been called first. But that doesn't happen in the case where we get an error during mount. The _destroy() function is careful to notice active use and only tears down resources that were created. The BUG_ON() can just be removed. Signed-off-by: Zach Brown <zab@versity.com>	2018-05-04 09:21:44 -07:00
Zach Brown	f3007f10ca	scoutfs: shut down server on commit errors We hadn't yet implemented any error handling in the server when commits fail. Commit errors are serious and we take them as a sign that something has gone horribly wrong. This patch prints commit error warnings to the console and shuts down. Clients will try to reconnect and resend their requests. The hope is that another server will be able to make progress. But this same node could become the server again and it could well be that the errors are persistent. The next steps are to implement server startup backoff, client retry backoff, and hard failure policies. Signed-off-by: Zach Brown <zab@versity.com>	2018-05-01 11:48:19 -07:00
Zach Brown	ae6907623c	scoutfs: add btree rw error traces and counters Add some trivial traces and counters around btree block IO errors. Signed-off-by: Zach Brown <zab@versity.com>	2018-05-01 11:48:19 -07:00
Zach Brown	24cc5cc296	scoutfs: lock manifest root request The manifest root request processing samples the stable_manifest_root in the server info. The stable_manifest_root is updated after a commit has suceeded. The read of stable_manifest_root in request processing was locking the manifest. The update during commit doesn't lock the manifest so these paths were racing. The race is very tight, a few cpu stores, but it could in theory give a client a malformed root that could be misinterpreted as corruption. Add a seqcount around the store of the stable manifest root during commit and its load during request processing. This ensures that clients always get a consistent manifest root. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	7d7f8e45b7	scoutfs: more carefully manage private bh bits The management of _checked and _valid_crc private bits in the buffer_head wasn't quite right. _checked indicates that the block has been checked and that the expensive crc verification doesn't need to be recalculated. _valid_crc then indicates the result of the crc verification. _checked is read without locks. First, we didn't make sure that _valid_crc was stored before _checked. Multiple tasks could race to see _checked before _valid_crc. So we add some memory barriers. Then we didn't clear _checked when re-reading a stale block. This meant that the moment the block was read its private flags could still indicate that it had a valid crc. We clear the private bits before we read so that we'll recalculate the crc. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	fe8b155061	scoutfs: add btree corruption messages Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	3efcc87413	scoutfs: add corruption messages for namei Add scoutfs_corruption() calls for corruption associated with mapping names to inodes. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	c9573d13bb	scoutfs: add scoutfs_corruption() Add a helper for printing a message warning about corruption. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	ac259c82a0	scoutfs: allow interrupting client sends Waiting for replies to sent requests wasn't interruptible. This was preventing ctl-c from breaking out of mount when a server wasn't yet around to accept connections. The only complication was that the receive thread was accessing the sender's struct outside of the lock. An interrupted sender could remove their struct while receive was processing it. We rework recv processing so that it only uses the sender struct under the lock. This introduces a cpu copy of the payload but they're small and relatively infrequent control messages. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 15:49:14 -07:00
Zach Brown	8061a5cd28	scoutfs: add server bind warning Emit an error message if the server fails to bind. It can mean that there is a bad configured address. But we might want to be able to bind if the address becomes available, so we don't hard error. We only emit the message once for a series of failures. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 15:49:14 -07:00
Zach Brown	81b3159508	scoutfs: return errors from read_items The introduction of the helper to handle stale segment retrying was masking errors. It's meant to pass through the caller's return status when it doesn't return -EAGAIN to trigger stale read retries. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 15:49:14 -07:00
Zach Brown	676d1e32ef	scoutfs: more carefully trace backref walk loop We were only issuing one kernel warning when we couldn't resolve a path to an inode due to excessive retries. It was hard to capture and we only saw details from the first instance. This adds a counter for each time we see excessive retries and returns -ELOOP in that case. We also extend the link backref adding trace point to include the found entry, if any. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 10:09:31 -07:00
Zach Brown	c118f7cc03	scoutfs: add option to force tiny btree blocks Add a tunable option to force using tiny btree blocks on an active mount. This lets us quickly exercise large btrees. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	e145267c05	scoutfs: allow smaller btree keys and values Now that we're using small file system keys we can dramatically shrink the maximum allowed btree keys and values. This more accurately matches the current users and less us fit more possible items in each block. Which lets us turn the block size way down and still have multiple worst case largest items per block. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	31286ad714	scoutfs: add options debugfs dir Add a debugfs dir that will offer debugging options for an actively mounted volume. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	90de34361c	scoutfs: add trigger for advancing btree ring Add a trigger that lets us force advancing the btree block to the start of the next half. It's only safe to do this once migration has moved all the blocks out of the old half. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	e1f32a0f8b	scoutfs: fix spurious hard stale block errors The stale block handling code only handled the case where we read through a stale root into blocks that have been overwritten in the persistent store. In this case you'll get a new root and the read will be OK. It didn't handle the case where we have stale blocks cached at the blocks of the legitamate current root. In this case we get ESTALE from each stale block and because the root doesn't change when we retry we assume the persistent structure is corrupt. This case can happen when the btree ring wraps and there are still blocks cached at the head of the ring. This became much more possible when we moved to small fixed size keys. The fix is to retry reading individual blocks or segments before returning -ESTALE and expecting the caller to get a new root and try again. In the stale cache case this will allow the more recent correct blocks to be read. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	966c8b8cbc	scoutfs: alloc inos at multiple of lock group Inode allocations come from batches that are reserved for directories. As the batch is exhausted a new one is acquired and allocated from. The batch size was arbitrarily set to the human friendly 10000. This doesn't interact well with the lock group size being a power of two. Each allocation batch will straddle an inode group with its previous and next inode batch. This often doesn't matter because dirctories very rarely have more than 9000 entries. But as entries pass 10000 they'd see surprising contention with other inode ranges in directories. Tweak the allocation size to be a multiple of the lock group size to stop this from happening. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:53:58 -07:00
Zach Brown	045380ca55	scoutfs: don't negatively cache unread segments Previously we changed item reading to try and read from the start of its locked range instead of from the key that wasn't found in the cache. This greatly improved the performance of access patterns that didn't proceed in key order. We rightly shrank the range of items that we'd claim to cache by the segments that we read. But we missed the case where our search key falls between two segments and we chose to read the next segment instead of the previous. If the previous segment in this case overlapped with the lock range then we were claiming to cache the segments contents but weren't reading it. This would result in bad negative caching of items that existed. scoutfs/500 was tripping over this as it tried to rename a file created by another node. The local renaming node would try to look up a key that only existed in level 0 and not read but negatively cache the items in the previous level 1 segment. We fix this by shrinking the caching range down as we're considering manifest entries instead of up as we process each segment read because we have to shrink based on the segments in the manifest, not the ones we chose to read. With this fixed the rename can see those items in the level 1 segment again. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00

1 2 3 4 5 ...

641 Commits