scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-06-02 17:56:20 +00:00

Author	SHA1	Message	Date
Zach Brown	7e935898ab	Avoid premature metadata enospc server_get_log_trees() sets the low flag in a mount's meta_avail allocator, triggering enospc for any space consuming allocatins in the mount, if the server's global meta_vail pool falls below the reserved block count. Before each server transaction opens we swap the global meta_avail and meta_freed allocators to ensure that the transaction has at least the reserved count of blocks available. This creates a risk of premature enospc as the global meta_avail pool drains and swaps to the larger meta_freed. The pool can be close to the reserved count, perhaps at it exactly. _get_log_trees can fill the client's mount, even a little, and drop the global meta_avail total under the reserved count, triggering enospc, even though meta_Freed could have had quite a lot of blocks. The fix is to ensure that the global meta_avail has 2x the reserved count and swapping if it falls under that. This ensures that a server transaction can consume an entire reserved count and still have enough to avoid triggering enospc. This fixes a scattering of rare premature enospc returns that were hitting during tests. It was rare for meta_avail to fall just at the reserved count and for get_log_trees to have to refill the client allocator, but it happened. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	fd686cab86	Fix total_data_blocks calculation in mkfs mkfs was incorrectly initializing total_data_blocks. The field is meant to record the number of blocks from the start of the device that the filesystem could access. mkfs was subtracting the initial reserved area of the device, meaning the number of blocks that the filesystem might access. This could allow accesses past devices if mount checks the device size against the smaller total_data_blocks. And we're about to use total_data_blocks as the start of a new extent to add when growing the volume. It needs to be fixed so that this new grown free extent doesn't overlap with the end of the existing free extents. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	4c1181c055	Remove first_ and last_ super blkno fields There are fields in the super block that specify the range of blocks that would be used for metadata or data. They are from the time when a single block device was carved up into regions for metadata and data. They don't make sense now that we have separate metadata and data block devices. The starting blkno is static and we go to the end of the device. This removes the fields now that they serve no purpose. The only use of them to check that freed extents fell within the correct bounds can still be performed by using the static starting number or roughly using the size of the devices. It's not perfect, but this is already only a check to see that the blknos aren't utter nonsense. We're removing the fields now to avoid having to update them while worrying about users when resizing devices. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	4893a6f915	scoutfs_dirents_equal should return bool It looks like it returned u64 because it was derived from _name_hash(). Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	384590f016	Sync net shouldn't wait for errored submits If async network request submission fails then the response handler will never be called. The sync request wrapper made the mistake of trying to wait for completion when initial submission failed. This never happened in normal operation but we're able to trigger it with some regularity with forced unmount during tests. Unmount would hang waiting for work to shutdown which was waiting for request responses that would never happen. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	192f077c16	Update data_version when fallocate changes size Changing the file size can changes the file contents -- reads will change when they stop returning data. fallocate can change the file size and if it does it should increment the data_version, just like setattr does. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	a9baeab22e	stage_tmpfile test gets current data_version The stage_tmpfile test util was written when fallocate didn't update data_version for size extensions. It is more correct to get the data_version after fallocate changes data_versions for however many transactions, extent allocations, and i_size extensions it took to allocate space. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	b7ab26539a	Avoid lockdep warning about upstream inversion Some kernels have blkdev_reread_part acquire the bd_mutex and then call into drop_partitions which calls fsync_bdev which acquires s_umount. This inverts the usual pattern of deactivate_super getting s_umount and then using blkdev_put in kill_sb->put_super to drop a second device. The inversion has been fixed upstream by years of rewrites. We can't go back in time to fix the kernels that we're testing against, unfortunately, so we disable lockdep around our valid leg of the inversion that lockdep is noticing in our testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	c51f0c37da	Defer dirty inode data writeback (and use list) iput() can only be used in contexts that could perform final inode deletion which requires cluster locks and transactions. This is absolutely true for the transaction committing worker. We can't have deletion during transaction commit trying to get locks and dirty more items in the transaction. Now that we're properly getting locks in final inode deletion and O_TMPFILE support has put pressure on deletion, we're seeing deadlocks between inode eviction during transaction commit getting a index lock and index lock invalidation trying to commit. We use the newly offered queued iput to defer the iput from walking our dirty inodes. The transaction commit will be able to proceed while the iput worker is off waiting for a lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:20:40 -07:00
Zach Brown	52107424dd	Promote deferred iput to inode call Lock invalidation had the ability to kick iput off to work context. We need to use it for inode writeback as well so we move the mechanism over to inode.c and give it a proper call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	099a65ab07	Try recovering from truncate errors and more info We're seeing errors during truncate that are surprising. Let's try and recover from them and provide more info when they happen so that we can dig deeper. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	21c5724dd5	Update fenced service file StartLimitBurst The first draft was written against an older schema, StartLimitBurst is in [Service] now. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	3974d98f6b	Don't use "/dev/*" redirections near systemd It sets up stdout and stderr as sockets, not pipes, so these links don't work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	2901b43906	Also allow omap requests to disconnected clients We recently fixed problems sending omap responses to originating clients which can race with the clients disconnecting. We need to handle the requests sent to clients on behalf of an origination request in exactly the same way. The send can race with the client being evicted. It'll be cleaned after the race is safely ignored by the client's rid being removed from the server's request tracking. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	03d7a4e7fe	Show relative times in quorum status file output The times in the quorum status file are in absolute monotinic kernel time since bootup. That's not particularly helpful especially when comparing across hosts with different boot times. This shows relative times in timespec64 seconds until or since the times in question. While we're at it we also collect the send and receive timestamps closer to each send or receive call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	d5d3b12986	Specficially shutdown quorum during forced unmount Generally, forced unmount works by returning errors for all IO. Quorum is pretty resilient in that it can have the IO errors eaten by server startup and does its own messaging that won't return errors. Trying to force unmount can have the quorum service continually participate in electing a server that immediately fails and shutds down. This specifically shuts down the internal quorum service when it sees that unmount is being forced. This is easier and cleaner than having the network IO return errors and then having that trigger shutdown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	e4dca8ddcc	Don't shutdown quorum if server startup fails The quorum service shuts down if it sees errors that mean that it can't do its job. This is mostly fatal errors gathering resources at startup or runtime IO errors but it was also shutting down if server startup fails. That's not quite right. This should be treated like the server shutting down on errors. Quorum needs to stay around to participate in electing the next server. Fence timeouts could trigger this. A quorum mount could crash, the next server without a fence script could have a fence request timeout and shutdown, and now the third remaining server is left to indefinitely send vote requests into the void. With this fixed, continuing that example, the quorum service in the second mount remains to elect the third server with a working fence script after the second server shuts down after its fence request times out. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	011b7d52e5	Merge pull request #45 from versity/ben/systemd_configs Add fenced systemd and example configs	2021-07-09 08:39:18 -07:00
Ben McClelland	3a9db45194	Add fenced systemd and example configs This should be good enough to get single node mounts up and running with fenced with minimal effort. The example config will need to be copied to /etc/scoutfs/scoutfs-fenced.conf for it to be functional, so this still requires specific opt-in and wont accidentally run for multi-node systems. Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>	2021-07-09 08:22:39 -07:00
Zach Brown	53f11f5479	Merge pull request #46 from versity/zab/orphan_deletion_and_enospc Zab/orphan deletion and enospc	2021-07-08 10:52:53 -07:00
Zach Brown	b4ede2ac6a	Allow omap responses to disconnected originators The omap message lifecycle is a little different than the server's usual handling that sends a response from the request handler. The response is sent long after the initial receive handler is pinning the connection to the client. It's fine for the response to be dropped. The main server request handler handled this case but other response senders didn't. Put this error handling in the server response sender itself so that all callers are covered. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 09:36:07 -07:00
Zach Brown	cbe8d77f78	Prevent duplicate inode item deletion We hide I_FREEING inodes from inode lookup to avoid inversions with cluster locking. This can result in duplicate inodes structs for a given inode number. Then can both race to try and delete the same items for their shared inode number. This leads to error messages from evict_inode and could lead to corruption if they, for example, both try and free the same data extents. This adds very basic serialization so only one instance can try to delete items at a time. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	5f682dabb5	Item cache invalidation uses seqs to avoid readers The item cache has to be careful not to insert stale read items when previously dirty items have been written and invalidated while a read was in flight. This was previously done by recording the possible range of items that a reader could see based on the key range of its lock. This is disasterous when a workload operates entirely within one lock. I ran into this when testing a small number of files with massive amounts of xattrs. While any reader is in flight all pages can't be invalidated because they all intersect with the one lock that covers all the items in use. The fix is to more naturally reflect the problem by tracking the greatest item seq in pages and the earliest seq that any readers can't see. This lets invalidate only skip pages with items that weren't visible to the earliest reader. This more naturally reflects that the problem is due to the age of the items, not their position in the key space. Now only a few of the most recently modified pages could be skipped and they'll be at the end of the LRU and won't typically be visited. As an added benefit it's now much cheaper to add, delete, and test the active readers. This fix stopped rm -rf of a full system's worth of xattrs from taking minutes constantly spinning skipping all pages in the LRU to seconds of doing real removal work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	120c2d342a	Add create_xattr_loop test tool Add a quick tool that creates xattrs in a tight loop. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	84454b38c5	Add mkfs -A for small device sizes Normally mkfs would fail if we specify meta or data devices that are too small. We'd like to use small devices for test scenarios, though, so add an option to allow specifying sizes smaller than the minumum required sizes. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	29cfa81574	Remove unused leftovers from quorum changes These forward declarations were for interfaces that have since been removed or changed and are no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9db3b475c0	Stop log merge work earlier during unmount The forest log merge work calls into the client to send commit requests to the server. The forest is usually destroyed relatively late in the sequence and can still be running after the client is destroyed. Adding a _forest_stop call lets us stop the log merging work before the client is destroyed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	24d682bf81	Add orphan-inodes test Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	2957f3e301	Avoid warnings when evict has signals pending Killing a task can end up in evict and break out of acquiring the locks to perform final inode deletion. This isn't necessarily fatal. The orphan task will come around and will delete the inode when it is truly no longer referenced. So let's silence the error and keep track of how many times it happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	0374661a92	Merge pull request #43 from versity/zab/btree_merging Zab/btree merging	2021-06-22 13:16:30 -07:00
Zach Brown	28759f3269	Rotate srch files as log trees items are reclaimed The log merging work deletes log trees items once their item roots are merged back into the fs root. Those deleted items could still have populated srch files that would be lost. We force rotation of the srch files in the items as they're reclaimed to turn them into rotated srch files that can be compacted. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:37:45 -07:00
Zach Brown	5c3fdb48af	Fix btree join item movement Refilling a btree block by moving items from its siblings as it falls under the join threshold had some pretty serious mistakes. It used the target block's total item count instead of the siblings when deciding how many items to move. It didn't take item moving overruns into account when deciding to compact so it could run out of contiguous free space as it moved the last item. And once it compacted it returned without moving because the return was meant to be in the error case. This is all fixed by correctly examining the sibling block to determine if we should join a block up to 75% full or move a big chunk over, compacting if the free space doesn't have room for an excessive worst case overrun, and fixing the compaction error checking return typo. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	a7828a6410	Add log merge item allocators to alloc detail The alloc iterator needs to find and include the totals of the avail and freed allocator list heads in the log merge items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	a1d46e1a92	Fix mkfs btree item offset calculation mkfs was miscalculating the offset of the start of the free region in the center of blocks as it populated blocks with items. It was using the length of the free region as its offset in the block. To find the offset of the end of the free region in the block it has to be taken relative to the end of the item array. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	d67db6662b	Fix item cache val_len alignment math Some item_val_len() callers were applying alignment twice, which isn't needed. And additions to erased_bytes as value lengths change didn't take alignment into account. They could end up double counting if val_len changes within the alignment are then accounted for again as the full item and alignment is later deleted. Additions to erased_bytes based on val_len should always take alignment into account. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	c5c050bef0	Item cache might free null page on alloc error The item cache allocates a page and a little tracking struct for each cached page. If the page allocation fails it might try to free a null page pointer, which isn't allowed. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	96d286d6e5	Zero btree item padding as items are created Item creation, which fills out a new item at the end of the array of item structs at the start of the block, didn't explicitly zero the item struct padding to 0. It would only have been zero if the memory was already zero, which is likely for new blocks, but isn't necessarily true if the memory had previously been used by deleted values. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9febc6b5dc	Update btree block validator for 8byte alignment The change to aligning values didn't update the btree block verifier's total length calculation, and while we're in there we can also check that values are correctly aligned. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	045b3ca8d4	Expand unused btree verifying walker Previously we had an unused function that could be flipped on to verify btree blocks during traversal. This refactors the block verifier a bit to be called by a verifying walker. This will let callers walk paths to leaves to verify the tree around operations, rather than verification being performed during the next walk. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	ff882a4c4f	Add btree total_above_join_low_water() test Take the condition used to decide if a btree block needs to be joined and put it in total_above_join_low_water() so that btree_merging will be able to call it to see if the leaf block it's merging into needs to be joined. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	3d1a0f06c0	Add scoutfs_btree_free_blocks Add a btree function for freeing all the blocks in a btree without having to cow the blocks to track which refs have been freed. We use a key from the caller to track which portions of the tree have been freed. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	3488b4e6e0	Add scoutfs print support for log merge items Add support for printing all the items in the log_merge tree that the server uses to track log merging. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	c482204fcf	Clean up btree root printing in superblock Over time the printing of the btree roots embedded in the super block has gotten a little out of hand. Add a helper macro for the printf format and args and re-order them to match their order in the superblock. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9711fef122	Update for core, trans, and item seq use We now have a core seq number in the super that is advanced for multiple users. The client transaction seq comes from the core seq so we remove the trans_seq from the super. The item version is also converted to use a seq that's derived from the core seq. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	91acf92666	Add client btree merge processing Add the client work which is regularly scheduled to ask the server for log merging work to do. The relatively simple client work gets a request from the server, finds the log roots to merge given the reqeust seq, performs the merge with a btree call and callbacks, and commits the result to the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9c2122f7de	Add server btree merge processing This adds the server processing side of the btree merge functionality. The client isn't yet sending the log_merge messages so no merging will be performed. The bulk of the work happens as the server processess a get_log_merge message to build a merge request for the client. It starts a log merge if one isn't in flight. If one is in flight it checks to see if it should be spliced and maybe finished. In the common case it finds the next range to be merged and sends the request to the client to process. The commit_log_merge handler is the completion side of that request. If the request failed then we unwind its resources based on the stored request item. If it succeeds we record it in an item for get_ processing to splice eventually. Then we modify two existing server code paths. First, get_log_tree doesn't just create or use a single existing log btree for a client mount. If the existing log btree is large enough it sets its finalized flag and advances the nr to use a new log btree. That makes the old finalized log btree available for merging. Then we need to be a bit more careful when reclaiming the open log btree for a client. We can't use next to find the only open log btree, we use prev to find the last and make sure that it isn't already finalized. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00

1 2 3 4 5 ...

1453 Commits