scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-19 04:55:06 +00:00

Author	SHA1	Message	Date
Bryant Duffy-Ly	66b8c5fbd7	Enhance clarify of some kfree paths In some of the allocation paths there are goto statements that end up calling kfree(). That is fine, but in cases where the pointer is not initially set to NULL then we might have an undefined behavior. kfree on a NULL pointer does nothing, so essentially these changes should not change behavior, but clarifies the code path better. Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>	2021-10-06 18:07:27 -05:00
Zach Brown	6ca8c0eec2	Consistently initialize dentry info Unfortunately, we're back in kernels that don't yet have d_op->d_init. We allocate our dentry info manually as we're given dentries. The recent verification work forgot to consistently make sure the info was allocated before using it. Fix that up, and while we're at it be a bit more robust in how we check to see that it's been initialized without grabbing the d_lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	ea2b01434e	Add support for i_version This adds i_version to our inode and maintains it as we allocate, load, modify, and store inodes. We set the flag in the superblock so in-kernel users can use i_version to see changes in our inodes. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	d5eec7d001	Fix uninitialized srch ret that won't happen More recent gcc notices that ret in delete_files can be undefined if nr is 0 while missing that we won't call delete_files in that case. Seems worth fixing, regardless. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	b9a0f1709f	Add xattr .totl. tag Add the .totl. xattr tag. When the tag is set the end of the name specifies a total name with 3 encoded u64s separated by dots. The value of the xattr is a u64 that is added to the named total. An ioctl is added to read the totals. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	46edf82b6b	Add inode crtime creation time Add an inode creation time field. It's created for all new inodes. It's visible to stat_more. setattr_more can set it during restore. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-03 11:14:41 -07:00
Zach Brown	79fbaa6481	Verify dentries after locking Our dir methods were trusting dentry args. The vfs code paths use i_mutex to protect dentries across revalidate or lookup and method calls. But that doesn't protect methods running in other mounts. Multiple nodes can interleave the initial lookup or revalidate then actual method call. Rename got this right. It is very paranoid about verifying inputs after acquiring all the locks it needs. We extend this pattern to the rest of the methods that need to use the mapping of name to inode (and our hash and pos) in dentries. Once we acquire the parent dir lock we verify that the dentry is still current, returning -EEXIST or -ENOENT as appropriate. Along these lines, we tighten up dentry info correctness a bit by updating our dentry info (recording lock coverage and hash/pos) for negative dentries produced by lookup or as the result of unlink. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-31 09:49:32 -07:00
Zach Brown	ad5662b892	Handle dupe invalidation requests during recovery Client lock invalidation handling was very strict about not receiving duplicate invalidation requests from the server because it could only track one pending request. The promise to only send one invalidate at a time is made by one server, it can't be enforced across server failover. Particularly because invalidation processing can have to do quite a lot of work with the server as it tears down state associated with the lock. We fix this by recording and processing each individual incoming invalidation request on the lock. The code that handled reordering of incoming grant responses and invalidation requests waited for the lock's mode to match the old mode in the invalidation request before proceeding. That would have prevented duplicate invalidation requests from making forward progress. To fix this we make lock client recieve processing synchronous instead of going through async work which can reorder. Now grant responses are processed as they're received and will always be resolved before all the invalidation requests are queued and processed in order. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	f5577e26b1	Reset item state when retrying stale forest reads The forest reader reads items from the fs_root and all log btrees and gives them to the caller who tracks them to resolve version differences. The reads can run into stale blocks which have been overwritten. The forest reader was implementing the retry under the item state in the caller. This can corrupt items that are only seen firest in an old fs root before a merge and then only seen in the fs_root after a merge. In this case the item won't have any versioning and the existing version from the old fs_root is preferred. This is particularly bad when the new version was deleted -- in that case we have no metadata which would tell us to drop the old item that was read from the old fs_root. This is fixed by pushing the retry up to callers who wipe the item state before each retry. Now each set of items is related to a single snapshot of the fs_root and logs at one point in time. I haven't seen definitive evidence of this happening in practice. I found this problem after putting on my craziest thinking toque and auditing the code for places where we could lose item updates. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5f57785790	Fix btree merge input item iteration Btree merging attempted to build an rbtree of the input roots with only one version of an item present in the rbtree at a time. It really messed this up by completely dropping an input root when a root with a newer version of its item tried to take its place in the rbtree. What it should have done is advance to the next item in the older root, which itself could have required advancing some other older root. Dropping the root entirely is catastrophically wrong because it hides the rest of the items in the root from merging. This has been manifesting as occasional mysterious item loss during tests where memory pressure, item update patterns, and merging all lined up just so. This fixes the problem by more clearly keeping the next item in each root in the rbtree. We sort by newest to oldest version so that once we merge the most recent version of an item its easy to skip all the older versions of the items in the next rbtree entries for the rest of the input roots. While we're at it we work with references to the static cached input btree blocks. The old code was a first pass that used an expensive btree walk per item and copied the value payload. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	3740c0a995	More carefully scan for orphan inodes The current orphan scan uses the forest_next_hint to look for candidate orphan items to delete. It doesn't skip deleted items and checks the forest of log btrees so it'd return hints for every single item that existed in all the log btrees across the system. And we call the hint per-item. When the system is deleting a lot of files we end up generating a huge load where all mounts are constantly getting the btree roots from the server, reading all the newest log btree blocks, finding deleted orphan items for inodes that have already been deleted, and moving on to the next deleted orphan item. The fix is to use a read-only traversal of only one version of the fs root for all the items in one scan. This avoids all the deleted orphan items that exist in the log btrees which will disappear when they're merged. It lets the item iteration happen in a single read-only cached btree instead of constantly reading in the most recently written root block of every log btree. The result is an enormous speedup of large deletions. I don't want to describe exactly how enormous. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	a4f5293e78	Flush invalidate and iput inode references We can be performing final deletion as inodes are evicted during unmount. We have to keep full locking, transactions, and networking up and running for the evict_inodes() call in generic_shutdown_super(). Unfortunately, this means that workers can be using inode references during evict_inodes() which prevents them from being evicted. Those workers can then remain running as we tear down the system, causing crashes and deadlocks as the final iputs try to use resources that have been destroyed. The fix is to first properly stop orphan scanning, which can instantiate new cached inodes, up before the call to kill_block_super ends up trying to evict all inodes. Then we just need to wait for any pending iput and invalidate work to finish and perform the final iput, which will always evict because generic_shutdown_super has cleared MS_ACTIVE. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	0c3026a2b7	Add simple per-lock server message count stats Add some simple tracking of message counts for each lock in the lock server so that we can start to see where conflicts may be happening in a running system. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5bc95fac7d	Add scoutfs_unmounting() Add a quick helper that can be used to avoid doing work if we know that we're already shutting down. This can be a single coarser indicator than adding functions to each subsystem to track that we're shutting down. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	b0a08eb922	Remove lock grace period We had some logic to try and delay lock invalidation while the lock was still actively in use. This was trying to reduce the cost of pathological lock conflict cases but it had some severe fairness problems. It was first introduced to deal with bad patterns in userspace that no longer exist and it was built on top of the LSM transaction machinery that also no longer exists. It hasn't aged well. Instead of introducing invalidation latency in the hopes that it leads to more batched work, which it can't always, let's aim more towards reducing latency in all parts of the write-invalidate-read path and also aim towards reducing contention in the first place. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	bb571377dc	Don't merge newer items past older We have a problem where items can appear to go backwards in time because of the way we chose which log btrees to finalize and merge. Because we don't have versions in items in the fs_root, and even might not have items at all if they were deleted, we always assume items in log btrees are newer than items in the fs root. This creates the requirement that we can't merge a log btree if it has items that are also present in older versions in other log btrees which are not being merged. The unmerged old item in the log btree would take precedent over the newer merged item in the fs root. We weren't enforcing this requirement at all. We used the max_item_seq to ensure that all items were older than the current stable seq but that says nothing about the relationship between older items in the finalized and active log btrees. Nothing at all stops an active btree from having an old version of a newer item that is present in another mount's finalized log btree. To reliably fix this we create a strict item seq discontinuity between all the finalized merge inputs and all the active log btrees. Once any log btree is naturally finalized the server forced all the clients to group up and finalize all their open log btrees. A merge operation can then safely operate on all the finalized trees before any new trees are given to clients who would start using increasing items seqs. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5897f4d889	Add a trivial trace_printk wrapper Make it a bit easier to include the fsid and rid in trace_printk messages when we're experimenting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:20 -07:00
Zach Brown	999093bfc9	Add sync log trees network command Add a command for the server to request that clients commit their open transaction. This will be used to create groups of finalized log btrees for consistent merging. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:17 -07:00
Zach Brown	05b5d93365	Verify that quorum_slot_nr references valid slot We were checking that quorum_slot_nr was within the range of possible slots allowed by the format as it was parsed. We weren't checking that it referenced a configured slot. Make sure, and give a nice error message that shows the configured slots. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4d7191dc48	Print messages on extent ins/rem errors Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4495dbdce6	Set initial quorum term from max of all blocks During rough forced unmount testing we saw a seemingly mysterious concurrent election. It could be explained if mounts coming up don't start with the same term. Let's try having mounts initialize their term to the greatest of all the terms they can see in the quorum blocks. This will prevent the situation where some new quorum actors with greater terms start out ignoring all the messages from others. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	70569b0448	Trivial quorum test;set -> test_and_set Nothing interesting here, just a minor convenience to use test and set instead of testing and then setting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	823838cf01	Add more messages to server processing errors The server doesn't give us much to go on when it gets an error handling requests to work with log trees from the client. This adds a lot of specific error messages so we can get a better understanding of failures. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	89b5865a4c	Verify that log tree commit is for sending rid We were trusting the rid in the log trees struct that the client sent. Compare it to our recorded rid on the connection and fail if the client sent the wrong rid. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-17 12:13:01 -07:00
Zach Brown	65ac42831f	Queue invalidation during previous request The locking protocol only allows one outstanding invalidation request for a lock at a time. The client invalidation state is a bit hairy and involves removing the lock from the invalidation list while it is being processed which includes sending the response. This means that another request can arrive while the lock is not on the invalidation list. We have fields in the lock to record another incoming request which puts the lock back on the list. But the invalidation work wasn't always queued again in this case. It looks like the incoming request path would queue the work, but by definition the lock isn't on the invalidation list during this race. If it's the only lock in play then the invalidation list will be empty and the work won't be queued. The lock can get stuck with a pending invalidation if nothing else kicks the invaliation worker. We saw this in testing when the root inode lock group missed the wakeup. The fix is to have the work requeue itself after putting the lock back on the invalidation list when it notices that another request came in. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-06 15:41:11 -07:00
Zach Brown	cb1726681c	Fix net BUG_ON if reconnection farewell send races When a client socket disconnects we save the connection state to re-use later if the client reconnects. A newly accepted connection finds the old connection associated with the reconnecting client and migrates state from the old idle connection to the newly accepted connection. While moving messages between the old and new send and resend queues the code had an aggressive BUG_ON that was asserting that the newly accepted connection couldn't have any messages in its resend queue. This BUG can be tripped due to the ordering of greeting processing and connection state migration. The server greeting processing path sends the farewell response to the client before it calls the net code to migrate connection state. When it "sends" the farewell response it puts the message on the send queue and kicks the send work. It's possible for the send work to execute and move the farewell response to the resend queue and trip the BUG_ON. This is harmless. The sent greeting response is going to end up on the resend queue either way, there's no reason for the reconnection migration to assert that it can't have happened yet. It is going to be dropped the moment we get a message from the client with a recv_seq that is necessarily past the greeting response which always gets a seq of 1 from the newly accepted connection. We remove the BUG_ON and try to splice the old resend queue after the possible response at the head of the resend_queue so that it is the first to be dropped. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-02 11:15:57 -07:00
Zach Brown	cdff272163	Fix alloc list exhaustion calculation The last thing server commits do is move extents from the freed list into freed extents. It moves as many as it can until it runs out of avail meta blocks and space fore freed meta blocks in the current allocator's lists. The calculation for whether the lists had resources to move an extent was quite off. It missed that the first move might have to dirty the current allocator or the list block, that the btree could join/split blocks at each level down the paths, and boy does it look like the height component of the calculation was just bonkers. With the wrong calculation the server could overflow the freed list while moving extents and trigger a BUG_ON. We rarely saw this in testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-01 14:31:57 -07:00
Zach Brown	7e935898ab	Avoid premature metadata enospc server_get_log_trees() sets the low flag in a mount's meta_avail allocator, triggering enospc for any space consuming allocatins in the mount, if the server's global meta_vail pool falls below the reserved block count. Before each server transaction opens we swap the global meta_avail and meta_freed allocators to ensure that the transaction has at least the reserved count of blocks available. This creates a risk of premature enospc as the global meta_avail pool drains and swaps to the larger meta_freed. The pool can be close to the reserved count, perhaps at it exactly. _get_log_trees can fill the client's mount, even a little, and drop the global meta_avail total under the reserved count, triggering enospc, even though meta_Freed could have had quite a lot of blocks. The fix is to ensure that the global meta_avail has 2x the reserved count and swapping if it falls under that. This ensures that a server transaction can consume an entire reserved count and still have enough to avoid triggering enospc. This fixes a scattering of rare premature enospc returns that were hitting during tests. It was rare for meta_avail to fall just at the reserved count and for get_log_trees to have to refill the client allocator, but it happened. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	4c1181c055	Remove first_ and last_ super blkno fields There are fields in the super block that specify the range of blocks that would be used for metadata or data. They are from the time when a single block device was carved up into regions for metadata and data. They don't make sense now that we have separate metadata and data block devices. The starting blkno is static and we go to the end of the device. This removes the fields now that they serve no purpose. The only use of them to check that freed extents fell within the correct bounds can still be performed by using the static starting number or roughly using the size of the devices. It's not perfect, but this is already only a check to see that the blknos aren't utter nonsense. We're removing the fields now to avoid having to update them while worrying about users when resizing devices. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	4893a6f915	scoutfs_dirents_equal should return bool It looks like it returned u64 because it was derived from _name_hash(). Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	384590f016	Sync net shouldn't wait for errored submits If async network request submission fails then the response handler will never be called. The sync request wrapper made the mistake of trying to wait for completion when initial submission failed. This never happened in normal operation but we're able to trigger it with some regularity with forced unmount during tests. Unmount would hang waiting for work to shutdown which was waiting for request responses that would never happen. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	192f077c16	Update data_version when fallocate changes size Changing the file size can changes the file contents -- reads will change when they stop returning data. fallocate can change the file size and if it does it should increment the data_version, just like setattr does. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	b7ab26539a	Avoid lockdep warning about upstream inversion Some kernels have blkdev_reread_part acquire the bd_mutex and then call into drop_partitions which calls fsync_bdev which acquires s_umount. This inverts the usual pattern of deactivate_super getting s_umount and then using blkdev_put in kill_sb->put_super to drop a second device. The inversion has been fixed upstream by years of rewrites. We can't go back in time to fix the kernels that we're testing against, unfortunately, so we disable lockdep around our valid leg of the inversion that lockdep is noticing in our testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	c51f0c37da	Defer dirty inode data writeback (and use list) iput() can only be used in contexts that could perform final inode deletion which requires cluster locks and transactions. This is absolutely true for the transaction committing worker. We can't have deletion during transaction commit trying to get locks and dirty more items in the transaction. Now that we're properly getting locks in final inode deletion and O_TMPFILE support has put pressure on deletion, we're seeing deadlocks between inode eviction during transaction commit getting a index lock and index lock invalidation trying to commit. We use the newly offered queued iput to defer the iput from walking our dirty inodes. The transaction commit will be able to proceed while the iput worker is off waiting for a lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:20:40 -07:00
Zach Brown	52107424dd	Promote deferred iput to inode call Lock invalidation had the ability to kick iput off to work context. We need to use it for inode writeback as well so we move the mechanism over to inode.c and give it a proper call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	099a65ab07	Try recovering from truncate errors and more info We're seeing errors during truncate that are surprising. Let's try and recover from them and provide more info when they happen so that we can dig deeper. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	2901b43906	Also allow omap requests to disconnected clients We recently fixed problems sending omap responses to originating clients which can race with the clients disconnecting. We need to handle the requests sent to clients on behalf of an origination request in exactly the same way. The send can race with the client being evicted. It'll be cleaned after the race is safely ignored by the client's rid being removed from the server's request tracking. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	03d7a4e7fe	Show relative times in quorum status file output The times in the quorum status file are in absolute monotinic kernel time since bootup. That's not particularly helpful especially when comparing across hosts with different boot times. This shows relative times in timespec64 seconds until or since the times in question. While we're at it we also collect the send and receive timestamps closer to each send or receive call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	d5d3b12986	Specficially shutdown quorum during forced unmount Generally, forced unmount works by returning errors for all IO. Quorum is pretty resilient in that it can have the IO errors eaten by server startup and does its own messaging that won't return errors. Trying to force unmount can have the quorum service continually participate in electing a server that immediately fails and shutds down. This specifically shuts down the internal quorum service when it sees that unmount is being forced. This is easier and cleaner than having the network IO return errors and then having that trigger shutdown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	e4dca8ddcc	Don't shutdown quorum if server startup fails The quorum service shuts down if it sees errors that mean that it can't do its job. This is mostly fatal errors gathering resources at startup or runtime IO errors but it was also shutting down if server startup fails. That's not quite right. This should be treated like the server shutting down on errors. Quorum needs to stay around to participate in electing the next server. Fence timeouts could trigger this. A quorum mount could crash, the next server without a fence script could have a fence request timeout and shutdown, and now the third remaining server is left to indefinitely send vote requests into the void. With this fixed, continuing that example, the quorum service in the second mount remains to elect the third server with a working fence script after the second server shuts down after its fence request times out. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	b4ede2ac6a	Allow omap responses to disconnected originators The omap message lifecycle is a little different than the server's usual handling that sends a response from the request handler. The response is sent long after the initial receive handler is pinning the connection to the client. It's fine for the response to be dropped. The main server request handler handled this case but other response senders didn't. Put this error handling in the server response sender itself so that all callers are covered. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 09:36:07 -07:00
Zach Brown	cbe8d77f78	Prevent duplicate inode item deletion We hide I_FREEING inodes from inode lookup to avoid inversions with cluster locking. This can result in duplicate inodes structs for a given inode number. Then can both race to try and delete the same items for their shared inode number. This leads to error messages from evict_inode and could lead to corruption if they, for example, both try and free the same data extents. This adds very basic serialization so only one instance can try to delete items at a time. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	5f682dabb5	Item cache invalidation uses seqs to avoid readers The item cache has to be careful not to insert stale read items when previously dirty items have been written and invalidated while a read was in flight. This was previously done by recording the possible range of items that a reader could see based on the key range of its lock. This is disasterous when a workload operates entirely within one lock. I ran into this when testing a small number of files with massive amounts of xattrs. While any reader is in flight all pages can't be invalidated because they all intersect with the one lock that covers all the items in use. The fix is to more naturally reflect the problem by tracking the greatest item seq in pages and the earliest seq that any readers can't see. This lets invalidate only skip pages with items that weren't visible to the earliest reader. This more naturally reflects that the problem is due to the age of the items, not their position in the key space. Now only a few of the most recently modified pages could be skipped and they'll be at the end of the LRU and won't typically be visited. As an added benefit it's now much cheaper to add, delete, and test the active readers. This fix stopped rm -rf of a full system's worth of xattrs from taking minutes constantly spinning skipping all pages in the LRU to seconds of doing real removal work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	29cfa81574	Remove unused leftovers from quorum changes These forward declarations were for interfaces that have since been removed or changed and are no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9db3b475c0	Stop log merge work earlier during unmount The forest log merge work calls into the client to send commit requests to the server. The forest is usually destroyed relatively late in the sequence and can still be running after the client is destroyed. Adding a _forest_stop call lets us stop the log merging work before the client is destroyed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	2957f3e301	Avoid warnings when evict has signals pending Killing a task can end up in evict and break out of acquiring the locks to perform final inode deletion. This isn't necessarily fatal. The orphan task will come around and will delete the inode when it is truly no longer referenced. So let's silence the error and keep track of how many times it happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00

1 2 3 4 5 ...

1064 Commits