scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 21:03:12 +00:00

Author	SHA1	Message	Date
Zach Brown	7ef62894bd	Add ino_alloc_per_lock option Add an option that can limit the number of inode numbers that are allocated per lock group. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 17:19:04 -08:00
Auke Kok	afb6ba00ad	POSIX ACL changes. The .get_acl() method now gets passed a mnt_idmap arg, and we can now choose to implement either .get_acl() or .get_inode_acl(). Technically .get_acl() is a new implementation, and .get_inode_acl() is the old. That second method now also gets an rcu flag passed, but we should be fine either way. Deeper under the covers however we do need to hook up the .set_acl() method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To make this not super messy (it already is) we tack on the get_acl() changes here. This is all roughly ca. v6.1-rc1-4-g7420332a6ff4. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-30 13:59:44 -04:00
Auke Kok	d2cd610c53	Fix return of uninit value. The value of `ret` is not initialized. If the writeback list is empty, or, if igrab() fails on the only inode on the list, the value of `ret` is returned without being initialized. This would cause the caller to needlessly have to retry, perhaps possibly make things worse. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Zach Brown	c19280c83c	Add cond_resched to iput worker The iput worker can accumulate quite a bit of pending work to do. We've seen hung task warnings while it's doing its work (admitedly in debug kernels). There's no harm in throwing in a cond_resched so other tasks get a chance to do work. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-15 17:35:17 -05:00
Auke Kok	f86a7b4d3c	Fully wait for orphan inode scan to complete. The issue with the previous attempt to fix the orphan-inodes test was that we would regularly exceed the 120s timeout value put in there. Instead, in this commit, we change the code to add a new counter to indicate orphan deletion progress. When orphan inodes are deleted, the increment of this counter indicates progress happened. Inversely, every time the counter doesn't increment, and the orphan scan attempts counter increments, we know that there was no more work to be done. For safety, we wait until 2 consecutive scan attempts were made without forward progress in the test case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	4ef64c6fcf	Vfs methods become user namespace mount aware. v5.11-rc4-24-g549c7297717c All of these VFS methods are now passed a user_namespace. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	c951713ab2	list_cmp_func_t introduced, using const. v5.12-rc6-9-g4f0f586bf0c8 All list_sort functions use the list_cmp_func_t type, which compares list_head member types. These are now required to be `const` as the compiler will now check them. This propagates into our callers. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Zach Brown	4a8240748e	Add project ID support Add support for project IDs. They're managed through the _attr_x interfaces and are inherited from the parent directory during creation. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	fb5331a1d9	Add inode retention bit Add a bit to the private scoutfs inode flags which indicates that the inode is in retention mode. The bit is visible through the _attr_x interface. It can only be set on regular files and when set it prevents modification to all but non-user xattrs. It can be cleared by root. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Bryant G. Duffy-Ly	90cfaf17d1	Initial support for different inode sizes We're about to increase the inode size and increment the format version. Inode reading and writing has to handle different valid inode sizes as allowed by the format version. This is the initial skeletal work that later patches which really increase the inode size will further refine to add the specific known sizes and format versions. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com> [zab@versity.com: reworded description, reworked to use _within] Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 14:53:49 -07:00
Zach Brown	6931cb7b0e	Add scoutfs_inode_[gs]et_flags Add functions for getting and setting our private inode flags. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 14:53:49 -07:00
Zach Brown	8e37be279c	Use seqlock to protect inode fields We were using a seqcount to protect high frequency reads and writes to some of our private inode fields. The writers were serialized by the caller but that's a bit too easy to get wrong. We're already storing the write seqcount update so the additional internal spinlock stores in seqlocks isn't a significant additional overhead. The seqlocks also handle preemption for us. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-25 15:11:20 -07:00
Auke Kok	d480243c11	Support .read/write_iter callbacks in lieu of .aio_read/write The aio_read and aio_write callbacks are no longer used by newer kernels which now uses iter based readers and writers. We can avoid implementing plain .read and .write as an iter will be generated when needed for us automatically. We add a new data_wait_check_iter() function accordingly. With these methods removed from the kernel, the el8 kernel no longer uses the extended ops wrapper struct and is much closer now to upstream. As a result, a lot of methods are moving around from inode_dir_operations to and from inode_file_operations etc, and perhaps things will look a bit more structured as a result. As a result, we need a slightly different data_wait_check() that accounts for the iter and offset properly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Auke Kok	ec50e66fff	Timespec64 changes for yr2038. Provide a fallback `current_time(inode)` implementation for older kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Auke Kok	f0de59a9a3	Use setattr_preapre() as inode_change_ok() was removed in v4.8-rc1 Instead, we can call setattr_prepare() directly. We provide a fallback for older kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Auke Kok	1f0a08eacb	Use the new inode->i_version manipulation methods. Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly manipulating the member as needed. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Auke Kok	dac3f056a5	inode->i_mutex has been replaced with inode->i_rwsem. Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced with ->i_rwsem. However, long since whenever, inode_lock() and related functions already worked as intended and provided fully exclusive locking to the inode. To avoid a name clash on pre-rhel8 kernels, we have to rename a stack variable in `src/file.c`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	8e067b3d3f	Truncate dirties zero tail extension When we truncate away from a partial block we need to zero its tail that was past i_size and dirty it so that it's written. We missed the typical vfs boilerplate of calling block_truncate_page from setattr->set_size that does this. We need to be a little careful to pass our file lock down to get_block and then queue the inode for writeback so its written out with the transaction. This follows the pattern in .write_end. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-06 10:31:31 -08:00
Zach Brown	276fbebdac	Avoid dput in lock invalidation The d_prune_aliases in lock invalidation was thought to be safe because the caller had an inode refernece, surely it can't get into iput_final. I missed the fundamental dcache pattern that dput can ascend through parents and end up in inode eviction for entirely unrelated inodes. It's very easy for this to deadlock, imagine if nothing else that the inode invalidation is blocked on in dput->iput->evict->delete->lock is itself in the list of locks to invalidate in the caller. We fix this by always kicking off d_prune and dput into async work. This increases the chance that inodes will still be referenced after invalidation and prevent inline deletion. More deletions can be deferred until the orphan scanner finds them. It should be rare, though. We're still likely to put and drop invalidated inodes before a writer gets around to removing the final unlink and asking us for the omap that describes our cached inodes. To perform the d_prune in work we make it a behavioural flag and make our queued iputs a little more robust. We use much safer and understandable locking to cover the count and the new flags and we put the work in re-entrant work in their own workqueue instead of one work instance in the system_wq. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-02 12:28:13 -08:00
Zach Brown	71ed4512dc	Include primary lock write_seq for write_only vers FS items are deleted by logging a deletion item that has a greater item version than the item to delete. The versions are usually maintained by the write_seq of the exclusive write lock that protects the item. Any newer write hold will have a greater version than all previous write holds so any items created under the lock will have a greater vers than all previous items under the lock. All deletion items will be merged with the older item and both will be dropped. This doesn't work for concurrent write-only locks. The write-only locks match with each other so their write_seqs are asssigned in the order that they are granted. That grant order can be mismatched with item creation order. We can get deletion items with lesser versions than the item to delete because of when each creation's write-only lock was granted. Write only locks are used to maintain consistency between concurrent writers and readers, not between writers. Consistency between writers is done with another primary write lock. For example, if you're writing seq items to a write-only region you need to have the write lock on the inode for the specific seq item you're writing. The fix, then, is to pass these primary write locks down to the item cache so that it can chose an item version that is the greatest amongst the transaction, the write-only lock, and the primary lock. This now ensures that the primary lock's increasing write_seq makes it down to the item, bringing item version ordering in line with exclusive holds of the primary lock. All of this to fix concurrent inode updates sometimes leaving behind duplicate meta_seq items because old seq item deletions ended up with older versions than the seq item they tried to delete, nullifying the deletion. Signed-off-by: Zach Brown <zab@versity.com>	2022-11-15 13:26:32 -08:00
Zach Brown	29538a9f45	Add POSIX ACL support Add support for the POSIX ACLs as described in acl(5). Support is enabled by default and can be explicitly enabled or disabled with the acl or noacl mount options, respectively. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:36:10 -07:00
Zach Brown	798fbb793e	Move to xattr_handler xattr prefix dispatch Move to the use of the array of xattr_handler structs on the super to dispatch set and get from generic_ based on the xattr prefix. This will make it easier to add handling of the pseudo system. ACL xattrs. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-21 14:24:52 -07:00
Zach Brown	1cbc927ccb	Only clear trying inode deletion bit when set try_delete_inode_items() is responsible for making sure that it's safe to delete an inode's persistent items. One of the things it has to check is that there isn't another deletion attempt on the inode in this mount. It sets a bit in lock data while it's working and backs off if the bit is already set. Unfortunately it was always clearing this bit as it exited, regardless of whether it set it or not. This would let the next attempt perform the deletion again before the working task had finished. This was often not a problem because background orphan scanning is the only source of regular concurrent deletion attempts. But it's a big problem if a deletion attempt takes a very long time. It gives enough time for an orphan scan attempt to clear the bit then try again and clobber on whoever is performing the very slow deletion. I hit this in a test that built files with an absurd number of fragmented extents. The second concurrent orphan attempt was able to proceed with deletion and performed a bunch of duplicate data extent frees and caused corruption. The fix is to only clear the bit if we set it. Now all concurrent attempts will back off until the first task is done. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	bddca171ee	Call iput outside cluster locked transactions The final iput of an inode can delete items in cluster locked transactions. It was never safe to call iput within locked transactions but we never saw the problem. Recent work on inode deletion raised the issue again. This makes sure that we always perform iput outside of locked transactions. The only interesting change is making scoutfs_new_inode() return the allocated inode on error so that the caller can put the inode after releasing the transaction. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	d846eec5e8	Harden final inode deletion We were seeing a number of problems coming from races that allowed tasks in a mount to try and concurrently delete an inode's items. We could see error messages indicating that deletion failed with -ENOENT, we could see users of inodes behave erratically as inodes were deleted from under them, and we could see eventual server errors trying to merge overlapping data extents which were "freed" (add to transaction lists) multiple times. This commit addresses the problems in one relatively large patch. While we could mechanically split up the fixes, they're all interdependent and splitting them up (bisecting through them) could cause failures that would be devilishly hard to diagnose. First we stop allowing multiple cached vfs inodes. This was initially done to avoid deadlocks between lock invalidation and final inode deletion. We add a specific lookup that's used by invalidation which ignores any inodes which are in I_NEW or I_FREEING. Now that iget can wait on inode flags we call iget5_locked before acquiring the cluster lock. This ensures that we can only have one cached vfs inode for a given inode number in evict_inode trying to delete. Now that we can only have one cached inode, we can rework the omap tracking to use _set and _clear instead of _inc and _put. This isn't strictly necessary but is a simplification and lets us issue warnings if we see that we ever try to set an inode numbers bit on behalf of multiple cached inodes. We also add a _test helper. Orphan scanning would try to perform deletion by instantiating a cached inode and then putting it, triggering eviction and final deletion. This was an attempt to simplify concurrency but ended up causing more problems. It no longer tries to interact with inode cache at all and attempts to safely delete inode items directly. It uses the omap test to determine that it should skip an already cached inode. We had attempted to forbid opening inodes by handle if they had an nlink of 0. Since we allowed multiple cached inodes for an inode number this was to prevent adding cached inodes that were being deleted. It was only performing the check on newly allocated inodes, though, so it could get a reference to the cached inode that the scanner had inserted for deleting. We're chosing to keep restricting opening by handle to only linked inodes so we also check existing inodes after they're refreshed. We're left with a task evicting an inode and the orphan scanner racing to delete an inode's items. We move the work of determining if its safe to delete out of scoutfs_omap_should_delete() and into try_delete_inode_items() which is called directly from eviction and scanning. This is mostly code motion but we do make three critical changes. We get rid of the goofy concurrent deletion detection in delete_inode_items() and instead use a bit in the lock data to serialize multiple attempts to delete an inode's items. We no longer assume that the inode must still be around because we were called from evict and specifically check that inode item is still present for deleting. Finally, we use the omap test to discover that we shouldn't delete an inode that is locally cached (and would be not be included to the omap response). We do all this under the inode write lock to serialize between mounts. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:28:58 -08:00
Zach Brown	a67ea30bb7	Add orphan_scan_delay_ms mount option Add a mount option to set the delay betwen scanning of the orphan list. The sysfs file for the option is writable so this option can be set at run time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	5adcf7677f	Export omap group calc helper The inode caller of omap was manually calculating the group and bits, which isn't fantastic. Export the little helper to calculate it so the inode caller doesn't have to. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	07f03d499f	Remove duplicate orphan work delay calculation You can almost feel the editing mistake that brought the delay calculation into the conditional and forgot to remove the initial calculation at declaration. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:23 -08:00
Zach Brown	15d7eec1f9	Disallow openening unlinked files by handle Our open by handle functions didn't care that the inode wasn't referenced and let tasks open unlinked inodes by number. This interacted badly with the inode deletion mechanisms which required that inodes couldn't be cached on other nodes after the transaction which removed their final reference. If a task did accidentally open a file by inode while it was being deleted it could see the inode items in an inconsistent state and return very confusing errors that look like corruption. The fix is to give the handle iget callers a flag to tell iget to only get the inode if it has a positive nlink. If iget sees that the inode has been unlinked it returns enoent. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	cff17a4cae	Remove unused flags scoutfs_inode_refresh arg The flags argument to scoutfs_inode_refresh wasn't being used. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e9b3cc873a	Export scoutfs_inode_init_key We're adding an ioctl that wants to build inode item keys so let's export the private inode key initializer. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e38beee85a	Stop using inode index key type as array index The code that updates inode index items on behalf of indexed fields uses an array to track changes in the fields. Those array indexes were the raw key type values. We're about to introduce some sparse space between all the key values so that we have some room to add keys in the future at arbitrary sort positions amongst the previous keys. We don't want the inode index item updating code to keep using raw types as array indices when the type values are no longer small dense values. We introduce indirection from type values to array indices to keep the tracking array in the in-memory inode struct small. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	94e5bc1457	Remove unused scoutfs_last_ino() Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	ea2b01434e	Add support for i_version This adds i_version to our inode and maintains it as we allocate, load, modify, and store inodes. We set the flag in the superblock so in-kernel users can use i_version to see changes in our inodes. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	46edf82b6b	Add inode crtime creation time Add an inode creation time field. It's created for all new inodes. It's visible to stat_more. setattr_more can set it during restore. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-03 11:14:41 -07:00
Zach Brown	3740c0a995	More carefully scan for orphan inodes The current orphan scan uses the forest_next_hint to look for candidate orphan items to delete. It doesn't skip deleted items and checks the forest of log btrees so it'd return hints for every single item that existed in all the log btrees across the system. And we call the hint per-item. When the system is deleting a lot of files we end up generating a huge load where all mounts are constantly getting the btree roots from the server, reading all the newest log btree blocks, finding deleted orphan items for inodes that have already been deleted, and moving on to the next deleted orphan item. The fix is to use a read-only traversal of only one version of the fs root for all the items in one scan. This avoids all the deleted orphan items that exist in the log btrees which will disappear when they're merged. It lets the item iteration happen in a single read-only cached btree instead of constantly reading in the most recently written root block of every log btree. The result is an enormous speedup of large deletions. I don't want to describe exactly how enormous. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	a4f5293e78	Flush invalidate and iput inode references We can be performing final deletion as inodes are evicted during unmount. We have to keep full locking, transactions, and networking up and running for the evict_inodes() call in generic_shutdown_super(). Unfortunately, this means that workers can be using inode references during evict_inodes() which prevents them from being evicted. Those workers can then remain running as we tear down the system, causing crashes and deadlocks as the final iputs try to use resources that have been destroyed. The fix is to first properly stop orphan scanning, which can instantiate new cached inodes, up before the call to kill_block_super ends up trying to evict all inodes. Then we just need to wait for any pending iput and invalidate work to finish and perform the final iput, which will always evict because generic_shutdown_super has cleared MS_ACTIVE. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	c51f0c37da	Defer dirty inode data writeback (and use list) iput() can only be used in contexts that could perform final inode deletion which requires cluster locks and transactions. This is absolutely true for the transaction committing worker. We can't have deletion during transaction commit trying to get locks and dirty more items in the transaction. Now that we're properly getting locks in final inode deletion and O_TMPFILE support has put pressure on deletion, we're seeing deadlocks between inode eviction during transaction commit getting a index lock and index lock invalidation trying to commit. We use the newly offered queued iput to defer the iput from walking our dirty inodes. The transaction commit will be able to proceed while the iput worker is off waiting for a lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:20:40 -07:00
Zach Brown	52107424dd	Promote deferred iput to inode call Lock invalidation had the ability to kick iput off to work context. We need to use it for inode writeback as well so we move the mechanism over to inode.c and give it a proper call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	cbe8d77f78	Prevent duplicate inode item deletion We hide I_FREEING inodes from inode lookup to avoid inversions with cluster locking. This can result in duplicate inodes structs for a given inode number. Then can both race to try and delete the same items for their shared inode number. This leads to error messages from evict_inode and could lead to corruption if they, for example, both try and free the same data extents. This adds very basic serialization so only one instance can try to delete items at a time. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	2957f3e301	Avoid warnings when evict has signals pending Killing a task can end up in evict and break out of acquiring the locks to perform final inode deletion. This isn't necessarily fatal. The orphan task will come around and will delete the inode when it is truly no longer referenced. So let's silence the error and keep track of how many times it happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	603af327ac	Ignore I_FREEING in all inode hash lookups Previously we added a ilookup variant that ignored I_FREEING inodes to avoid a deadlock between lock invalidation (lock->I_FREEING) and eviction (I_FREEING->lock); Now we're seeing similar deadlocks between eviction (I_FREEING->lock) and fh_to_dentry's iget (lock->I_FREEING). I think it's reasonable to ignore all inodes with I_FREEING set when we're using our _test callback in ilookup or iget. We can remove the _nofreeing ilookup variant and move its I_FREEING test into the iget_test callback provided to both ilookup and iget. Callers will get the same result, it will just happen without waiting for a previously I_FREEING inode to leave. They'll get NULL instead of waiting from ilookup. They'll allocate and start to initialize a newer instance of the inode and insert it along side the previous instance. We don't have inode number re-use so we don't have the problem where a newly allocated inode number is relying on inode cache serialization to not find a previously allocated inode that is being evicted. This change does allow for concurrent iget of an inode number that is being deleted on a local node. This could happen in fh_to_dentry with a raw inode number. But this was already a problem between mounts because they don't have a shared inode cache to serialize them. Once we fix that between nodes, we fix it on a single node as well. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:22:10 -07:00
Zach Brown	4389c73c14	Fix deadlock between lock invalidate and evict We've had a long-standing deadlock between lock invalidation and eviction. Invalidating a lock wants to lookup inodes and drop their resources while blocking locks. Eviction wants to get a lock to perform final deletion while the inodes has I_FREEING set which blocks lookups. We only saw this deadlock a handful of times in all of the time we've run the code, but it's now much more common now that we're acquiring locks in iput to test that nlink is zero instead of only when nlink is zero. I see unmount hang regularly when testing final inode deletion. This adds a lookup variant for invalidation which will refuse to return freeing inodes so they won't be waited on. Once they're freeing they can't be seen by future lock users so they don't need to be invalidated. This keeps the lock invalication promise and avoids sleeping on freeing inodes which creates the deadlock. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	715c29aad3	Proactively drop dentry/inode caches outside locks Previously we wouldn't try and remove cached dentries and inodes as lock revocation removed cluster lock coverage. The next time we tried to use the cached dentries or inodes we'd acquire a lock and refresh them. But now cached inodes prevent final inode deletion. If they linger outside cluster locking then any final deletion will need to be deferred until all its cached inodes are naturally dropped at some point in the future across the cluster. It might take refreshing the dentries or for memory pressure to push out the old cached inodes. This tries to proctively drop cached dentries and inodes as we lose cluster lock coverage if they're not actively referenced. We need to be careful not to perform final inode deletion during lock invalidation because it will deadlock, so we defer an iput which could delete during evict out to async work. Now deletion can be done synchronously in the task that is performing the unlink because previous use of the inode on remote mounts hasn't left unused cached inodes sitting around. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	da1af9b841	Add scoutfs inode ino lock coverage Add lock coverage which tracks if the inode has been refreshed and is covered by the inode group cluster lock. This will be used by drop_inode and evict_inode to discover that the inode is current and doesn't need to be refreshed. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00

1 2 3 4

163 Commits