scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-07 04:26:29 +00:00

Author	SHA1	Message	Date
Chris Kirby	d38e41cb57	Add the inode number to scoutfs_xattr_set traces. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	a896984f59	Only start new quorum election after a receive failure It's possible for the quorum worker to be preempted for a long period, especially on debug kernels. Since we only check for how much time has passed, it's possible for a clean receive to inadvertently trigger an election. This can cause the quorum-heartbeat-timeout test to fail due to observed delays outside of the expected bounds. Instead, make sure we had a receive failure before comparing timestamps. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	35bcad91a6	Close window where we can lose search items In finalize_and_start_log_merge(), we overwrite the server mount's log tree with its finalized form and then later write out its next open log tree. This leaves a window where the mount's srch_file is nulled out, causing us to lose any search items in that log tree. This shows up as intermittent failures in the srch-basic-functionality test. Eliminate this timing window by doing what unmount/reclaim does when it finalizes, by moving the resources from the item that we finalize into server trees/items as it finalizes. Then there is no window where those resources exist only in memory until we create another transaction. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	0b7b9d4a5e	Avoid trigger munching of block_remove_stale trigger. It's entirely likely that the trigger here is munched by a read on a dirty block from any unrelated or background read. Avoid that by putting the trigger at the end of the condition list. Now that the order is swapped, we have to avoid a null deref in block_is_dirty(bp) here, as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	f86a7b4d3c	Fully wait for orphan inode scan to complete. The issue with the previous attempt to fix the orphan-inodes test was that we would regularly exceed the 120s timeout value put in there. Instead, in this commit, we change the code to add a new counter to indicate orphan deletion progress. When orphan inodes are deleted, the increment of this counter indicates progress happened. Inversely, every time the counter doesn't increment, and the orphan scan attempts counter increments, we know that there was no more work to be done. For safety, we wait until 2 consecutive scan attempts were made without forward progress in the test case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	bb3e1f3665	Fix commit budget calculation with multiple holders The try_drain_data_freed() path was generating errors about overrunning its commit budget: scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602 The budget overrun check was using the current number of commit holders (in this case one) instead of the the maximum number of concurrent holders (in this case two). So even well behaved paths like try_drain_data_freed() can appear to exceed their commit budget if other holders dirty some blocks and apply their commits before the try_drain_data_freed() thread does its final budget reconciliation. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	0d262de4ac	Fix dirtied block calculation in extent_mod_blocks() Free extents are stored in two btrees: one sorted by block number, one by size. So if you insert a new extent between two existing extents, you can be modifying two items in the by-block-number tree. And depending on the size of those items, that can result in three items over in the -by-size tree. So that's a 5x multiplier per level. If we're shrinking the tree and adding more freed blocks, we're conceptually dirtying two blocks at each level to merge. (current 2 in the code). But if they fall under the low water mark then one of them is freed, so we can have 3 per level in this case. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	3f786596e0	Don't overrun the block budget in server_log_merge_free_work(). This fixes a potential fence post failure like the following: error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672 The code is only accounting for the freed btree blocks, not the dirtying of other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2), dirty some log btree blocks, loop again, then consume another (COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget. In this example, we went over by 13 blocks. By only consuming up to 1/8 of the budget on each loop, and committing when we have consumed 3/4 of the budget, we can avoid the fence post condition. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Zach Brown	e088424d70	Add initial filters for warnings in distro source Add a chunk of filters for sparse warnings that trigger on distro kernel source. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	d0cf026298	Require sparse, and filter kernel sparse output Fail the build if we don't check with sparse in both the kernel and userspace utils. Add a filtering wrapper to the kernel build so that we have a place to filter out uninteresting errors from kernel sources that we're building against. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	03fa1ce7c5	Avoid bad sparse warning in lock_invalidate() This is another example of refactoring a loop to avoid sparse warnings from doing something in the else of a failed trylock if. We want to drop and reacquire the lock if the trylock fails so we do it every loop iteration. This shouldn't be experiencing much contention because most of the cov users are usually done under locks and invalidation has excluded lock holders. So the additional lock and unlock noise should be local. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	3d9f10de93	Work around sparse warning in _item_write_done scoutfs_item_write_done() acquires the cinf dirty_lock and pg rwlock out of order. It uses a trylock to detect failure and back off of both before retrying. sparse seems to have some peculiar sensitivity to following the else branch from a failed trylock while already in a context. Doing that consistently triggered the spurious mismatched context warning. This refactors the loop to always drop and reacquire the dirty_lock after attemping the trylock. It's not great, but this shouldn't be very contended because the transaction write has serialized write lock holderse that would be trying to dirty items. The silly lock noise will be mostly cached. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:34:23 -07:00
Auke Kok	2f48a606e8	Fix -Wmaybe-uninitalized since rhel9.5 Looks like the compiler isn't smart enough to understand the pass by pointer value, and we can initialize it here easily. make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64' CC [M] /home/auke/scoutfs/kmod/src/server.o /home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’: /home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4170 \| ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4171 \| SCOUTFS_FENCE_CLIENT_RECOVERY); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors There's still the obvious issue here that we'd intended to support ipv6 but just disregard that here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 15:20:50 -07:00
Zach Brown	888b1394a6	Retry client commit and get log trees separately The client transaction commit worker has a series of functions that it calls to commit the current transaction and open the next one. If any of them fail, it retries all of them from the beginning each time until they all succeed. This pattern behaves badly since we added the strict get_trans_seq and commit_trans_seq latching in the log_trees. The server will only commit the items for a get or commit request once, and will fail a commit request if it isn't given the seq that matches the current item. If the server gets an error it can have persisted items while sending an error to the client. If this error was for a get request, then the client will retry all of its transaction write functions. This includes the commit request which is now using a stale seq and will fail indefinitely. This is visible in the server log as: error -5 committing client logs for rid e57e37132c919c4f: invalid log trees item get_trans_seq The solution is to retry the commit and get phases independently. This way a failed get will be retried on its own without running through the commit phase that had succeeded. The client will eventually get the next seq that it can then safely commit. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	e457694f19	Don't send dirty data_freed blocks to client At the end of get_log_trees we can try and drain the data_freed extent tree, which can take multiple commits. If a commit fails then the blocks are still dirty in memory. We can't send references to those blocks to the client. We have to return an error and not send the log_trees, like the main get_log_trees does. The client will retry and eventually get a log_trees that references blocks that were successfully committed. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	609fc56cd6	Merge pull request #203 from versity/auke/new_inode_ctime Fix new_inode ctime assignment.	2025-02-25 15:23:16 -08:00
Auke Kok	e3e2cfceec	Fix new_inode ctime assignment. Very old copy/paste bug here, we want to update new_inode's ctime instead. old_inode already is updated. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-02-18 13:15:49 -05:00
Auke Kok	e9d147260c	Fix ctx->pos updating to properly handle dent gaps We need to assure we're emitting dents with the proper position and we already have them as part of our dent. The only caveat is to increment ctx->pos once beyond the list to make sure the caller doesn't call us once more. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	6c85879489	Assert unlock doesn't underflow lock user count. While debugging a double unlock error we hit this condition and debugging would have been a lot easier had we enforced this simple constraint that we can't decrement the lock users count if it's already 0. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8b76a53cf3	Avoid cluster locking while put_user() in _allocated_inos. Similar to fiemap, readdir and walk_inodes, this method could have put_user during a page fault, causing potentially a deadlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	e76a171c40	Avoid faulting while cluster locked in _walk_inodes. Similar to readdir and fiemap vfs methods, we can't copy to user while holding cluster locks. The previous comment about it being safe no longer applies, and this could deadlock. Rewrite the loop to iterate and store entries in a page, then flush the page contents while not holding a clusterlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8cb08507d6	Do not copy to user while holding locks in scoutfs_data_fiemap() Now that we support mmap writes, at any point in time we could pagefault and lock for writes. That means - just like readdir - we can no longer lock and copy_to_user, since it also may page fault and thus deadlock. We statically allocate 32 extent entries on the stack and use these to shuffle out fiemap entries at a time, locking and unlocking around collecting and fiemap_fill_extent_next. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	cad12d5ce8	Avoid deadlock in _readdir() due to copy_to_user(). dir_emit() will copy_to_user, which can pagefault. If this happens while cluster locked, we could deadlock. We use a single page to stage dir_emit data, and iterate between fetching dirents while locked, and emitting them while not locked. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	1bcd1d4d00	Drop readdir pre-.iterate() compat (el7.5ish). These 2 sections of compat for readdir are wholly obsolete and can be hard dropped, which restores the method to look like current upstream code. This was added in `ddd1a4e`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Auke Kok	b944f609aa	remap_pages ops becomes obsolete.	2025-01-23 14:28:40 -05:00
Auke Kok	519b47a53c	mmap() trace events. We merely trace exit values and position, and ignore length. Because vm_fault_t is __bitwise, sparse will loudly complain about a plain cast to u32, so we must __force (on el8). ret will be 512 in normal cases. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Benjamin LaHaise	3788d67101	Add support for writable shared mmap()ings Add support for writable MAP_SHARED mmap()ings. Avoid issues with late writepage()s building transactions by doing the block_write_begin() work in scoutfs_data_page_mkwrite(). Ensure the page is marked dirty and prepared for write, then let the VM complete the write when the page is flushed or invalidated. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Benjamin LaHaise	b7a3d03711	Add support for read only mmap() Adds the required memory mapped ops struct and page fault handler for reads. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Zach Brown	3e71f49260	Merge pull request #195 from versity/auke/el9_5 RHEL9.5 kernel support	2024-12-03 14:27:57 -08:00
Greg Cymbalski	110d5ea0d5	Block EL9 minor version upgrades Since kABI migrations across minor versions is a thing of the past going forward, we now: - Detect if we're on EL9 - If so, add a requirement on the various flavors of release package to that specific major.minor version This appropriately does not allow upgrades across minor versions.	2024-12-02 16:04:24 -08:00
Auke Kok	669de459a7	bdev_open_by_path is now removed as well. Additional blkdev/bdev changes now cause this call to be removed as well resulting in us having to use yet another API to do the same for el9_5. The changes are a little more subtle as now the bdev_mount() call passes a custom bd_holder_ops that we must match or else throw a WARN_ON, so we switch to using sbi as our holder arg instead. Make sure to bdev_fput and not fput, since we don't want to have our private data cleanup deferred, failing xfstests generic/604. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-11-27 18:52:39 -08:00
Auke Kok	621271f8cf	backing_dev_info is entirely removed. The assignments to it is no longer needed at all. All references can be dropped since v6.4-rc4-163-g0d625446d0a4. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-11-08 13:32:21 -05:00
Auke Kok	d1092cdbe9	current_time() is no longer extern. Since v6.5-rc1-7-g9b6304c1d537, current_time() is no longer extern, so we need to update this grep regex to continue to match. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-11-08 13:21:03 -05:00
Zach Brown	19e78c32fc	Allow null lock compatibility between nodes Right now a client requesting a null mode for a lock will cause invalidations of all existing granted modes of the lock across the cluster. This is unneccessarily broad. The absolute requirement is that a null request invalidates other existing granted modes on the client. That's how the client safely resolves shrinking's desire to free locks while the locks are in use. It relies on turning it into the race between use and remote invalidation. But that only requires invalidating existing grants from the requesting client, not all clients. It is always safe for null grants to coexist with all grants on other clients. Consider the existing mechanics involving null modes. First, null locks are instatiated on the client before sending any requests at all. At any given time newly allocated null locks are coexisting with all existing locks across the cluster. Second, the server frees the client entry tracking struct the moment it sends a null grant to the client. From that point on the client's null lock can not have any impact on the rest of the lock holders because the server has forgotten about it. So we add this case to the server's test that two client lock modes are compatible. We take the opportunity to comment the heck out of this function instead of making it a dense boolean composition. The only functional change is the addition of this case, the existing cases are refactored but unchanged. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-31 15:34:59 -07:00
Zach Brown	8c1a45c9f5	Use bools instead of weird addition as or in net When freeing acked reesponses in the net layer we sweep the send and resend queues looking for queued responses up to the sequence number we've had acked. The code that did this used a weird pattern of returning ints and adding them which gave me pause. Clean it up to use bools and or (not short-circuiting \|\|) to more obviously communicate what's going on. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-30 13:38:12 -07:00
Zach Brown	5a6eb569f3	Add some lock debugging trace fields Over time some fields have been added to the lock struct which haven't been added to the lock tracing output. Add some of the more relevant lock fields to tracing. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-30 13:16:04 -07:00
Zach Brown	69d9040e68	Close lock server use-after-free race Lock object lifetimes in the lock server are protected by reference counts. References are acquired while holding a lock on an rbtree. Unfortunately, the decision to free lock objects wasn't tested while also holding that lock on the rbtree. A caller putting their object would test the refcount, then wait to get the rbtree lock to remove it from the tree. There's a possible race where the decision is made to remove the object but another reference is added before the object is removed. This was seen in testing and manifest as an incoming request handling path adding a request message to the object before it is freed, losing the message. Clients would then hang on a lock that never saw a response because their request was freed with the lock object. The fix is to hold the rbtree lock when testing the refcount and deciding to free. It adds a bit more contention but not significantly so, given the wild existing contention on a per-fs spinlocked rbtree. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-30 13:04:13 -07:00
Greg Cymbalski	70c36ae394	Generate debug packages We had previously explicitly disabled this; let's start generating them.	2024-10-24 14:56:09 -07:00
Auke Kok	235ab133a7	We must provide a_ops->dirty_folio and invalidate_folio. In v5.17-rc4-53-g3a3bae50af5d, we can no longer omit having this method unhooked as the mm caller blindly calls it now. In-kernel filesystems all were fixed in this change. aops->invalidatepage was the old aops method that would free pages with private attached data. This method is replaced with the new invalidate_folio method. If this method is NULL, the memory will become orphaned. (v5.17-rc4-29-gf50015a596fa) Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	7d0e7e29f8	Avoid integer wrapping pitfalls for (off, len) pairs. We use check_add_overflow(a, b, d) here to validate that (off, len) pairs do not exceed the max value type. The kernel conveniently has several macros to sort out the problems with signed or unsigned types. However, we're not interested in purely seeing whether (a + b) overflows, because we're using this for (off, len) overflow checks, where the bytes we read are from 0 to len -1. We must therefore call this check with (b) being "len - 1". I've made sure that we don't accidentally fail when (len == 0) in all cases by making sure we've already checked this condition before, and moving code around as needed to ensure that (len > 0) in all cases where we check. The macro check_add_overflow requires a (d) argument in which temporarily the result of the addition is stored and then checked to see if an overflow occurred. We put a `tmp` variable on the stack of the correct type as needed to make the checks function. simple-release-extents test mistakenly relied on this buggy wrap code, so it needs fixing. The move-blocks test also got it wrong. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	69de6d7a74	Check for zero len in scoutfs_data_wait_check We consistently enter scoutfs_data_wait_check when len == 0 from scoutfs_aio_write() which directly passes the i_size_read() value, and for cases where we `echo >> $FILE` this is always reached. This can cause the wrapping check to fail since `0 + (0 - 1) < 0` which triggers the WARN_ON_ONCE wrap check that needs updating to allow certain operations on huge files. More importantly we can just omit all these checks if `len == 0` anyway, since they should always succeed and never should require taking all the locks. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	c298360a49	blkdev api changes - pass holder to replace FMODE_EXCL Passing a holder ptr to these functions now replaces the FMODE_EXCL flag. _put no longer needs flags for this reason, but the holder instead. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	95f4e56546	Introduce blk_mode_t instead of abuse of fmode_t v6.4-rc2-198-g05bdb9965305 adds a new type for passing flags instead of abusing fmode_t flags. They are essentially the same flags just in a new type. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	d5c2768f04	.tmpfile method now passed a struct file, which must be opened. v6.0-rc6-9-g863f144f12ad changes the VFS method to pass in a struct file and not a dentry in preperation for tmpfile support in fuse. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	676d429264	Assume el9 is the same as el8 for rpmbuild purposes. The current spec template can't handle future major el releases gracefully and fails to build entirely. We isolate all changes so that they are either "el7 specific" or generic. This rids us entirely of el8 specific conditionals. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	5b260e6b54	block_write_begin() no longer is being passed aop_flags. The flag is now obsolete, we don't need to set flags here or pass them. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	e2b06f2c92	mpage_readpage() is now replaced with mpage_read_folio. Folios are the new data types used for passing pages. For now, folios only appear to have a single page. Future kernels will change that. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	546b437df7	Shrinkers are now registered with a name. v5.19-rc4-52-ge33c267ab70d Adds shrinker names to the registration call to aid with shrinker debugging, which is highly opaque. To enable you'll have to recompile the kernel with CONFIG_SHRINKER_DEBUG=y though, since it's disabled by default in OSV kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	381f4543b7	Use iter based read/write to support splice and thus sendfile(). The iter based read/write calls can support splice in el9 if we hook up these calls, otherwise splice will stop working. ->write() similar to: v3.15-rc4-330-g8d0207652cbe. ->read() to generic implementation. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	418a441604	kernel_setsockopt no longer available. We instead opt to use sock_setsockopt which is generally exactly the same and can be easily converted to map to kernel_setsockopt without impacting the code significantly. There are 3 methods we're calling with usec timeval's, and that is significantly different now that this requires a bit more compat code so we split these out to separate compat functions to handle them. Some of the TCP sock functions also have a slightly different signature that we want to split them out (struct socket vs. sock). Some further no longer return success, either. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00

1 2 3 4 5 ...

1372 Commits