scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-03 10:55:20 +00:00

Author	SHA1	Message	Date
Zach Brown	8b6418fb79	Add kernelcompat for list_lru Add kernelcompat helpers for initial use of list_lru for shrinking. The most complicated part is the walk callback type changing. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	206c24c41f	Retry stale item reads instead of stopping reclaim Readers can read a set of items that is stale with respect to items that were dirtied and written under a local cluster lock after the read started. The active reader machanism addressed this by refusing to shrink pages that could contain items that were dirtied while any readers were in flight. Under the right circumstances this can result in refusing to shrink quite a lot of pages indeed. This changes the mechanism to allow pages to be reclaimed, and instead forces stale readers to retry. The gamble is that reads are much faster than writes. A small fraction should have to retry, and when they do they can be satisfied by the block cache. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Auke Kok	a5dbe7f286	Don't set ret = -ENOMEM and immediately overwrite. It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which is legitimately a failure, thus the code here releases the sock again. But the code block here sets `ret = ENOMEM` and then restarts the loop, which immediately sets `ret = kernel_accept()`, thus overwriting the -ENOMEM error value. We can argue that an ENOMEM error situation here is not catastrophical. If this is the first that we're ever receiving an ENOMEM situation here while trying to accept a new client, we can just release the socket and wait for the client to try again. If the kernel at that point still is out of memory to handle the new incoming connection, that will then cascade down and clean up the while listener at that point. The alternative is to let this error path unwind out and break down the listener immediately, something the code today doesn't do. We're keeping the behavior therefore the same. I've opted therefore to replace the `ret = -ENOMEM` assignment with a comment explaining why we're ignoring the error situation here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:26:44 -04:00
Auke Kok	c1e89d597d	Fix NULL dereference on error branch in handle_request. If scoutfs_send_omap_response fails for any reason, req is NULL and we would hit a hard NULL deref during unwinding. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	2c4316b096	Avoid uninitialized map, flags in ext. This function returns a stack pointer to a struct scoutfs_extent, after setting start, len to an extent found in the proper zone, but it leaves map and flags members unset. Initialize the struct to {0,} avoids passing uninitialized values up the callstack. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	e704cd7074	Fix masking of EIO in compact_logs. Several of the inconsistency error paths already correctly `goto out` but this one has a `break`. This would result in doing a whole lot of work on corrupted data. Make this error path go to `out` instead as the others do. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:59 -04:00
Auke Kok	8c5b09aee8	Prevent masking away inconsistent state in search_sorted_file. In these two error conditions we explicitly set `ret = -EIO` but then `break` to set `ret = 0` immediately again, masking away a critical error code that should be returned. Instead, `goto out` retains the EIO error value for the caller. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:28 -04:00
Auke Kok	d2cd610c53	Fix return of uninit value. The value of `ret` is not initialized. If the writeback list is empty, or, if igrab() fails on the only inode on the list, the value of `ret` is returned without being initialized. This would cause the caller to needlessly have to retry, perhaps possibly make things worse. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	52563d3f73	Address double copy_to_user, possible 1-byte leak. We shouldn't copy the entire _dirent struct and then copy in the name again right after, just stop at offsetoff(struct, name). Now that we're no longer copying the uninitialized name[3] from ent, there is no more possible 1-byte leak here, too. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	4358d57f55	Avoid possible NULL deref on ENOMEM. Assure that we reschedule even if this happens. Maybe it'll recover. If not, we'll have other issues elsewhere first. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	021830ab04	If kzalloc fails, avoid NULL deref. We still assign NULL to sbi->s_fs_info to aid checks in cleanup paths. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	3e63739711	plug `df` ioctl leaks. The `type` member and padding are not initialized before being copied to userspace. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	b25d8e8741	Plug super leak. We accidentally could leak super here, so make sure to free it. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	4a9760afe0	Incorrect array_size test. ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2, therefore arr[3] is out of bounds. The array length test is off by one and needs fixing. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Zach Brown	aa8517d29b	Remove msghdr iov_iter kernelcompat This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat. kernel_{send,recv}msg() initializes either msg_iov or msg_iter. This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from iovec." because previous patches fixed the order of arguments, and the net send caller was removed. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	feae5757c4	Send messages in batches Previous work had the receiver try to receive multiple messages in bulk. This does the same for the sender. We walk the send queue and initialize a vector that we then send with one call. This is intentionally similar to the single message sending pattern to avoid unintended changes. Along with the changes to recieve in bulk this ended up increasing the message processing rate by about 6x when both send and receive were going full throttle. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	e79086f381	Fix swapped sendmsg nr_segs/count When the msg_iter compat was added the iter was initialized with nr_segs and count swapped. I'm not convinced this had any effect because the kernel_{send,recv}msg() call would initialize msg_iter again with the correct arguments. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	45e815bf76	Receive incoming messages in bulk Our messaging layer is used for small control messages, not large data payloads. By calling recvmsg twice for every incoming message we're hitting the socket lock reasonably hard. With senders doing the same, and a lot of messages flowing in each direction, the contention is non-trivial. This changes the receiver to copy as much of the incoming stream into a page that is then framed and copied again into individual allocated messages that can be processed concurrently. We're avoiding contention with the sender on the socket at the cost of additional copies of our small messages. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	c313b71b2e	Process client lock messages in ordered work The lock client has a requirement that it can't handle some messages being processed out of order. Previously it had detected message ordering itself, but had missed some cases. Recieve processing was then changed to always call lock message processing from the recv work to globally order all lock messages. This inline processing was contributing to excessive latencies in making our way through the incoming receive queue, delaying work that would otherwise be parallel once we got it off the recv queue. This was seen in practice as a giant flood of lock shrink messages arrived at the client. It processed each in turn, starving a statfs response long enough to trigger the hung task warning. This fix does two things. First, it moves ordered recv processing out of the recv work. It lets the recv work drain the socket quickly and turn it into a list that the ordered work is consuming. Other messages will have a chance to be received and queued to their processing work without having to wait for the ordered work to be processed. Secondly, it adds parallelism to the ordered processing. The incoming lock messages don't need global ordering, they need ordering within each lock. We add an arbitrary but reasonable number of ordered workers and hash lock messages to each worker based on the lock's key. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Chris Kirby	b4d8323750	Quorum message cleanup Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END update_quorum_block() call fails in scoutfs_quorum_worker(). Correctly print if the reader or writer failed when logging errors in update_quorum_block(). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	aa48a8ccfc	Generate sorted srch-safe entry pairs During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was generating inode numbers that were not in sorted order. This resulted in later failures during srch-basic-functionality, because we were winding up with out of order first/last pairs and merging incorrectly. Instead, reuse the single entry in the block repeatedly, generating zero-padded pairs of this entry that are interpreted as create/delete and vanish during searching and merging. These aren't encoded in the normal way, but the extra zeroes are ignored during the decoding phase. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	d277d7e955	Fix race condition in orphan-inodes test Make sure that the orphan scanners can see deletions after forced unmounts by waiting for reclaim_open_log_tree() to run on each mount; and waiting for finalize_and_start_log_merge() to run and not find any finalized trees. Do this by adding two new counters: reclaimed_open_logs and log_merge_no_finalized and fixing the orphan-inodes test to check those before waiting for the orphan scanners to complete. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	c72bf915ae	Use ENOLINK as a special error code during forced unmount Tests such as quorum-heartbeat-timeout were failing with EIO messages in dmesg output due to expected errors during forced unmount. Use ENOLINK instead, and filter all errors from dmesg with this errno (67). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:58:44 -07:00
Zach Brown	c19280c83c	Add cond_resched to iput worker The iput worker can accumulate quite a bit of pending work to do. We've seen hung task warnings while it's doing its work (admitedly in debug kernels). There's no harm in throwing in a cond_resched so other tasks get a chance to do work. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	01847d9fb6	Add tracing for get_file_block() and scoutfs_ioc_search_xattrs(). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	84a48ed8e2	Fix several cases in srch.c where the return value of EIO should have been -EIO. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	d38e41cb57	Add the inode number to scoutfs_xattr_set traces. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	a896984f59	Only start new quorum election after a receive failure It's possible for the quorum worker to be preempted for a long period, especially on debug kernels. Since we only check for how much time has passed, it's possible for a clean receive to inadvertently trigger an election. This can cause the quorum-heartbeat-timeout test to fail due to observed delays outside of the expected bounds. Instead, make sure we had a receive failure before comparing timestamps. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	35bcad91a6	Close window where we can lose search items In finalize_and_start_log_merge(), we overwrite the server mount's log tree with its finalized form and then later write out its next open log tree. This leaves a window where the mount's srch_file is nulled out, causing us to lose any search items in that log tree. This shows up as intermittent failures in the srch-basic-functionality test. Eliminate this timing window by doing what unmount/reclaim does when it finalizes, by moving the resources from the item that we finalize into server trees/items as it finalizes. Then there is no window where those resources exist only in memory until we create another transaction. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	0b7b9d4a5e	Avoid trigger munching of block_remove_stale trigger. It's entirely likely that the trigger here is munched by a read on a dirty block from any unrelated or background read. Avoid that by putting the trigger at the end of the condition list. Now that the order is swapped, we have to avoid a null deref in block_is_dirty(bp) here, as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	f86a7b4d3c	Fully wait for orphan inode scan to complete. The issue with the previous attempt to fix the orphan-inodes test was that we would regularly exceed the 120s timeout value put in there. Instead, in this commit, we change the code to add a new counter to indicate orphan deletion progress. When orphan inodes are deleted, the increment of this counter indicates progress happened. Inversely, every time the counter doesn't increment, and the orphan scan attempts counter increments, we know that there was no more work to be done. For safety, we wait until 2 consecutive scan attempts were made without forward progress in the test case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	bb3e1f3665	Fix commit budget calculation with multiple holders The try_drain_data_freed() path was generating errors about overrunning its commit budget: scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602 The budget overrun check was using the current number of commit holders (in this case one) instead of the the maximum number of concurrent holders (in this case two). So even well behaved paths like try_drain_data_freed() can appear to exceed their commit budget if other holders dirty some blocks and apply their commits before the try_drain_data_freed() thread does its final budget reconciliation. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	0d262de4ac	Fix dirtied block calculation in extent_mod_blocks() Free extents are stored in two btrees: one sorted by block number, one by size. So if you insert a new extent between two existing extents, you can be modifying two items in the by-block-number tree. And depending on the size of those items, that can result in three items over in the -by-size tree. So that's a 5x multiplier per level. If we're shrinking the tree and adding more freed blocks, we're conceptually dirtying two blocks at each level to merge. (current 2 in the code). But if they fall under the low water mark then one of them is freed, so we can have 3 per level in this case. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	3f786596e0	Don't overrun the block budget in server_log_merge_free_work(). This fixes a potential fence post failure like the following: error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672 The code is only accounting for the freed btree blocks, not the dirtying of other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2), dirty some log btree blocks, loop again, then consume another (COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget. In this example, we went over by 13 blocks. By only consuming up to 1/8 of the budget on each loop, and committing when we have consumed 3/4 of the budget, we can avoid the fence post condition. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Zach Brown	e088424d70	Add initial filters for warnings in distro source Add a chunk of filters for sparse warnings that trigger on distro kernel source. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	d0cf026298	Require sparse, and filter kernel sparse output Fail the build if we don't check with sparse in both the kernel and userspace utils. Add a filtering wrapper to the kernel build so that we have a place to filter out uninteresting errors from kernel sources that we're building against. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	03fa1ce7c5	Avoid bad sparse warning in lock_invalidate() This is another example of refactoring a loop to avoid sparse warnings from doing something in the else of a failed trylock if. We want to drop and reacquire the lock if the trylock fails so we do it every loop iteration. This shouldn't be experiencing much contention because most of the cov users are usually done under locks and invalidation has excluded lock holders. So the additional lock and unlock noise should be local. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:35:36 -07:00
Zach Brown	3d9f10de93	Work around sparse warning in _item_write_done scoutfs_item_write_done() acquires the cinf dirty_lock and pg rwlock out of order. It uses a trylock to detect failure and back off of both before retrying. sparse seems to have some peculiar sensitivity to following the else branch from a failed trylock while already in a context. Doing that consistently triggered the spurious mismatched context warning. This refactors the loop to always drop and reacquire the dirty_lock after attemping the trylock. It's not great, but this shouldn't be very contended because the transaction write has serialized write lock holderse that would be trying to dirty items. The silly lock noise will be mostly cached. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-03 09:34:23 -07:00
Auke Kok	2f48a606e8	Fix -Wmaybe-uninitalized since rhel9.5 Looks like the compiler isn't smart enough to understand the pass by pointer value, and we can initialize it here easily. make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64' CC [M] /home/auke/scoutfs/kmod/src/server.o /home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’: /home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4170 \| ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4171 \| SCOUTFS_FENCE_CLIENT_RECOVERY); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors There's still the obvious issue here that we'd intended to support ipv6 but just disregard that here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 15:20:50 -07:00
Zach Brown	888b1394a6	Retry client commit and get log trees separately The client transaction commit worker has a series of functions that it calls to commit the current transaction and open the next one. If any of them fail, it retries all of them from the beginning each time until they all succeed. This pattern behaves badly since we added the strict get_trans_seq and commit_trans_seq latching in the log_trees. The server will only commit the items for a get or commit request once, and will fail a commit request if it isn't given the seq that matches the current item. If the server gets an error it can have persisted items while sending an error to the client. If this error was for a get request, then the client will retry all of its transaction write functions. This includes the commit request which is now using a stale seq and will fail indefinitely. This is visible in the server log as: error -5 committing client logs for rid e57e37132c919c4f: invalid log trees item get_trans_seq The solution is to retry the commit and get phases independently. This way a failed get will be retried on its own without running through the commit phase that had succeeded. The client will eventually get the next seq that it can then safely commit. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	e457694f19	Don't send dirty data_freed blocks to client At the end of get_log_trees we can try and drain the data_freed extent tree, which can take multiple commits. If a commit fails then the blocks are still dirty in memory. We can't send references to those blocks to the client. We have to return an error and not send the log_trees, like the main get_log_trees does. The client will retry and eventually get a log_trees that references blocks that were successfully committed. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	609fc56cd6	Merge pull request #203 from versity/auke/new_inode_ctime Fix new_inode ctime assignment.	2025-02-25 15:23:16 -08:00
Auke Kok	e3e2cfceec	Fix new_inode ctime assignment. Very old copy/paste bug here, we want to update new_inode's ctime instead. old_inode already is updated. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-02-18 13:15:49 -05:00
Auke Kok	e9d147260c	Fix ctx->pos updating to properly handle dent gaps We need to assure we're emitting dents with the proper position and we already have them as part of our dent. The only caveat is to increment ctx->pos once beyond the list to make sure the caller doesn't call us once more. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	6c85879489	Assert unlock doesn't underflow lock user count. While debugging a double unlock error we hit this condition and debugging would have been a lot easier had we enforced this simple constraint that we can't decrement the lock users count if it's already 0. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8b76a53cf3	Avoid cluster locking while put_user() in _allocated_inos. Similar to fiemap, readdir and walk_inodes, this method could have put_user during a page fault, causing potentially a deadlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	e76a171c40	Avoid faulting while cluster locked in _walk_inodes. Similar to readdir and fiemap vfs methods, we can't copy to user while holding cluster locks. The previous comment about it being safe no longer applies, and this could deadlock. Rewrite the loop to iterate and store entries in a page, then flush the page contents while not holding a clusterlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8cb08507d6	Do not copy to user while holding locks in scoutfs_data_fiemap() Now that we support mmap writes, at any point in time we could pagefault and lock for writes. That means - just like readdir - we can no longer lock and copy_to_user, since it also may page fault and thus deadlock. We statically allocate 32 extent entries on the stack and use these to shuffle out fiemap entries at a time, locking and unlocking around collecting and fiemap_fill_extent_next. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	cad12d5ce8	Avoid deadlock in _readdir() due to copy_to_user(). dir_emit() will copy_to_user, which can pagefault. If this happens while cluster locked, we could deadlock. We use a single page to stage dir_emit data, and iterate between fetching dirents while locked, and emitting them while not locked. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	1bcd1d4d00	Drop readdir pre-.iterate() compat (el7.5ish). These 2 sections of compat for readdir are wholly obsolete and can be hard dropped, which restores the method to look like current upstream code. This was added in `ddd1a4e`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00

1 2 3 4 5 ...

1398 Commits