scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2025-12-23 05:25:18 +00:00

Author	SHA1	Message	Date
Zach Brown	1c66f9a9a5	Merge pull request #227 from versity/auke/el96 RHEL 9.6 support.	2025-10-30 13:39:05 -07:00
Auke Kok	afb6ba00ad	POSIX ACL changes. The .get_acl() method now gets passed a mnt_idmap arg, and we can now choose to implement either .get_acl() or .get_inode_acl(). Technically .get_acl() is a new implementation, and .get_inode_acl() is the old. That second method now also gets an rcu flag passed, but we should be fine either way. Deeper under the covers however we do need to hook up the .set_acl() method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To make this not super messy (it already is) we tack on the get_acl() changes here. This is all roughly ca. v6.1-rc1-4-g7420332a6ff4. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-30 13:59:44 -04:00
Auke Kok	29e486e411	All vfs methods now take a mnt_idmap instead of user_namespace arg. Similar to before when namespaces were added, they are now translated to a mnt_idmap, since v6.2-rc1-2-gabf08576afe3. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-30 13:58:34 -04:00
Zach Brown	8f3177fe33	Merge pull request #254 from versity/zab/shrink_cleanup Zab/shrink cleanup	2025-10-30 08:56:33 -07:00
Zach Brown	419079e606	Merge pull request #239 from versity/auke/keepalive Add tcp_keepalive_timeout_ms option, change default to 60s	2025-10-29 17:15:17 -07:00
Zach Brown	6a70ee03b5	Dump block alloc stacks for leaked blocks The typical pattern of spinning isolating a list_lru results in a livelock if there are blocks with leaked refcounts. We're rarely seeing this in testing. We can have a modest array in each block that records the stack of the caller that initially allocated the block and dump that stack for any blocks that we're unable to shrink/isolate. Instead of spinning shrinking, we can give it a good try and then print the blocks that remain and carry on with unmount, leaking a few blocks. (Past events have had 2 blocks.) Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 16:16:58 -07:00
Zach Brown	38a2ffe0c7	Add stacktrace kernelcompat Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	4b41cf9789	Centralize port numbers and avoid ephemeral The tests were using high ephemeral port numbers for the mount server's listening port. This caused occasional failure if the client's ephemeral ports happened to collide with the ports used by the tests. This ports all the port number configuration in one place and has a quick check to make sure it doesn't wander into the current ephemeral range. Then it updates all the tests to use the chosen ports. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	102899290e	Allow harmless srch compact commit errors The server's srch commit error warnings were a bit severe. The compaction operations are a function of persistent state. If they fail then the inputs still exist and the next attempt will retry whatever failed. Not all errors are a problem, only those that result in partial commits that leave inconsistent state. In particular, we have to support the case where a client retransmits a compaction request to a new server after a first server performed the commit but couldn't respond. Throwing warnings when the new server gets ENOENT looking for the busy compaction item isn't helpful. This came in tests as background compaction was in flight as tests unmounted and mounted servers repeatedly to test lock recovery. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	89387fb192	Use list_lru for block cache shrinking The block cache had a bizarre cache eviction policy that was trying to avoid precise LRU updates at each block. It had pretty bad behaviour, including only allowing reclaim of maybe 20% of the blocks that were visited by the shrinker. We can use the existing list_lru facility in the kernel to do a better job. Blocks only exhibit contention as they're allocated and added to per-node lists. From then on we only set accessed bits and the private list walkers move blocks around on the list as we see the accessed bits. (It looks more like a fifo with lazy promotion than a "LRU" that is actively moving list items around as they're accessed.) Using the facility means changing how we remove blocks from the cache and hide them from lookup. We clean up the refcount inserted flag a bit to be expressed more as a base refcount that can be acquired by whoever's removing from the cache. It seems a lot clearer. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	8b6418fb79	Add kernelcompat for list_lru Add kernelcompat helpers for initial use of list_lru for shrinking. The most complicated part is the walk callback type changing. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	206c24c41f	Retry stale item reads instead of stopping reclaim Readers can read a set of items that is stale with respect to items that were dirtied and written under a local cluster lock after the read started. The active reader machanism addressed this by refusing to shrink pages that could contain items that were dirtied while any readers were in flight. Under the right circumstances this can result in refusing to shrink quite a lot of pages indeed. This changes the mechanism to allow pages to be reclaimed, and instead forces stale readers to retry. The gamble is that reads are much faster than writes. A small fraction should have to retry, and when they do they can be satisfied by the block cache. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Auke Kok	f67462750b	Add tcp_keepalive_timeout_ms option, change default to 60s The default TCP keepalive value is currently 10s, resulting in clients being disconnected after 10 seconds of not replying to a TCP keepalive packet. These keepalive values are reasonable most of the times, but we've seen client disconnects where this timeout has been exceeded, resulting in fencing. The cause for this is unknown at this time, but it is suspected that network intermissions are happening. This change adds a configurable value for this specific client socket timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose value remains unchanged. The default value of 10000ms (10s) is changed to 60s. This is the value we're assuming is much better suited for customers and has been briefly trialed, showing that it may help to avoid network level interruptions better. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-28 18:45:43 -04:00
Zach Brown	fd8aaa0810	Merge pull request #205 from versity/auke/scar Changes from static analysis.	2025-10-27 16:06:32 -07:00
Auke Kok	a5dbe7f286	Don't set ret = -ENOMEM and immediately overwrite. It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which is legitimately a failure, thus the code here releases the sock again. But the code block here sets `ret = ENOMEM` and then restarts the loop, which immediately sets `ret = kernel_accept()`, thus overwriting the -ENOMEM error value. We can argue that an ENOMEM error situation here is not catastrophical. If this is the first that we're ever receiving an ENOMEM situation here while trying to accept a new client, we can just release the socket and wait for the client to try again. If the kernel at that point still is out of memory to handle the new incoming connection, that will then cascade down and clean up the while listener at that point. The alternative is to let this error path unwind out and break down the listener immediately, something the code today doesn't do. We're keeping the behavior therefore the same. I've opted therefore to replace the `ret = -ENOMEM` assignment with a comment explaining why we're ignoring the error situation here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:26:44 -04:00
Auke Kok	c1e89d597d	Fix NULL dereference on error branch in handle_request. If scoutfs_send_omap_response fails for any reason, req is NULL and we would hit a hard NULL deref during unwinding. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	2c4316b096	Avoid uninitialized map, flags in ext. This function returns a stack pointer to a struct scoutfs_extent, after setting start, len to an extent found in the proper zone, but it leaves map and flags members unset. Initialize the struct to {0,} avoids passing uninitialized values up the callstack. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	e704cd7074	Fix masking of EIO in compact_logs. Several of the inconsistency error paths already correctly `goto out` but this one has a `break`. This would result in doing a whole lot of work on corrupted data. Make this error path go to `out` instead as the others do. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:59 -04:00
Auke Kok	8c5b09aee8	Prevent masking away inconsistent state in search_sorted_file. In these two error conditions we explicitly set `ret = -EIO` but then `break` to set `ret = 0` immediately again, masking away a critical error code that should be returned. Instead, `goto out` retains the EIO error value for the caller. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:28 -04:00
Auke Kok	d2cd610c53	Fix return of uninit value. The value of `ret` is not initialized. If the writeback list is empty, or, if igrab() fails on the only inode on the list, the value of `ret` is returned without being initialized. This would cause the caller to needlessly have to retry, perhaps possibly make things worse. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	52563d3f73	Address double copy_to_user, possible 1-byte leak. We shouldn't copy the entire _dirent struct and then copy in the name again right after, just stop at offsetoff(struct, name). Now that we're no longer copying the uninitialized name[3] from ent, there is no more possible 1-byte leak here, too. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	4358d57f55	Avoid possible NULL deref on ENOMEM. Assure that we reschedule even if this happens. Maybe it'll recover. If not, we'll have other issues elsewhere first. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	021830ab04	If kzalloc fails, avoid NULL deref. We still assign NULL to sbi->s_fs_info to aid checks in cleanup paths. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	3e63739711	plug `df` ioctl leaks. The `type` member and padding are not initialized before being copied to userspace. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	b25d8e8741	Plug super leak. We accidentally could leak super here, so make sure to free it. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	4a9760afe0	Incorrect array_size test. ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2, therefore arr[3] is out of bounds. The array length test is off by one and needs fixing. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Zach Brown	33f6e9d0cd	Merge pull request #248 from versity/auke/shuffle-tests Add option to shuffle test order.	2025-10-23 09:33:34 -07:00
Zach Brown	f9780fc391	Merge pull request #253 from versity/zab/msg_rate_shrink Zab/msg rate shrink	2025-10-23 09:32:25 -07:00
Zach Brown	aa8517d29b	Remove msghdr iov_iter kernelcompat This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat. kernel_{send,recv}msg() initializes either msg_iov or msg_iter. This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from iovec." because previous patches fixed the order of arguments, and the net send caller was removed. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	feae5757c4	Send messages in batches Previous work had the receiver try to receive multiple messages in bulk. This does the same for the sender. We walk the send queue and initialize a vector that we then send with one call. This is intentionally similar to the single message sending pattern to avoid unintended changes. Along with the changes to recieve in bulk this ended up increasing the message processing rate by about 6x when both send and receive were going full throttle. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	e79086f381	Fix swapped sendmsg nr_segs/count When the msg_iter compat was added the iter was initialized with nr_segs and count swapped. I'm not convinced this had any effect because the kernel_{send,recv}msg() call would initialize msg_iter again with the correct arguments. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	45e815bf76	Receive incoming messages in bulk Our messaging layer is used for small control messages, not large data payloads. By calling recvmsg twice for every incoming message we're hitting the socket lock reasonably hard. With senders doing the same, and a lot of messages flowing in each direction, the contention is non-trivial. This changes the receiver to copy as much of the incoming stream into a page that is then framed and copied again into individual allocated messages that can be processed concurrently. We're avoiding contention with the sender on the socket at the cost of additional copies of our small messages. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	c313b71b2e	Process client lock messages in ordered work The lock client has a requirement that it can't handle some messages being processed out of order. Previously it had detected message ordering itself, but had missed some cases. Recieve processing was then changed to always call lock message processing from the recv work to globally order all lock messages. This inline processing was contributing to excessive latencies in making our way through the incoming receive queue, delaying work that would otherwise be parallel once we got it off the recv queue. This was seen in practice as a giant flood of lock shrink messages arrived at the client. It processed each in turn, starving a statfs response long enough to trigger the hung task warning. This fix does two things. First, it moves ordered recv processing out of the recv work. It lets the recv work drain the socket quickly and turn it into a list that the ordered work is consuming. Other messages will have a chance to be received and queued to their processing work without having to wait for the ordered work to be processed. Secondly, it adds parallelism to the ordered processing. The incoming lock messages don't need global ordering, they need ordering within each lock. We add an arbitrary but reasonable number of ordered workers and hash lock messages to each worker based on the lock's key. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-22 11:18:30 -07:00
Zach Brown	0ecaceba14	Merge pull request #236 from versity/team/ci_green Team/ci green	2025-10-22 11:05:08 -07:00
Chris Kirby	b4d8323750	Quorum message cleanup Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END update_quorum_block() call fails in scoutfs_quorum_worker(). Correctly print if the reader or writer failed when logging errors in update_quorum_block(). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	aa48a8ccfc	Generate sorted srch-safe entry pairs During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was generating inode numbers that were not in sorted order. This resulted in later failures during srch-basic-functionality, because we were winding up with out of order first/last pairs and merging incorrectly. Instead, reuse the single entry in the block repeatedly, generating zero-padded pairs of this entry that are interpreted as create/delete and vanish during searching and merging. These aren't encoded in the normal way, but the extra zeroes are ignored during the decoding phase. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	d277d7e955	Fix race condition in orphan-inodes test Make sure that the orphan scanners can see deletions after forced unmounts by waiting for reclaim_open_log_tree() to run on each mount; and waiting for finalize_and_start_log_merge() to run and not find any finalized trees. Do this by adding two new counters: reclaimed_open_logs and log_merge_no_finalized and fixing the orphan-inodes test to check those before waiting for the orphan scanners to complete. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	c72bf915ae	Use ENOLINK as a special error code during forced unmount Tests such as quorum-heartbeat-timeout were failing with EIO messages in dmesg output due to expected errors during forced unmount. Use ENOLINK instead, and filter all errors from dmesg with this errno (67). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:58:44 -07:00
Auke Kok	c3e6f3cd54	Don't run format-version-forward-back on el8, either This test compiles an earlier commit from the tree that is starting to fail due to various changes on the OS level, most recently due to sparse issues with newer kernel headers. This problem will likely increase in the future as we add more supported releases. We opt to just only run this test on el7 for now. While we could have made this skip sparse checks that fail it on el8, it will suffice at this point if this just works on one of the supported OS versions during testing. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-15 17:35:17 -05:00
Zach Brown	c19280c83c	Add cond_resched to iput worker The iput worker can accumulate quite a bit of pending work to do. We've seen hung task warnings while it's doing its work (admitedly in debug kernels). There's no harm in throwing in a cond_resched so other tasks get a chance to do work. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	01847d9fb6	Add tracing for get_file_block() and scoutfs_ioc_search_xattrs(). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	84a48ed8e2	Fix several cases in srch.c where the return value of EIO should have been -EIO. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	d38e41cb57	Add the inode number to scoutfs_xattr_set traces. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	a896984f59	Only start new quorum election after a receive failure It's possible for the quorum worker to be preempted for a long period, especially on debug kernels. Since we only check for how much time has passed, it's possible for a clean receive to inadvertently trigger an election. This can cause the quorum-heartbeat-timeout test to fail due to observed delays outside of the expected bounds. Instead, make sure we had a receive failure before comparing timestamps. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-15 17:35:17 -05:00
Chris Kirby	35bcad91a6	Close window where we can lose search items In finalize_and_start_log_merge(), we overwrite the server mount's log tree with its finalized form and then later write out its next open log tree. This leaves a window where the mount's srch_file is nulled out, causing us to lose any search items in that log tree. This shows up as intermittent failures in the srch-basic-functionality test. Eliminate this timing window by doing what unmount/reclaim does when it finalizes, by moving the resources from the item that we finalize into server trees/items as it finalizes. Then there is no window where those resources exist only in memory until we create another transaction. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	0b7b9d4a5e	Avoid trigger munching of block_remove_stale trigger. It's entirely likely that the trigger here is munched by a read on a dirty block from any unrelated or background read. Avoid that by putting the trigger at the end of the condition list. Now that the order is swapped, we have to avoid a null deref in block_is_dirty(bp) here, as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	f86a7b4d3c	Fully wait for orphan inode scan to complete. The issue with the previous attempt to fix the orphan-inodes test was that we would regularly exceed the 120s timeout value put in there. Instead, in this commit, we change the code to add a new counter to indicate orphan deletion progress. When orphan inodes are deleted, the increment of this counter indicates progress happened. Inversely, every time the counter doesn't increment, and the orphan scan attempts counter increments, we know that there was no more work to be done. For safety, we wait until 2 consecutive scan attempts were made without forward progress in the test case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	96eb9662a1	Revert "Extend orphan-inodes timeout." This reverts commit `138c7c6b49`. The timeout value here is still exceeded by CI test jobs, and thus causing the test to fail. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	47af90d078	Fix race in offline-extent-waiting test Before comparing file contents, wait for the background dd to complete. Also fix a typo. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	669e37c636	Remove hung task workaround from large-fragmented-free test Adjusting hung_task_timeout_secs is still needed for this test to pass with a debug kernel. But the logic belongs on the platform side. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00

1 2 3 4 5 ...

2105 Commits