scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2025-12-23 05:25:18 +00:00

Author	SHA1	Message	Date
Auke Kok	2884a92408	Avoid using bash special device nodes. Bash has special handling when these standard IO files, but there are cases where customers have special restrictions set on them. Likely to avoid leaking error data out of system logs as part of IDS software. In any case, we can just reopen existing file descriptors here in both these cases to avoid this entirely. This will always work. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 13:24:48 -05:00
Zach Brown	e194714004	Merge pull request #264 from versity/auke/findmnt_retval Findmnt returns 1 when no matching entries found	2025-12-03 14:29:31 -08:00
Auke Kok	8bb2f83cf9	Findmnt returns 1 when no matching entries found Our local fence script attempts to interpret errors executing `findmnt` as critical errors, but the program exit code explicitly returns EXIT_FAILURE when the total number of matching mount entries is zero. This can happen if the mount disappeared while we're attempting to fence the mount, but, the scoutfs sysfs files are still in place as we read them. It's a small window, but, it's a fork/exec plus full parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes on my system. There's no other exit codes from findmnt other than 0 and 1. At that point, we can only assume that if the stdout is empty, the mount isn't there anymore. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-02 12:55:11 -08:00
Zach Brown	6a9a6789d5	Merge pull request #267 from versity/clk/merge_enoent Handle ENOENT when getting log merge status item	2025-12-02 09:34:28 -08:00
Chris Kirby	ee630b164f	Handle ENOENT when getting log merge status item Tests that cause client retries can fail with this error from server_commit_log_merge(): error -2 committing log merge: getting merge status item This can happen if the server has already committed and resolved the log merge that is being retried. We can safely ignore ENOENT here just like we do a few lines later. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-12-01 08:58:24 -06:00
Zach Brown	1c7678b6f5	Merge pull request #263 from versity/zab/v1.26 v1.26 Release	2025-11-18 09:39:27 -08:00
Zach Brown	22b5e79bbd	v1.26 Release Finish the release notes for the 1.26 release. Signed-off-by: Zach Brown <zab@versity.com> v1.26	2025-11-17 14:42:14 -08:00
Zach Brown	259e639271	Merge pull request #262 from versity/zab/ino_alloc_per_lock Add ino_alloc_per_lock option	2025-11-14 13:57:49 -08:00
Zach Brown	4d66c38c71	Remove redundant WARN in commit_log_trees The server's commit_log_trees has an error message that includes the source of the error, but it's not used for all errors. The WARN_ON is redundant with the message and is removed because it isn't filtered out when we see errors from forced unmount. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-14 10:04:30 -08:00
Zach Brown	7ef62894bd	Add ino_alloc_per_lock option Add an option that can limit the number of inode numbers that are allocated per lock group. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 17:19:04 -08:00
Zach Brown	1f363a1ead	Merge pull request #261 from versity/zab/log_merge_double_free Zab/log merge double free	2025-11-13 17:18:30 -08:00
Zach Brown	8ddf9b8c8c	Handle disappearing fencing requests and targets The userspace fencing process wasn't careful about handling underlying directories that disappear while it was working. On the server/fenced side, fencing requests can linger after they've been resolved by writing 1 to fenced or error. The script could come back around to see the directory before the server finally removes it, causing all later uses of the request dir to fail. We saw this in the logs as a bunch of cat errors for the various request files. On the local fence script side, all the mounts can be in the process of being unmounted so both the /sys/fs dirs and the mount it self can be removed while we're working. For both, when we're working with the /sys/fs files we read them without logging errors and then test that the dir still exists before using what we read. When fencing a mount, we stop if findmnt doesn't find the mount and then raise a umount error if the /sys/fs dir exists after umount fails. And while we're at it, we have each scripts logging append instead of truncating (if, say, it's a log file instead of an interactive tty). Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	fd80c17ab6	Filter out kernel message when guests are slow Ignore more kernel messages when debug guests are being slow. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	991e2cbdf8	Ignore slow quorum hb transfers in tests We're getting test failures from messages that our guests can be unresponsive. They sure can be. We don't need to fail for this one specific case. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	92ac132873	Silence merge splice error when forcing Silence another error warning and assertion that's assuming that the result of the errors is going to be persistent. When we're forcing an unmount we've severed storage and networking. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	ad078cd93c	Avoid lock stalling mmap_stress mmap_stress gets completely stalled in lock messaging and starving most of the mmap_stress threads, which causes it to delay and even time out in CI. Instead of spawning threads over all 5 test nodes, we reduce it to spawning over only 2 artificially. This still does a good number of operations on those node, and now the work is spread across the two nodes evenly. Additionaly, I've added a miniscule (10ms) delay in between operations that should hopefully be sufficient for other locking attempts to settle and allow the threads to better spread the work. This now shows that all the threads exit within < 0.25s on my test machine, which is a lot better than the 40s variation that I was seeing locally. Hopefully this fares better in CI. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	90cb458cd5	Make mmap_stress not exceed a fixed amount of time. There's a scenarion where mmap_stress gets enough resources that twoe of the threads will starve the others, which then all take a very long time catching up committing changes. Because this test program didn't finish until all the threads had completed a fixed amount of work, essentially these threads all ended up tripping over eachother. In CI this would exceed 6h+, while originally I intended this to run in about 100s or so. Instead, cap the run time to ~30s by default. If threads exceed this time, they will immediately exit, which causes any clog in contention between the threads to drain relatively quickly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	1ab798e7eb	Silence inconsistent srch on forced unmount Assembling a srch compaction operation creates an item and populates it with allocator state. It doesn't cleanly unwind the allocation and undo the compaction item change if allocation filling fails and issues a warning. This warning isn't needed if the error shows that we're in forced unmount. The inconsistent state won't be applied, it will be dropped on the floor as the mount is torn down. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	e182914e51	Fix double free of metadata blocks in log merging The log merging process is meant to provide parallelism across workers in mounts. The idea is that the server hands out a bunch of concurrent non-intersecting work that's based on the structure of the stable input fs_root btree. The nature of the parallel work (cow of the blocks that intersect a key range) means that the ranges of concurrently issued work can't overlap or the work will all cow the same input blocks, freeing that input stable block multiple times. We're seeing this in testing. Correctness was intended by having an advancing key that sweeps sorted ranges. Duplicate ranges would never be hit as the key advanced past each it visited. This was broken by the mapping of the fs item keys to log merge tree keys by clobbering the sk_zone key value. It effectively interleaves the ranges of each zone in the fs root (meta indexes, orphans, fs items). With just the right log merge conditions that involve logged items in the right places and partial completed work to insert remaining ranges behind the key, ranges can be stored at mapped keys that end up with ranges out of order. The server iterates over these and ends up issueing overlapping work, which results in duplicated frees of the input blocks. The fix, without changing the format of the stored log tree items, is to perform a full sweep of all the range items and determine the next item by looking at the full precision stored keys. This ensures that the processed ranges always advance and never overlap. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	8484a58dd6	Have xfstest pass when using args The xfstests's golden output includes the full set of tests we expect to run when no args are specified. If we specify args then the set of tests can change and the test will always fail when they do. This fixes that by having the test check the set of tests itself, rather than relying on golden output. If args are specified then our xfstest only fails if any of the executed xfstest tests failed. Without args, we perform the same scraping of the check output and compare it against the expected results ourself. It would have been a bit much to put that large file inline in the test file, so we add a dir of per-test files in revision control. We can also put the list of exclusions there. We can also clean up the output redirection helper functions to make them more clear. After xfstests has executed we want to redirect output back to the compared output so that we can catch any unexpected output. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	a077104531	Add crash monitor to run-tests Add a little background function that runs during the test which triggers a crash if it finds catastrophic failure conditions. This is the second bg task we want to kill and we can only have one function run on the EXIT trap, so we create a generic process killing trap function. We feed it the fenced pid as well. run-tests didn't log much of value into the fenced log, and we're not logging the kills into anymore, so we just remove run-tests fenced logging. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	23aaa994df	Add -l to run-tests for looping over tests Add an option to run-tests to have it loop over each test that will be run a number of times. Looping stops if the test doesn't pass. Most of the change in the per-test execution is indenting as we add the for loop block. The stats and kmsg output are lifted up before of the loop. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-06 12:07:42 -08:00
Zach Brown	7d14b57b2d	Export PATH once in run-tests Might as well just export the PATH once as we change it, no need to export it in every test iteration. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-06 11:02:38 -08:00
Zach Brown	3f252be4be	Merge pull request #241 from versity/auke/waiter_err_data_version_obsolete Ignore data_version in scoutfs_ioc_data_wait_err.	2025-11-04 10:09:57 -08:00
Auke Kok	a4d25d9b55	Ignore data_version in scoutfs_ioc_data_wait_err. The data_wait_err ioctl currently requires the correct data_version for the inode to be passed in, or else the ioctl returns -ESTALE. But the ioctl itself is just a passthrough mechanism for notifying data waiters, which doesn't involve the data_version at all. Instead, we can just drop checking the value. The field remains in the headers, but we've marked it as being ignored from now on. The reason for the change is documented in the header file as well. This all is a lot simpler than having to modify/rev the data_waiters interface to support passing back the data_version, because there isn't any space left to easily do this, and then userspace would just pass it back to the data_wait_err ioctl. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-31 12:24:03 -04:00
Zach Brown	79cd25f693	Merge pull request #255 from versity/zab/compact_error_block_leak Don't leak alloc blocks on srch compact error	2025-10-31 09:08:17 -07:00
Zach Brown	f2646130ae	Don't leak alloc blocks on srch compact error scoutfs_alloc_prepare_commit() is badly named. All it really does is put the references to the two dirty alloc list blocks in the allocator. It must allways be called if allocation was attempted, but it's easier to require that it always be paired with _alloc_init(). If the srch compaction worker in the client sees an error it will send the error back to the server without writing its dirty blocks. In avoiding the write it also avoided putting the two block references, leading to leaked blocks. We've been seeing rare messages with leaked blocks in tests. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-30 14:47:18 -07:00
Zach Brown	1c66f9a9a5	Merge pull request #227 from versity/auke/el96 RHEL 9.6 support.	2025-10-30 13:39:05 -07:00
Auke Kok	afb6ba00ad	POSIX ACL changes. The .get_acl() method now gets passed a mnt_idmap arg, and we can now choose to implement either .get_acl() or .get_inode_acl(). Technically .get_acl() is a new implementation, and .get_inode_acl() is the old. That second method now also gets an rcu flag passed, but we should be fine either way. Deeper under the covers however we do need to hook up the .set_acl() method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To make this not super messy (it already is) we tack on the get_acl() changes here. This is all roughly ca. v6.1-rc1-4-g7420332a6ff4. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-30 13:59:44 -04:00
Auke Kok	29e486e411	All vfs methods now take a mnt_idmap instead of user_namespace arg. Similar to before when namespaces were added, they are now translated to a mnt_idmap, since v6.2-rc1-2-gabf08576afe3. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-30 13:58:34 -04:00
Zach Brown	8f3177fe33	Merge pull request #254 from versity/zab/shrink_cleanup Zab/shrink cleanup	2025-10-30 08:56:33 -07:00
Zach Brown	419079e606	Merge pull request #239 from versity/auke/keepalive Add tcp_keepalive_timeout_ms option, change default to 60s	2025-10-29 17:15:17 -07:00
Zach Brown	6a70ee03b5	Dump block alloc stacks for leaked blocks The typical pattern of spinning isolating a list_lru results in a livelock if there are blocks with leaked refcounts. We're rarely seeing this in testing. We can have a modest array in each block that records the stack of the caller that initially allocated the block and dump that stack for any blocks that we're unable to shrink/isolate. Instead of spinning shrinking, we can give it a good try and then print the blocks that remain and carry on with unmount, leaking a few blocks. (Past events have had 2 blocks.) Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 16:16:58 -07:00
Zach Brown	38a2ffe0c7	Add stacktrace kernelcompat Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	4b41cf9789	Centralize port numbers and avoid ephemeral The tests were using high ephemeral port numbers for the mount server's listening port. This caused occasional failure if the client's ephemeral ports happened to collide with the ports used by the tests. This ports all the port number configuration in one place and has a quick check to make sure it doesn't wander into the current ephemeral range. Then it updates all the tests to use the chosen ports. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	102899290e	Allow harmless srch compact commit errors The server's srch commit error warnings were a bit severe. The compaction operations are a function of persistent state. If they fail then the inputs still exist and the next attempt will retry whatever failed. Not all errors are a problem, only those that result in partial commits that leave inconsistent state. In particular, we have to support the case where a client retransmits a compaction request to a new server after a first server performed the commit but couldn't respond. Throwing warnings when the new server gets ENOENT looking for the busy compaction item isn't helpful. This came in tests as background compaction was in flight as tests unmounted and mounted servers repeatedly to test lock recovery. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	89387fb192	Use list_lru for block cache shrinking The block cache had a bizarre cache eviction policy that was trying to avoid precise LRU updates at each block. It had pretty bad behaviour, including only allowing reclaim of maybe 20% of the blocks that were visited by the shrinker. We can use the existing list_lru facility in the kernel to do a better job. Blocks only exhibit contention as they're allocated and added to per-node lists. From then on we only set accessed bits and the private list walkers move blocks around on the list as we see the accessed bits. (It looks more like a fifo with lazy promotion than a "LRU" that is actively moving list items around as they're accessed.) Using the facility means changing how we remove blocks from the cache and hide them from lookup. We clean up the refcount inserted flag a bit to be expressed more as a base refcount that can be acquired by whoever's removing from the cache. It seems a lot clearer. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	8b6418fb79	Add kernelcompat for list_lru Add kernelcompat helpers for initial use of list_lru for shrinking. The most complicated part is the walk callback type changing. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	206c24c41f	Retry stale item reads instead of stopping reclaim Readers can read a set of items that is stale with respect to items that were dirtied and written under a local cluster lock after the read started. The active reader machanism addressed this by refusing to shrink pages that could contain items that were dirtied while any readers were in flight. Under the right circumstances this can result in refusing to shrink quite a lot of pages indeed. This changes the mechanism to allow pages to be reclaimed, and instead forces stale readers to retry. The gamble is that reads are much faster than writes. A small fraction should have to retry, and when they do they can be satisfied by the block cache. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Auke Kok	f67462750b	Add tcp_keepalive_timeout_ms option, change default to 60s The default TCP keepalive value is currently 10s, resulting in clients being disconnected after 10 seconds of not replying to a TCP keepalive packet. These keepalive values are reasonable most of the times, but we've seen client disconnects where this timeout has been exceeded, resulting in fencing. The cause for this is unknown at this time, but it is suspected that network intermissions are happening. This change adds a configurable value for this specific client socket timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose value remains unchanged. The default value of 10000ms (10s) is changed to 60s. This is the value we're assuming is much better suited for customers and has been briefly trialed, showing that it may help to avoid network level interruptions better. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-28 18:45:43 -04:00
Zach Brown	fd8aaa0810	Merge pull request #205 from versity/auke/scar Changes from static analysis.	2025-10-27 16:06:32 -07:00
Auke Kok	a5dbe7f286	Don't set ret = -ENOMEM and immediately overwrite. It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which is legitimately a failure, thus the code here releases the sock again. But the code block here sets `ret = ENOMEM` and then restarts the loop, which immediately sets `ret = kernel_accept()`, thus overwriting the -ENOMEM error value. We can argue that an ENOMEM error situation here is not catastrophical. If this is the first that we're ever receiving an ENOMEM situation here while trying to accept a new client, we can just release the socket and wait for the client to try again. If the kernel at that point still is out of memory to handle the new incoming connection, that will then cascade down and clean up the while listener at that point. The alternative is to let this error path unwind out and break down the listener immediately, something the code today doesn't do. We're keeping the behavior therefore the same. I've opted therefore to replace the `ret = -ENOMEM` assignment with a comment explaining why we're ignoring the error situation here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:26:44 -04:00
Auke Kok	c1e89d597d	Fix NULL dereference on error branch in handle_request. If scoutfs_send_omap_response fails for any reason, req is NULL and we would hit a hard NULL deref during unwinding. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	2c4316b096	Avoid uninitialized map, flags in ext. This function returns a stack pointer to a struct scoutfs_extent, after setting start, len to an extent found in the proper zone, but it leaves map and flags members unset. Initialize the struct to {0,} avoids passing uninitialized values up the callstack. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:24:00 -04:00
Auke Kok	e704cd7074	Fix masking of EIO in compact_logs. Several of the inconsistency error paths already correctly `goto out` but this one has a `break`. This would result in doing a whole lot of work on corrupted data. Make this error path go to `out` instead as the others do. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:59 -04:00
Auke Kok	8c5b09aee8	Prevent masking away inconsistent state in search_sorted_file. In these two error conditions we explicitly set `ret = -EIO` but then `break` to set `ret = 0` immediately again, masking away a critical error code that should be returned. Instead, `goto out` retains the EIO error value for the caller. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:23:28 -04:00
Auke Kok	d2cd610c53	Fix return of uninit value. The value of `ret` is not initialized. If the writeback list is empty, or, if igrab() fails on the only inode on the list, the value of `ret` is returned without being initialized. This would cause the caller to needlessly have to retry, perhaps possibly make things worse. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	52563d3f73	Address double copy_to_user, possible 1-byte leak. We shouldn't copy the entire _dirent struct and then copy in the name again right after, just stop at offsetoff(struct, name). Now that we're no longer copying the uninitialized name[3] from ent, there is no more possible 1-byte leak here, too. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	4358d57f55	Avoid possible NULL deref on ENOMEM. Assure that we reschedule even if this happens. Maybe it'll recover. If not, we'll have other issues elsewhere first. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00
Auke Kok	021830ab04	If kzalloc fails, avoid NULL deref. We still assign NULL to sbi->s_fs_info to aid checks in cleanup paths. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-24 14:21:06 -04:00

1 2 3 4 5 ...

2132 Commits