scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-03 10:55:20 +00:00

Author	SHA1	Message	Date
Auke Kok	8bb2f83cf9	Findmnt returns 1 when no matching entries found Our local fence script attempts to interpret errors executing `findmnt` as critical errors, but the program exit code explicitly returns EXIT_FAILURE when the total number of matching mount entries is zero. This can happen if the mount disappeared while we're attempting to fence the mount, but, the scoutfs sysfs files are still in place as we read them. It's a small window, but, it's a fork/exec plus full parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes on my system. There's no other exit codes from findmnt other than 0 and 1. At that point, we can only assume that if the stdout is empty, the mount isn't there anymore. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-02 12:55:11 -08:00
Zach Brown	8ddf9b8c8c	Handle disappearing fencing requests and targets The userspace fencing process wasn't careful about handling underlying directories that disappear while it was working. On the server/fenced side, fencing requests can linger after they've been resolved by writing 1 to fenced or error. The script could come back around to see the directory before the server finally removes it, causing all later uses of the request dir to fail. We saw this in the logs as a bunch of cat errors for the various request files. On the local fence script side, all the mounts can be in the process of being unmounted so both the /sys/fs dirs and the mount it self can be removed while we're working. For both, when we're working with the /sys/fs files we read them without logging errors and then test that the dir still exists before using what we read. When fencing a mount, we stop if findmnt doesn't find the mount and then raise a umount error if the /sys/fs dir exists after umount fails. And while we're at it, we have each scripts logging append instead of truncating (if, say, it's a log file instead of an interactive tty). Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	fd80c17ab6	Filter out kernel message when guests are slow Ignore more kernel messages when debug guests are being slow. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	991e2cbdf8	Ignore slow quorum hb transfers in tests We're getting test failures from messages that our guests can be unresponsive. They sure can be. We don't need to fail for this one specific case. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	ad078cd93c	Avoid lock stalling mmap_stress mmap_stress gets completely stalled in lock messaging and starving most of the mmap_stress threads, which causes it to delay and even time out in CI. Instead of spawning threads over all 5 test nodes, we reduce it to spawning over only 2 artificially. This still does a good number of operations on those node, and now the work is spread across the two nodes evenly. Additionaly, I've added a miniscule (10ms) delay in between operations that should hopefully be sufficient for other locking attempts to settle and allow the threads to better spread the work. This now shows that all the threads exit within < 0.25s on my test machine, which is a lot better than the 40s variation that I was seeing locally. Hopefully this fares better in CI. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	90cb458cd5	Make mmap_stress not exceed a fixed amount of time. There's a scenarion where mmap_stress gets enough resources that twoe of the threads will starve the others, which then all take a very long time catching up committing changes. Because this test program didn't finish until all the threads had completed a fixed amount of work, essentially these threads all ended up tripping over eachother. In CI this would exceed 6h+, while originally I intended this to run in about 100s or so. Instead, cap the run time to ~30s by default. If threads exceed this time, they will immediately exit, which causes any clog in contention between the threads to drain relatively quickly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	8484a58dd6	Have xfstest pass when using args The xfstests's golden output includes the full set of tests we expect to run when no args are specified. If we specify args then the set of tests can change and the test will always fail when they do. This fixes that by having the test check the set of tests itself, rather than relying on golden output. If args are specified then our xfstest only fails if any of the executed xfstest tests failed. Without args, we perform the same scraping of the check output and compare it against the expected results ourself. It would have been a bit much to put that large file inline in the test file, so we add a dir of per-test files in revision control. We can also put the list of exclusions there. We can also clean up the output redirection helper functions to make them more clear. After xfstests has executed we want to redirect output back to the compared output so that we can catch any unexpected output. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	a077104531	Add crash monitor to run-tests Add a little background function that runs during the test which triggers a crash if it finds catastrophic failure conditions. This is the second bg task we want to kill and we can only have one function run on the EXIT trap, so we create a generic process killing trap function. We feed it the fenced pid as well. run-tests didn't log much of value into the fenced log, and we're not logging the kills into anymore, so we just remove run-tests fenced logging. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	23aaa994df	Add -l to run-tests for looping over tests Add an option to run-tests to have it loop over each test that will be run a number of times. Looping stops if the test doesn't pass. Most of the change in the per-test execution is indenting as we add the for loop block. The stats and kmsg output are lifted up before of the loop. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-06 12:07:42 -08:00
Zach Brown	7d14b57b2d	Export PATH once in run-tests Might as well just export the PATH once as we change it, no need to export it in every test iteration. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-06 11:02:38 -08:00
Zach Brown	4b41cf9789	Centralize port numbers and avoid ephemeral The tests were using high ephemeral port numbers for the mount server's listening port. This caused occasional failure if the client's ephemeral ports happened to collide with the ports used by the tests. This ports all the port number configuration in one place and has a quick check to make sure it doesn't wander into the current ephemeral range. Then it updates all the tests to use the chosen ports. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Zach Brown	33f6e9d0cd	Merge pull request #248 from versity/auke/shuffle-tests Add option to shuffle test order.	2025-10-23 09:33:34 -07:00
Chris Kirby	d277d7e955	Fix race condition in orphan-inodes test Make sure that the orphan scanners can see deletions after forced unmounts by waiting for reclaim_open_log_tree() to run on each mount; and waiting for finalize_and_start_log_merge() to run and not find any finalized trees. Do this by adding two new counters: reclaimed_open_logs and log_merge_no_finalized and fixing the orphan-inodes test to check those before waiting for the orphan scanners to complete. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	c72bf915ae	Use ENOLINK as a special error code during forced unmount Tests such as quorum-heartbeat-timeout were failing with EIO messages in dmesg output due to expected errors during forced unmount. Use ENOLINK instead, and filter all errors from dmesg with this errno (67). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:58:44 -07:00
Auke Kok	c3e6f3cd54	Don't run format-version-forward-back on el8, either This test compiles an earlier commit from the tree that is starting to fail due to various changes on the OS level, most recently due to sparse issues with newer kernel headers. This problem will likely increase in the future as we add more supported releases. We opt to just only run this test on el7 for now. While we could have made this skip sparse checks that fail it on el8, it will suffice at this point if this just works on one of the supported OS versions during testing. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-15 17:35:17 -05:00
Auke Kok	f86a7b4d3c	Fully wait for orphan inode scan to complete. The issue with the previous attempt to fix the orphan-inodes test was that we would regularly exceed the 120s timeout value put in there. Instead, in this commit, we change the code to add a new counter to indicate orphan deletion progress. When orphan inodes are deleted, the increment of this counter indicates progress happened. Inversely, every time the counter doesn't increment, and the orphan scan attempts counter increments, we know that there was no more work to be done. For safety, we wait until 2 consecutive scan attempts were made without forward progress in the test case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	96eb9662a1	Revert "Extend orphan-inodes timeout." This reverts commit `138c7c6b49`. The timeout value here is still exceeded by CI test jobs, and thus causing the test to fail. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	47af90d078	Fix race in offline-extent-waiting test Before comparing file contents, wait for the background dd to complete. Also fix a typo. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	669e37c636	Remove hung task workaround from large-fragmented-free test Adjusting hung_task_timeout_secs is still needed for this test to pass with a debug kernel. But the logic belongs on the platform side. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	bf87ea0a1c	Add option to shuffle test order. The `-R` option will shuffle the order in which tests are executed. The testing order shouldn't affect the outcome of any of the tests, but in practice many of these tests will execute code slightly different based on the history of the filesystem, resources allocated, memory usage etc. of tests that were executed before. Shuffling the order of tests therefore introduces small semi-random variations in the enviroment. The xfstests test is the only one that can't be shuffled yet into the mix, so it is kept at the end. This is because it leaves the filesystems unmounted. At a later point we may want to address this. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-03 14:55:32 -07:00
Zach Brown	624eb128c6	Merge pull request #221 from versity/auke/enospc-test Give enospc test more time to commit unlink.	2025-05-09 11:27:04 -07:00
Zach Brown	091eb3b683	Merge pull request #219 from versity/auke/fix-tests-failing-dirty-test-dirs Fix test cases that don't run cleanly in a semi-dirty env.	2025-05-09 11:17:24 -07:00
Zach Brown	04e8cc6295	Merge pull request #220 from versity/auke/orphan-inodes Extend orphan-inodes timeout.	2025-05-09 11:15:13 -07:00
Auke Kok	377e49caf1	Properly silently kill background tasks. Occasionally, we have some tests fail because these kills produce: tests/lock-recover-invalidate.sh: line 42: 9928 Terminated Even though we expected them to be silent. In these particular cases we already don't care about this output. We borrow the silent_kill() function from orphan-inodes and promote it to t_silent_kill() in funcs/exec.sh, and then use it everywhere where appropriate. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 12:03:04 -07:00
Auke Kok	d08eb66adc	Give enospc test more time to commit unlink. The current test sequence performs the unlink and immediately tests whether enough resources are available to create new files again, and this consistently fails. One of my crummy VMs takes a good 12 seconds before the `touch` actually succeeds. We care about the filesystem eventually returning from ENOSPC, and certainly we don't want it to take forever, but there is a period after our first ENOSPC error and cleanup that we expect ENOSPC to fail for a bit longer. Make the timeout 120s. As soon as the `touch` completes, exit the wait loop. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 11:40:13 -07:00
Auke Kok	1d0cde7cc3	Clean up old test data as needed. If run without `-m` (explicit mkfs) in subsequent testing, old test data files may break several tests. Most failures are -EEXIST, but there are some more subtle ones. This change erases any existing test dir as needed just before we run the tests, and avoids the issue entirely. I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative solution but that likely will interfere disproportionally with tests that do disconnects and other thing that can be impacted by an unlink storm. This has an obvious performance aspect - tests will be a little slower to start on subsequent runs. In CI, this will effectively be a no-op though. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 10:10:01 -07:00
Auke Kok	138c7c6b49	Extend orphan-inodes timeout. This test regularly fails in CI when the 15 seconds elapses and the system still hasn't concluded the mount log merges and orphan inode scans needed to unlink the test files. Instead of just extending the timeout value, we test-and-retry for 120s. This hopefully is faster in most cases. My smallest VM needs about 6s-8s on average. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 09:56:45 -07:00
Zach Brown	8aa1a98901	Merge pull request #210 from versity/auke/perf-irq-took-too-long Filter out perf `interrupt took too long` dmesg.	2025-04-30 10:04:00 -07:00
Auke Kok	24031cde1d	TAP formatted output. Stored as `results/scoutfs.tap`, this file contains TAP format 14 generated test results. Embedded in the output are some metadata so that these files can be aggregated and stored in an unique and deduplicating way, but using a generated UUID at the start of testing. The file itself also catches git ID, date, and kernel version, as well as the (possibly altered) test sequence used. Any test that has diff or dmesg output will be considered failed, and a copy of the relevant data is included as comments. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-15 12:02:41 -07:00
Auke Kok	1b47e9429e	Filter out perf `interrupt took too long` dmesg. Example: ``` [ 2469.638414] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 ``` Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-14 12:06:58 -07:00
Auke Kok	7ea084082d	Ignore pipefail alternative error when not a tty. This happens with the basic-truncate test, only. It's the only user of the `yes` program. The `yes` command normally fails gracefully under the usual runs that are attached to some terminal. But when the test script runs entirely under something else, it will throw a needless error message that pollutes the test output: `yes: standard output: Broken pipe` Adjust the redirect to omit all stderr for `yes` in this case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-14 11:13:39 -07:00
Auke Kok	e59a5f8ebd	Readdir w/offset validation. Verify using xfs_io that readdir offsets match expected output. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	92f704d35a	Enable all xfstests mmap() tests. Now that all of these should be passing, we enable all mmap() tests in xfstests, and update the golden output with the new tests. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Auke Kok	311bf75902	Add mmap tests. Two test programs are added. The run time is about 1min on my el7 instance. The test script finishes up with a read/write mmap test on offline extents to verify the data wait paths in those functions. One program will perform vfs read/write and mmap read/write calls on the same file from across 5 threads (mounts) repeatedly. The goal is to assure there are no locking issues between read/write paths. The second test program performs consistency checking on a file that is repeatedly written/read using memory maps and normal reads and writes, and the content is verified after every operation. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Zach Brown	4a26059d00	Add lock-shrink-read-race test Add a quick test that races readers and shrinking to stress lock object refcount racing between concurrent lock request handling threads in the lock server. Signed-off-by: Zach Brown <zab@versity.com>	2024-10-31 15:35:11 -07:00
Auke Kok	fc7876e844	Allow certain tests to skip, but not fail exit condition. Previously, any t_skip would cause the final test result to be a failure because up until now no test should have been skipped. However, with format-version-forward-back not being compatible with el9, we are going to rely on el7/8 testing for that test soleley, and therefore we have to allow skipping of this test on el9 and newer OS versions. We add `t_skip_permitted` to signal this from the test case to the run-tests.sh script. A new exit code is passed, and all accounting is updated to reflect that a test was skipped, but this was permitted. We modify format-version-forward-back to use this new exit path. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	5337b9e221	Ingore Process accounting resumed dmesg. I'm seeing more and more of these as audit is enabled in el8 and el9 images I am using for testing, and during ENOSPC tests this has a chance of triggering process accounting suspension, and subsequent resume. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	8a22bdd366	Ignore device mapper size change dmesg output. In v1.18-10-g5507ee5, we changed the test code away from loopback to device-mapper, which simplified our DUT setup code. However, this results in the occasional `device changed size` messages now being emitted by the `dm` driver instead of the `loop` kernel module. We have to additionally ignore these kernel messages from now as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	9335d2eb86	Don't --track when checking out a tag. I've pushed a tag/release to scoutfs-xfstests-dev instead of a full blown branch. This seems simpler and cleaner than using branches, because we're going to end up rebasing these things a lot. However, we can't --track tags, so, if the branch name passed to -x is actually a tag instead of a branch, we have to omit the --track option here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	97b081de3f	Switch xfstests tag over in CI jobs using this marker file. CI testing needs to know which xfstests branch to use on all OSs. We can't just use the el9 xfstests branch on el9 only, because we need to run the same el9 xfstests on el8 and el7 as well, otherwise testing will just fail. So, we put a marker file in our git repo that tells us that we're not going to use the default `scoutfs` branch from scoutfs-xfstests-dev but our own special tag or branch. The CI job then should pass the proper -x {branch} flag to the run-tests.sh script. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	21b5032365	Add new xfstests that we won't support or don't pass The new version of xfstests adds a _lot_ more tests to our mix. Many of the new ones will auto enable or auto skip as needed. There are tests we can't or won't support that will be in future xfstests. Disable them now so we can avoid dealing with them later. Quite a few fall into "we don't support these types of mounting yet", mostly bind-mount or dm-mapper things. We disable all the swapfile tests flatout. A few tests fail on el7 but not el8/9 but we don't have a way to run them without failing yet, so disable them as well. Update golden with the proper new array of tests. This all requires the `auke/scoutfs-el9` branch in `versity/scoutfs-xfstests-dev`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	4723f4f9ab	Disable format-version-forward-back test on el9+. Using t_skip, we just skip this test on el9. If we ever want to add a formatversion 2->3 test, perhaps we should just add a separate test script, instead of going over a static array. But let's not worry about this too much right now. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	0a8b3f4e94	Fix basic-posix-acl test output on el9 It turns out that on el9, `bash -c` prints out `bash: line 1: cd..` instead of `line 0:` on el7 or el8. So discard all the stderr from these `cd` lines entirely and just rely on the expected echo output to stdout. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	8a4b0967cb	Add fiemap output through scoutfs util. There's filefrag already, and that works, but, it's output is very inconsistent between various OS release versions, and it has already meant that we'd needed to adjust tests to account for these little but insignificant changes. A lot more work than useful. It's even more changed in el9. This adds `scoutfs get-fiemap FILE` and prints out block extent info with flags that we care about as an abbreviated letter: U for Unwritten, L for Last, and O for Unknown (as in, "offline"). The -P/--physical and -L/--logical options turn off logical or physical offset display, in case you only want to see the offsets in either units. You can pass -b/--byte to display offsets and lengths in byte values. The block size will then be obtained from fstat() of the queried file (4096 for scoutfs). I've removed all uses of filefrag from our scoutfs tests. Xfstests still calls it but their internal diff takes care of that issue. Where needed and appropriate, the tests are adjusted so that the output of `scoutfs get-fiemap` is as close as it can to what it used to be, so that reading the test results allows the quick view of what might have been going wrong. There are some output strings I have not bothered to update because there's no real value to updating every output string to match, and we just adjust the golden file accordingly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 15:38:34 -07:00
Auke Kok	606c519e96	Simple-staging doesn't actually test overflow. This isn't a simple case where we can use u64_region_wraps because length is s32. Let's actually test an overflow case instead of a case that doesn't overflow, though. We still should properly add an overflow test here as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	7d0e7e29f8	Avoid integer wrapping pitfalls for (off, len) pairs. We use check_add_overflow(a, b, d) here to validate that (off, len) pairs do not exceed the max value type. The kernel conveniently has several macros to sort out the problems with signed or unsigned types. However, we're not interested in purely seeing whether (a + b) overflows, because we're using this for (off, len) overflow checks, where the bytes we read are from 0 to len -1. We must therefore call this check with (b) being "len - 1". I've made sure that we don't accidentally fail when (len == 0) in all cases by making sure we've already checked this condition before, and moving code around as needed to ensure that (len > 0) in all cases where we check. The macro check_add_overflow requires a (d) argument in which temporarily the result of the addition is stored and then checked to see if an overflow occurred. We put a `tmp` variable on the stack of the correct type as needed to make the checks function. simple-release-extents test mistakenly relied on this buggy wrap code, so it needs fixing. The move-blocks test also got it wrong. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	6d42d260cf	xargs option conflict now a warning in el9 The warnings thrown by el9's version of xargs are unexpected output and cause this test to fail. When using the -I option (replace) the -n 1 arguments are always assumed. In el7/8 no warnings were printed. We can just remove `-n 1` since the argument is never needed. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	b45fbe0bbb	Don't pass data version to attr_x unless the ioctl means to set it. The wrapper in setattr_more that translates the operations to attr_x needs to decide whether to ask attr_x to perform a change to any of the fields passed to it or not. For the date and size fields this is implicit - we always tell attr_x to change them. For any of the other fields, it should be explicit. The only field that is in the struct that this applies to is data_version. Because the data version field by default is zero, we use that as condition to decide whether to pass the data_version down to attr_x. Previously, the code would always pass a data_version=0 down to attr_x, triggering one of the validity checks, making it return -EINVAL. We add a simple test case to test for this issue. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-09-27 19:31:22 -04:00
Auke Kok	9d8ac2c7d7	Write to kmsg which test we're executing. This is done by xfstests and it's so much easier to follow what is going on from logs or e.g. serial console that I thought I should do this for scoutfs tests as well. It makes it so much easier to discern which test may have been cause for issues when running a bunch of tests and you're looking back at logs later. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-08-28 14:36:55 -07:00
Auke Kok	7b039a1d18	Add basic POSIX ACL tests. These are extremely limited and very quick basic ACL tests we can trivially do in under a second - purely basic funtionality tests only. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-08-12 15:07:43 -04:00

1 2 3 4 5

229 Commits