Commit Graph

229 Commits

Author SHA1 Message Date
Auke Kok
8bb2f83cf9 Findmnt returns 1 when no matching entries found
Our local fence script attempts to interpret errors executing `findmnt`
as critical errors, but the program exit code explicitly returns
EXIT_FAILURE when the total number of matching mount entries is zero.

This can happen if the mount disappeared while we're attempting to
fence the mount, but, the scoutfs sysfs files are still in place as
we read them. It's a small window, but, it's a fork/exec plus full
parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes
on my system.

There's no other exit codes from findmnt other than 0 and 1. At that
point, we can only assume that if the stdout is empty, the mount
isn't there anymore.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-02 12:55:11 -08:00
Zach Brown
8ddf9b8c8c Handle disappearing fencing requests and targets
The userspace fencing process wasn't careful about handling underlying
directories that disappear while it was working.

On the server/fenced side, fencing requests can linger after they've
been resolved by writing 1 to fenced or error.  The script could come
back around to see the directory before the server finally removes it,
causing all later uses of the request dir to fail.  We saw this in the
logs as a bunch of cat errors for the various request files.

On the local fence script side, all the mounts can be in the process of
being unmounted so both the /sys/fs dirs and the mount it self can be
removed while we're working.

For both, when we're working with the /sys/fs files we read them without
logging errors and then test that the dir still exists before using what
we read.  When fencing a mount, we stop if findmnt doesn't find the
mount and then raise a umount error if the /sys/fs dir exists after
umount fails.

And while we're at it, we have each scripts logging append instead of
truncating (if, say, it's a log file instead of an interactive tty).

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
fd80c17ab6 Filter out kernel message when guests are slow
Ignore more kernel messages when debug guests are being slow.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
991e2cbdf8 Ignore slow quorum hb transfers in tests
We're getting test failures from messages that our guests can be
unresponsive.  They sure can be.  We don't need to fail for this one
specific case.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
ad078cd93c Avoid lock stalling mmap_stress
mmap_stress gets completely stalled in lock messaging and starving
most of the mmap_stress threads, which causes it to delay and even
time out in CI.

Instead of spawning threads over all 5 test nodes, we reduce it
to spawning over only 2 artificially. This still does a good number
of operations on those node, and now the work is spread across the
two nodes evenly.

Additionaly, I've added a miniscule (10ms) delay in between operations
that should hopefully be sufficient for other locking attempts to
settle and allow the threads to better spread the work.

This now shows that all the threads exit within < 0.25s on my test
machine, which is a lot better than the 40s variation that I was seeing
locally. Hopefully this fares better in CI.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
90cb458cd5 Make mmap_stress not exceed a fixed amount of time.
There's a scenarion where mmap_stress gets enough resources that
twoe of the threads will starve the others, which then all take
a very long time catching up committing changes.

Because this test program didn't finish until all the threads had
completed a fixed amount of work, essentially these threads all
ended up tripping over eachother. In CI this would exceed 6h+,
while originally I intended this to run in about 100s or so.

Instead, cap the run time to ~30s by default. If threads exceed
this time, they will immediately exit, which causes any clog in
contention between the threads to drain relatively quickly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
8484a58dd6 Have xfstest pass when using args
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified.  If we specify args then the set of
tests can change and the test will always fail when they do.

This fixes that by having the test check the set of tests itself, rather
than relying on golden output.  If args are specified then our xfstest
only fails if any of the executed xfstest tests failed.  Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.

It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control.  We can
also put the list of exclusions there.

We can also clean up the output redirection helper functions to make
them more clear.  After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
a077104531 Add crash monitor to run-tests
Add a little background function that runs during the test which
triggers a crash if it finds catastrophic failure conditions.

This is the second bg task we want to kill and we can only have one
function run on the EXIT trap, so we create a generic process killing
trap function.

We feed it the fenced pid as well.  run-tests didn't log much of value
into the fenced log, and we're not logging the kills into anymore, so we
just remove run-tests fenced logging.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
23aaa994df Add -l to run-tests for looping over tests
Add an option to run-tests to have it loop over each test that will be
run a number of times.  Looping stops if the test doesn't pass.

Most of the change in the per-test execution is indenting as we add the
for loop block.  The stats and kmsg output are lifted up before of the
loop.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 12:07:42 -08:00
Zach Brown
7d14b57b2d Export PATH once in run-tests
Might as well just export the PATH once as we change it, no need to
export it in every test iteration.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 11:02:38 -08:00
Zach Brown
4b41cf9789 Centralize port numbers and avoid ephemeral
The tests were using high ephemeral port numbers for the mount server's
listening port.  This caused occasional failure if the client's
ephemeral ports happened to collide with the ports used by the tests.

This ports all the port number configuration in one place and has a
quick check to make sure it doesn't wander into the current ephemeral
range.  Then it updates all the tests to use the chosen ports.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
33f6e9d0cd Merge pull request #248 from versity/auke/shuffle-tests
Add option to shuffle test order.
2025-10-23 09:33:34 -07:00
Chris Kirby
d277d7e955 Fix race condition in orphan-inodes test
Make sure that the orphan scanners can see deletions after forced unmounts
by waiting for reclaim_open_log_tree() to run on each mount; and waiting for
finalize_and_start_log_merge() to run and not find any finalized trees.

Do this by adding two new counters: reclaimed_open_logs and
log_merge_no_finalized and fixing the orphan-inodes test to check those
before waiting for the orphan scanners to complete.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
c72bf915ae Use ENOLINK as a special error code during forced unmount
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:58:44 -07:00
Auke Kok
c3e6f3cd54 Don't run format-version-forward-back on el8, either
This test compiles an earlier commit from the tree that is starting to
fail due to various changes on the OS level, most recently due to sparse
issues with newer kernel headers. This problem will likely increase
in the future as we add more supported releases.

We opt to just only run this test on el7 for now. While we could have
made this skip sparse checks that fail it on el8, it will suffice at
this point if this just works on one of the supported OS versions
during testing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-15 17:35:17 -05:00
Auke Kok
f86a7b4d3c Fully wait for orphan inode scan to complete.
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.

Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.

For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
96eb9662a1 Revert "Extend orphan-inodes timeout."
This reverts commit 138c7c6b49.

The timeout value here is still exceeded by CI test jobs, and thus
causing the test to fail.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
47af90d078 Fix race in offline-extent-waiting test
Before comparing file contents, wait for the background dd to complete.
Also fix a typo.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
669e37c636 Remove hung task workaround from large-fragmented-free test
Adjusting hung_task_timeout_secs is still needed for this test to pass
with a debug kernel. But the logic belongs on the platform side.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
bf87ea0a1c Add option to shuffle test order.
The `-R` option will shuffle the order in which tests are executed.

The testing order shouldn't affect the outcome of any of the tests, but
in practice many of these tests will execute code slightly different
based on the history of the filesystem, resources allocated, memory
usage etc. of tests that were executed before. Shuffling the order of
tests therefore introduces small semi-random variations in the
enviroment.

The xfstests test is the only one that can't be shuffled yet into the
mix, so it is kept at the end. This is because it leaves the filesystems
unmounted. At a later point we may want to address this.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-03 14:55:32 -07:00
Zach Brown
624eb128c6 Merge pull request #221 from versity/auke/enospc-test
Give enospc test more time to commit unlink.
2025-05-09 11:27:04 -07:00
Zach Brown
091eb3b683 Merge pull request #219 from versity/auke/fix-tests-failing-dirty-test-dirs
Fix test cases that don't run cleanly in a semi-dirty env.
2025-05-09 11:17:24 -07:00
Zach Brown
04e8cc6295 Merge pull request #220 from versity/auke/orphan-inodes
Extend orphan-inodes timeout.
2025-05-09 11:15:13 -07:00
Auke Kok
377e49caf1 Properly silently kill background tasks.
Occasionally, we have some tests fail because these kills produce:

tests/lock-recover-invalidate.sh: line 42:  9928 Terminated

Even though we expected them to be silent. In these particular cases we
already don't care about this output.

We borrow the silent_kill() function from orphan-inodes and promote it
to t_silent_kill() in funcs/exec.sh, and then use it everywhere where
appropriate.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 12:03:04 -07:00
Auke Kok
d08eb66adc Give enospc test more time to commit unlink.
The current test sequence performs the unlink and immediately tests
whether enough resources are available to create new files again, and
this consistently fails.

One of my crummy VMs takes a good 12 seconds before the `touch` actually
succeeds. We care about the filesystem eventually returning from ENOSPC,
and certainly we don't want it to take forever, but there is a period
after our first ENOSPC error and cleanup that we expect ENOSPC to fail
for a bit longer.

Make the timeout 120s. As soon as the `touch` completes, exit the wait
loop.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 11:40:13 -07:00
Auke Kok
1d0cde7cc3 Clean up old test data as needed.
If run without `-m` (explicit mkfs) in subsequent testing, old test
data files may break several tests. Most failures are -EEXIST, but
there are some more subtle ones.

This change erases any existing test dir as needed just before we
run the tests, and avoids the issue entirely.

I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative
solution but that likely will interfere disproportionally with
tests that do disconnects and other thing that can be impacted by an
unlink storm.

This has an obvious performance aspect - tests will be a little
slower to start on subsequent runs. In CI, this will effectively be
a no-op though.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 10:10:01 -07:00
Auke Kok
138c7c6b49 Extend orphan-inodes timeout.
This test regularly fails in CI when the 15 seconds elapses and the
system still hasn't concluded the mount log merges and orphan inode
scans needed to unlink the test files.

Instead of just extending the timeout value, we test-and-retry for 120s.
This hopefully is faster in most cases. My smallest VM needs about 6s-8s
on average.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 09:56:45 -07:00
Zach Brown
8aa1a98901 Merge pull request #210 from versity/auke/perf-irq-took-too-long
Filter out perf `interrupt took too long` dmesg.
2025-04-30 10:04:00 -07:00
Auke Kok
24031cde1d TAP formatted output.
Stored as `results/scoutfs.tap`, this file contains TAP format 14
generated test results.

Embedded in the output are some metadata so that these files can be
aggregated and stored in an unique and deduplicating way, but using a
generated UUID at the start of testing. The file itself also catches git
ID, date, and kernel version, as well as the (possibly altered) test
sequence used.

Any test that has diff or dmesg output will be considered failed, and a
copy of the relevant data is included as comments.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-04-15 12:02:41 -07:00
Auke Kok
1b47e9429e Filter out perf interrupt took too long dmesg.
Example:

```
[ 2469.638414] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
```

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-04-14 12:06:58 -07:00
Auke Kok
7ea084082d Ignore pipefail alternative error when not a tty.
This happens with the basic-truncate test, only. It's the only user
of the `yes` program.

The `yes` command normally fails gracefully under the usual runs that
are attached to some terminal. But when the test script runs entirely
under something else, it will throw a needless error message that
pollutes the test output:

  `yes: standard output: Broken pipe`

Adjust the redirect to omit all stderr for `yes` in this case.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-04-14 11:13:39 -07:00
Auke Kok
e59a5f8ebd Readdir w/offset validation.
Verify using xfs_io that readdir offsets match expected output.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-27 14:49:04 -05:00
Auke Kok
92f704d35a Enable all xfstests mmap() tests.
Now that all of these should be passing, we enable all mmap() tests in
xfstests, and update the golden output with the new tests.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-23 14:28:40 -05:00
Auke Kok
311bf75902 Add mmap tests.
Two test programs are added. The run time is about 1min on my el7
instance.

The test script finishes up with a read/write mmap test on offline
extents to verify the data wait paths in those functions.

One program will perform vfs read/write and mmap read/write calls on
the same file from across 5 threads (mounts) repeatedly.  The goal
is to assure there are no locking issues between read/write paths.

The second test program performs consistency checking on a file that is
repeatedly written/read using memory maps and normal reads and writes,
and the content is verified after every operation.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-23 14:28:40 -05:00
Zach Brown
4a26059d00 Add lock-shrink-read-race test
Add a quick test that races readers and shrinking to stress lock object
refcount racing between concurrent lock request handling threads in the
lock server.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-31 15:35:11 -07:00
Auke Kok
fc7876e844 Allow certain tests to skip, but not fail exit condition.
Previously, any t_skip would cause the final test result to be a failure
because up until now no test should have been skipped.

However, with format-version-forward-back not being compatible with el9,
we are going to rely on el7/8 testing for that test soleley, and
therefore we have to allow skipping of this test on el9 and newer OS
versions.

We add `t_skip_permitted` to signal this from the test case to the
run-tests.sh script. A new exit code is passed, and all accounting is
updated to reflect that a test was skipped, but this was permitted. We
modify format-version-forward-back to use this new exit path.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
5337b9e221 Ingore Process accounting resumed dmesg.
I'm seeing more and more of these as audit is enabled in el8 and el9
images I am using for testing, and during ENOSPC tests this has a chance
of triggering process accounting suspension, and subsequent resume.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
8a22bdd366 Ignore device mapper size change dmesg output.
In v1.18-10-g5507ee5, we changed the test code away from loopback
to device-mapper, which simplified our DUT setup code.

However, this results in the occasional `device changed size` messages
now being emitted by the `dm` driver instead of the `loop` kernel
module. We have to additionally ignore these kernel messages from now as
well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
9335d2eb86 Don't --track when checking out a tag.
I've pushed a tag/release to scoutfs-xfstests-dev instead of a full
blown branch. This seems simpler and cleaner than using branches,
because we're going to end up rebasing these things a lot. However, we
can't --track tags, so, if the branch name passed to -x is actually a
tag instead of a branch, we have to omit the --track option here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
97b081de3f Switch xfstests tag over in CI jobs using this marker file.
CI testing needs to know which xfstests branch to use on all OSs.

We can't just use the el9 xfstests branch on el9 only, because we
need to run the same el9 xfstests on el8 and el7 as well, otherwise
testing will just fail.

So, we put a marker file in our git repo that tells us that we're
not going to use the default `scoutfs` branch from scoutfs-xfstests-dev
but our own special tag or branch. The CI job then should pass the
proper -x {branch} flag to the run-tests.sh script.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
21b5032365 Add new xfstests that we won't support or don't pass
The new version of xfstests adds a _lot_ more tests to our mix. Many
of the new ones will auto enable or auto skip as needed.

There are tests we can't or won't support that will be in future
xfstests. Disable them now so we can avoid dealing with them later.

Quite a few fall into "we don't support these types of mounting yet",
mostly bind-mount or dm-mapper things. We disable all the swapfile
tests flatout.

A few tests fail on el7 but not el8/9 but we don't have a way to run
them without failing yet, so disable them as well.

Update golden with the proper new array of tests. This all requires
the `auke/scoutfs-el9` branch in `versity/scoutfs-xfstests-dev`.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
4723f4f9ab Disable format-version-forward-back test on el9+.
Using t_skip, we just skip this test on el9.

If we ever want to add a formatversion 2->3 test, perhaps we should
just add a separate test script, instead of going over a static array.

But let's not worry about this too much right now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
0a8b3f4e94 Fix basic-posix-acl test output on el9
It turns out that on el9, `bash -c` prints out `bash: line 1: cd..`
instead of `line 0:` on el7 or el8. So discard all the stderr from
these `cd` lines entirely and just rely on the expected echo
output to stdout.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
8a4b0967cb Add fiemap output through scoutfs util.
There's filefrag already, and that works, but, it's output is very
inconsistent between various OS release versions, and it has already
meant that we'd needed to adjust tests to account for these little
but insignificant changes. A lot more work than useful. It's even
more changed in el9.

This adds `scoutfs get-fiemap FILE` and prints out block extent
info with flags that we care about as an abbreviated letter: U for
Unwritten, L for Last, and O for Unknown (as in, "offline").

The -P/--physical and -L/--logical options turn off logical or physical
offset display, in case you only want to see the offsets in either
units. You can pass -b/--byte to display offsets and lengths in
byte values. The block size will then be obtained from fstat() of
the queried file (4096 for scoutfs).

I've removed all uses of filefrag from our scoutfs tests. Xfstests
still calls it but their internal diff takes care of that issue.

Where needed and appropriate, the tests are adjusted so that the output
of `scoutfs get-fiemap` is as close as it can to what it used to be,
so that reading the test results allows the quick view of what might
have been going wrong.

There are some output strings I have not bothered to update because
there's no real value to updating every output string to match,
and we just adjust the golden file accordingly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
606c519e96 Simple-staging doesn't actually test overflow.
This isn't a simple case where we can use u64_region_wraps because
length is s32.

Let's actually test an overflow case instead of a case that doesn't
overflow, though. We still should properly add an overflow test here as
well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
7d0e7e29f8 Avoid integer wrapping pitfalls for (off, len) pairs.
We use check_add_overflow(a, b, d) here to validate that (off, len)
pairs do not exceed the max value type. The kernel conveniently has
several macros to sort out the problems with signed or unsigned types.

However, we're not interested in purely seeing whether (a + b)
overflows, because we're using this for (off, len) overflow checks,
where the bytes we read are from 0 to len -1. We must therefore call
this check with (b) being "len - 1".

I've made sure that we don't accidentally fail when (len == 0)
in all cases by making sure we've already checked this condition
before, and moving code around as needed to ensure that (len > 0)
in all cases where we check.

The macro check_add_overflow requires a (d) argument in which
temporarily the result of the addition is stored and then checked to see
if an overflow occurred. We put a `tmp` variable on the stack of the
correct type as needed to make the checks function.

simple-release-extents test mistakenly relied on this buggy wrap code,
so it needs fixing. The move-blocks test also got it wrong.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
6d42d260cf xargs option conflict now a warning in el9
The warnings thrown by el9's version of xargs are unexpected output and
cause this test to fail. When using the -I option (replace) the -n 1
arguments are always assumed. In el7/8 no warnings were printed.

We can just remove `-n 1` since the argument is never needed.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
b45fbe0bbb Don't pass data version to attr_x unless the ioctl means to set it.
The wrapper in setattr_more that translates the operations to attr_x
needs to decide whether to ask attr_x to perform a change to any of
the fields passed to it or not. For the date and size fields this
is implicit - we always tell attr_x to change them. For any of the
other fields, it should be explicit.

The only field that is in the struct that this applies to is
data_version. Because the data version field by default is zero,
we use that as condition to decide whether to pass the data_version
down to attr_x.

Previously, the code would always pass a data_version=0 down to attr_x,
triggering one of the validity checks, making it return -EINVAL. We
add a simple test case to test for this issue.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-27 19:31:22 -04:00
Auke Kok
9d8ac2c7d7 Write to kmsg which test we're executing.
This is done by xfstests and it's so much easier to follow what is going
on from logs or e.g. serial console that I thought I should do this for
scoutfs tests as well. It makes it so much easier to discern which test
may have been cause for issues when running a bunch of tests and you're
looking back at logs later.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-28 14:36:55 -07:00
Auke Kok
7b039a1d18 Add basic POSIX ACL tests.
These are extremely limited and very quick basic ACL tests we can
trivially do in under a second - purely basic funtionality tests only.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-12 15:07:43 -04:00