It's possible to trigger the block device autoloading mechanism
with a mknod()/stat(), and this mechanism has long been declared
obsolete, thus triggering a dmesg warning since el9_7, which then
fails the test. You may need to `rmmod loop` to reproduce.
Avoid this by avoiding to trigger a loop autoload - we just make a
different blockdev. Chosing `42` here should avoid any autoload
mechanism as this number is explicitly for demo drivers and should
never trigger an autoload.
We also just ignore the warning line in dmesg. Other tests can and
might perhaps still trigger this, as well as background noise running
during the test.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The tap output file was not yet complete as it failed to include
the contents of `status.msg`. In a few cases, that would mean it
lacks important context.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We're getting test failures from messages that our guests can be
unresponsive. They sure can be. We don't need to fail for this one
specific case.
Signed-off-by: Zach Brown <zab@versity.com>
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified. If we specify args then the set of
tests can change and the test will always fail when they do.
This fixes that by having the test check the set of tests itself, rather
than relying on golden output. If args are specified then our xfstest
only fails if any of the executed xfstest tests failed. Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.
It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control. We can
also put the list of exclusions there.
We can also clean up the output redirection helper functions to make
them more clear. After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.
Signed-off-by: Zach Brown <zab@versity.com>
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).
Signed-off-by: Chris Kirby <ckirby@versity.com>
Occasionally, we have some tests fail because these kills produce:
tests/lock-recover-invalidate.sh: line 42: 9928 Terminated
Even though we expected them to be silent. In these particular cases we
already don't care about this output.
We borrow the silent_kill() function from orphan-inodes and promote it
to t_silent_kill() in funcs/exec.sh, and then use it everywhere where
appropriate.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Stored as `results/scoutfs.tap`, this file contains TAP format 14
generated test results.
Embedded in the output are some metadata so that these files can be
aggregated and stored in an unique and deduplicating way, but using a
generated UUID at the start of testing. The file itself also catches git
ID, date, and kernel version, as well as the (possibly altered) test
sequence used.
Any test that has diff or dmesg output will be considered failed, and a
copy of the relevant data is included as comments.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Previously, any t_skip would cause the final test result to be a failure
because up until now no test should have been skipped.
However, with format-version-forward-back not being compatible with el9,
we are going to rely on el7/8 testing for that test soleley, and
therefore we have to allow skipping of this test on el9 and newer OS
versions.
We add `t_skip_permitted` to signal this from the test case to the
run-tests.sh script. A new exit code is passed, and all accounting is
updated to reflect that a test was skipped, but this was permitted. We
modify format-version-forward-back to use this new exit path.
Signed-off-by: Auke Kok <auke.kok@versity.com>
I'm seeing more and more of these as audit is enabled in el8 and el9
images I am using for testing, and during ENOSPC tests this has a chance
of triggering process accounting suspension, and subsequent resume.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In v1.18-10-g5507ee5, we changed the test code away from loopback
to device-mapper, which simplified our DUT setup code.
However, this results in the occasional `device changed size` messages
now being emitted by the `dm` driver instead of the `loop` kernel
module. We have to additionally ignore these kernel messages from now as
well.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The new version of xfstests adds a _lot_ more tests to our mix. Many
of the new ones will auto enable or auto skip as needed.
There are tests we can't or won't support that will be in future
xfstests. Disable them now so we can avoid dealing with them later.
Quite a few fall into "we don't support these types of mounting yet",
mostly bind-mount or dm-mapper things. We disable all the swapfile
tests flatout.
A few tests fail on el7 but not el8/9 but we don't have a way to run
them without failing yet, so disable them as well.
Update golden with the proper new array of tests. This all requires
the `auke/scoutfs-el9` branch in `versity/scoutfs-xfstests-dev`.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We have some fs functions which return info based on the test mount nr
as the test has setup. This refactors those a bit to also provide
some of the info when the caller has a path in a given mount. This will
let tests work with scratch mounts a little more easily.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
KASAN could raise a spurious warning if the unwinder started in code
without ORC metadata and tried to access in the KASAN stack frame
redzones. This was fixed upstream but we can rarely see it in older
kernels. We can ignore these messages.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we're not setting up per-mount loopback devices we can not have
the loop module loaded until tests are running.
Signed-off-by: Zach Brown <zab@versity.com>
On el9 distros systemd-journald will log rotation events into kmesg.
Since the default logs on VM images are transient only, they are
rotated several times during a single test cycle, causing test failures.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The t_quiet test command execution helper was constantly truncating the
quiet.log with the output of each command. It was meant to show each
command and its output as they're run.
Signed-off-by: Zach Brown <zab@versity.com>
If setting a sysfs option failes the bash write error is output. It
contains the script line number which can fail over time, leading to
mismatched golden output failures if we used the output as an expected
indication of failure. Callers should test its rc and output
accordingly if they want the failure logged and compared.
Signed-off-by: Zach Brown <zab@versity.com>
The test shell helpers for saving and restoring mount options were
trying to put each mount's option value in an array. It meant to build
the array key by concatenating the option name and the mount number.
But it didn't isolate the option "name" variable when evaluating it,
instead always evaluating "name_" to nothing and building keys for all
options that only contained the mount index. This then broke when tests
attempted to save and restore multiple options.
Signed-off-by: Zach Brown <zab@versity.com>
The t_server_nr and t_first_client_nr helpers iterated over all the fs
numbers examining their quorum/is_leader files, but clients don't have a
quorum/ directory. This was causing spurious outputs in tests that were
looking for servers but didn't find it in the first quorum fs number and
made it down into the clients.
Give them a helper that returns 0 for being a leader if the quorum/ dir
doesn't exist.
Signed-off-by: Zach Brown <zab@versity.com>
[85164.299902] scoutfs f.8c19e1.r.facf2e error: server error writing btree blocks: -5
[144308.589596] scoutfs f.c9397a.r.8ae97f error: server error -5 freeing merged btree blocks: looping commit del/upd freeing item
[174646.005596] scoutfs f.15f0b3.r.1862df error: server error -5 freeing merged btree blocks: final commit del/upd freeing item
[146653.893676] scoutfs f.c7f188.r.34e23c error: server error writing super block: -5
[273218.436675] scoutfs f.dd4157.r.f0da7e error: server failed to bind to 127.0.0.1:42002, err -98
[376832.542823] scoutfs f.049985.r.1a8987 error: error -5 reading quorum block 19 to update event 1 term 3
The above is an example output that will be filtered out
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
The quorum service shuts down if it sees errors that mean that it can't
do its job.
This is mostly fatal errors gathering resources at startup or runtime IO
errors but it was also shutting down if server startup fails. That's
not quite right. This should be treated like the server shutting down
on errors. Quorum needs to stay around to participate in electing the
next server.
Fence timeouts could trigger this. A quorum mount could crash, the
next server without a fence script could have a fence request timeout
and shutdown, and now the third remaining server is left to indefinitely
send vote requests into the void.
With this fixed, continuing that example, the quorum service in the
second mount remains to elect the third server with a working fence
script after the second server shuts down after its fence request times
out.
Signed-off-by: Zach Brown <zab@versity.com>
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block. It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.
But the design of the leader bit in the block violates the invariant
that only a slot will write to its block. As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.
The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play. An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.
All this comes together to have the server fence an old mount in a slot
while a new mount is coming up. The new mount sees the mark change and
freaks out and stops participating in quorum.
The fix is to rework the quorum blocks so that each slot only writes to
its own block. Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced. We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed. Each
event gets its own term so we can compare events to discover live
servers.
We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which exercises the various reasons for fencing mounts and
checks that we reclaim the resources that they had.
Signed-off-by: Zach Brown <zab@versity.com>
The shared recovery layer outputs different messages than when it ran
only for lock_recovery in the lock server.
Signed-off-by: Zach Brown <zab@versity.com>
t_umount had a typo that had it try to unmount a mount based on a
caller's variable, which accidentally happened to work for its only
caller. Future callers would not have been so lucky.
Signed-off-by: Zach Brown <zab@versity.com>
t_trigger_arm always output the value of the trigger after arming on the
premise that tests required the trigger being armed. In the process of
showing the trigger it calls a bunch of t_ helpers that build the path
to the trigger file using statfs_more to get the rid of mounts.
If the trigger being armed is in the server's mount and the specific
trigger test is fired by the server's statfs_more request processing
then the trigger can be fired before read its value. Tests can
inconsistently fail as the golden output shows the trigger being armed
or not depending on if it was in the server's mount or not.
t_trigger_arm_silent doesn't output the value of the armed trigger. It
can be used for low level triggers that don't rely on reading the
trigger's value to discover that their effect has happened.
Signed-off-by: Zach Brown <zab@versity.com>
Tests can use t_counter_diff to put a message in their golden output
when a specific change in counters is expected. This adds
t_counter_diff_changed to output a message that indicates change or not,
for tests that want to see counters change but the amount of change
doesn't need to be precisely known.
Signed-off-by: Zach Brown <zab@versity.com>
We mask device numbers in command output to 0:0 so that we can have
consistent golden test output. The device number matching regex
responsible for this missed a few digits.
It didn't show up until we both tested enough mounts to get larger
device minor numbers and fixed multi-mount consistency so that the
affected tests didn't fail for other reasons.
Signed-off-by: Zach Brown <zab@versity.com>
Our test unmount function unmounted the device instead of the mount
point. It was written this way back in an old version of the harness
which didn't track mount points.
Now that we have mount points, we can just unmount that. This stops the
umount command from having to search through all the current mounts
looking for the mountpoint for the device it was asked to unmount.
Signed-off-by: Zach Brown <zab@versity.com>
When running in debug kernels in guests we can really bog down things
enough to trigger hrtimer warnings. I don't think there's much we can
reasonably do about that.
Signed-off-by: Zach Brown <zab@versity.com>