This test case is used to detect and reproduce a customer issue we're
seeing where the new .get_acl() method API and underlying changes in
el9_6+ are causing ACL cache fetching to return inconsistent results,
which shows as missing ACLs on directories.
This particular sequence is consistent enough that it warrants making
it into a specific test.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Adds basic mkfs/mount/umount helpers that handle all the basics
for making, mounting and unmounting scratch devices. The mount/unmount
create "$T_MSCR", which lives in "$T_TMPDIR".
Signed-off-by: Auke Kok <auke.kok@versity.com>
There have been a few failures where output is generated just before we
crash but it didn't have a chance to be written. Add a best-effort
background sync before crashing. There's a good chance it'll hang if
the system is stuck so we don't wait for it directly, just for .. some
time to pass.
Signed-off-by: Zach Brown <zab@versity.com>
Errors from lock server calls typically shut the server down.
During normal unmount a client's locks are reclaimed before the
connection is disconnected. The lock server won't try to send to
unmounting clients.
Clients whose connections time out can cause ENOTCONN errors. Their
connection is freed before they're fenced and their locks are reclaimed.
The server can try to send to the client for a lock that's disconnected
and get a send error.
These errors shouldn't shut down the server. The client is either going
to be fenced and have the locks reclaimed, ensuring forward progress, or
the server is going to shutdown if it can't fence.
This was seen in testing as multiple clients were timed out.
Signed-off-by: Zach Brown <zab@versity.com>
The fence-and-reclaim test runs a bunch of scenarios and makes sure that
the fence agent was run on the appropriate mount's rids.
Unfortunately the checks were racey. The check itself only looked at
the log once to see if the rid had been fenced. Each check had steps
before that would wait until the rid should have been fenced and could
be checked.
Those steps were racey. They'd do things like make sure a fence request
wasn't pending, but never waited for it to be created in the first
place. They'd falsely indicate that the log should be checked and when
the rid wasn't found in the log the test would fail. In logs of
failures we'd see that the rids were fenced after this test failed and
moved on to the next.
This simplifies the checks. It gets rid of all the intermediate steps
and just waits around for the rid to be fenced, with a timeout. This
avoids the flakey tests.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has assertions for critical errors, but it doesn't
allow the synthetic errors that come from forced unmount severing the
client's connection to the world.
Signed-off-by: Zach Brown <zab@versity.com>
The net layer was initially built around send queue lists with the
presumption that there wouldn't be many messages in flight and that
responses would be sent roughly in order.
In the modern era, we can have 10s of thousands of lock request messages
in flight. This lead to o(n^2) processing in quite a few places as recv
processing searched for either requests to complete or responses to
free.
This adds messages to two rbtrees, indexing either requests by their id
or responses by their send sequence. Recv processing can find messages
in o(log n). This patch intends to be minimally disruptive. It's only
replacing the search of the send and resend queues in the recv path with
rbtrees. Other uses of the two queue lists are untouched.
On a single node, with ~40k lock shrink attempts in flight, we go from
processing ~800 total request/grant request/response pairs per second to
~60,000 per second.
Signed-off-by: Zach Brown <zab@versity.com>
The use of the VM shrinker was a bad fit for locks. Shrinking a lock
requires a round trip with the server to request a null mode. The VM
treats the locks like a cache, as expected, which leads to huge amounts
of locks accumulating and then being shrank in bulk. This creates a
huge backlog of locks making their way through the network conversation
with the server that implements invalidating to a null mode and freeing.
It starves other network and lock processing, possibly for minutes.
This removes the VM shrinker and instead introduces an option that sets
a limit on the number of idle locks. As the number of locks exceeds the
count we only try to free an oldest lock at each lock call. This
results in a lock freeing pace that is proportional to the allocation of
new locks by callers and so is throttled by the work done while callers
hold locks. It avoids the bulk shrinking of 10s of thousands of locks
that we see in the field.
Signed-off-by: Zach Brown <zab@versity.com>
We observe that unmount in this test can consume up to 10sec of time
before proceeding to record heartbeat timeout elections by followers.
When this happens, elections and new leaders happen before unmount even
completes. This indicates that hearbeat packets from the unmount are
ceased immediately, but the unmount is taking longer doing other things.
The timeouts then trigger, possibly during the unmount.
The result is that with timeouts of 3 seconds, we're not actually
waiting for an election at all. It already happened 7 seconds ago. The
code here just "sees" that it happens a few hundred ms after it started
looking for it.
There's a few ways about this fix. We could record the actual timestamp
of the election, and compare it with the actual timestamp of the last
heartbeat packet. This would be conclusive, and could disregard any
complication from umount taking too long. But it also means adding
timestamping in various places, or having to rely on tcpdump with packet
processing.
We can't just record $start before unmount. We will still violate the
part of the test that checks that elections didn't happen too late.
Especially in the 3sec test case if unmount takes 10sec.
The simplest solution is to unmount in a bg thread, and circle around
later to `wait` for it to assure we can re-mount without ill effect.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Due to an iput race, the "unlink wait for open on other mount"
subtest can fail. If the unlink happens inline, then the test
passes. But if the orphan scanner has to complete the unlink
work, it's possible that there won't be enough log merge work
for the scanner to do the cleanup before we look at the seq index.
Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow
forcing a log merge. Add new counters, log_merges_start and
log_merge_complete, so that tests can see that a merge has happened.
Then we have to wait for the orphan scanner to do its work.
Add a new counter, orphan_scan_empty, that increments each time
the scanner walks the entire inode space without finding any
orphans. Once the test sees that counter increment, it should be
safe to check the seq index and see that the unlinked inode is gone.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The /sys/kernel/debug/tracing/buffer_size_kb file always reads as
"7 (expanded: 1408)". So the -T option to run-test.sh won't work,
because it tries to multiply that string by the given factor.
It always defaults to 1408 on every platform we currently support.
Just use that value so we can specify -T in CI runs.
Signed-off-by: Chris Kirby <ckirby@versity.com>
A few callers of alloc_move_empty in the server were providing a budget
that was too small. Recent changes to extent_mod_blocks increased the
max budget that is necessary to move extents between btrees. The
existing WAG of 100 was too small for trees of height 2 and 3. This
caused looping in production.
We can increase the move budget to half the overall commit budget, which
leaves room for a height of around 7 each. This is much greater than we
see in practice because the size of the per-mount btrees is effectiely
limited by both watermarks and thresholds to commit and drain.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible to trigger the block device autoloading mechanism
with a mknod()/stat(), and this mechanism has long been declared
obsolete, thus triggering a dmesg warning since el9_7, which then
fails the test. You may need to `rmmod loop` to reproduce.
Avoid this by avoiding to trigger a loop autoload - we just make a
different blockdev. Chosing `42` here should avoid any autoload
mechanism as this number is explicitly for demo drivers and should
never trigger an autoload.
We also just ignore the warning line in dmesg. Other tests can and
might perhaps still trigger this, as well as background noise running
during the test.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Use of T_M0 and variants should be reserved for e.g. scoutfs
<subcommand> -p <mountpoint> type of usages. Tests should create
individual content files in the assigned subdirectory.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The tap output file was not yet complete as it failed to include
the contents of `status.msg`. In a few cases, that would mean it
lacks important context.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Bash has special handling when these standard IO files, but
there are cases where customers have special restrictions set
on them. Likely to avoid leaking error data out of system logs
as part of IDS software.
In any case, we can just reopen existing file descriptors here
in both these cases to avoid this entirely. This will always
work.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Our local fence script attempts to interpret errors executing `findmnt`
as critical errors, but the program exit code explicitly returns
EXIT_FAILURE when the total number of matching mount entries is zero.
This can happen if the mount disappeared while we're attempting to
fence the mount, but, the scoutfs sysfs files are still in place as
we read them. It's a small window, but, it's a fork/exec plus full
parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes
on my system.
There's no other exit codes from findmnt other than 0 and 1. At that
point, we can only assume that if the stdout is empty, the mount
isn't there anymore.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Tests that cause client retries can fail with this error
from server_commit_log_merge():
error -2 committing log merge: getting merge status item
This can happen if the server has already committed and resolved
the log merge that is being retried. We can safely ignore ENOENT here
just like we do a few lines later.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors. The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.
Signed-off-by: Zach Brown <zab@versity.com>
The userspace fencing process wasn't careful about handling underlying
directories that disappear while it was working.
On the server/fenced side, fencing requests can linger after they've
been resolved by writing 1 to fenced or error. The script could come
back around to see the directory before the server finally removes it,
causing all later uses of the request dir to fail. We saw this in the
logs as a bunch of cat errors for the various request files.
On the local fence script side, all the mounts can be in the process of
being unmounted so both the /sys/fs dirs and the mount it self can be
removed while we're working.
For both, when we're working with the /sys/fs files we read them without
logging errors and then test that the dir still exists before using what
we read. When fencing a mount, we stop if findmnt doesn't find the
mount and then raise a umount error if the /sys/fs dir exists after
umount fails.
And while we're at it, we have each scripts logging append instead of
truncating (if, say, it's a log file instead of an interactive tty).
Signed-off-by: Zach Brown <zab@versity.com>
We're getting test failures from messages that our guests can be
unresponsive. They sure can be. We don't need to fail for this one
specific case.
Signed-off-by: Zach Brown <zab@versity.com>
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent. When we're forcing an
unmount we've severed storage and networking.
Signed-off-by: Zach Brown <zab@versity.com>
mmap_stress gets completely stalled in lock messaging and starving
most of the mmap_stress threads, which causes it to delay and even
time out in CI.
Instead of spawning threads over all 5 test nodes, we reduce it
to spawning over only 2 artificially. This still does a good number
of operations on those node, and now the work is spread across the
two nodes evenly.
Additionaly, I've added a miniscule (10ms) delay in between operations
that should hopefully be sufficient for other locking attempts to
settle and allow the threads to better spread the work.
This now shows that all the threads exit within < 0.25s on my test
machine, which is a lot better than the 40s variation that I was seeing
locally. Hopefully this fares better in CI.
Signed-off-by: Auke Kok <auke.kok@versity.com>
There's a scenarion where mmap_stress gets enough resources that
twoe of the threads will starve the others, which then all take
a very long time catching up committing changes.
Because this test program didn't finish until all the threads had
completed a fixed amount of work, essentially these threads all
ended up tripping over eachother. In CI this would exceed 6h+,
while originally I intended this to run in about 100s or so.
Instead, cap the run time to ~30s by default. If threads exceed
this time, they will immediately exit, which causes any clog in
contention between the threads to drain relatively quickly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Assembling a srch compaction operation creates an item and populates it
with allocator state. It doesn't cleanly unwind the allocation and undo
the compaction item change if allocation filling fails and issues a
warning.
This warning isn't needed if the error shows that we're in forced
unmount. The inconsistent state won't be applied, it will be dropped on
the floor as the mount is torn down.
Signed-off-by: Zach Brown <zab@versity.com>
The log merging process is meant to provide parallelism across workers
in mounts. The idea is that the server hands out a bunch of concurrent
non-intersecting work that's based on the structure of the stable input
fs_root btree.
The nature of the parallel work (cow of the blocks that intersect a key
range) means that the ranges of concurrently issued work can't overlap
or the work will all cow the same input blocks, freeing that input
stable block multiple times. We're seeing this in testing.
Correctness was intended by having an advancing key that sweeps sorted
ranges. Duplicate ranges would never be hit as the key advanced past
each it visited. This was broken by the mapping of the fs item keys to
log merge tree keys by clobbering the sk_zone key value. It effectively
interleaves the ranges of each zone in the fs root (meta indexes,
orphans, fs items). With just the right log merge conditions that
involve logged items in the right places and partial completed work to
insert remaining ranges behind the key, ranges can be stored at mapped
keys that end up with ranges out of order. The server iterates over
these and ends up issueing overlapping work, which results in duplicated
frees of the input blocks.
The fix, without changing the format of the stored log tree items, is to
perform a full sweep of all the range items and determine the next item
by looking at the full precision stored keys. This ensures that the
processed ranges always advance and never overlap.
Signed-off-by: Zach Brown <zab@versity.com>
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified. If we specify args then the set of
tests can change and the test will always fail when they do.
This fixes that by having the test check the set of tests itself, rather
than relying on golden output. If args are specified then our xfstest
only fails if any of the executed xfstest tests failed. Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.
It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control. We can
also put the list of exclusions there.
We can also clean up the output redirection helper functions to make
them more clear. After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.
Signed-off-by: Zach Brown <zab@versity.com>
Add a little background function that runs during the test which
triggers a crash if it finds catastrophic failure conditions.
This is the second bg task we want to kill and we can only have one
function run on the EXIT trap, so we create a generic process killing
trap function.
We feed it the fenced pid as well. run-tests didn't log much of value
into the fenced log, and we're not logging the kills into anymore, so we
just remove run-tests fenced logging.
Signed-off-by: Zach Brown <zab@versity.com>
Add an option to run-tests to have it loop over each test that will be
run a number of times. Looping stops if the test doesn't pass.
Most of the change in the per-test execution is indenting as we add the
for loop block. The stats and kmsg output are lifted up before of the
loop.
Signed-off-by: Zach Brown <zab@versity.com>