Errors from lock server calls typically shut the server down.
During normal unmount a client's locks are reclaimed before the
connection is disconnected. The lock server won't try to send to
unmounting clients.
Clients whose connections time out can cause ENOTCONN errors. Their
connection is freed before they're fenced and their locks are reclaimed.
The server can try to send to the client for a lock that's disconnected
and get a send error.
These errors shouldn't shut down the server. The client is either going
to be fenced and have the locks reclaimed, ensuring forward progress, or
the server is going to shutdown if it can't fence.
This was seen in testing as multiple clients were timed out.
Signed-off-by: Zach Brown <zab@versity.com>
The fence-and-reclaim test runs a bunch of scenarios and makes sure that
the fence agent was run on the appropriate mount's rids.
Unfortunately the checks were racey. The check itself only looked at
the log once to see if the rid had been fenced. Each check had steps
before that would wait until the rid should have been fenced and could
be checked.
Those steps were racey. They'd do things like make sure a fence request
wasn't pending, but never waited for it to be created in the first
place. They'd falsely indicate that the log should be checked and when
the rid wasn't found in the log the test would fail. In logs of
failures we'd see that the rids were fenced after this test failed and
moved on to the next.
This simplifies the checks. It gets rid of all the intermediate steps
and just waits around for the rid to be fenced, with a timeout. This
avoids the flakey tests.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has assertions for critical errors, but it doesn't
allow the synthetic errors that come from forced unmount severing the
client's connection to the world.
Signed-off-by: Zach Brown <zab@versity.com>
The net layer was initially built around send queue lists with the
presumption that there wouldn't be many messages in flight and that
responses would be sent roughly in order.
In the modern era, we can have 10s of thousands of lock request messages
in flight. This lead to o(n^2) processing in quite a few places as recv
processing searched for either requests to complete or responses to
free.
This adds messages to two rbtrees, indexing either requests by their id
or responses by their send sequence. Recv processing can find messages
in o(log n). This patch intends to be minimally disruptive. It's only
replacing the search of the send and resend queues in the recv path with
rbtrees. Other uses of the two queue lists are untouched.
On a single node, with ~40k lock shrink attempts in flight, we go from
processing ~800 total request/grant request/response pairs per second to
~60,000 per second.
Signed-off-by: Zach Brown <zab@versity.com>
The use of the VM shrinker was a bad fit for locks. Shrinking a lock
requires a round trip with the server to request a null mode. The VM
treats the locks like a cache, as expected, which leads to huge amounts
of locks accumulating and then being shrank in bulk. This creates a
huge backlog of locks making their way through the network conversation
with the server that implements invalidating to a null mode and freeing.
It starves other network and lock processing, possibly for minutes.
This removes the VM shrinker and instead introduces an option that sets
a limit on the number of idle locks. As the number of locks exceeds the
count we only try to free an oldest lock at each lock call. This
results in a lock freeing pace that is proportional to the allocation of
new locks by callers and so is throttled by the work done while callers
hold locks. It avoids the bulk shrinking of 10s of thousands of locks
that we see in the field.
Signed-off-by: Zach Brown <zab@versity.com>
We observe that unmount in this test can consume up to 10sec of time
before proceeding to record heartbeat timeout elections by followers.
When this happens, elections and new leaders happen before unmount even
completes. This indicates that hearbeat packets from the unmount are
ceased immediately, but the unmount is taking longer doing other things.
The timeouts then trigger, possibly during the unmount.
The result is that with timeouts of 3 seconds, we're not actually
waiting for an election at all. It already happened 7 seconds ago. The
code here just "sees" that it happens a few hundred ms after it started
looking for it.
There's a few ways about this fix. We could record the actual timestamp
of the election, and compare it with the actual timestamp of the last
heartbeat packet. This would be conclusive, and could disregard any
complication from umount taking too long. But it also means adding
timestamping in various places, or having to rely on tcpdump with packet
processing.
We can't just record $start before unmount. We will still violate the
part of the test that checks that elections didn't happen too late.
Especially in the 3sec test case if unmount takes 10sec.
The simplest solution is to unmount in a bg thread, and circle around
later to `wait` for it to assure we can re-mount without ill effect.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Due to an iput race, the "unlink wait for open on other mount"
subtest can fail. If the unlink happens inline, then the test
passes. But if the orphan scanner has to complete the unlink
work, it's possible that there won't be enough log merge work
for the scanner to do the cleanup before we look at the seq index.
Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow
forcing a log merge. Add new counters, log_merges_start and
log_merge_complete, so that tests can see that a merge has happened.
Then we have to wait for the orphan scanner to do its work.
Add a new counter, orphan_scan_empty, that increments each time
the scanner walks the entire inode space without finding any
orphans. Once the test sees that counter increment, it should be
safe to check the seq index and see that the unlinked inode is gone.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The /sys/kernel/debug/tracing/buffer_size_kb file always reads as
"7 (expanded: 1408)". So the -T option to run-test.sh won't work,
because it tries to multiply that string by the given factor.
It always defaults to 1408 on every platform we currently support.
Just use that value so we can specify -T in CI runs.
Signed-off-by: Chris Kirby <ckirby@versity.com>
A few callers of alloc_move_empty in the server were providing a budget
that was too small. Recent changes to extent_mod_blocks increased the
max budget that is necessary to move extents between btrees. The
existing WAG of 100 was too small for trees of height 2 and 3. This
caused looping in production.
We can increase the move budget to half the overall commit budget, which
leaves room for a height of around 7 each. This is much greater than we
see in practice because the size of the per-mount btrees is effectiely
limited by both watermarks and thresholds to commit and drain.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible to trigger the block device autoloading mechanism
with a mknod()/stat(), and this mechanism has long been declared
obsolete, thus triggering a dmesg warning since el9_7, which then
fails the test. You may need to `rmmod loop` to reproduce.
Avoid this by avoiding to trigger a loop autoload - we just make a
different blockdev. Chosing `42` here should avoid any autoload
mechanism as this number is explicitly for demo drivers and should
never trigger an autoload.
We also just ignore the warning line in dmesg. Other tests can and
might perhaps still trigger this, as well as background noise running
during the test.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Use of T_M0 and variants should be reserved for e.g. scoutfs
<subcommand> -p <mountpoint> type of usages. Tests should create
individual content files in the assigned subdirectory.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The tap output file was not yet complete as it failed to include
the contents of `status.msg`. In a few cases, that would mean it
lacks important context.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Bash has special handling when these standard IO files, but
there are cases where customers have special restrictions set
on them. Likely to avoid leaking error data out of system logs
as part of IDS software.
In any case, we can just reopen existing file descriptors here
in both these cases to avoid this entirely. This will always
work.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Our local fence script attempts to interpret errors executing `findmnt`
as critical errors, but the program exit code explicitly returns
EXIT_FAILURE when the total number of matching mount entries is zero.
This can happen if the mount disappeared while we're attempting to
fence the mount, but, the scoutfs sysfs files are still in place as
we read them. It's a small window, but, it's a fork/exec plus full
parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes
on my system.
There's no other exit codes from findmnt other than 0 and 1. At that
point, we can only assume that if the stdout is empty, the mount
isn't there anymore.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Tests that cause client retries can fail with this error
from server_commit_log_merge():
error -2 committing log merge: getting merge status item
This can happen if the server has already committed and resolved
the log merge that is being retried. We can safely ignore ENOENT here
just like we do a few lines later.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors. The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.
Signed-off-by: Zach Brown <zab@versity.com>
The userspace fencing process wasn't careful about handling underlying
directories that disappear while it was working.
On the server/fenced side, fencing requests can linger after they've
been resolved by writing 1 to fenced or error. The script could come
back around to see the directory before the server finally removes it,
causing all later uses of the request dir to fail. We saw this in the
logs as a bunch of cat errors for the various request files.
On the local fence script side, all the mounts can be in the process of
being unmounted so both the /sys/fs dirs and the mount it self can be
removed while we're working.
For both, when we're working with the /sys/fs files we read them without
logging errors and then test that the dir still exists before using what
we read. When fencing a mount, we stop if findmnt doesn't find the
mount and then raise a umount error if the /sys/fs dir exists after
umount fails.
And while we're at it, we have each scripts logging append instead of
truncating (if, say, it's a log file instead of an interactive tty).
Signed-off-by: Zach Brown <zab@versity.com>
We're getting test failures from messages that our guests can be
unresponsive. They sure can be. We don't need to fail for this one
specific case.
Signed-off-by: Zach Brown <zab@versity.com>
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent. When we're forcing an
unmount we've severed storage and networking.
Signed-off-by: Zach Brown <zab@versity.com>