Adds an accompanying option to set a data preallocation minimum
threshold value. The value can be set through sysfs or at mount
time.
data_prealloc_blocks_min cannot be larger than data_prealloc_blocks,
and this is enforced. This should be fine for all common use
cases where the _min option is expected to be less than 2048,
the default of data_prealloc_blocks.
Extra test cases are added to validate bad mount option values and
sysfs value writes. As well as tests that validate that the
minimum threshold is set and honored as expected.
Preallocation scales with scoutfs_get_inode_onoff() online values,
so that new extents double the online size every allocation until
it reaches data_prealloc_blocks. The _onoff() value is only
fetched once if possible.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add a reclaim_skip_finalize trigger that prevents reclaim from
setting FINALIZED on log_trees entries. The test arms this trigger,
force-unmounts a client to create an orphan, and verifies the log
merge succeeds without timeout and the orphan reclaim message
appears in dmesg.
Signed-off-by: Auke Kok <auke.kok@versity.com>
An unfinalized log_trees entry whose rid is not in mounted_clients
is an orphan left behind by incomplete reclaim. Previously this
permanently blocked log merges because the finalize loop treated it
as an active client that would never commit.
Call reclaim_open_log_tree for orphaned rids before starting a log
merge. Once reclaimed, the existing merge and freeing paths include
them normally.
Also skip orphans in get_stable_trans_seq so their open transaction
doesn't artificially lower the stable sequence.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Basic testing for the punch-offline ioctl code. The tests consist of a
bunch of negative testing to make sure things that are expressly not
allowed fail, followed by a bunch of known-expected outcome tests that
punches holes in several patterns, verifying them.
Signed-off-by: Auke Kok <auke.kok@versity.com>
A minimal punch_offline ioctl wrapper. Argument style is adopted from
stage/release.
Following the syntax for the option of stage/release, this calls the
punch offline ioctl, punching any offline extent within the designated
range from offset with length.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add an archive layer ioctl for converting offline extents into sparse
extents without relying on or modifying data_version. This is helpful
when working with files with very large sparse regions.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
The initialization here avoids clearing __pad[], which leaks
to disk. Use a struct initializer to avoid it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This allocation here currently leaks through __pad[7] which
is written to disk. Use the initializer to enforce zeroing
the pad. The name member is written right after.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The caller sends the return value of this inline as u8. If we return
-EINVAL, it maps to (234) which is outside of our enum range. Assume
this was meant to return SCOUTFS_NET_ERR_EINVAL which is a defined
constant.
Signed-off-by: Auke Kok <auke.kok@versity.com>
These boolean checks are all mutually exclusive, meaning this
check will always succeed due to the negative. Instead of && it
needs to use ||.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The exact 2 lines here are repeated. It suggests that there may
have been the intent of an additional check, but, there isn't
anything left from what I can see that needs checking here.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This setup function always returned 0, even on error, causing
initialization to continue despite the error.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This test regularly fails here because the grep is greedy and can
match inodes ending in the same digits as the one we're looking for.
Make it use the same awk pattern used below.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The fix added in v1.26-17-gef0f6f8a does a good job of avoiding the
intermittent test failures for the part that it was added. The remote
unlink section could use it as well, as it suffers from the same
intermediate failures.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This file was put into $CWD by the test scripts for no real good
reason. I suppose somewhere $seqres was supposed to be set before
these writes happened. Just write them to the test temp folder for
good measure for now.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In el9.6, the kernel VFS no longer goes through xattr handlers to
retreive ACLs, but instead calls the FS drivers' .get_{inode_}acl
method. In the initial compat version we hooked up to .get_acl given
the identical name that was used in the past.
However, this results in caching issues, as was encountered by customers
and exposed in the added test case `basic-acl-consistency`. The result
is that some group ACL entries may appear randomly missing. Dropping
caches may temporarily fix the issue.
The root cause of the issue is that the VFS now has 2 separate paths to
retreive ACL's from the FS driver, and, they have conflicting
implications for caching. `.get_acl` is purely meant for filesystems
like overlay/ecryptfs where no caching should ever go on as they are
fully passthrough only. Filesystems with dentries (i.e. all normal
filesystems should not expose this interface, and instead expose the
.get_inode_acl method. And indeed, in introducing the new interface, the
upstream kernel converts all but a few fs's to use .get_inode_acl().
The functional change in the driver is to detach KC_GET_ACL_DENTRY and
introduce KC_GET_INODE_ACL to handle the new (and required) interface.
KC_SET_ACL_DENTRY is detached due to it being a different changeset in
the kernel and we should separate these for good measure now.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This test case is used to detect and reproduce a customer issue we're
seeing where the new .get_acl() method API and underlying changes in
el9_6+ are causing ACL cache fetching to return inconsistent results,
which shows as missing ACLs on directories.
This particular sequence is consistent enough that it warrants making
it into a specific test.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Adds basic mkfs/mount/umount helpers that handle all the basics
for making, mounting and unmounting scratch devices. The mount/unmount
create "$T_MSCR", which lives in "$T_TMPDIR".
Signed-off-by: Auke Kok <auke.kok@versity.com>
There have been a few failures where output is generated just before we
crash but it didn't have a chance to be written. Add a best-effort
background sync before crashing. There's a good chance it'll hang if
the system is stuck so we don't wait for it directly, just for .. some
time to pass.
Signed-off-by: Zach Brown <zab@versity.com>
Errors from lock server calls typically shut the server down.
During normal unmount a client's locks are reclaimed before the
connection is disconnected. The lock server won't try to send to
unmounting clients.
Clients whose connections time out can cause ENOTCONN errors. Their
connection is freed before they're fenced and their locks are reclaimed.
The server can try to send to the client for a lock that's disconnected
and get a send error.
These errors shouldn't shut down the server. The client is either going
to be fenced and have the locks reclaimed, ensuring forward progress, or
the server is going to shutdown if it can't fence.
This was seen in testing as multiple clients were timed out.
Signed-off-by: Zach Brown <zab@versity.com>
The fence-and-reclaim test runs a bunch of scenarios and makes sure that
the fence agent was run on the appropriate mount's rids.
Unfortunately the checks were racey. The check itself only looked at
the log once to see if the rid had been fenced. Each check had steps
before that would wait until the rid should have been fenced and could
be checked.
Those steps were racey. They'd do things like make sure a fence request
wasn't pending, but never waited for it to be created in the first
place. They'd falsely indicate that the log should be checked and when
the rid wasn't found in the log the test would fail. In logs of
failures we'd see that the rids were fenced after this test failed and
moved on to the next.
This simplifies the checks. It gets rid of all the intermediate steps
and just waits around for the rid to be fenced, with a timeout. This
avoids the flakey tests.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has assertions for critical errors, but it doesn't
allow the synthetic errors that come from forced unmount severing the
client's connection to the world.
Signed-off-by: Zach Brown <zab@versity.com>
The net layer was initially built around send queue lists with the
presumption that there wouldn't be many messages in flight and that
responses would be sent roughly in order.
In the modern era, we can have 10s of thousands of lock request messages
in flight. This lead to o(n^2) processing in quite a few places as recv
processing searched for either requests to complete or responses to
free.
This adds messages to two rbtrees, indexing either requests by their id
or responses by their send sequence. Recv processing can find messages
in o(log n). This patch intends to be minimally disruptive. It's only
replacing the search of the send and resend queues in the recv path with
rbtrees. Other uses of the two queue lists are untouched.
On a single node, with ~40k lock shrink attempts in flight, we go from
processing ~800 total request/grant request/response pairs per second to
~60,000 per second.
Signed-off-by: Zach Brown <zab@versity.com>
The use of the VM shrinker was a bad fit for locks. Shrinking a lock
requires a round trip with the server to request a null mode. The VM
treats the locks like a cache, as expected, which leads to huge amounts
of locks accumulating and then being shrank in bulk. This creates a
huge backlog of locks making their way through the network conversation
with the server that implements invalidating to a null mode and freeing.
It starves other network and lock processing, possibly for minutes.
This removes the VM shrinker and instead introduces an option that sets
a limit on the number of idle locks. As the number of locks exceeds the
count we only try to free an oldest lock at each lock call. This
results in a lock freeing pace that is proportional to the allocation of
new locks by callers and so is throttled by the work done while callers
hold locks. It avoids the bulk shrinking of 10s of thousands of locks
that we see in the field.
Signed-off-by: Zach Brown <zab@versity.com>
We observe that unmount in this test can consume up to 10sec of time
before proceeding to record heartbeat timeout elections by followers.
When this happens, elections and new leaders happen before unmount even
completes. This indicates that hearbeat packets from the unmount are
ceased immediately, but the unmount is taking longer doing other things.
The timeouts then trigger, possibly during the unmount.
The result is that with timeouts of 3 seconds, we're not actually
waiting for an election at all. It already happened 7 seconds ago. The
code here just "sees" that it happens a few hundred ms after it started
looking for it.
There's a few ways about this fix. We could record the actual timestamp
of the election, and compare it with the actual timestamp of the last
heartbeat packet. This would be conclusive, and could disregard any
complication from umount taking too long. But it also means adding
timestamping in various places, or having to rely on tcpdump with packet
processing.
We can't just record $start before unmount. We will still violate the
part of the test that checks that elections didn't happen too late.
Especially in the 3sec test case if unmount takes 10sec.
The simplest solution is to unmount in a bg thread, and circle around
later to `wait` for it to assure we can re-mount without ill effect.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Due to an iput race, the "unlink wait for open on other mount"
subtest can fail. If the unlink happens inline, then the test
passes. But if the orphan scanner has to complete the unlink
work, it's possible that there won't be enough log merge work
for the scanner to do the cleanup before we look at the seq index.
Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow
forcing a log merge. Add new counters, log_merges_start and
log_merge_complete, so that tests can see that a merge has happened.
Then we have to wait for the orphan scanner to do its work.
Add a new counter, orphan_scan_empty, that increments each time
the scanner walks the entire inode space without finding any
orphans. Once the test sees that counter increment, it should be
safe to check the seq index and see that the unlinked inode is gone.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The /sys/kernel/debug/tracing/buffer_size_kb file always reads as
"7 (expanded: 1408)". So the -T option to run-test.sh won't work,
because it tries to multiply that string by the given factor.
It always defaults to 1408 on every platform we currently support.
Just use that value so we can specify -T in CI runs.
Signed-off-by: Chris Kirby <ckirby@versity.com>