Commit Graph

2113 Commits

Author SHA1 Message Date
Zach Brown
ead8be6b8c Have xfstest pass when using args
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified.  If we specify args then the set of
tests can change and the test will always fail when they do.

This fixes that by having the test check the set of tests itself, rather
than relying on golden output.  If args are specified then our xfstest
only fails if any of the executed xfstest tests failed.  Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.

It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control.  We can
also put the list of exclusions there.

We can also clean up the output redirection helper functions to make
them more clear.  After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 16:03:46 -08:00
Zach Brown
ae84271b37 Add crash monitor to run-tests
Add a little background function that runs during the test which
triggers a crash if it finds catastrophic failure conditions.

This is the second bg task we want to kill and we can only have one
function run on the EXIT trap, so we create a generic process killing
trap function.

We feed it the fenced pid as well.  run-tests didn't log much of value
into the fenced log, and we're not logging the kills into anymore, so we
just remove run-tests fenced logging.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 12:11:46 -08:00
Zach Brown
23aaa994df Add -l to run-tests for looping over tests
Add an option to run-tests to have it loop over each test that will be
run a number of times.  Looping stops if the test doesn't pass.

Most of the change in the per-test execution is indenting as we add the
for loop block.  The stats and kmsg output are lifted up before of the
loop.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 12:07:42 -08:00
Zach Brown
7d14b57b2d Export PATH once in run-tests
Might as well just export the PATH once as we change it, no need to
export it in every test iteration.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 11:02:38 -08:00
Zach Brown
3f252be4be Merge pull request #241 from versity/auke/waiter_err_data_version_obsolete
Ignore data_version in scoutfs_ioc_data_wait_err.
2025-11-04 10:09:57 -08:00
Auke Kok
a4d25d9b55 Ignore data_version in scoutfs_ioc_data_wait_err.
The data_wait_err ioctl currently requires the correct data_version
for the inode to be passed in, or else the ioctl returns -ESTALE. But
the ioctl itself is just a passthrough mechanism for notifying data
waiters, which doesn't involve the data_version at all.

Instead, we can just drop checking the value. The field remains in the
headers, but we've marked it as being ignored from now on. The reason
for the change is documented in the header file as well.

This all is a lot simpler than having to modify/rev the data_waiters
interface to support passing back the data_version, because there isn't
any space left to easily do this, and then userspace would just pass it
back to the data_wait_err ioctl.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-31 12:24:03 -04:00
Zach Brown
79cd25f693 Merge pull request #255 from versity/zab/compact_error_block_leak
Don't leak alloc blocks on srch compact error
2025-10-31 09:08:17 -07:00
Zach Brown
f2646130ae Don't leak alloc blocks on srch compact error
scoutfs_alloc_prepare_commit() is badly named.  All it really does is
put the references to the two dirty alloc list blocks in the allocator.
It must allways be called if allocation was attempted, but it's easier
to require that it always be paired with _alloc_init().

If the srch compaction worker in the client sees an error it will send
the error back to the server without writing its dirty blocks.  In
avoiding the write it also avoided putting the two block references,
leading to leaked blocks.  We've been seeing rare messages with leaked
blocks in tests.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-30 14:47:18 -07:00
Zach Brown
1c66f9a9a5 Merge pull request #227 from versity/auke/el96
RHEL 9.6 support.
2025-10-30 13:39:05 -07:00
Auke Kok
afb6ba00ad POSIX ACL changes.
The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.

Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.

This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-30 13:59:44 -04:00
Auke Kok
29e486e411 All vfs methods now take a mnt_idmap instead of user_namespace arg.
Similar to before when namespaces were added, they are now translated to
a mnt_idmap, since v6.2-rc1-2-gabf08576afe3.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-30 13:58:34 -04:00
Zach Brown
8f3177fe33 Merge pull request #254 from versity/zab/shrink_cleanup
Zab/shrink cleanup
2025-10-30 08:56:33 -07:00
Zach Brown
419079e606 Merge pull request #239 from versity/auke/keepalive
Add tcp_keepalive_timeout_ms option, change default to 60s
2025-10-29 17:15:17 -07:00
Zach Brown
6a70ee03b5 Dump block alloc stacks for leaked blocks
The typical pattern of spinning isolating a list_lru results in a
livelock if there are blocks with leaked refcounts.  We're rarely seeing
this in testing.

We can have a modest array in each block that records the stack of the
caller that initially allocated the block and dump that stack for any
blocks that we're unable to shrink/isolate.  Instead of spinning
shrinking, we can give it a good try and then print the blocks that
remain and carry on with unmount, leaking a few blocks.  (Past events
have had 2 blocks.)

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 16:16:58 -07:00
Zach Brown
38a2ffe0c7 Add stacktrace kernelcompat
Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
4b41cf9789 Centralize port numbers and avoid ephemeral
The tests were using high ephemeral port numbers for the mount server's
listening port.  This caused occasional failure if the client's
ephemeral ports happened to collide with the ports used by the tests.

This ports all the port number configuration in one place and has a
quick check to make sure it doesn't wander into the current ephemeral
range.  Then it updates all the tests to use the chosen ports.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
102899290e Allow harmless srch compact commit errors
The server's srch commit error warnings were a bit severe.  The
compaction operations are a function of persistent state.  If they fail
then the inputs still exist and the next attempt will retry whatever
failed.  Not all errors are a problem, only those that result in partial
commits that leave inconsistent state.

In particular, we have to support the case where a client retransmits a
compaction request to a new server after a first server performed the
commit but couldn't respond.  Throwing warnings when the new server gets
ENOENT looking for the busy compaction item isn't helpful.  This came in
tests as background compaction was in flight as tests unmounted and
mounted servers repeatedly to test lock recovery.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
89387fb192 Use list_lru for block cache shrinking
The block cache had a bizarre cache eviction policy that was trying to
avoid precise LRU updates at each block.  It had pretty bad behaviour,
including only allowing reclaim of maybe 20% of the blocks that were
visited by the shrinker.

We can use the existing list_lru facility in the kernel to do a better
job.  Blocks only exhibit contention as they're allocated and added to
per-node lists.  From then on we only set accessed bits and the private
list walkers move blocks around on the list as we see the accessed bits.
(It looks more like a fifo with lazy promotion than a "LRU" that is
actively moving list items around as they're accessed.)

Using the facility means changing how we remove blocks from the cache
and hide them from lookup.  We clean up the refcount inserted flag a bit
to be expressed more as a base refcount that can be acquired by
whoever's removing from the cache.  It seems a lot clearer.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
8b6418fb79 Add kernelcompat for list_lru
Add kernelcompat helpers for initial use of list_lru for shrinking.  The
most complicated part is the walk callback type changing.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
206c24c41f Retry stale item reads instead of stopping reclaim
Readers can read a set of items that is stale with respect to items that
were dirtied and written under a local cluster lock after the read
started.

The active reader machanism addressed this by refusing to shrink pages
that could contain items that were dirtied while any readers were in
flight.  Under the right circumstances this can result in refusing to
shrink quite a lot of pages indeed.

This changes the mechanism to allow pages to be reclaimed, and instead
forces stale readers to retry.  The gamble is that reads are much faster
than writes.  A small fraction should have to retry, and when they do
they can be satisfied by the block cache.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Auke Kok
f67462750b Add tcp_keepalive_timeout_ms option, change default to 60s
The default TCP keepalive value is currently 10s, resulting in clients
being disconnected after 10 seconds of not replying to a TCP keepalive
packet. These keepalive values are reasonable most of the times, but
we've seen client disconnects where this timeout has been exceeded,
resulting in fencing. The cause for this is unknown at this time, but it
is suspected that network intermissions are happening.

This change adds a configurable value for this specific client socket
timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose
value remains unchanged.

The default value of 10000ms (10s) is changed to 60s. This is the value
we're assuming is much better suited for customers and has been briefly
trialed, showing that it may help to avoid network level interruptions
better.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-28 18:45:43 -04:00
Zach Brown
fd8aaa0810 Merge pull request #205 from versity/auke/scar
Changes from static analysis.
2025-10-27 16:06:32 -07:00
Auke Kok
a5dbe7f286 Don't set ret = -ENOMEM and immediately overwrite.
It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which
is legitimately a failure, thus the code here releases the sock again.

But the code block here sets `ret = ENOMEM` and then restarts the loop,
which immediately sets `ret = kernel_accept()`, thus overwriting the
-ENOMEM error value.

We can argue that an ENOMEM error situation here is not catastrophical.
If this is the first that we're ever receiving an ENOMEM situation here
while trying to accept a new client, we can just release the socket and
wait for the client to try again. If the kernel at that point still is
out of memory to handle the new incoming connection, that will then
cascade down and clean up the while listener at that point.

The alternative is to let this error path unwind out and break down the
listener immediately, something the code today doesn't do. We're keeping
the behavior therefore the same.

I've opted therefore to replace the `ret = -ENOMEM` assignment with a
comment explaining why we're ignoring the error situation here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:26:44 -04:00
Auke Kok
c1e89d597d Fix NULL dereference on error branch in handle_request.
If scoutfs_send_omap_response fails for any reason, req is NULL and we
would hit a hard NULL deref during unwinding.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:24:00 -04:00
Auke Kok
2c4316b096 Avoid uninitialized map, flags in ext.
This function returns a stack pointer to a struct scoutfs_extent, after
setting start, len to an extent found in the proper zone, but it leaves
map and flags members unset.

Initialize the struct to {0,} avoids passing uninitialized values up the
callstack.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:24:00 -04:00
Auke Kok
e704cd7074 Fix masking of EIO in compact_logs.
Several of the inconsistency error paths already correctly `goto out`
but this one has a `break`. This would result in doing a whole lot of
work on corrupted data.

Make this error path go to `out` instead as the others do.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:23:59 -04:00
Auke Kok
8c5b09aee8 Prevent masking away inconsistent state in search_sorted_file.
In these two error conditions we explicitly set `ret = -EIO` but then
`break` to set `ret = 0` immediately again, masking away a critical
error code that should be returned.

Instead, `goto out` retains the EIO error value for the caller.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:23:28 -04:00
Auke Kok
d2cd610c53 Fix return of uninit value.
The value of `ret` is not initialized. If the writeback list is empty,
or, if igrab() fails on the only inode on the list, the value
of `ret` is returned without being initialized. This would cause the
caller to needlessly have to retry, perhaps possibly make things worse.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
52563d3f73 Address double copy_to_user, possible 1-byte leak.
We shouldn't copy the entire _dirent struct and then copy in the name
again right after, just stop at offsetoff(struct, name).

Now that we're no longer copying the uninitialized name[3] from ent,
there is no more possible 1-byte leak here, too.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
4358d57f55 Avoid possible NULL deref on ENOMEM.
Assure that we reschedule even if this happens. Maybe it'll recover. If
not, we'll have other issues elsewhere first.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
021830ab04 If kzalloc fails, avoid NULL deref.
We still assign NULL to sbi->s_fs_info to aid checks in cleanup paths.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
3e63739711 plug df ioctl leaks.
The `type` member and padding are not initialized before being
copied to userspace.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
b25d8e8741 Plug super leak.
We accidentally could leak super here, so make sure to free it.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
4a9760afe0 Incorrect array_size test.
ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2,
therefore arr[3] is out of bounds. The array length test is off by one
and needs fixing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Zach Brown
33f6e9d0cd Merge pull request #248 from versity/auke/shuffle-tests
Add option to shuffle test order.
2025-10-23 09:33:34 -07:00
Zach Brown
f9780fc391 Merge pull request #253 from versity/zab/msg_rate_shrink
Zab/msg rate shrink
2025-10-23 09:32:25 -07:00
Zach Brown
aa8517d29b Remove msghdr iov_iter kernelcompat
This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat.
kernel_{send,recv}msg() initializes either msg_iov or msg_iter.

This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from
iovec." because previous patches fixed the order of arguments, and the
net send caller was removed.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
feae5757c4 Send messages in batches
Previous work had the receiver try to receive multiple messages in bulk.
This does the same for the sender.

We walk the send queue and initialize a vector that we then send with
one call.  This is intentionally similar to the single message sending
pattern to avoid unintended changes.

Along with the changes to recieve in bulk this ended up increasing the
message processing rate by about 6x when both send and receive were
going full throttle.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
e79086f381 Fix swapped sendmsg nr_segs/count
When the msg_iter compat was added the iter was initialized with nr_segs
and count swapped.  I'm not convinced this had any effect because the
kernel_{send,recv}msg() call would initialize msg_iter again with the
correct arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
45e815bf76 Receive incoming messages in bulk
Our messaging layer is used for small control messages, not large data
payloads.  By calling recvmsg twice for every incoming message we're
hitting the socket lock reasonably hard.  With senders doing the same,
and a lot of messages flowing in each direction, the contention is
non-trivial.

This changes the receiver to copy as much of the incoming stream into a
page that is then framed and copied again into individual allocated
messages that can be processed concurrently.  We're avoiding contention
with the sender on the socket at the cost of additional copies of our
small messages.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
c313b71b2e Process client lock messages in ordered work
The lock client has a requirement that it can't handle some messages
being processed out of order.  Previously it had detected message
ordering itself, but had missed some cases.  Recieve processing was then
changed to always call lock message processing from the recv work to
globally order all lock messages.

This inline processing was contributing to excessive latencies in making
our way through the incoming receive queue, delaying work that would
otherwise be parallel once we got it off the recv queue.

This was seen in practice as a giant flood of lock shrink messages
arrived at the client.  It processed each in turn, starving a statfs
response long enough to trigger the hung task warning.

This fix does two things.

First, it moves ordered recv processing out of the recv work.  It lets
the recv work drain the socket quickly and turn it into a list that the
ordered work is consuming.  Other messages will have a chance to be
received and queued to their processing work without having to wait for
the ordered work to be processed.

Secondly, it adds parallelism to the ordered processing.  The incoming
lock messages don't need global ordering, they need ordering within each
lock.  We add an arbitrary but reasonable number of ordered workers and
hash lock messages to each worker based on the lock's key.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
0ecaceba14 Merge pull request #236 from versity/team/ci_green
Team/ci green
2025-10-22 11:05:08 -07:00
Chris Kirby
b4d8323750 Quorum message cleanup
Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END
update_quorum_block() call fails in scoutfs_quorum_worker().

Correctly print if the reader or writer failed when logging errors
in update_quorum_block().

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
aa48a8ccfc Generate sorted srch-safe entry pairs
During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was
generating inode numbers that were not in sorted order. This resulted
in later failures during srch-basic-functionality, because we were
winding up with out of order first/last pairs and merging incorrectly.

Instead, reuse the single entry in the block repeatedly, generating
zero-padded pairs of this entry that are interpreted as create/delete
and vanish during searching and merging. These aren't encoded in the
normal way, but the extra zeroes are ignored during the decoding phase.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
d277d7e955 Fix race condition in orphan-inodes test
Make sure that the orphan scanners can see deletions after forced unmounts
by waiting for reclaim_open_log_tree() to run on each mount; and waiting for
finalize_and_start_log_merge() to run and not find any finalized trees.

Do this by adding two new counters: reclaimed_open_logs and
log_merge_no_finalized and fixing the orphan-inodes test to check those
before waiting for the orphan scanners to complete.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
c72bf915ae Use ENOLINK as a special error code during forced unmount
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:58:44 -07:00
Auke Kok
c3e6f3cd54 Don't run format-version-forward-back on el8, either
This test compiles an earlier commit from the tree that is starting to
fail due to various changes on the OS level, most recently due to sparse
issues with newer kernel headers. This problem will likely increase
in the future as we add more supported releases.

We opt to just only run this test on el7 for now. While we could have
made this skip sparse checks that fail it on el8, it will suffice at
this point if this just works on one of the supported OS versions
during testing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-15 17:35:17 -05:00
Zach Brown
c19280c83c Add cond_resched to iput worker
The iput worker can accumulate quite a bit of pending work to do.  We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels).  There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
01847d9fb6 Add tracing for get_file_block() and scoutfs_ioc_search_xattrs().
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
84a48ed8e2 Fix several cases in srch.c where the return value of EIO should have been -EIO.
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00