Commit Graph

2182 Commits

Author SHA1 Message Date
Auke Kok
f29a411bf1 generic_file_splice_read is removed.
Based on my reading of the gfs2 driver, it appears it's likely the safer
approach to take copy_splice_read instead of filemap_splice_read as it
may potentially lead to cluster deadlocks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:28:30 -08:00
Auke Kok
8ceedbd819 Obsolete scoutfs_writepage.
Due to folios, the kernel will call scoutfs_writepages() and this
becomes unused. It could be ported but the helper function to call isn't
exported anymore.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
bcde8b2169 Fix unlocked pt_excl in scoutfs_readahead.
This caller of scoutfs_get_block is now actively used in el10 and
the WARN_ON_ONCE(!lock) in data.c:567 triggers.

XXX FIXME XXX
However, this will hit the `BUG_ON(!list_empty(pages));` that's a few
lines further, in some of our testing, so, it's still not right

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
92e3c72d9d mv overwrite error format changes in el10
This is somewhat cumbersome, we want to see the error message, but the
format changes enough to make this messy. We opt to change the golden to
the new format, which only shows one of the arguments in its error
output: the thing that cannot be overwritten. We then add a filter that
rewrites the old output format with sed patterns to be exactly like the
new format, so this will work everywhere again, without changing or
adding filters to obscure error messages.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
e42705d757 Add sysfs default_groups usage.
Since v5.1-rc3-29-gaa30f47cf666, and in el9, there are changes to reduce
the amount of boilerplate code needed to hook up lots of attribute files
using a .default_groups member. In el10, this becomes the required
method as the .default_attrs member now becomes removed. This touches
every sysfs part that we have.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
f08a338a6b set_blocksize() takes struct file argument.
In v6.9-rc4-8-gead083aeeed9, this now takes a struct file argument,
adding to the ifdef salad we've got going on here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
38906a51da generic_fillattr() now wants the request_mask arg from caller.
Since ~v6.5-rc1-95-g0d72b92883c6, generic_fillattr() asks us to pass
through the request_mask from the caller. This allows it to only
request a subset.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
4ef1f56f76 Shrinker API v4.
Yet another major shrinker API evolution in v6.6-rc4-53-gc42d50aefd17.
The struct shrinker now has to be dynamically allocated. This is
purposely a backwards incompatible break. We add another KC_ wrapper
around the new shrinker_alloc() and move some initialization around to
make this as much as possible low impact, but compatible with the old
APIs through substitution.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
80926cfe55 bio_add_page is now __must_check
The return type always has been int, so, we just need to add return
value checking and do something with it. We could return -ENOMEM here as
well, either way it'll fall all the way through no matter what.

This is since v6.4-rc2-100-g83f2caaaf9cb.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
5f2f8f199b Adjust for __assign_str() losing second argument.
In v6.8-9146-gc759e609030c, the second argument for __assign_str() was
removed, as the second parameter is already derived from the __string()
definition and no longer needed. We have to do a little digging in
headers here to find the definition.

Note the missing `;` at a few places... it has to be added now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-13 14:27:11 -08:00
Auke Kok
4c4a9c154d RIP bd_inode.
v6.9-rc4-29-g203c1ce0bb06 removes bd_inode. The canonical replacement is
bd_mapping->host, were applicable. We have one use where we directly
need the mapping instead of the inode, as well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
b1bef1b7f5 Fix compiler warnings for flex array definitions.
Instead of defining a struct that ends with a flex array member with
`val[0]`, the compiler now balks at this since technically, the spec
considers this unsanitary. As a result however, we can't memcpy to
`struct->val` since that's a pointer and now we're writing something of
a different length (u8's in our case) into something that's of pointer
size. So there we have to do the opposite, and memcpy to
&struct->val[0].

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
34a78ae4a6 unaligned.h moved from asm/ to linux/
In v6.12-rc1-3-g5f60d5f6bbc1, asm/unaligned.h only included
asm-generic/unaligned.h and that was cleaned up from architecture
specific things. Everyone should now include linux/unaligned.h and the
former include was removed.

A quick peek at server.c shows that while included, it no longer uses
any function from this header at all, so it can just be dropped.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
c63b3188c3 Account for difference in stat output format for device nodes.
The new format in el10 has non-hex output, separated by a comma. Add the
additional filter string so this works as expected.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
a455d089e5 Fix el10 not skipping the format-version-forward-back test.
The logic only accounted for single-digit versions. With el10, that
breaks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
85e3424a63 Use a/m/c_time accessor functions.
In v6.6-rc5-1-g077c212f0344, one can no longer directly access the
inode m_time and a_time etc. We have to go through these static inline
functions to get to them. The compat is matched closely to mimic the
new functions.

Further back, ctime accessors were added in v6.5-rc1-7-g9b6304c1d537,
and need to be applied as well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:23 -08:00
Auke Kok
8a953c9ba3 Stop using egrep.
egrep Is no longer in el10, so replace it with `grep -E` everywhere.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-11 11:30:12 -08:00
Auke Kok
40df1e078b prandom_bytes and family removed, switch to get_random_bytes variants.
In v6.1-rc5-2-ge9a688bcb193, get_random_u32_below() becomes available and
can start replacing prandom_bytes_max(). Switch to it where we can.

get_random_bytes() has been available since el7, so also replace
prandom_bytes() where we're using it.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-02-06 11:14:31 -08:00
Zach Brown
831faff7d2 Merge pull request #282 from versity/zab/v1.28
v1.28 Release
2026-02-06 09:28:52 -08:00
Zach Brown
8dad826f88 v1.28 Release
Finish the release notes for the 1.28 release.

Signed-off-by: Zach Brown <zab@versity.com>
v1.28
2026-02-05 09:47:05 -08:00
Zach Brown
3a05c69643 Merge pull request #279 from versity/auke/basic-acl-consistency
Auke/basic acl consistency (test/reproduction)
2026-02-02 10:32:30 -08:00
Auke Kok
533f309aec Switch to .get_inode_acl() to avoid rcu corruption.
In el9.6, the kernel VFS no longer goes through xattr handlers to
retreive ACLs, but instead calls the FS drivers' .get_{inode_}acl
method.  In the initial compat version we hooked up to .get_acl given
the identical name that was used in the past.

However, this results in caching issues, as was encountered by customers
and exposed in the added test case `basic-acl-consistency`. The result
is that some group ACL entries may appear randomly missing. Dropping
caches may temporarily fix the issue.

The root cause of the issue is that the VFS now has 2 separate paths to
retreive ACL's from the FS driver, and, they have conflicting
implications for caching. `.get_acl` is purely meant for filesystems
like overlay/ecryptfs where no caching should ever go on as they are
fully passthrough only. Filesystems with dentries (i.e. all normal
filesystems should not expose this interface, and instead expose the
.get_inode_acl method. And indeed, in introducing the new interface, the
upstream kernel converts all but a few fs's to use .get_inode_acl().

The functional change in the driver is to detach KC_GET_ACL_DENTRY and
introduce KC_GET_INODE_ACL to handle the new (and required) interface.
KC_SET_ACL_DENTRY is detached due to it being a different changeset in
the kernel and we should separate these for good measure now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-30 11:31:43 -08:00
Auke Kok
0ef22b3c44 Add basic ACL consistency test case.
This test case is used to detect and reproduce a customer issue we're
seeing where the new .get_acl() method API and underlying changes in
el9_6+ are causing ACL cache fetching to return inconsistent results,
which shows as missing ACLs on directories.

This particular sequence is consistent enough that it warrants making
it into a specific test.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-22 12:23:38 -08:00
Auke Kok
85ffba5329 Update existing tests to use scratch helpers.
Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-20 12:35:43 -08:00
Auke Kok
553e6e909e Scratch mount test helpers.
Adds basic mkfs/mount/umount helpers that handle all the basics
for making, mounting and unmounting scratch devices. The mount/unmount
create "$T_MSCR", which lives in "$T_TMPDIR".

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-20 12:35:09 -08:00
Zach Brown
9b569415f2 Merge pull request #276 from versity/zab/v1.27
v1.27 Release
2026-01-15 19:36:38 -08:00
Zach Brown
6a1e136085 v1.27 Release
Finish the release notes for the 1.27 release.

Signed-off-by: Zach Brown <zab@versity.com>
v1.27
2026-01-15 14:21:53 -08:00
Zach Brown
7ca789c837 Merge pull request #278 from versity/zab/test_sync_before_crash
Have run-tests monitor sync before crashing
2026-01-15 14:03:26 -08:00
Zach Brown
4d55fe6251 Have run-tests monitor sync before crashing
There have been a few failures where output is generated just before we
crash but it didn't have a chance to be written.  Add a best-effort
background sync before crashing.  There's a good chance it'll hang if
the system is stuck so we don't wait for it directly, just for .. some
time to pass.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-15 10:41:44 -08:00
Zach Brown
8f896d9783 Merge pull request #277 from versity/zab/avoid_lock_shrink_storm_hangs
Zab/avoid lock shrink storm hangs
2026-01-14 11:13:09 -08:00
Zach Brown
e54f8d3ec0 Don't shutdown server from sending to fencing client
Errors from lock server calls typically shut the server down.

During normal unmount a client's locks are reclaimed before the
connection is disconnected.  The lock server won't try to send to
unmounting clients.

Clients whose connections time out can cause ENOTCONN errors.  Their
connection is freed before they're fenced and their locks are reclaimed.
The server can try to send to the client for a lock that's disconnected
and get a send error.

These errors shouldn't shut down the server.  The client is either going
to be fenced and have the locks reclaimed, ensuring forward progress, or
the server is going to shutdown if it can't fence.

This was seen in testing as multiple clients were timed out.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-13 15:34:55 -08:00
Zach Brown
d89e16214d Simplify fence-and-reclaim fence execution check
The fence-and-reclaim test runs a bunch of scenarios and makes sure that
the fence agent was run on the appropriate mount's rids.

Unfortunately the checks were racey.  The check itself only looked at
the log once to see if the rid had been fenced.  Each check had steps
before that would wait until the rid should have been fenced and could
be checked.

Those steps were racey.  They'd do things like make sure a fence request
wasn't pending, but never waited for it to be created in the first
place.  They'd falsely indicate that the log should be checked and when
the rid wasn't found in the log the test would fail.  In logs of
failures we'd see that the rids were fenced after this test failed and
moved on to the next.

This simplifies the checks.  It gets rid of all the intermediate steps
and just waits around for the rid to be fenced, with a timeout.  This
avoids the flakey tests.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-13 15:34:55 -08:00
Zach Brown
b468352254 Add t_wait_until_timeout
Add a test helper for waiting for a command to return success which will
fail the test after a timeout.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-13 15:34:55 -08:00
Zach Brown
0eb9dfebdc Allow forced unmount errors in lock invalidation
Lock invalidation has assertions for critical errors, but it doesn't
allow the synthetic errors that come from forced unmount severing the
client's connection to the world.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-13 15:34:55 -08:00
Zach Brown
f5750de244 Search messages in rbtree instead of lists
The net layer was initially built around send queue lists with the
presumption that there wouldn't be many messages in flight and that
responses would be sent roughly in order.

In the modern era, we can have 10s of thousands of lock request messages
in flight.  This lead to o(n^2) processing in quite a few places as recv
processing searched for either requests to complete or responses to
free.

This adds messages to two rbtrees, indexing either requests by their id
or responses by their send sequence.  Recv processing can find messages
in o(log n).  This patch intends to be minimally disruptive.  It's only
replacing the search of the send and resend queues in the recv path with
rbtrees.  Other uses of the two queue lists are untouched.

On a single node, with ~40k lock shrink attempts in flight, we go from
processing ~800 total request/grant request/response pairs per second to
~60,000 per second.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-13 15:32:55 -08:00
Zach Brown
f0c7996612 Limit client locks with option instead of shrinker
The use of the VM shrinker was a bad fit for locks.  Shrinking a lock
requires a round trip with the server to request a null mode.  The VM
treats the locks like a cache, as expected, which leads to huge amounts
of locks accumulating and then being shrank in bulk.  This creates a
huge backlog of locks making their way through the network conversation
with the server that implements invalidating to a null mode and freeing.
It starves other network and lock processing, possibly for minutes.

This removes the VM shrinker and instead introduces an option that sets
a limit on the number of idle locks.  As the number of locks exceeds the
count we only try to free an oldest lock at each lock call.  This
results in a lock freeing pace that is proportional to the allocation of
new locks by callers and so is throttled by the work done while callers
hold locks.  It avoids the bulk shrinking of 10s of thousands of locks
that we see in the field.

Signed-off-by: Zach Brown <zab@versity.com>
2026-01-08 10:58:50 -08:00
Zach Brown
5143927e07 Merge pull request #275 from versity/auke/qht_slow_umount_pr
Unmounts can be slow and break quorum-heartbeat-timeout
2026-01-08 09:35:23 -08:00
Auke Kok
f495f52ec9 Unmounts can be slow and break quorum-heartbeat-timeout
We observe that unmount in this test can consume up to 10sec of time
before proceeding to record heartbeat timeout elections by followers.

When this happens, elections and new leaders happen before unmount even
completes. This indicates that hearbeat packets from the unmount are
ceased immediately, but the unmount is taking longer doing other things.
The timeouts then trigger, possibly during the unmount.

The result is that with timeouts of 3 seconds, we're not actually
waiting for an election at all. It already happened 7 seconds ago. The
code here just "sees" that it happens a few hundred ms after it started
looking for it.

There's a few ways about this fix. We could record the actual timestamp
of the election, and compare it with the actual timestamp of the last
heartbeat packet. This would be conclusive, and could disregard any
complication from umount taking too long. But it also means adding
timestamping in various places, or having to rely on tcpdump with packet
processing.

We can't just record $start before unmount. We will still violate the
part of the test that checks that elections didn't happen too late.
Especially in the 3sec test case if unmount takes 10sec.

The simplest solution is to unmount in a bg thread, and circle around
later to `wait` for it to assure we can re-mount without ill effect.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-08 09:05:40 -08:00
Zach Brown
3dafeaac5b Merge pull request #273 from versity/clk/inode_deletion
Clk/inode deletion
2026-01-07 12:20:12 -08:00
Chris Kirby
ef0f6f8ac2 Fix race in inode-deletion test
Due to an iput race, the "unlink wait for open on other mount"
subtest can fail. If the unlink happens inline, then the test
passes. But if the orphan scanner has to complete the unlink
work, it's possible that there won't be enough log merge work
for the scanner to do the cleanup before we look at the seq index.

Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow
forcing a log merge. Add new counters, log_merges_start and
log_merge_complete, so that tests can see that a merge has happened.

Then we have to wait for the orphan scanner to do its work.
Add a new counter, orphan_scan_empty, that increments each time
the scanner walks the entire inode space without finding any
orphans. Once the test sees that counter increment, it should be
safe to check the seq index and see that the unlinked inode is gone.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2026-01-07 08:29:38 -06:00
Chris Kirby
c0cd29aa1b Fix run-test.sh buffer multiplier breakage
The /sys/kernel/debug/tracing/buffer_size_kb file always reads as
"7 (expanded: 1408)". So the -T option to run-test.sh won't work,
because it tries to multiply that string by the given factor.

It always defaults to 1408 on every platform we currently support.
Just use that value so we can specify -T in CI runs.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-12-18 15:05:48 -06:00
Zach Brown
50bff13f21 Merge pull request #266 from versity/zab/increase_move_empty_budget
Increase server commit block budget for alloc move
2025-12-18 12:44:20 -08:00
Zach Brown
de70ca2372 Increase server commit block budget for alloc move
A few callers of alloc_move_empty in the server were providing a budget
that was too small.  Recent changes to extent_mod_blocks increased the
max budget that is necessary to move extents between btrees.  The
existing WAG of 100 was too small for trees of height 2 and 3.  This
caused looping in production.

We can increase the move budget to half the overall commit budget, which
leaves room for a height of around 7 each.  This is much greater than we
see in practice because the size of the per-mount btrees is effectiely
limited by both watermarks and thresholds to commit and drain.

Signed-off-by: Zach Brown <zab@versity.com>
2025-12-17 14:22:04 -06:00
Zach Brown
5af1412d5f Merge pull request #270 from versity/auke/bdev_autoloading
Avoid block device autoloading warning.
2025-12-17 11:06:32 -08:00
Zach Brown
0a2b2ad409 Merge pull request #269 from versity/auke/tap_status_msg
Include t_fail status in tap output.
2025-12-17 11:04:00 -08:00
Auke Kok
6c4590a8a0 Avoid block device autoloading warning.
It's possible to trigger the block device autoloading mechanism
with a mknod()/stat(), and this mechanism has long been declared
obsolete, thus triggering a dmesg warning since el9_7, which then
fails the test. You may need to `rmmod loop` to reproduce.

Avoid this by avoiding to trigger a loop autoload - we just make a
different blockdev. Chosing `42` here should avoid any autoload
mechanism as this number is explicitly for demo drivers and should
never trigger an autoload.

We also just ignore the warning line in dmesg. Other tests can and
might perhaps still trigger this, as well as background noise running
during the test.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-08 13:04:58 -08:00
Zach Brown
1768f69c3c Merge pull request #224 from versity/auke/renameat2-test-sub-dir
Use T_D0/1 instead of T_M0 here.
2025-12-08 10:05:46 -08:00
Zach Brown
dcb0fd5805 Merge pull request #268 from versity/auke/dont_use_bash_special_stdfiles
Avoid using bash special device nodes.
2025-12-08 09:47:19 -08:00
Auke Kok
660f874488 Use T_D0/1 instead of T_M0 here.
Use of T_M0 and variants should be reserved for e.g. scoutfs
<subcommand> -p <mountpoint> type of usages. Tests should create
individual content files in the assigned subdirectory.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-04 14:34:02 -05:00
Auke Kok
e1a6689a9b Include t_fail status in tap output.
The tap output file was not yet complete as it failed to include
the contents of `status.msg`. In a few cases, that would mean it
lacks important context.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-04 14:09:39 -05:00