When the fs_root is too short for subtrees, get_parent returns the
entire root for every request. Multiple concurrent merges would each
independently CoW and modify the same root tree. Processing their
completions would replace the root with each result, only keeping the
last and orphaning blocks allocated by earlier completions.
Limit to one outstanding request when fs_root.height <= 2 to prevent
this.
Signed-off-by: Auke Kok <auke.kok@versity.com>
bt = bl->data, but we just marked bl to be freed with scoutfs_block_put(),
so save the blkno. Very hypothetical.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The descent in free_blocks stops at level 1 and frees all leaf refs
in that parent block. Ancestor blocks reached through their final
child ref are recorded in blknos[] and freed after the leaves. But
the level-1 parent block itself was never freed — it wasn't added to
blknos[] during descent (that array only tracks levels 2+) and wasn't
freed explicitly after the leaf loop.
This leaked every level-1 parent block in every finalized log tree,
leaving them as meta blocks that were neither referenced by any btree
nor present in any free list. This caused CK_META_COVERAGE gap
failures in scoutfs check.
Free the level-1 parent block explicitly after all its leaf refs have
been freed. Adjust the budget check to reserve space for this
additional free (+1 for the parent alongside the existing +nr_par for
ancestors).
Signed-off-by: Auke Kok <auke.kok@versity.com>
The "server error emptying freed" error was causing a
fence-and-reclaim test failure. In this case, the error
was -ENOLINK, which we should ignore for messaging purposes.
Signed-off-by: Chris Kirby <ckirby@versity.com>
It's possible for a srch compaction to collapse down to nothing
if given evenly paired create/delete entries. In this case, we
were emitting an empty block. This could cause problems for
search_sorted_file(), which assumes that every block it sees
has a valid first and last entry.
Fix this by keeping a temp entry and only emitting it if it differs
from the next entry in the block. Be sure to flush out a straggling
temp entry if we have one when we're done with the last block of
the merge.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Add unmounting checks to lock_wait_cond() and lock_key_range() so
that lock waiters wake up and new lock requests fail with -ESHUTDOWN
during unmount. Replace the unbounded wait_event() with a 60 second
timeout to prevent indefinite hangs. Relax the WARN_ON_ONCE at
lock_key_range entry to only warn when not unmounting, since late
lock attempts during shutdown are expected.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Replace unbounded wait_for_completion() with a 120 second timeout
to prevent indefinite hangs during unmount if the server never
responds to the farewell request.
Signed-off-by: Auke Kok <auke.kok@versity.com>
When block_submit_bio() fails, set BLOCK_BIT_ERROR so that
waiters in wait_event(uptodate_or_error) will wake up rather
than waiting indefinitely for a completion.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The initialization here avoids clearing __pad[], which leaks
to disk. Use a struct initializer to avoid it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This allocation here currently leaks through __pad[7] which
is written to disk. Use the initializer to enforce zeroing
the pad. The name member is written right after.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The caller sends the return value of this inline as u8. If we return
-EINVAL, it maps to (234) which is outside of our enum range. Assume
this was meant to return SCOUTFS_NET_ERR_EINVAL which is a defined
constant.
Signed-off-by: Auke Kok <auke.kok@versity.com>
These boolean checks are all mutually exclusive, meaning this
check will always succeed due to the negative. Instead of && it
needs to use ||.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The exact 2 lines here are repeated. It suggests that there may
have been the intent of an additional check, but, there isn't
anything left from what I can see that needs checking here.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This setup function always returned 0, even on error, causing
initialization to continue despite the error.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In el9.6, the kernel VFS no longer goes through xattr handlers to
retreive ACLs, but instead calls the FS drivers' .get_{inode_}acl
method. In the initial compat version we hooked up to .get_acl given
the identical name that was used in the past.
However, this results in caching issues, as was encountered by customers
and exposed in the added test case `basic-acl-consistency`. The result
is that some group ACL entries may appear randomly missing. Dropping
caches may temporarily fix the issue.
The root cause of the issue is that the VFS now has 2 separate paths to
retreive ACL's from the FS driver, and, they have conflicting
implications for caching. `.get_acl` is purely meant for filesystems
like overlay/ecryptfs where no caching should ever go on as they are
fully passthrough only. Filesystems with dentries (i.e. all normal
filesystems should not expose this interface, and instead expose the
.get_inode_acl method. And indeed, in introducing the new interface, the
upstream kernel converts all but a few fs's to use .get_inode_acl().
The functional change in the driver is to detach KC_GET_ACL_DENTRY and
introduce KC_GET_INODE_ACL to handle the new (and required) interface.
KC_SET_ACL_DENTRY is detached due to it being a different changeset in
the kernel and we should separate these for good measure now.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Errors from lock server calls typically shut the server down.
During normal unmount a client's locks are reclaimed before the
connection is disconnected. The lock server won't try to send to
unmounting clients.
Clients whose connections time out can cause ENOTCONN errors. Their
connection is freed before they're fenced and their locks are reclaimed.
The server can try to send to the client for a lock that's disconnected
and get a send error.
These errors shouldn't shut down the server. The client is either going
to be fenced and have the locks reclaimed, ensuring forward progress, or
the server is going to shutdown if it can't fence.
This was seen in testing as multiple clients were timed out.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has assertions for critical errors, but it doesn't
allow the synthetic errors that come from forced unmount severing the
client's connection to the world.
Signed-off-by: Zach Brown <zab@versity.com>
The net layer was initially built around send queue lists with the
presumption that there wouldn't be many messages in flight and that
responses would be sent roughly in order.
In the modern era, we can have 10s of thousands of lock request messages
in flight. This lead to o(n^2) processing in quite a few places as recv
processing searched for either requests to complete or responses to
free.
This adds messages to two rbtrees, indexing either requests by their id
or responses by their send sequence. Recv processing can find messages
in o(log n). This patch intends to be minimally disruptive. It's only
replacing the search of the send and resend queues in the recv path with
rbtrees. Other uses of the two queue lists are untouched.
On a single node, with ~40k lock shrink attempts in flight, we go from
processing ~800 total request/grant request/response pairs per second to
~60,000 per second.
Signed-off-by: Zach Brown <zab@versity.com>
The use of the VM shrinker was a bad fit for locks. Shrinking a lock
requires a round trip with the server to request a null mode. The VM
treats the locks like a cache, as expected, which leads to huge amounts
of locks accumulating and then being shrank in bulk. This creates a
huge backlog of locks making their way through the network conversation
with the server that implements invalidating to a null mode and freeing.
It starves other network and lock processing, possibly for minutes.
This removes the VM shrinker and instead introduces an option that sets
a limit on the number of idle locks. As the number of locks exceeds the
count we only try to free an oldest lock at each lock call. This
results in a lock freeing pace that is proportional to the allocation of
new locks by callers and so is throttled by the work done while callers
hold locks. It avoids the bulk shrinking of 10s of thousands of locks
that we see in the field.
Signed-off-by: Zach Brown <zab@versity.com>
Due to an iput race, the "unlink wait for open on other mount"
subtest can fail. If the unlink happens inline, then the test
passes. But if the orphan scanner has to complete the unlink
work, it's possible that there won't be enough log merge work
for the scanner to do the cleanup before we look at the seq index.
Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow
forcing a log merge. Add new counters, log_merges_start and
log_merge_complete, so that tests can see that a merge has happened.
Then we have to wait for the orphan scanner to do its work.
Add a new counter, orphan_scan_empty, that increments each time
the scanner walks the entire inode space without finding any
orphans. Once the test sees that counter increment, it should be
safe to check the seq index and see that the unlinked inode is gone.
Signed-off-by: Chris Kirby <ckirby@versity.com>
A few callers of alloc_move_empty in the server were providing a budget
that was too small. Recent changes to extent_mod_blocks increased the
max budget that is necessary to move extents between btrees. The
existing WAG of 100 was too small for trees of height 2 and 3. This
caused looping in production.
We can increase the move budget to half the overall commit budget, which
leaves room for a height of around 7 each. This is much greater than we
see in practice because the size of the per-mount btrees is effectiely
limited by both watermarks and thresholds to commit and drain.
Signed-off-by: Zach Brown <zab@versity.com>
Tests that cause client retries can fail with this error
from server_commit_log_merge():
error -2 committing log merge: getting merge status item
This can happen if the server has already committed and resolved
the log merge that is being retried. We can safely ignore ENOENT here
just like we do a few lines later.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors. The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.
Signed-off-by: Zach Brown <zab@versity.com>
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent. When we're forcing an
unmount we've severed storage and networking.
Signed-off-by: Zach Brown <zab@versity.com>
Assembling a srch compaction operation creates an item and populates it
with allocator state. It doesn't cleanly unwind the allocation and undo
the compaction item change if allocation filling fails and issues a
warning.
This warning isn't needed if the error shows that we're in forced
unmount. The inconsistent state won't be applied, it will be dropped on
the floor as the mount is torn down.
Signed-off-by: Zach Brown <zab@versity.com>
The log merging process is meant to provide parallelism across workers
in mounts. The idea is that the server hands out a bunch of concurrent
non-intersecting work that's based on the structure of the stable input
fs_root btree.
The nature of the parallel work (cow of the blocks that intersect a key
range) means that the ranges of concurrently issued work can't overlap
or the work will all cow the same input blocks, freeing that input
stable block multiple times. We're seeing this in testing.
Correctness was intended by having an advancing key that sweeps sorted
ranges. Duplicate ranges would never be hit as the key advanced past
each it visited. This was broken by the mapping of the fs item keys to
log merge tree keys by clobbering the sk_zone key value. It effectively
interleaves the ranges of each zone in the fs root (meta indexes,
orphans, fs items). With just the right log merge conditions that
involve logged items in the right places and partial completed work to
insert remaining ranges behind the key, ranges can be stored at mapped
keys that end up with ranges out of order. The server iterates over
these and ends up issueing overlapping work, which results in duplicated
frees of the input blocks.
The fix, without changing the format of the stored log tree items, is to
perform a full sweep of all the range items and determine the next item
by looking at the full precision stored keys. This ensures that the
processed ranges always advance and never overlap.
Signed-off-by: Zach Brown <zab@versity.com>
The data_wait_err ioctl currently requires the correct data_version
for the inode to be passed in, or else the ioctl returns -ESTALE. But
the ioctl itself is just a passthrough mechanism for notifying data
waiters, which doesn't involve the data_version at all.
Instead, we can just drop checking the value. The field remains in the
headers, but we've marked it as being ignored from now on. The reason
for the change is documented in the header file as well.
This all is a lot simpler than having to modify/rev the data_waiters
interface to support passing back the data_version, because there isn't
any space left to easily do this, and then userspace would just pass it
back to the data_wait_err ioctl.
Signed-off-by: Auke Kok <auke.kok@versity.com>
scoutfs_alloc_prepare_commit() is badly named. All it really does is
put the references to the two dirty alloc list blocks in the allocator.
It must allways be called if allocation was attempted, but it's easier
to require that it always be paired with _alloc_init().
If the srch compaction worker in the client sees an error it will send
the error back to the server without writing its dirty blocks. In
avoiding the write it also avoided putting the two block references,
leading to leaked blocks. We've been seeing rare messages with leaked
blocks in tests.
Signed-off-by: Zach Brown <zab@versity.com>
The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.
Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.
This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Similar to before when namespaces were added, they are now translated to
a mnt_idmap, since v6.2-rc1-2-gabf08576afe3.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The typical pattern of spinning isolating a list_lru results in a
livelock if there are blocks with leaked refcounts. We're rarely seeing
this in testing.
We can have a modest array in each block that records the stack of the
caller that initially allocated the block and dump that stack for any
blocks that we're unable to shrink/isolate. Instead of spinning
shrinking, we can give it a good try and then print the blocks that
remain and carry on with unmount, leaking a few blocks. (Past events
have had 2 blocks.)
Signed-off-by: Zach Brown <zab@versity.com>
The server's srch commit error warnings were a bit severe. The
compaction operations are a function of persistent state. If they fail
then the inputs still exist and the next attempt will retry whatever
failed. Not all errors are a problem, only those that result in partial
commits that leave inconsistent state.
In particular, we have to support the case where a client retransmits a
compaction request to a new server after a first server performed the
commit but couldn't respond. Throwing warnings when the new server gets
ENOENT looking for the busy compaction item isn't helpful. This came in
tests as background compaction was in flight as tests unmounted and
mounted servers repeatedly to test lock recovery.
Signed-off-by: Zach Brown <zab@versity.com>
The block cache had a bizarre cache eviction policy that was trying to
avoid precise LRU updates at each block. It had pretty bad behaviour,
including only allowing reclaim of maybe 20% of the blocks that were
visited by the shrinker.
We can use the existing list_lru facility in the kernel to do a better
job. Blocks only exhibit contention as they're allocated and added to
per-node lists. From then on we only set accessed bits and the private
list walkers move blocks around on the list as we see the accessed bits.
(It looks more like a fifo with lazy promotion than a "LRU" that is
actively moving list items around as they're accessed.)
Using the facility means changing how we remove blocks from the cache
and hide them from lookup. We clean up the refcount inserted flag a bit
to be expressed more as a base refcount that can be acquired by
whoever's removing from the cache. It seems a lot clearer.
Signed-off-by: Zach Brown <zab@versity.com>
Add kernelcompat helpers for initial use of list_lru for shrinking. The
most complicated part is the walk callback type changing.
Signed-off-by: Zach Brown <zab@versity.com>
Readers can read a set of items that is stale with respect to items that
were dirtied and written under a local cluster lock after the read
started.
The active reader machanism addressed this by refusing to shrink pages
that could contain items that were dirtied while any readers were in
flight. Under the right circumstances this can result in refusing to
shrink quite a lot of pages indeed.
This changes the mechanism to allow pages to be reclaimed, and instead
forces stale readers to retry. The gamble is that reads are much faster
than writes. A small fraction should have to retry, and when they do
they can be satisfied by the block cache.
Signed-off-by: Zach Brown <zab@versity.com>
The default TCP keepalive value is currently 10s, resulting in clients
being disconnected after 10 seconds of not replying to a TCP keepalive
packet. These keepalive values are reasonable most of the times, but
we've seen client disconnects where this timeout has been exceeded,
resulting in fencing. The cause for this is unknown at this time, but it
is suspected that network intermissions are happening.
This change adds a configurable value for this specific client socket
timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose
value remains unchanged.
The default value of 10000ms (10s) is changed to 60s. This is the value
we're assuming is much better suited for customers and has been briefly
trialed, showing that it may help to avoid network level interruptions
better.
Signed-off-by: Auke Kok <auke.kok@versity.com>
It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which
is legitimately a failure, thus the code here releases the sock again.
But the code block here sets `ret = ENOMEM` and then restarts the loop,
which immediately sets `ret = kernel_accept()`, thus overwriting the
-ENOMEM error value.
We can argue that an ENOMEM error situation here is not catastrophical.
If this is the first that we're ever receiving an ENOMEM situation here
while trying to accept a new client, we can just release the socket and
wait for the client to try again. If the kernel at that point still is
out of memory to handle the new incoming connection, that will then
cascade down and clean up the while listener at that point.
The alternative is to let this error path unwind out and break down the
listener immediately, something the code today doesn't do. We're keeping
the behavior therefore the same.
I've opted therefore to replace the `ret = -ENOMEM` assignment with a
comment explaining why we're ignoring the error situation here.
Signed-off-by: Auke Kok <auke.kok@versity.com>
If scoutfs_send_omap_response fails for any reason, req is NULL and we
would hit a hard NULL deref during unwinding.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This function returns a stack pointer to a struct scoutfs_extent, after
setting start, len to an extent found in the proper zone, but it leaves
map and flags members unset.
Initialize the struct to {0,} avoids passing uninitialized values up the
callstack.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Several of the inconsistency error paths already correctly `goto out`
but this one has a `break`. This would result in doing a whole lot of
work on corrupted data.
Make this error path go to `out` instead as the others do.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In these two error conditions we explicitly set `ret = -EIO` but then
`break` to set `ret = 0` immediately again, masking away a critical
error code that should be returned.
Instead, `goto out` retains the EIO error value for the caller.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The value of `ret` is not initialized. If the writeback list is empty,
or, if igrab() fails on the only inode on the list, the value
of `ret` is returned without being initialized. This would cause the
caller to needlessly have to retry, perhaps possibly make things worse.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We shouldn't copy the entire _dirent struct and then copy in the name
again right after, just stop at offsetoff(struct, name).
Now that we're no longer copying the uninitialized name[3] from ent,
there is no more possible 1-byte leak here, too.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Assure that we reschedule even if this happens. Maybe it'll recover. If
not, we'll have other issues elsewhere first.
Signed-off-by: Auke Kok <auke.kok@versity.com>