scoutfs_client_destroy nulled client->conn before scoutfs_net_free_conn
had a chance to drain the conn's workqueue. An in-flight proc_worker
running client_lock_recover dispatches scoutfs_lock_recover_request
synchronously, which in turn calls scoutfs_client_lock_recover_response.
That helper reads client->conn and hands it to scoutfs_net_response, so
a racing NULL made submit_send dereference conn->lock and trip a KASAN
null-ptr-deref followed by a GPF.
Only became reachable in practice once reconnect started draining pending
client requests with -ECONNRESET, because the farewell can now return
while the server is still sending requests on the re-established socket.
Reorder so scoutfs_net_free_conn runs first; its shutdown_worker drains
conn->workq before any memory is freed, then client->conn is nulled.
The original intent of nulling to catch buggy late callers is preserved.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The srch compact and orphan scan workers called sync request RPCs that
would block indefinitely if the server stopped answering. Both workers
are idempotent and reschedule on error, so blocking forever buys
nothing compared to treating a stalled RPC as a failure and trying
again on the next tick.
Add scoutfs_net_sync_request_timeout, a bounded-wait variant that
returns -ETIMEDOUT if the response doesn't arrive in time. Response
state lives on a refcounted heap allocation rather than the caller's
stack so a late callback can't scribble into freed memory. On timeout
we race with an arriving response for the msend under conn->lock: if
find_request wins we queue_dead_free and drop the callback's ref;
otherwise we wait for the in-flight callback to complete before
returning.
Add _timeout typed wrappers for the four RPCs these workers use and
thread a 5 minute bound in from each worker. All other callers keep
the unbounded client_sync_request path with its reconnect retries.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Previously, client_greeting spliced pending requests back onto send_queue
when reconnecting to a new server. Those requests carried state from
the old server (sequence numbers, log tree references, lock modes) that
was reclaimed at fence time, so resending against the new server was
incorrect.
Drain pending requests with -ECONNRESET at greeting time, mirroring the
forcing_unmount drain in the shutdown worker. Thread the lock pointer
through scoutfs_client_lock_request so the response callback can clear
request_pending and wake waiters on error; otherwise a lock_key_range
waiter would block forever because the new server's lock recovery only
reports granted modes, not pending requests.
Wrap the sync request senders in client_sync_request so userspace paths
(statfs, mkdir, sysfs volopt, resize ioctl, walk-inodes ioctl) retry
transparently across failover instead of surfacing a new -ECONNRESET
that callers never saw before.
Signed-off-by: Auke Kok <auke.kok@versity.com>
block_dirty_ref() skipped setting *ref_blkno when the block was
already dirty, leaving the caller with a stale value. Set it to 0
on the already-dirty fast path so callers do not try to free a
block that was not allocated.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add a WARN_ON_ONCE check that the freed list ref blkno matches the
block header blkno after dirtying alloc blocks. Also save and restore
freed.first_nr on the error path, and initialize av_old/fr_old to 0
so the diagnostic message has valid values.
Signed-off-by: Auke Kok <auke.kok@versity.com>
retry_forever() only checked scoutfs_forcing_unmount(), so a normal
unmount with a network error in the commit path would loop forever.
Also check scoutfs_unmounting() so the write worker can exit cleanly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
During normal unmount, lock_invalidate_worker can hang in
scoutfs_trans_sync(sb, 1) because the trans commit path may
return network errors that cause an infinite retry loop.
Skip full lock_invalidate() during shutdown and unmount, and
extract lock_clear_coverage() to still clean up coverage items
in those paths and in scoutfs_lock_destroy(). Without this,
coverage items can remain attached to locks being freed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Replace unbounded wait_for_completion() in scoutfs_net_sync_request()
with a 60 second timeout loop that checks scoutfs_unmounting(). Cancel
the queued request before returning -ESHUTDOWN so that sync_response
cannot fire on freed stack memory after the caller returns.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The "server error emptying freed" error was causing a
fence-and-reclaim test failure. In this case, the error
was -ENOLINK, which we should ignore for messaging purposes.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Add unmounting checks to lock_wait_cond() and lock_key_range() so
that lock waiters wake up and new lock requests fail with -ESHUTDOWN
during unmount. Replace the unbounded wait_event() with a 60 second
timeout to prevent indefinite hangs. Relax the WARN_ON_ONCE at
lock_key_range entry to only warn when not unmounting, since late
lock attempts during shutdown are expected.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Replace unbounded wait_for_completion() with a 120 second timeout
to prevent indefinite hangs during unmount if the server never
responds to the farewell request.
Signed-off-by: Auke Kok <auke.kok@versity.com>
I'm seeing consistent CPU soft lockups in block_free_work on
my bare metal system that aren't reached by VM instances. The
reason is that the bare metal machine has a ton more memory
available causing the block free work queue to grow much
larger in size, and then it has so much work that it can take 30+
seconds before it goes through it all.
This is all with a debug kernel. A non debug kernel will likely
zoom through the outstanding work here at a much faster rate.
Signed-off-by: Auke Kok <auke.kok@versity.com>
block_submit_bio will return -ENOLINK if called during a forced
shutdown, the bio is never submitted, and thus no completion callback
will fire to set BLOCK_BIT_ERROR. Any other task waiting for this
specific bp will end up waiting forever.
To fix, fall through to the existing block_end_io call on the
error path instead of returning directly. That means moving
the forcing_unmount check past the setup calls so block_end_io's
bookkeeping stays balanced. block_end_io then sets BLOCK_BIT_ERROR
and wakes up waiters just as it would on a failed async completion.
Signed-off-by: Auke Kok <auke.kok@versity.com>
scoutfs_quota_mod_rule calls scoutfs_item_create/delete which use
the transaction allocator but it never held it. Without the hold,
a concurrent transaction commit can call scoutfs_alloc_init to
reinitialize the allocator while dirty_alloc_blocks is in the middle
of setting up the freed list block. This overwrites alloc->freed with
the server's fresh (empty) state, causing a blkno mismatch BUG_ON
in list_block_add.
Reproduced by stressing concurrent quota add/del operations across
mounts. Crashdump analysis confirms dirty_list_block COW'd a freed
block (fr_old=9842, new blkno=9852) but by the time list_block_add
ran, freed.ref.blkno was 0 with first_nr=0 and total_nr=0: the freed
list head had been zeroed by a concurrent alloc_init.
Fix by adding scoutfs_hold_trans/scoutfs_release_trans around the
item modification in scoutfs_quota_mod_rule, preventing transaction
commit from racing with the allocator use.
Rename the 'unlock' label to 'release' since 'out' now directly
does the unlock. The unlock safely handles a NULL lock.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Without overly broad filtering empty lines from dmesg, filter
them so dmesg.new doesn't trigger a test failure. I don't want
to overly process dmesg, so do this as late as possible.
The xfs lockdep patterns can forget a leading/trailing empty line,
causing a failure despite the explicit removal of the lockdep
false positive.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This already caught xfs_nondir_ilock_class, but recent CI runs
have been hitting xfs_dir_ilock_class, too.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The gcc version in el7 can't determine that scoutfs_block_check_stale
won't return ret = 0 when the input ret value is < 0, and
errors because we might call alloc_wpage with an uninitialized
read_seq. Initialize it to 0 to avoid it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Readers currently accumulate all finalized log tree deltas into
a single bucket for deciding whether they are already in fs_root
or not, but, finalized trees that aren't inputs to a current merge
will have higher seqs, and thus we may be double applying deltas
already merged into fs_root.
To distinguish, scoutfs_totl_merge_contribute() needs to know the
merge status item seq. We change wkic's get_roots() from using the
SCOUTFS_NET_CMD_GET_ROOTS RPC to reading the superblock directly.
This is needed because totl merge resolution has to use the same data
as the btree roots it is operating on, thus we can't grab it from a
SCOUTFS_NET_CMD_GET_ROOTS packet - it likely is different.
Signed-off-by: Auke Kok <auke.kok@versity.com>
These mislabeled members and enums were clearly not describing
the actual data being handled and obfuscating the intent of
avoiding mixing merge input items with non-merge input items.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Before deltas were added this code path was correct, but with
deltas we can't just retry this without clearing &root, since
it would potentially double count.
The condition where this could happen is when there are deltas in
several finalized log trees, and we've made progress towards reading
some of them, and then encounter a stale btree block. The retry
would not clear the collected trees, apply the same delta as was
already applied before the retry, and thus double count.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Two different clients can write delta's for totl indexes at the same
time, recording their changes. When merged, a reader should apply both
in order, and only once. To do so, the seq determines whether the delta
has been applied already.
The code fails to update the seq while walking the trees for deltas to
apply. Subsequently, when processing subsequent trees, it could
re-process deltas already applied. In case of a large negative delta
(e.g. removal of large amounts of files), the totl value could become
negative, resulting in quota lockout.
The fix is simple: advance the seq when reading partial delta merges
to avoid double counting.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add a trigger that forces btree_merge() to return -ERANGE after
modifying a leaf's worth of items, causing many small partial merges
per merge cycle. This is used by tests to reliably reproduce races
that depend on partial merges splicing items into fs_root while
finalized logs still exist.
The trigger check lives inside btree_merge() where it can observe
actual item modification progress, rather than overriding the
caller's dirty byte limit argument which applies to the whole
writer context.
Signed-off-by: Auke Kok <auke.kok@versity.com>
merge_read_item() fails to update found->seq when combining delta items
from multiple finalized log trees. Add a test case to replicate the
conditions of this issue.
Each of 5 mounts sets totl value 1 on 2500 shared keys, giving an
expected total of 5 per key. Any total > 5 proves double-counting
from a stale seq.
The log_merge_force_partial trigger forces many partial merges per
cycle, creating the conditions where stale-seq items get spliced into
fs_root while finalized logs still exist. Parallel readers on all
mounts race against this window to detect double-counted values.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We had no basic testing for `scoutfs read-xattr-index` whatsoever. This
adds your basic negative argument tests, lifecycle tests, the
deduplicated reads, and partial removal.
This exposes a bug in deletion where the indx entry isn't cleaned up
on inode delete.
Signed-off-by: Auke Kok <auke.kok@versity.com>
During inode deletion, scoutfs_xattr_drop forgot to set the xid
of the xattr after calling parse_indx_key, which hardcodes xid=0, and it
is the callers' responsibility. delete_force then deletes the wrong
key, and returns no errors on nonexistant keys.
So now there is a pending deletion for a non-existant indx and an
orphan indx entry in the tree. Subsequent calls to `scoutfs
read-xattr-index` will thus return entries for deleted inodes.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This xfs lockdep stack trace has at least 2 variants around
fs_reclaim, so try and capture it not too precisely here.
We can remove "lockdep disabled" in the $re grep -v, because it
can affect both this and the kasan one.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add a reclaim_skip_finalize trigger that prevents reclaim from
setting FINALIZED on log_trees entries. The test arms this trigger,
force-unmounts a client to create an orphan, and verifies the log
merge succeeds without timeout and the orphan reclaim message
appears in dmesg.
Signed-off-by: Auke Kok <auke.kok@versity.com>
An unfinalized log_trees entry whose rid is not in mounted_clients
is an orphan left behind by incomplete reclaim. Previously this
permanently blocked log merges because the finalize loop treated it
as an active client that would never commit.
Call reclaim_open_log_tree for orphaned rids before starting a log
merge. Once reclaimed, the existing merge and freeing paths include
them normally.
Also skip orphans in get_stable_trans_seq so their open transaction
doesn't artificially lower the stable sequence.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Basic testing for the punch-offline ioctl code. The tests consist of a
bunch of negative testing to make sure things that are expressly not
allowed fail, followed by a bunch of known-expected outcome tests that
punches holes in several patterns, verifying them.
Signed-off-by: Auke Kok <auke.kok@versity.com>
A minimal punch_offline ioctl wrapper. Argument style is adopted from
stage/release.
Following the syntax for the option of stage/release, this calls the
punch offline ioctl, punching any offline extent within the designated
range from offset with length.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add an archive layer ioctl for converting offline extents into sparse
extents without relying on or modifying data_version. This is helpful
when working with files with very large sparse regions.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
The initialization here avoids clearing __pad[], which leaks
to disk. Use a struct initializer to avoid it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This allocation here currently leaks through __pad[7] which
is written to disk. Use the initializer to enforce zeroing
the pad. The name member is written right after.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The caller sends the return value of this inline as u8. If we return
-EINVAL, it maps to (234) which is outside of our enum range. Assume
this was meant to return SCOUTFS_NET_ERR_EINVAL which is a defined
constant.
Signed-off-by: Auke Kok <auke.kok@versity.com>