A few callers of alloc_move_empty in the server were providing a budget
that was too small. Recent changes to extent_mod_blocks increased the
max budget that is necessary to move extents between btrees. The
existing WAG of 100 was too small for trees of height 2 and 3. This
caused looping in production.
We can increase the move budget to half the overall commit budget, which
leaves room for a height of around 7 each. This is much greater than we
see in practice because the size of the per-mount btrees is effectiely
limited by both watermarks and thresholds to commit and drain.
Signed-off-by: Zach Brown <zab@versity.com>
Tests that cause client retries can fail with this error
from server_commit_log_merge():
error -2 committing log merge: getting merge status item
This can happen if the server has already committed and resolved
the log merge that is being retried. We can safely ignore ENOENT here
just like we do a few lines later.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors. The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.
Signed-off-by: Zach Brown <zab@versity.com>
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent. When we're forcing an
unmount we've severed storage and networking.
Signed-off-by: Zach Brown <zab@versity.com>
Assembling a srch compaction operation creates an item and populates it
with allocator state. It doesn't cleanly unwind the allocation and undo
the compaction item change if allocation filling fails and issues a
warning.
This warning isn't needed if the error shows that we're in forced
unmount. The inconsistent state won't be applied, it will be dropped on
the floor as the mount is torn down.
Signed-off-by: Zach Brown <zab@versity.com>
The log merging process is meant to provide parallelism across workers
in mounts. The idea is that the server hands out a bunch of concurrent
non-intersecting work that's based on the structure of the stable input
fs_root btree.
The nature of the parallel work (cow of the blocks that intersect a key
range) means that the ranges of concurrently issued work can't overlap
or the work will all cow the same input blocks, freeing that input
stable block multiple times. We're seeing this in testing.
Correctness was intended by having an advancing key that sweeps sorted
ranges. Duplicate ranges would never be hit as the key advanced past
each it visited. This was broken by the mapping of the fs item keys to
log merge tree keys by clobbering the sk_zone key value. It effectively
interleaves the ranges of each zone in the fs root (meta indexes,
orphans, fs items). With just the right log merge conditions that
involve logged items in the right places and partial completed work to
insert remaining ranges behind the key, ranges can be stored at mapped
keys that end up with ranges out of order. The server iterates over
these and ends up issueing overlapping work, which results in duplicated
frees of the input blocks.
The fix, without changing the format of the stored log tree items, is to
perform a full sweep of all the range items and determine the next item
by looking at the full precision stored keys. This ensures that the
processed ranges always advance and never overlap.
Signed-off-by: Zach Brown <zab@versity.com>
The data_wait_err ioctl currently requires the correct data_version
for the inode to be passed in, or else the ioctl returns -ESTALE. But
the ioctl itself is just a passthrough mechanism for notifying data
waiters, which doesn't involve the data_version at all.
Instead, we can just drop checking the value. The field remains in the
headers, but we've marked it as being ignored from now on. The reason
for the change is documented in the header file as well.
This all is a lot simpler than having to modify/rev the data_waiters
interface to support passing back the data_version, because there isn't
any space left to easily do this, and then userspace would just pass it
back to the data_wait_err ioctl.
Signed-off-by: Auke Kok <auke.kok@versity.com>
scoutfs_alloc_prepare_commit() is badly named. All it really does is
put the references to the two dirty alloc list blocks in the allocator.
It must allways be called if allocation was attempted, but it's easier
to require that it always be paired with _alloc_init().
If the srch compaction worker in the client sees an error it will send
the error back to the server without writing its dirty blocks. In
avoiding the write it also avoided putting the two block references,
leading to leaked blocks. We've been seeing rare messages with leaked
blocks in tests.
Signed-off-by: Zach Brown <zab@versity.com>
The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.
Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.
This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Similar to before when namespaces were added, they are now translated to
a mnt_idmap, since v6.2-rc1-2-gabf08576afe3.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The typical pattern of spinning isolating a list_lru results in a
livelock if there are blocks with leaked refcounts. We're rarely seeing
this in testing.
We can have a modest array in each block that records the stack of the
caller that initially allocated the block and dump that stack for any
blocks that we're unable to shrink/isolate. Instead of spinning
shrinking, we can give it a good try and then print the blocks that
remain and carry on with unmount, leaking a few blocks. (Past events
have had 2 blocks.)
Signed-off-by: Zach Brown <zab@versity.com>
The server's srch commit error warnings were a bit severe. The
compaction operations are a function of persistent state. If they fail
then the inputs still exist and the next attempt will retry whatever
failed. Not all errors are a problem, only those that result in partial
commits that leave inconsistent state.
In particular, we have to support the case where a client retransmits a
compaction request to a new server after a first server performed the
commit but couldn't respond. Throwing warnings when the new server gets
ENOENT looking for the busy compaction item isn't helpful. This came in
tests as background compaction was in flight as tests unmounted and
mounted servers repeatedly to test lock recovery.
Signed-off-by: Zach Brown <zab@versity.com>
The block cache had a bizarre cache eviction policy that was trying to
avoid precise LRU updates at each block. It had pretty bad behaviour,
including only allowing reclaim of maybe 20% of the blocks that were
visited by the shrinker.
We can use the existing list_lru facility in the kernel to do a better
job. Blocks only exhibit contention as they're allocated and added to
per-node lists. From then on we only set accessed bits and the private
list walkers move blocks around on the list as we see the accessed bits.
(It looks more like a fifo with lazy promotion than a "LRU" that is
actively moving list items around as they're accessed.)
Using the facility means changing how we remove blocks from the cache
and hide them from lookup. We clean up the refcount inserted flag a bit
to be expressed more as a base refcount that can be acquired by
whoever's removing from the cache. It seems a lot clearer.
Signed-off-by: Zach Brown <zab@versity.com>
Add kernelcompat helpers for initial use of list_lru for shrinking. The
most complicated part is the walk callback type changing.
Signed-off-by: Zach Brown <zab@versity.com>
Readers can read a set of items that is stale with respect to items that
were dirtied and written under a local cluster lock after the read
started.
The active reader machanism addressed this by refusing to shrink pages
that could contain items that were dirtied while any readers were in
flight. Under the right circumstances this can result in refusing to
shrink quite a lot of pages indeed.
This changes the mechanism to allow pages to be reclaimed, and instead
forces stale readers to retry. The gamble is that reads are much faster
than writes. A small fraction should have to retry, and when they do
they can be satisfied by the block cache.
Signed-off-by: Zach Brown <zab@versity.com>
The default TCP keepalive value is currently 10s, resulting in clients
being disconnected after 10 seconds of not replying to a TCP keepalive
packet. These keepalive values are reasonable most of the times, but
we've seen client disconnects where this timeout has been exceeded,
resulting in fencing. The cause for this is unknown at this time, but it
is suspected that network intermissions are happening.
This change adds a configurable value for this specific client socket
timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose
value remains unchanged.
The default value of 10000ms (10s) is changed to 60s. This is the value
we're assuming is much better suited for customers and has been briefly
trialed, showing that it may help to avoid network level interruptions
better.
Signed-off-by: Auke Kok <auke.kok@versity.com>
It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which
is legitimately a failure, thus the code here releases the sock again.
But the code block here sets `ret = ENOMEM` and then restarts the loop,
which immediately sets `ret = kernel_accept()`, thus overwriting the
-ENOMEM error value.
We can argue that an ENOMEM error situation here is not catastrophical.
If this is the first that we're ever receiving an ENOMEM situation here
while trying to accept a new client, we can just release the socket and
wait for the client to try again. If the kernel at that point still is
out of memory to handle the new incoming connection, that will then
cascade down and clean up the while listener at that point.
The alternative is to let this error path unwind out and break down the
listener immediately, something the code today doesn't do. We're keeping
the behavior therefore the same.
I've opted therefore to replace the `ret = -ENOMEM` assignment with a
comment explaining why we're ignoring the error situation here.
Signed-off-by: Auke Kok <auke.kok@versity.com>
If scoutfs_send_omap_response fails for any reason, req is NULL and we
would hit a hard NULL deref during unwinding.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This function returns a stack pointer to a struct scoutfs_extent, after
setting start, len to an extent found in the proper zone, but it leaves
map and flags members unset.
Initialize the struct to {0,} avoids passing uninitialized values up the
callstack.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Several of the inconsistency error paths already correctly `goto out`
but this one has a `break`. This would result in doing a whole lot of
work on corrupted data.
Make this error path go to `out` instead as the others do.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In these two error conditions we explicitly set `ret = -EIO` but then
`break` to set `ret = 0` immediately again, masking away a critical
error code that should be returned.
Instead, `goto out` retains the EIO error value for the caller.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The value of `ret` is not initialized. If the writeback list is empty,
or, if igrab() fails on the only inode on the list, the value
of `ret` is returned without being initialized. This would cause the
caller to needlessly have to retry, perhaps possibly make things worse.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We shouldn't copy the entire _dirent struct and then copy in the name
again right after, just stop at offsetoff(struct, name).
Now that we're no longer copying the uninitialized name[3] from ent,
there is no more possible 1-byte leak here, too.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Assure that we reschedule even if this happens. Maybe it'll recover. If
not, we'll have other issues elsewhere first.
Signed-off-by: Auke Kok <auke.kok@versity.com>
ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2,
therefore arr[3] is out of bounds. The array length test is off by one
and needs fixing.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat.
kernel_{send,recv}msg() initializes either msg_iov or msg_iter.
This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from
iovec." because previous patches fixed the order of arguments, and the
net send caller was removed.
Signed-off-by: Zach Brown <zab@versity.com>
Previous work had the receiver try to receive multiple messages in bulk.
This does the same for the sender.
We walk the send queue and initialize a vector that we then send with
one call. This is intentionally similar to the single message sending
pattern to avoid unintended changes.
Along with the changes to recieve in bulk this ended up increasing the
message processing rate by about 6x when both send and receive were
going full throttle.
Signed-off-by: Zach Brown <zab@versity.com>
When the msg_iter compat was added the iter was initialized with nr_segs
and count swapped. I'm not convinced this had any effect because the
kernel_{send,recv}msg() call would initialize msg_iter again with the
correct arguments.
Signed-off-by: Zach Brown <zab@versity.com>
Our messaging layer is used for small control messages, not large data
payloads. By calling recvmsg twice for every incoming message we're
hitting the socket lock reasonably hard. With senders doing the same,
and a lot of messages flowing in each direction, the contention is
non-trivial.
This changes the receiver to copy as much of the incoming stream into a
page that is then framed and copied again into individual allocated
messages that can be processed concurrently. We're avoiding contention
with the sender on the socket at the cost of additional copies of our
small messages.
Signed-off-by: Zach Brown <zab@versity.com>
The lock client has a requirement that it can't handle some messages
being processed out of order. Previously it had detected message
ordering itself, but had missed some cases. Recieve processing was then
changed to always call lock message processing from the recv work to
globally order all lock messages.
This inline processing was contributing to excessive latencies in making
our way through the incoming receive queue, delaying work that would
otherwise be parallel once we got it off the recv queue.
This was seen in practice as a giant flood of lock shrink messages
arrived at the client. It processed each in turn, starving a statfs
response long enough to trigger the hung task warning.
This fix does two things.
First, it moves ordered recv processing out of the recv work. It lets
the recv work drain the socket quickly and turn it into a list that the
ordered work is consuming. Other messages will have a chance to be
received and queued to their processing work without having to wait for
the ordered work to be processed.
Secondly, it adds parallelism to the ordered processing. The incoming
lock messages don't need global ordering, they need ordering within each
lock. We add an arbitrary but reasonable number of ordered workers and
hash lock messages to each worker based on the lock's key.
Signed-off-by: Zach Brown <zab@versity.com>
Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END
update_quorum_block() call fails in scoutfs_quorum_worker().
Correctly print if the reader or writer failed when logging errors
in update_quorum_block().
Signed-off-by: Chris Kirby <ckirby@versity.com>
During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was
generating inode numbers that were not in sorted order. This resulted
in later failures during srch-basic-functionality, because we were
winding up with out of order first/last pairs and merging incorrectly.
Instead, reuse the single entry in the block repeatedly, generating
zero-padded pairs of this entry that are interpreted as create/delete
and vanish during searching and merging. These aren't encoded in the
normal way, but the extra zeroes are ignored during the decoding phase.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Make sure that the orphan scanners can see deletions after forced unmounts
by waiting for reclaim_open_log_tree() to run on each mount; and waiting for
finalize_and_start_log_merge() to run and not find any finalized trees.
Do this by adding two new counters: reclaimed_open_logs and
log_merge_no_finalized and fixing the orphan-inodes test to check those
before waiting for the orphan scanners to complete.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).
Signed-off-by: Chris Kirby <ckirby@versity.com>
The iput worker can accumulate quite a bit of pending work to do. We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels). There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible for the quorum worker to be preempted for a long period,
especially on debug kernels. Since we only check for how much time
has passed, it's possible for a clean receive to inadvertently
trigger an election. This can cause the quorum-heartbeat-timeout
test to fail due to observed delays outside of the expected bounds.
Instead, make sure we had a receive failure before comparing timestamps.
Signed-off-by: Chris Kirby <ckirby@versity.com>
In finalize_and_start_log_merge(), we overwrite the server
mount's log tree with its finalized form and then later write out
its next open log tree. This leaves a window where the mount's
srch_file is nulled out, causing us to lose any search items in
that log tree.
This shows up as intermittent failures in the srch-basic-functionality
test.
Eliminate this timing window by doing what unmount/reclaim does when
it finalizes, by moving the resources from the item that we finalize
into server trees/items as it finalizes. Then there is no window
where those resources exist only in memory until we create another
transaction.
Signed-off-by: Chris Kirby <ckirby@versity.com>
It's entirely likely that the trigger here is munched by a read on a
dirty block from any unrelated or background read. Avoid that by putting
the trigger at the end of the condition list.
Now that the order is swapped, we have to avoid a null deref in
block_is_dirty(bp) here, as well.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.
Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.
For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The try_drain_data_freed() path was generating errors about overrunning
its commit budget:
scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602
The budget overrun check was using the current number of commit holders
(in this case one) instead of the the maximum number of concurrent holders
(in this case two). So even well behaved paths like try_drain_data_freed()
can appear to exceed their commit budget if other holders dirty some blocks
and apply their commits before the try_drain_data_freed() thread does its
final budget reconciliation.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Free extents are stored in two btrees: one sorted by block number, one
by size. So if you insert a new extent between two existing extents, you can
be modifying two items in the by-block-number tree. And depending on the size
of those items, that can result in three items over in the -by-size tree.
So that's a 5x multiplier per level.
If we're shrinking the tree and adding more freed blocks, we're conceptually
dirtying two blocks at each level to merge. (current *2 in the code).
But if they fall under the low water mark then one of them is freed, so we
can have *3 per level in this case.
Signed-off-by: Chris Kirby <ckirby@versity.com>