ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2,
therefore arr[3] is out of bounds. The array length test is off by one
and needs fixing.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat.
kernel_{send,recv}msg() initializes either msg_iov or msg_iter.
This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from
iovec." because previous patches fixed the order of arguments, and the
net send caller was removed.
Signed-off-by: Zach Brown <zab@versity.com>
Previous work had the receiver try to receive multiple messages in bulk.
This does the same for the sender.
We walk the send queue and initialize a vector that we then send with
one call. This is intentionally similar to the single message sending
pattern to avoid unintended changes.
Along with the changes to recieve in bulk this ended up increasing the
message processing rate by about 6x when both send and receive were
going full throttle.
Signed-off-by: Zach Brown <zab@versity.com>
When the msg_iter compat was added the iter was initialized with nr_segs
and count swapped. I'm not convinced this had any effect because the
kernel_{send,recv}msg() call would initialize msg_iter again with the
correct arguments.
Signed-off-by: Zach Brown <zab@versity.com>
Our messaging layer is used for small control messages, not large data
payloads. By calling recvmsg twice for every incoming message we're
hitting the socket lock reasonably hard. With senders doing the same,
and a lot of messages flowing in each direction, the contention is
non-trivial.
This changes the receiver to copy as much of the incoming stream into a
page that is then framed and copied again into individual allocated
messages that can be processed concurrently. We're avoiding contention
with the sender on the socket at the cost of additional copies of our
small messages.
Signed-off-by: Zach Brown <zab@versity.com>
The lock client has a requirement that it can't handle some messages
being processed out of order. Previously it had detected message
ordering itself, but had missed some cases. Recieve processing was then
changed to always call lock message processing from the recv work to
globally order all lock messages.
This inline processing was contributing to excessive latencies in making
our way through the incoming receive queue, delaying work that would
otherwise be parallel once we got it off the recv queue.
This was seen in practice as a giant flood of lock shrink messages
arrived at the client. It processed each in turn, starving a statfs
response long enough to trigger the hung task warning.
This fix does two things.
First, it moves ordered recv processing out of the recv work. It lets
the recv work drain the socket quickly and turn it into a list that the
ordered work is consuming. Other messages will have a chance to be
received and queued to their processing work without having to wait for
the ordered work to be processed.
Secondly, it adds parallelism to the ordered processing. The incoming
lock messages don't need global ordering, they need ordering within each
lock. We add an arbitrary but reasonable number of ordered workers and
hash lock messages to each worker based on the lock's key.
Signed-off-by: Zach Brown <zab@versity.com>
Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END
update_quorum_block() call fails in scoutfs_quorum_worker().
Correctly print if the reader or writer failed when logging errors
in update_quorum_block().
Signed-off-by: Chris Kirby <ckirby@versity.com>
During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was
generating inode numbers that were not in sorted order. This resulted
in later failures during srch-basic-functionality, because we were
winding up with out of order first/last pairs and merging incorrectly.
Instead, reuse the single entry in the block repeatedly, generating
zero-padded pairs of this entry that are interpreted as create/delete
and vanish during searching and merging. These aren't encoded in the
normal way, but the extra zeroes are ignored during the decoding phase.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Make sure that the orphan scanners can see deletions after forced unmounts
by waiting for reclaim_open_log_tree() to run on each mount; and waiting for
finalize_and_start_log_merge() to run and not find any finalized trees.
Do this by adding two new counters: reclaimed_open_logs and
log_merge_no_finalized and fixing the orphan-inodes test to check those
before waiting for the orphan scanners to complete.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).
Signed-off-by: Chris Kirby <ckirby@versity.com>
This test compiles an earlier commit from the tree that is starting to
fail due to various changes on the OS level, most recently due to sparse
issues with newer kernel headers. This problem will likely increase
in the future as we add more supported releases.
We opt to just only run this test on el7 for now. While we could have
made this skip sparse checks that fail it on el8, it will suffice at
this point if this just works on one of the supported OS versions
during testing.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The iput worker can accumulate quite a bit of pending work to do. We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels). There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible for the quorum worker to be preempted for a long period,
especially on debug kernels. Since we only check for how much time
has passed, it's possible for a clean receive to inadvertently
trigger an election. This can cause the quorum-heartbeat-timeout
test to fail due to observed delays outside of the expected bounds.
Instead, make sure we had a receive failure before comparing timestamps.
Signed-off-by: Chris Kirby <ckirby@versity.com>
In finalize_and_start_log_merge(), we overwrite the server
mount's log tree with its finalized form and then later write out
its next open log tree. This leaves a window where the mount's
srch_file is nulled out, causing us to lose any search items in
that log tree.
This shows up as intermittent failures in the srch-basic-functionality
test.
Eliminate this timing window by doing what unmount/reclaim does when
it finalizes, by moving the resources from the item that we finalize
into server trees/items as it finalizes. Then there is no window
where those resources exist only in memory until we create another
transaction.
Signed-off-by: Chris Kirby <ckirby@versity.com>
It's entirely likely that the trigger here is munched by a read on a
dirty block from any unrelated or background read. Avoid that by putting
the trigger at the end of the condition list.
Now that the order is swapped, we have to avoid a null deref in
block_is_dirty(bp) here, as well.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.
Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.
For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This reverts commit 138c7c6b49.
The timeout value here is still exceeded by CI test jobs, and thus
causing the test to fail.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Adjusting hung_task_timeout_secs is still needed for this test to pass
with a debug kernel. But the logic belongs on the platform side.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The try_drain_data_freed() path was generating errors about overrunning
its commit budget:
scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602
The budget overrun check was using the current number of commit holders
(in this case one) instead of the the maximum number of concurrent holders
(in this case two). So even well behaved paths like try_drain_data_freed()
can appear to exceed their commit budget if other holders dirty some blocks
and apply their commits before the try_drain_data_freed() thread does its
final budget reconciliation.
Signed-off-by: Chris Kirby <ckirby@versity.com>
Free extents are stored in two btrees: one sorted by block number, one
by size. So if you insert a new extent between two existing extents, you can
be modifying two items in the by-block-number tree. And depending on the size
of those items, that can result in three items over in the -by-size tree.
So that's a 5x multiplier per level.
If we're shrinking the tree and adding more freed blocks, we're conceptually
dirtying two blocks at each level to merge. (current *2 in the code).
But if they fall under the low water mark then one of them is freed, so we
can have *3 per level in this case.
Signed-off-by: Chris Kirby <ckirby@versity.com>
On el8, sparse is at 0.6.4 in epel-release, but it fails with:
```
[SP src/util.c]
src/util.c: note: in included file (through /usr/include/sys/stat.h):
/usr/include/bits/statx.h:30:6: error: not a function <noident>
/usr/include/bits/statx.h:30:6: error: bad constant expression type
```
This is due to us needing O_DIRECT from <fcntl.h>, so we set _GNU_SOURCE
before including it, but this causes (through _USE_GNU in sys/stat.h)
statx.h to be included, and that has __has_include, and sparse is too
dumb to understand it.
Just shut it up.
Signed-off-by: Auke Kok <auke.kok@versity.com>
This fixes a potential fence post failure like the following:
error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672
The code is only accounting for the freed btree blocks, not the dirtying of
other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2),
dirty some log btree blocks, loop again, then consume another
(COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget.
In this example, we went over by 13 blocks.
By only consuming up to 1/8 of the budget on each loop, and committing when we
have consumed 3/4 of the budget, we can avoid the fence post condition.
Signed-off-by: Chris Kirby <ckirby@versity.com>
The `-R` option will shuffle the order in which tests are executed.
The testing order shouldn't affect the outcome of any of the tests, but
in practice many of these tests will execute code slightly different
based on the history of the filesystem, resources allocated, memory
usage etc. of tests that were executed before. Shuffling the order of
tests therefore introduces small semi-random variations in the
enviroment.
The xfstests test is the only one that can't be shuffled yet into the
mix, so it is kept at the end. This is because it leaves the filesystems
unmounted. At a later point we may want to address this.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Fail the build if we don't check with sparse in both the kernel and
userspace utils. Add a filtering wrapper to the kernel build so that we
have a place to filter out uninteresting errors from kernel sources that
we're building against.
Signed-off-by: Zach Brown <zab@versity.com>
This is another example of refactoring a loop to avoid sparse warnings
from doing something in the else of a failed trylock if. We want to
drop and reacquire the lock if the trylock fails so we do it every loop
iteration. This shouldn't be experiencing much contention because most
of the cov users are usually done under locks and invalidation has
excluded lock holders. So the additional lock and unlock noise should
be local.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_write_done() acquires the cinf dirty_lock and pg rwlock out
of order. It uses a trylock to detect failure and back off of both
before retrying.
sparse seems to have some peculiar sensitivity to following the else
branch from a failed trylock while already in a context. Doing that
consistently triggered the spurious mismatched context warning.
This refactors the loop to always drop and reacquire the dirty_lock
after attemping the trylock. It's not great, but this shouldn't be very
contended because the transaction write has serialized write lock
holderse that would be trying to dirty items. The silly lock noise will
be mostly cached.
Signed-off-by: Zach Brown <zab@versity.com>
Looks like the compiler isn't smart enough to understand the pass by
pointer value, and we can initialize it here easily.
make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64'
CC [M] /home/auke/scoutfs/kmod/src/server.o
/home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’:
/home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
4170 | ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4171 | SCOUTFS_FENCE_CLIENT_RECOVERY);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
There's still the obvious issue here that we'd intended to support ipv6
but just disregard that here.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Occasionally, we have some tests fail because these kills produce:
tests/lock-recover-invalidate.sh: line 42: 9928 Terminated
Even though we expected them to be silent. In these particular cases we
already don't care about this output.
We borrow the silent_kill() function from orphan-inodes and promote it
to t_silent_kill() in funcs/exec.sh, and then use it everywhere where
appropriate.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The current test sequence performs the unlink and immediately tests
whether enough resources are available to create new files again, and
this consistently fails.
One of my crummy VMs takes a good 12 seconds before the `touch` actually
succeeds. We care about the filesystem eventually returning from ENOSPC,
and certainly we don't want it to take forever, but there is a period
after our first ENOSPC error and cleanup that we expect ENOSPC to fail
for a bit longer.
Make the timeout 120s. As soon as the `touch` completes, exit the wait
loop.
Signed-off-by: Auke Kok <auke.kok@versity.com>
If run without `-m` (explicit mkfs) in subsequent testing, old test
data files may break several tests. Most failures are -EEXIST, but
there are some more subtle ones.
This change erases any existing test dir as needed just before we
run the tests, and avoids the issue entirely.
I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative
solution but that likely will interfere disproportionally with
tests that do disconnects and other thing that can be impacted by an
unlink storm.
This has an obvious performance aspect - tests will be a little
slower to start on subsequent runs. In CI, this will effectively be
a no-op though.
Signed-off-by: Auke Kok <auke.kok@versity.com>