Compare commits

...

81 Commits

Author SHA1 Message Date
Zach Brown
22b5e79bbd v1.26 Release
Finish the release notes for the 1.26 release.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-17 14:42:14 -08:00
Zach Brown
259e639271 Merge pull request #262 from versity/zab/ino_alloc_per_lock
Add ino_alloc_per_lock option
2025-11-14 13:57:49 -08:00
Zach Brown
4d66c38c71 Remove redundant WARN in commit_log_trees
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors.  The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-14 10:04:30 -08:00
Zach Brown
7ef62894bd Add ino_alloc_per_lock option
Add an option that can limit the number of inode numbers that are
allocated per lock group.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 17:19:04 -08:00
Zach Brown
1f363a1ead Merge pull request #261 from versity/zab/log_merge_double_free
Zab/log merge double free
2025-11-13 17:18:30 -08:00
Zach Brown
8ddf9b8c8c Handle disappearing fencing requests and targets
The userspace fencing process wasn't careful about handling underlying
directories that disappear while it was working.

On the server/fenced side, fencing requests can linger after they've
been resolved by writing 1 to fenced or error.  The script could come
back around to see the directory before the server finally removes it,
causing all later uses of the request dir to fail.  We saw this in the
logs as a bunch of cat errors for the various request files.

On the local fence script side, all the mounts can be in the process of
being unmounted so both the /sys/fs dirs and the mount it self can be
removed while we're working.

For both, when we're working with the /sys/fs files we read them without
logging errors and then test that the dir still exists before using what
we read.  When fencing a mount, we stop if findmnt doesn't find the
mount and then raise a umount error if the /sys/fs dir exists after
umount fails.

And while we're at it, we have each scripts logging append instead of
truncating (if, say, it's a log file instead of an interactive tty).

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
fd80c17ab6 Filter out kernel message when guests are slow
Ignore more kernel messages when debug guests are being slow.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
991e2cbdf8 Ignore slow quorum hb transfers in tests
We're getting test failures from messages that our guests can be
unresponsive.  They sure can be.  We don't need to fail for this one
specific case.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
92ac132873 Silence merge splice error when forcing
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent.  When we're forcing an
unmount we've severed storage and networking.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
ad078cd93c Avoid lock stalling mmap_stress
mmap_stress gets completely stalled in lock messaging and starving
most of the mmap_stress threads, which causes it to delay and even
time out in CI.

Instead of spawning threads over all 5 test nodes, we reduce it
to spawning over only 2 artificially. This still does a good number
of operations on those node, and now the work is spread across the
two nodes evenly.

Additionaly, I've added a miniscule (10ms) delay in between operations
that should hopefully be sufficient for other locking attempts to
settle and allow the threads to better spread the work.

This now shows that all the threads exit within < 0.25s on my test
machine, which is a lot better than the 40s variation that I was seeing
locally. Hopefully this fares better in CI.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
90cb458cd5 Make mmap_stress not exceed a fixed amount of time.
There's a scenarion where mmap_stress gets enough resources that
twoe of the threads will starve the others, which then all take
a very long time catching up committing changes.

Because this test program didn't finish until all the threads had
completed a fixed amount of work, essentially these threads all
ended up tripping over eachother. In CI this would exceed 6h+,
while originally I intended this to run in about 100s or so.

Instead, cap the run time to ~30s by default. If threads exceed
this time, they will immediately exit, which causes any clog in
contention between the threads to drain relatively quickly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
1ab798e7eb Silence inconsistent srch on forced unmount
Assembling a srch compaction operation creates an item and populates it
with allocator state.  It doesn't cleanly unwind the allocation and undo
the compaction item change if allocation filling fails and issues a
warning.

This warning isn't needed if the error shows that we're in forced
unmount.  The inconsistent state won't be applied, it will be dropped on
the floor as the mount is torn down.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
e182914e51 Fix double free of metadata blocks in log merging
The log merging process is meant to provide parallelism across workers
in mounts.  The idea is that the server hands out a bunch of concurrent
non-intersecting work that's based on the structure of the stable input
fs_root btree.

The nature of the parallel work (cow of the blocks that intersect a key
range) means that the ranges of concurrently issued work can't overlap
or the work will all cow the same input blocks, freeing that input
stable block multiple times.  We're seeing this in testing.

Correctness was intended by having an advancing key that sweeps sorted
ranges.  Duplicate ranges would never be hit as the key advanced past
each it visited.  This was broken by the mapping of the fs item keys to
log merge tree keys by clobbering the sk_zone key value.  It effectively
interleaves the ranges of each zone in the fs root (meta indexes,
orphans, fs items).  With just the right log merge conditions that
involve logged items in the right places and partial completed work to
insert remaining ranges behind the key, ranges can be stored at mapped
keys that end up with ranges out of order.  The server iterates over
these and ends up issueing overlapping work, which results in duplicated
frees of the input blocks.

The fix, without changing the format of the stored log tree items, is to
perform a full sweep of all the range items and determine the next item
by looking at the full precision stored keys.  This ensures that the
processed ranges always advance and never overlap.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
8484a58dd6 Have xfstest pass when using args
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified.  If we specify args then the set of
tests can change and the test will always fail when they do.

This fixes that by having the test check the set of tests itself, rather
than relying on golden output.  If args are specified then our xfstest
only fails if any of the executed xfstest tests failed.  Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.

It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control.  We can
also put the list of exclusions there.

We can also clean up the output redirection helper functions to make
them more clear.  After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
a077104531 Add crash monitor to run-tests
Add a little background function that runs during the test which
triggers a crash if it finds catastrophic failure conditions.

This is the second bg task we want to kill and we can only have one
function run on the EXIT trap, so we create a generic process killing
trap function.

We feed it the fenced pid as well.  run-tests didn't log much of value
into the fenced log, and we're not logging the kills into anymore, so we
just remove run-tests fenced logging.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
23aaa994df Add -l to run-tests for looping over tests
Add an option to run-tests to have it loop over each test that will be
run a number of times.  Looping stops if the test doesn't pass.

Most of the change in the per-test execution is indenting as we add the
for loop block.  The stats and kmsg output are lifted up before of the
loop.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 12:07:42 -08:00
Zach Brown
7d14b57b2d Export PATH once in run-tests
Might as well just export the PATH once as we change it, no need to
export it in every test iteration.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-06 11:02:38 -08:00
Zach Brown
3f252be4be Merge pull request #241 from versity/auke/waiter_err_data_version_obsolete
Ignore data_version in scoutfs_ioc_data_wait_err.
2025-11-04 10:09:57 -08:00
Auke Kok
a4d25d9b55 Ignore data_version in scoutfs_ioc_data_wait_err.
The data_wait_err ioctl currently requires the correct data_version
for the inode to be passed in, or else the ioctl returns -ESTALE. But
the ioctl itself is just a passthrough mechanism for notifying data
waiters, which doesn't involve the data_version at all.

Instead, we can just drop checking the value. The field remains in the
headers, but we've marked it as being ignored from now on. The reason
for the change is documented in the header file as well.

This all is a lot simpler than having to modify/rev the data_waiters
interface to support passing back the data_version, because there isn't
any space left to easily do this, and then userspace would just pass it
back to the data_wait_err ioctl.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-31 12:24:03 -04:00
Zach Brown
79cd25f693 Merge pull request #255 from versity/zab/compact_error_block_leak
Don't leak alloc blocks on srch compact error
2025-10-31 09:08:17 -07:00
Zach Brown
f2646130ae Don't leak alloc blocks on srch compact error
scoutfs_alloc_prepare_commit() is badly named.  All it really does is
put the references to the two dirty alloc list blocks in the allocator.
It must allways be called if allocation was attempted, but it's easier
to require that it always be paired with _alloc_init().

If the srch compaction worker in the client sees an error it will send
the error back to the server without writing its dirty blocks.  In
avoiding the write it also avoided putting the two block references,
leading to leaked blocks.  We've been seeing rare messages with leaked
blocks in tests.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-30 14:47:18 -07:00
Zach Brown
1c66f9a9a5 Merge pull request #227 from versity/auke/el96
RHEL 9.6 support.
2025-10-30 13:39:05 -07:00
Auke Kok
afb6ba00ad POSIX ACL changes.
The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.

Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.

This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-30 13:59:44 -04:00
Auke Kok
29e486e411 All vfs methods now take a mnt_idmap instead of user_namespace arg.
Similar to before when namespaces were added, they are now translated to
a mnt_idmap, since v6.2-rc1-2-gabf08576afe3.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-30 13:58:34 -04:00
Zach Brown
8f3177fe33 Merge pull request #254 from versity/zab/shrink_cleanup
Zab/shrink cleanup
2025-10-30 08:56:33 -07:00
Zach Brown
419079e606 Merge pull request #239 from versity/auke/keepalive
Add tcp_keepalive_timeout_ms option, change default to 60s
2025-10-29 17:15:17 -07:00
Zach Brown
6a70ee03b5 Dump block alloc stacks for leaked blocks
The typical pattern of spinning isolating a list_lru results in a
livelock if there are blocks with leaked refcounts.  We're rarely seeing
this in testing.

We can have a modest array in each block that records the stack of the
caller that initially allocated the block and dump that stack for any
blocks that we're unable to shrink/isolate.  Instead of spinning
shrinking, we can give it a good try and then print the blocks that
remain and carry on with unmount, leaking a few blocks.  (Past events
have had 2 blocks.)

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 16:16:58 -07:00
Zach Brown
38a2ffe0c7 Add stacktrace kernelcompat
Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
4b41cf9789 Centralize port numbers and avoid ephemeral
The tests were using high ephemeral port numbers for the mount server's
listening port.  This caused occasional failure if the client's
ephemeral ports happened to collide with the ports used by the tests.

This ports all the port number configuration in one place and has a
quick check to make sure it doesn't wander into the current ephemeral
range.  Then it updates all the tests to use the chosen ports.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
102899290e Allow harmless srch compact commit errors
The server's srch commit error warnings were a bit severe.  The
compaction operations are a function of persistent state.  If they fail
then the inputs still exist and the next attempt will retry whatever
failed.  Not all errors are a problem, only those that result in partial
commits that leave inconsistent state.

In particular, we have to support the case where a client retransmits a
compaction request to a new server after a first server performed the
commit but couldn't respond.  Throwing warnings when the new server gets
ENOENT looking for the busy compaction item isn't helpful.  This came in
tests as background compaction was in flight as tests unmounted and
mounted servers repeatedly to test lock recovery.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
89387fb192 Use list_lru for block cache shrinking
The block cache had a bizarre cache eviction policy that was trying to
avoid precise LRU updates at each block.  It had pretty bad behaviour,
including only allowing reclaim of maybe 20% of the blocks that were
visited by the shrinker.

We can use the existing list_lru facility in the kernel to do a better
job.  Blocks only exhibit contention as they're allocated and added to
per-node lists.  From then on we only set accessed bits and the private
list walkers move blocks around on the list as we see the accessed bits.
(It looks more like a fifo with lazy promotion than a "LRU" that is
actively moving list items around as they're accessed.)

Using the facility means changing how we remove blocks from the cache
and hide them from lookup.  We clean up the refcount inserted flag a bit
to be expressed more as a base refcount that can be acquired by
whoever's removing from the cache.  It seems a lot clearer.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
8b6418fb79 Add kernelcompat for list_lru
Add kernelcompat helpers for initial use of list_lru for shrinking.  The
most complicated part is the walk callback type changing.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Zach Brown
206c24c41f Retry stale item reads instead of stopping reclaim
Readers can read a set of items that is stale with respect to items that
were dirtied and written under a local cluster lock after the read
started.

The active reader machanism addressed this by refusing to shrink pages
that could contain items that were dirtied while any readers were in
flight.  Under the right circumstances this can result in refusing to
shrink quite a lot of pages indeed.

This changes the mechanism to allow pages to be reclaimed, and instead
forces stale readers to retry.  The gamble is that reads are much faster
than writes.  A small fraction should have to retry, and when they do
they can be satisfied by the block cache.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-29 10:12:52 -07:00
Auke Kok
f67462750b Add tcp_keepalive_timeout_ms option, change default to 60s
The default TCP keepalive value is currently 10s, resulting in clients
being disconnected after 10 seconds of not replying to a TCP keepalive
packet. These keepalive values are reasonable most of the times, but
we've seen client disconnects where this timeout has been exceeded,
resulting in fencing. The cause for this is unknown at this time, but it
is suspected that network intermissions are happening.

This change adds a configurable value for this specific client socket
timeout. It enforces that its value is above UNRESPONSIVE_PROBES, whose
value remains unchanged.

The default value of 10000ms (10s) is changed to 60s. This is the value
we're assuming is much better suited for customers and has been briefly
trialed, showing that it may help to avoid network level interruptions
better.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-28 18:45:43 -04:00
Zach Brown
fd8aaa0810 Merge pull request #205 from versity/auke/scar
Changes from static analysis.
2025-10-27 16:06:32 -07:00
Auke Kok
a5dbe7f286 Don't set ret = -ENOMEM and immediately overwrite.
It's possible that scoutfs_net_alloc_conn() fails due to -ENOMEM, which
is legitimately a failure, thus the code here releases the sock again.

But the code block here sets `ret = ENOMEM` and then restarts the loop,
which immediately sets `ret = kernel_accept()`, thus overwriting the
-ENOMEM error value.

We can argue that an ENOMEM error situation here is not catastrophical.
If this is the first that we're ever receiving an ENOMEM situation here
while trying to accept a new client, we can just release the socket and
wait for the client to try again. If the kernel at that point still is
out of memory to handle the new incoming connection, that will then
cascade down and clean up the while listener at that point.

The alternative is to let this error path unwind out and break down the
listener immediately, something the code today doesn't do. We're keeping
the behavior therefore the same.

I've opted therefore to replace the `ret = -ENOMEM` assignment with a
comment explaining why we're ignoring the error situation here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:26:44 -04:00
Auke Kok
c1e89d597d Fix NULL dereference on error branch in handle_request.
If scoutfs_send_omap_response fails for any reason, req is NULL and we
would hit a hard NULL deref during unwinding.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:24:00 -04:00
Auke Kok
2c4316b096 Avoid uninitialized map, flags in ext.
This function returns a stack pointer to a struct scoutfs_extent, after
setting start, len to an extent found in the proper zone, but it leaves
map and flags members unset.

Initialize the struct to {0,} avoids passing uninitialized values up the
callstack.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:24:00 -04:00
Auke Kok
e704cd7074 Fix masking of EIO in compact_logs.
Several of the inconsistency error paths already correctly `goto out`
but this one has a `break`. This would result in doing a whole lot of
work on corrupted data.

Make this error path go to `out` instead as the others do.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:23:59 -04:00
Auke Kok
8c5b09aee8 Prevent masking away inconsistent state in search_sorted_file.
In these two error conditions we explicitly set `ret = -EIO` but then
`break` to set `ret = 0` immediately again, masking away a critical
error code that should be returned.

Instead, `goto out` retains the EIO error value for the caller.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:23:28 -04:00
Auke Kok
d2cd610c53 Fix return of uninit value.
The value of `ret` is not initialized. If the writeback list is empty,
or, if igrab() fails on the only inode on the list, the value
of `ret` is returned without being initialized. This would cause the
caller to needlessly have to retry, perhaps possibly make things worse.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
52563d3f73 Address double copy_to_user, possible 1-byte leak.
We shouldn't copy the entire _dirent struct and then copy in the name
again right after, just stop at offsetoff(struct, name).

Now that we're no longer copying the uninitialized name[3] from ent,
there is no more possible 1-byte leak here, too.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
4358d57f55 Avoid possible NULL deref on ENOMEM.
Assure that we reschedule even if this happens. Maybe it'll recover. If
not, we'll have other issues elsewhere first.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
021830ab04 If kzalloc fails, avoid NULL deref.
We still assign NULL to sbi->s_fs_info to aid checks in cleanup paths.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
3e63739711 plug df ioctl leaks.
The `type` member and padding are not initialized before being
copied to userspace.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
b25d8e8741 Plug super leak.
We accidentally could leak super here, so make sure to free it.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Auke Kok
4a9760afe0 Incorrect array_size test.
ARRAY_SIZE(...) will return `3` for this array with members from 0 to 2,
therefore arr[3] is out of bounds. The array length test is off by one
and needs fixing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-24 14:21:06 -04:00
Zach Brown
33f6e9d0cd Merge pull request #248 from versity/auke/shuffle-tests
Add option to shuffle test order.
2025-10-23 09:33:34 -07:00
Zach Brown
f9780fc391 Merge pull request #253 from versity/zab/msg_rate_shrink
Zab/msg rate shrink
2025-10-23 09:32:25 -07:00
Zach Brown
aa8517d29b Remove msghdr iov_iter kernelcompat
This removes the KC_MSGHDR_STRUCT_IOV_ITER kernel compat.
kernel_{send,recv}msg() initializes either msg_iov or msg_iter.

This isn't a clean revert of "69068ae2 Initialize msg.msg_iter from
iovec." because previous patches fixed the order of arguments, and the
net send caller was removed.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
feae5757c4 Send messages in batches
Previous work had the receiver try to receive multiple messages in bulk.
This does the same for the sender.

We walk the send queue and initialize a vector that we then send with
one call.  This is intentionally similar to the single message sending
pattern to avoid unintended changes.

Along with the changes to recieve in bulk this ended up increasing the
message processing rate by about 6x when both send and receive were
going full throttle.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
e79086f381 Fix swapped sendmsg nr_segs/count
When the msg_iter compat was added the iter was initialized with nr_segs
and count swapped.  I'm not convinced this had any effect because the
kernel_{send,recv}msg() call would initialize msg_iter again with the
correct arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
45e815bf76 Receive incoming messages in bulk
Our messaging layer is used for small control messages, not large data
payloads.  By calling recvmsg twice for every incoming message we're
hitting the socket lock reasonably hard.  With senders doing the same,
and a lot of messages flowing in each direction, the contention is
non-trivial.

This changes the receiver to copy as much of the incoming stream into a
page that is then framed and copied again into individual allocated
messages that can be processed concurrently.  We're avoiding contention
with the sender on the socket at the cost of additional copies of our
small messages.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
c313b71b2e Process client lock messages in ordered work
The lock client has a requirement that it can't handle some messages
being processed out of order.  Previously it had detected message
ordering itself, but had missed some cases.  Recieve processing was then
changed to always call lock message processing from the recv work to
globally order all lock messages.

This inline processing was contributing to excessive latencies in making
our way through the incoming receive queue, delaying work that would
otherwise be parallel once we got it off the recv queue.

This was seen in practice as a giant flood of lock shrink messages
arrived at the client.  It processed each in turn, starving a statfs
response long enough to trigger the hung task warning.

This fix does two things.

First, it moves ordered recv processing out of the recv work.  It lets
the recv work drain the socket quickly and turn it into a list that the
ordered work is consuming.  Other messages will have a chance to be
received and queued to their processing work without having to wait for
the ordered work to be processed.

Secondly, it adds parallelism to the ordered processing.  The incoming
lock messages don't need global ordering, they need ordering within each
lock.  We add an arbitrary but reasonable number of ordered workers and
hash lock messages to each worker based on the lock's key.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-22 11:18:30 -07:00
Zach Brown
0ecaceba14 Merge pull request #236 from versity/team/ci_green
Team/ci green
2025-10-22 11:05:08 -07:00
Chris Kirby
b4d8323750 Quorum message cleanup
Make sure to log an error if the SCOUTFS_QUORUM_EVENT_END
update_quorum_block() call fails in scoutfs_quorum_worker().

Correctly print if the reader or writer failed when logging errors
in update_quorum_block().

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
aa48a8ccfc Generate sorted srch-safe entry pairs
During log compaction, the SRCH_COMPACT_LOGS_PAD_SAFE trigger was
generating inode numbers that were not in sorted order. This resulted
in later failures during srch-basic-functionality, because we were
winding up with out of order first/last pairs and merging incorrectly.

Instead, reuse the single entry in the block repeatedly, generating
zero-padded pairs of this entry that are interpreted as create/delete
and vanish during searching and merging. These aren't encoded in the
normal way, but the extra zeroes are ignored during the decoding phase.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
d277d7e955 Fix race condition in orphan-inodes test
Make sure that the orphan scanners can see deletions after forced unmounts
by waiting for reclaim_open_log_tree() to run on each mount; and waiting for
finalize_and_start_log_merge() to run and not find any finalized trees.

Do this by adding two new counters: reclaimed_open_logs and
log_merge_no_finalized and fixing the orphan-inodes test to check those
before waiting for the orphan scanners to complete.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:59:03 -07:00
Chris Kirby
c72bf915ae Use ENOLINK as a special error code during forced unmount
Tests such as quorum-heartbeat-timeout were failing with EIO messages in
dmesg output due to expected errors during forced unmount. Use ENOLINK
instead, and filter all errors from dmesg with this errno (67).

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-22 10:58:44 -07:00
Auke Kok
c3e6f3cd54 Don't run format-version-forward-back on el8, either
This test compiles an earlier commit from the tree that is starting to
fail due to various changes on the OS level, most recently due to sparse
issues with newer kernel headers. This problem will likely increase
in the future as we add more supported releases.

We opt to just only run this test on el7 for now. While we could have
made this skip sparse checks that fail it on el8, it will suffice at
this point if this just works on one of the supported OS versions
during testing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-15 17:35:17 -05:00
Zach Brown
c19280c83c Add cond_resched to iput worker
The iput worker can accumulate quite a bit of pending work to do.  We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels).  There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
01847d9fb6 Add tracing for get_file_block() and scoutfs_ioc_search_xattrs().
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
84a48ed8e2 Fix several cases in srch.c where the return value of EIO should have been -EIO.
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
d38e41cb57 Add the inode number to scoutfs_xattr_set traces.
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
a896984f59 Only start new quorum election after a receive failure
It's possible for the quorum worker to be preempted for a long period,
especially on debug kernels. Since we only check for how much time
has passed, it's possible for a clean receive to inadvertently
trigger an election. This can cause the quorum-heartbeat-timeout
test to fail due to observed delays outside of the expected bounds.

Instead, make sure we had a receive failure before comparing timestamps.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-15 17:35:17 -05:00
Chris Kirby
35bcad91a6 Close window where we can lose search items
In finalize_and_start_log_merge(), we overwrite the server
mount's log tree with its finalized form and then later write out
its next open log tree. This leaves a window where the mount's
srch_file is nulled out, causing us to lose any search items in
that log tree.

This shows up as intermittent failures in the srch-basic-functionality
test.

Eliminate this timing window by doing what unmount/reclaim does when
it finalizes, by moving the resources from the item that we finalize
into server trees/items as it finalizes. Then there is no window
where those resources exist only in memory until we create another
transaction.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
0b7b9d4a5e Avoid trigger munching of block_remove_stale trigger.
It's entirely likely that the trigger here is munched by a read on a
dirty block from any unrelated or background read. Avoid that by putting
the trigger at the end of the condition list.

Now that the order is swapped, we have to avoid a null deref in
block_is_dirty(bp) here, as well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
f86a7b4d3c Fully wait for orphan inode scan to complete.
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.

Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.

For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
96eb9662a1 Revert "Extend orphan-inodes timeout."
This reverts commit 138c7c6b49.

The timeout value here is still exceeded by CI test jobs, and thus
causing the test to fail.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
47af90d078 Fix race in offline-extent-waiting test
Before comparing file contents, wait for the background dd to complete.
Also fix a typo.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
669e37c636 Remove hung task workaround from large-fragmented-free test
Adjusting hung_task_timeout_secs is still needed for this test to pass
with a debug kernel. But the logic belongs on the platform side.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
bb3e1f3665 Fix commit budget calculation with multiple holders
The try_drain_data_freed() path was generating errors about overrunning
its commit budget:

scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602

The budget overrun check was using the current number of commit holders
(in this case one) instead of the the maximum number of concurrent holders
(in this case two). So even well behaved paths like try_drain_data_freed()
can appear to exceed their commit budget if other holders dirty some blocks
and apply their commits before the try_drain_data_freed() thread does its
final budget reconciliation.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
0d262de4ac Fix dirtied block calculation in extent_mod_blocks()
Free extents are stored in two btrees: one sorted by block number, one
by size. So if you insert a new extent between two existing extents, you can
be modifying two items in the by-block-number tree. And depending on the size
of those items, that can result in three items over in the -by-size tree.
So that's a 5x multiplier per level.

If we're shrinking the tree and adding more freed blocks, we're conceptually
dirtying two blocks at each level to merge. (current *2 in the code).
But if they fall under the low water mark then one of them is freed, so we
can have *3 per level in this case.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Auke Kok
70bd936213 Ignore sparse error about stat.h on el8.
On el8, sparse is at 0.6.4 in epel-release, but it fails with:
```
[SP src/util.c]
src/util.c: note: in included file (through /usr/include/sys/stat.h):
/usr/include/bits/statx.h:30:6: error: not a function <noident>
/usr/include/bits/statx.h:30:6: error: bad constant expression type
```

This is due to us needing O_DIRECT from <fcntl.h>, so we set _GNU_SOURCE
before including it, but this causes (through _USE_GNU in sys/stat.h)
statx.h to be included, and that has __has_include, and sparse is too
dumb to understand it.

Just shut it up.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-06 12:27:25 -05:00
Chris Kirby
3f786596e0 Don't overrun the block budget in server_log_merge_free_work().
This fixes a potential fence post failure like the following:

error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672

The code is only accounting for the freed btree blocks, not the dirtying of
other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2),
dirty some log btree blocks, loop again, then consume another
(COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget.

In this example, we went over by 13 blocks.

By only consuming up to 1/8 of the budget on each loop, and committing when we
have consumed 3/4 of the budget, we can avoid the fence post condition.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-06 12:27:25 -05:00
Zach Brown
cad47ed1ed Merge pull request #247 from versity/zab/sparse_error
Zab/sparse error
2025-10-06 10:09:07 -07:00
Auke Kok
bf87ea0a1c Add option to shuffle test order.
The `-R` option will shuffle the order in which tests are executed.

The testing order shouldn't affect the outcome of any of the tests, but
in practice many of these tests will execute code slightly different
based on the history of the filesystem, resources allocated, memory
usage etc. of tests that were executed before. Shuffling the order of
tests therefore introduces small semi-random variations in the
enviroment.

The xfstests test is the only one that can't be shuffled yet into the
mix, so it is kept at the end. This is because it leaves the filesystems
unmounted. At a later point we may want to address this.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-03 14:55:32 -07:00
Zach Brown
e088424d70 Add initial filters for warnings in distro source
Add a chunk of filters for sparse warnings that trigger on distro kernel
source.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-03 09:35:36 -07:00
Zach Brown
d0cf026298 Require sparse, and filter kernel sparse output
Fail the build if we don't check with sparse in both the kernel and
userspace utils.  Add a filtering wrapper to the kernel build so that we
have a place to filter out uninteresting errors from kernel sources that
we're building against.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-03 09:35:36 -07:00
Zach Brown
03fa1ce7c5 Avoid bad sparse warning in lock_invalidate()
This is another example of refactoring a loop to avoid sparse warnings
from doing something in the else of a failed trylock if.  We want to
drop and reacquire the lock if the trylock fails so we do it every loop
iteration.  This shouldn't be experiencing much contention because most
of the cov users are usually done under locks and invalidation has
excluded lock holders.  So the additional lock and unlock noise should
be local.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-03 09:35:36 -07:00
Zach Brown
3d9f10de93 Work around sparse warning in _item_write_done
scoutfs_item_write_done() acquires the cinf dirty_lock and pg rwlock out
of order.  It uses a trylock to detect failure and back off of both
before retrying.

sparse seems to have some peculiar sensitivity to following the else
branch from a failed trylock while already in a context.  Doing that
consistently triggered the spurious mismatched context warning.

This refactors the loop to always drop and reacquire the dirty_lock
after attemping the trylock.  It's not great, but this shouldn't be very
contended because the transaction write has serialized write lock
holderse that would be trying to dirty items.  The silly lock noise will
be mostly cached.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-03 09:34:23 -07:00
56 changed files with 2809 additions and 1963 deletions

View File

@@ -1,6 +1,41 @@
Versity ScoutFS Release Notes
=============================
---
v1.26
\
*Nov 17, 2025*
Add the ino\_alloc\_per\_lock mount option. This changes the number of
inode numbers allocated under each cluster lock and can alleviate lock
contention for some patterns of larger file creation.
Add the tcp\_keepalive\_timeout\_ms mount option. This can enable the
system to survive longer periods of networking outages.
Fix a rare double free of internal btree metadata blocks when merging
log trees. The duplicated freed metadata block numbers would cause
persistent errors in the server, preventing the server from starting and
hanging the system.
Fix the data\_wait interface to not require the correct data\_version of
the inode when raising an error. This lets callers raise errors when
they're unable to recall the details of the inode to discover its
data\_version.
Change scoutfs to more aggressively reclaim cached memory when under
memory pressure. This makes scoutfs behave more like other kernel
components and it integrates better with the reclaim policy heuristics
in the VM core of the kernel.
Change scoutfs to more efficiently transmit and receive socket messages.
Under heavy load this can process messages sufficiently more quickly to
avoid hung task messages for tasks that were waiting for cluster lock
messages to be processed.
Fix faulty server block commit budget calculations that were generating
spurious "holders exceeded alloc budget" console messages.
---
v1.25
\

View File

@@ -5,13 +5,6 @@ ifeq ($(SK_KSRC),)
SK_KSRC := $(shell echo /lib/modules/`uname -r`/build)
endif
# fail if sparse fails if we find it
ifeq ($(shell sparse && echo found),found)
SP =
else
SP = @:
endif
SCOUTFS_GIT_DESCRIBE ?= \
$(shell git describe --all --abbrev=6 --long 2>/dev/null || \
echo no-git)
@@ -36,9 +29,7 @@ TARFILE = scoutfs-kmod-$(RPM_VERSION).tar
all: module
module:
$(MAKE) $(SCOUTFS_ARGS)
$(SP) $(MAKE) C=2 CF="-D__CHECK_ENDIAN__" $(SCOUTFS_ARGS)
$(MAKE) CHECK=$(CURDIR)/src/sparse-filtered.sh C=1 CF="-D__CHECK_ENDIAN__" $(SCOUTFS_ARGS)
modules_install:
$(MAKE) $(SCOUTFS_ARGS) modules_install

View File

@@ -158,15 +158,6 @@ ifneq (,$(shell grep 'sock_create_kern.*struct net' include/linux/net.h))
ccflags-y += -DKC_SOCK_CREATE_KERN_NET=1
endif
#
# v3.18-rc6-1619-gc0371da6047a
#
# iov_iter is now part of struct msghdr
#
ifneq (,$(shell grep 'struct iov_iter.*msg_iter' include/linux/socket.h))
ccflags-y += -DKC_MSGHDR_STRUCT_IOV_ITER=1
endif
#
# v4.17-rc6-7-g95582b008388
#
@@ -287,6 +278,14 @@ ifneq (,$(shell grep 'int ..mknod. .struct user_namespace' include/linux/fs.h))
ccflags-y += -DKC_VFS_METHOD_USER_NAMESPACE_ARG
endif
#
# v6.2-rc1-2-gabf08576afe3
#
# fs: vfs methods use struct mnt_idmap instead of struct user_namespace
ifneq (,$(shell grep 'int vfs_mknod.struct mnt_idmap' include/linux/fs.h))
ccflags-y += -DKC_VFS_METHOD_MNT_IDMAP_ARG
endif
#
# v5.17-rc2-21-g07888c665b40
#
@@ -434,3 +433,56 @@ endif
ifneq (,$(shell grep 'int ..remap_pages..struct vm_area_struct' include/linux/mm.h))
ccflags-y += -DKC_MM_REMAP_PAGES
endif
#
# v3.19-4742-g503c358cf192
#
# list_lru_shrink_count() and list_lru_shrink_walk() introduced
#
ifneq (,$(shell grep 'list_lru_shrink_count.*struct list_lru' include/linux/list_lru.h))
ccflags-y += -DKC_LIST_LRU_SHRINK_COUNT_WALK
endif
#
# v3.19-4757-g3f97b163207c
#
# lru_list_walk_cb lru arg added
#
ifneq (,$(shell grep 'struct list_head \*item, spinlock_t \*lock, void \*cb_arg' include/linux/list_lru.h))
ccflags-y += -DKC_LIST_LRU_WALK_CB_ITEM_LOCK
endif
#
# v6.7-rc4-153-g0a97c01cd20b
#
# list_lru_{add,del} -> list_lru_{add,del}_obj
#
ifneq (,$(shell grep '^bool list_lru_add_obj' include/linux/list_lru.h))
ccflags-y += -DKC_LIST_LRU_ADD_OBJ
endif
#
# v6.12-rc6-227-gda0c02516c50
#
# lru_list_walk_cb lock arg removed
#
ifneq (,$(shell grep 'struct list_lru_one \*list, spinlock_t \*lock, void \*cb_arg' include/linux/list_lru.h))
ccflags-y += -DKC_LIST_LRU_WALK_CB_LIST_LOCK
endif
#
# v5.1-rc4-273-ge9b98e162aa5
#
# introduce stack trace helpers
#
ifneq (,$(shell grep '^unsigned int stack_trace_save' include/linux/stacktrace.h))
ccflags-y += -DKC_STACK_TRACE_SAVE
endif
# v6.1-rc1-4-g7420332a6ff4
#
# .get_acl() method now has dentry arg (and mnt_idmap). The old get_acl has been renamed
# to get_inode_acl() and is still available as well, but has an extra rcu param.
ifneq (,$(shell grep 'struct posix_acl ...get_acl..struct mnt_idmap ., struct dentry' include/linux/fs.h))
ccflags-y += -DKC_GET_ACL_DENTRY
endif

View File

@@ -107,8 +107,15 @@ struct posix_acl *scoutfs_get_acl_locked(struct inode *inode, int type, struct s
return acl;
}
#ifdef KC_GET_ACL_DENTRY
struct posix_acl *scoutfs_get_acl(KC_VFS_NS_DEF
struct dentry *dentry, int type)
{
struct inode *inode = dentry->d_inode;
#else
struct posix_acl *scoutfs_get_acl(struct inode *inode, int type)
{
#endif
struct super_block *sb = inode->i_sb;
struct scoutfs_lock *lock = NULL;
struct posix_acl *acl;
@@ -201,8 +208,15 @@ out:
return ret;
}
#ifdef KC_GET_ACL_DENTRY
int scoutfs_set_acl(KC_VFS_NS_DEF
struct dentry *dentry, struct posix_acl *acl, int type)
{
struct inode *inode = dentry->d_inode;
#else
int scoutfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
{
#endif
struct super_block *sb = inode->i_sb;
struct scoutfs_lock *lock = NULL;
LIST_HEAD(ind_locks);
@@ -240,7 +254,12 @@ int scoutfs_acl_get_xattr(struct dentry *dentry, const char *name, void *value,
if (!IS_POSIXACL(dentry->d_inode))
return -EOPNOTSUPP;
#ifdef KC_GET_ACL_DENTRY
acl = scoutfs_get_acl(KC_VFS_INIT_NS
dentry, type);
#else
acl = scoutfs_get_acl(dentry->d_inode, type);
#endif
if (IS_ERR(acl))
return PTR_ERR(acl);
if (acl == NULL)
@@ -286,7 +305,11 @@ int scoutfs_acl_set_xattr(struct dentry *dentry, const char *name, const void *v
}
}
#ifdef KC_GET_ACL_DENTRY
ret = scoutfs_set_acl(KC_VFS_INIT_NS dentry, acl, type);
#else
ret = scoutfs_set_acl(dentry->d_inode, acl, type);
#endif
out:
posix_acl_release(acl);

View File

@@ -1,9 +1,14 @@
#ifndef _SCOUTFS_ACL_H_
#define _SCOUTFS_ACL_H_
#ifdef KC_GET_ACL_DENTRY
struct posix_acl *scoutfs_get_acl(KC_VFS_NS_DEF struct dentry *dentry, int type);
int scoutfs_set_acl(KC_VFS_NS_DEF struct dentry *dentry, struct posix_acl *acl, int type);
#else
struct posix_acl *scoutfs_get_acl(struct inode *inode, int type);
struct posix_acl *scoutfs_get_acl_locked(struct inode *inode, int type, struct scoutfs_lock *lock);
int scoutfs_set_acl(struct inode *inode, struct posix_acl *acl, int type);
#endif
struct posix_acl *scoutfs_get_acl_locked(struct inode *inode, int type, struct scoutfs_lock *lock);
int scoutfs_set_acl_locked(struct inode *inode, struct posix_acl *acl, int type,
struct scoutfs_lock *lock, struct list_head *ind_locks);
#ifdef KC_XATTR_STRUCT_XATTR_HANDLER

View File

@@ -86,18 +86,47 @@ static u64 smallest_order_length(u64 len)
}
/*
* An extent modification dirties three distinct leaves of an allocator
* btree as it adds and removes the blkno and size sorted items for the
* old and new lengths of the extent. Dirtying the paths to these
* leaves can grow the tree and grow/shrink neighbours at each level.
* We over-estimate the number of blocks allocated and freed (the paths
* share a root, growth doesn't free) to err on the simpler and safer
* side. The overhead is minimal given the relatively large list blocks
* and relatively short allocator trees.
* Moving an extent between trees can dirty blocks in several ways. This
* function calculates worst case number of blocks across these scenarions.
* We treat the alloc and free counts independently, so the values below are
* max(allocated, freed), not the sum.
*
* We track extents with two separate btree items: by block number and by size.
*
* If we're removing an extent from the btree (allocating), we can dirty
* two blocks if the keys are in different leaves. If we wind up merging
* leaves because we fall below the low water mark, we can wind up freeing
* three leaves.
*
* That sequence is as follows, assuming the original keys are removed from
* blocks A and B:
*
* Allocate new dirty A' and B'
* Free old stable A and B
* B' has fallen below the low water mark, so copy B' into A'
* Free B'
*
* An extent insertion (freeing an extent) can dirty up to five distinct items
* in the btree as it adds and removes the blkno and size sorted items for the
* old and new lengths of the extent:
*
* In the by-blkno portion of the btree, we can dirty (allocate for COW) up
* to two blocks- either by merging adjacent extents, which can cause us to
* join leaf blocks; or by an insertion that causes a split.
*
* In the by-size portion, we never merge extents, so normally we just dirty
* a single item with a size insertion. But if we merged adjacent extents in
* the by-blkno portion of the tree, we might be working with three by-sizex
* items: removing the two old ones that were combined in the merge; and
* adding the new one for the larger, merged size.
*
* Finally, dirtying the paths to these leaves can grow the tree and grow/shrink
* neighbours at each level, so we multiply by the height of the tree after
* accounting for a possible new level.
*/
static u32 extent_mod_blocks(u32 height)
{
return ((1 + height) * 2) * 3;
return ((1 + height) * 3) * 5;
}
/*
@@ -828,7 +857,7 @@ static int find_zone_extent(struct super_block *sb, struct scoutfs_alloc_root *r
.zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE,
};
struct scoutfs_extent found;
struct scoutfs_extent ext;
struct scoutfs_extent ext = {0,};
u64 start;
u64 len;
int nr;

View File

@@ -22,6 +22,8 @@
#include <linux/rhashtable.h>
#include <linux/random.h>
#include <linux/sched/mm.h>
#include <linux/list_lru.h>
#include <linux/stacktrace.h>
#include "format.h"
#include "super.h"
@@ -38,26 +40,12 @@
* than the page size. Callers can have their own contexts for tracking
* dirty blocks that are written together. We pin dirty blocks in
* memory and only checksum them all as they're all written.
*
* Memory reclaim is driven by maintaining two very coarse groups of
* blocks. As we access blocks we mark them with an increasing counter
* to discourage them from being reclaimed. We then define a threshold
* at the current counter minus half the population. Recent blocks have
* a counter greater than the threshold, and all other blocks with
* counters less than it are considered older and are candidates for
* reclaim. This results in access updates rarely modifying an atomic
* counter as blocks need to be moved into the recent group, and shrink
* can randomly scan blocks looking for the half of the population that
* will be in the old group. It's reasonably effective, but is
* particularly efficient and avoids contention between concurrent
* accesses and shrinking.
*/
struct block_info {
struct super_block *sb;
atomic_t total_inserted;
atomic64_t access_counter;
struct rhashtable ht;
struct list_lru lru;
wait_queue_head_t waitq;
KC_DEFINE_SHRINKER(shrinker);
struct work_struct free_work;
@@ -76,28 +64,15 @@ enum block_status_bits {
BLOCK_BIT_PAGE_ALLOC, /* page (possibly high order) allocation */
BLOCK_BIT_VIRT, /* mapped virt allocation */
BLOCK_BIT_CRC_VALID, /* crc has been verified */
BLOCK_BIT_ACCESSED, /* seen by lookup since last lru add/walk */
};
/*
* We want to tie atomic changes in refcounts to whether or not the
* block is still visible in the hash table, so we store the hash
* table's reference up at a known high bit. We could naturally set the
* inserted bit through excessive refcount increments. We don't do
* anything about that but at least warn if we get close.
*
* We're avoiding the high byte for no real good reason, just out of a
* historical fear of implementations that don't provide the full
* precision.
*/
#define BLOCK_REF_INSERTED (1U << 23)
#define BLOCK_REF_FULL (BLOCK_REF_INSERTED >> 1)
struct block_private {
struct scoutfs_block bl;
struct super_block *sb;
atomic_t refcount;
u64 accessed;
struct rhash_head ht_head;
struct list_head lru_head;
struct list_head dirty_entry;
struct llist_node free_node;
unsigned long bits;
@@ -106,13 +81,15 @@ struct block_private {
struct page *page;
void *virt;
};
unsigned int stack_len;
unsigned long stack[10];
};
#define TRACE_BLOCK(which, bp) \
do { \
__typeof__(bp) _bp = (bp); \
trace_scoutfs_block_##which(_bp->sb, _bp, _bp->bl.blkno, atomic_read(&_bp->refcount), \
atomic_read(&_bp->io_count), _bp->bits, _bp->accessed); \
atomic_read(&_bp->io_count), _bp->bits); \
} while (0)
#define BLOCK_PRIVATE(_bl) \
@@ -126,7 +103,17 @@ static __le32 block_calc_crc(struct scoutfs_block_header *hdr, u32 size)
return cpu_to_le32(calc);
}
static struct block_private *block_alloc(struct super_block *sb, u64 blkno)
static noinline void save_block_stack(struct block_private *bp)
{
bp->stack_len = stack_trace_save(bp->stack, ARRAY_SIZE(bp->stack), 2);
}
static void print_block_stack(struct block_private *bp)
{
stack_trace_print(bp->stack, bp->stack_len, 1);
}
static noinline struct block_private *block_alloc(struct super_block *sb, u64 blkno)
{
struct block_private *bp;
unsigned int nofs_flags;
@@ -176,11 +163,13 @@ static struct block_private *block_alloc(struct super_block *sb, u64 blkno)
bp->bl.blkno = blkno;
bp->sb = sb;
atomic_set(&bp->refcount, 1);
INIT_LIST_HEAD(&bp->lru_head);
INIT_LIST_HEAD(&bp->dirty_entry);
set_bit(BLOCK_BIT_NEW, &bp->bits);
atomic_set(&bp->io_count, 0);
TRACE_BLOCK(allocate, bp);
save_block_stack(bp);
out:
if (!bp)
@@ -233,32 +222,85 @@ static void block_free_work(struct work_struct *work)
}
/*
* Get a reference to a block while holding an existing reference.
* Users of blocks hold a refcount. If putting a refcount drops to zero
* then the block is freed.
*
* Acquiring new references and claiming the exclusive right to tear
* down a block is built around this LIVE_REFCOUNT_BASE refcount value.
* As blocks are initially cached they have the live base added to their
* refcount. Lookups will only increment the refcount and return blocks
* for reference holders while the refcount is >= than the base.
*
* To remove a block from the cache and eventually free it, either by
* the lru walk in the shrinker, or by reference holders, the live base
* is removed and turned into a normal refcount increment that will be
* put by the caller. This can only be done once for a block, and once
* its done lookup will not return any more references.
*/
#define LIVE_REFCOUNT_BASE (INT_MAX ^ (INT_MAX >> 1))
/*
* Inc the refcount while holding an incremented refcount. We can't
* have so many individual reference holders that they pass the live
* base.
*/
static void block_get(struct block_private *bp)
{
WARN_ON_ONCE((atomic_read(&bp->refcount) & ~BLOCK_REF_INSERTED) <= 0);
int now = atomic_inc_return(&bp->refcount);
atomic_inc(&bp->refcount);
BUG_ON(now <= 1);
BUG_ON(now == LIVE_REFCOUNT_BASE);
}
/*
* Get a reference to a block as long as it's been inserted in the hash
* table and hasn't been removed.
*/
static struct block_private *block_get_if_inserted(struct block_private *bp)
* if (*v >= u) {
* *v += a;
* return true;
* }
*/
static bool atomic_add_unless_less(atomic_t *v, int a, int u)
{
int cnt;
int c;
do {
cnt = atomic_read(&bp->refcount);
WARN_ON_ONCE(cnt & BLOCK_REF_FULL);
if (!(cnt & BLOCK_REF_INSERTED))
return NULL;
c = atomic_read(v);
if (c < u)
return false;
} while (atomic_cmpxchg(v, c, c + a) != c);
} while (atomic_cmpxchg(&bp->refcount, cnt, cnt + 1) != cnt);
return true;
}
return bp;
static bool block_get_if_live(struct block_private *bp)
{
return atomic_add_unless_less(&bp->refcount, 1, LIVE_REFCOUNT_BASE);
}
/*
* If the refcount still has the live base, subtract it and increment
* the callers refcount that they'll put.
*/
static bool block_get_remove_live(struct block_private *bp)
{
return atomic_add_unless_less(&bp->refcount, (1 - LIVE_REFCOUNT_BASE), LIVE_REFCOUNT_BASE);
}
/*
* Only get the live base refcount if it is the only refcount remaining.
* This means that there are no active refcount holders and the block
* can't be dirty or under IO, which both hold references.
*/
static bool block_get_remove_live_only(struct block_private *bp)
{
int c;
do {
c = atomic_read(&bp->refcount);
if (c != LIVE_REFCOUNT_BASE)
return false;
} while (atomic_cmpxchg(&bp->refcount, c, c - LIVE_REFCOUNT_BASE + 1) != c);
return true;
}
/*
@@ -290,104 +332,73 @@ static const struct rhashtable_params block_ht_params = {
};
/*
* Insert a new block into the hash table. Once it is inserted in the
* hash table readers can start getting references. The caller may have
* multiple refs but the block can't already be inserted.
* Insert the block into the cache so that it's visible for lookups.
* The caller can hold references (including for a dirty block).
*
* We make sure the base is added and the block is in the lru once it's
* in the hash. If hash table insertion fails it'll be briefly visible
* in the lru, but won't be isolated/evicted because we hold an
* incremented refcount in addition to the live base.
*/
static int block_insert(struct super_block *sb, struct block_private *bp)
{
DECLARE_BLOCK_INFO(sb, binf);
int ret;
WARN_ON_ONCE(atomic_read(&bp->refcount) & BLOCK_REF_INSERTED);
BUG_ON(atomic_read(&bp->refcount) >= LIVE_REFCOUNT_BASE);
atomic_add(LIVE_REFCOUNT_BASE, &bp->refcount);
smp_mb__after_atomic(); /* make sure live base is visible to list_lru walk */
list_lru_add_obj(&binf->lru, &bp->lru_head);
retry:
atomic_add(BLOCK_REF_INSERTED, &bp->refcount);
ret = rhashtable_lookup_insert_fast(&binf->ht, &bp->ht_head, block_ht_params);
if (ret < 0) {
atomic_sub(BLOCK_REF_INSERTED, &bp->refcount);
if (ret == -EBUSY) {
/* wait for pending rebalance to finish */
synchronize_rcu();
goto retry;
} else {
atomic_sub(LIVE_REFCOUNT_BASE, &bp->refcount);
BUG_ON(atomic_read(&bp->refcount) >= LIVE_REFCOUNT_BASE);
list_lru_del_obj(&binf->lru, &bp->lru_head);
}
} else {
atomic_inc(&binf->total_inserted);
TRACE_BLOCK(insert, bp);
}
return ret;
}
static u64 accessed_recently(struct block_info *binf)
{
return atomic64_read(&binf->access_counter) - (atomic_read(&binf->total_inserted) >> 1);
}
/*
* Make sure that a block that is being accessed is less likely to be
* reclaimed if it is seen by the shrinker. If the block hasn't been
* accessed recently we update its accessed value.
* Indicate to the lru walker that this block has been accessed since it
* was added or last walked.
*/
static void block_accessed(struct super_block *sb, struct block_private *bp)
{
DECLARE_BLOCK_INFO(sb, binf);
if (bp->accessed == 0 || bp->accessed < accessed_recently(binf)) {
if (!test_and_set_bit(BLOCK_BIT_ACCESSED, &bp->bits))
scoutfs_inc_counter(sb, block_cache_access_update);
bp->accessed = atomic64_inc_return(&binf->access_counter);
}
}
/*
* The caller wants to remove the block from the hash table and has an
* idea what the refcount should be. If the refcount does still
* indicate that the block is hashed, and we're able to clear that bit,
* then we can remove it from the hash table.
* Remove the block from the cache. When this returns the block won't
* be visible for additional references from lookup.
*
* The caller makes sure that it's safe to be referencing this block,
* either with their own held reference (most everything) or by being in
* an rcu grace period (shrink).
*/
static bool block_remove_cnt(struct super_block *sb, struct block_private *bp, int cnt)
{
DECLARE_BLOCK_INFO(sb, binf);
int ret;
if ((cnt & BLOCK_REF_INSERTED) &&
(atomic_cmpxchg(&bp->refcount, cnt, cnt & ~BLOCK_REF_INSERTED) == cnt)) {
TRACE_BLOCK(remove, bp);
ret = rhashtable_remove_fast(&binf->ht, &bp->ht_head, block_ht_params);
WARN_ON_ONCE(ret); /* must have been inserted */
atomic_dec(&binf->total_inserted);
return true;
}
return false;
}
/*
* Try to remove the block from the hash table as long as the refcount
* indicates that it is still in the hash table. This can be racing
* with normal refcount changes so it might have to retry.
* We always try and remove from the hash table. It's safe to remove a
* block that isn't hashed, it just returns -ENOENT.
*
* This is racing with the lru walk in the shrinker also trying to
* remove idle blocks from the cache. They both try to remove the live
* refcount base and perform their removal and put if they get it.
*/
static void block_remove(struct super_block *sb, struct block_private *bp)
{
int cnt;
DECLARE_BLOCK_INFO(sb, binf);
do {
cnt = atomic_read(&bp->refcount);
} while ((cnt & BLOCK_REF_INSERTED) && !block_remove_cnt(sb, bp, cnt));
}
rhashtable_remove_fast(&binf->ht, &bp->ht_head, block_ht_params);
/*
* Take one shot at removing the block from the hash table if it's still
* in the hash table and the caller has the only other reference.
*/
static bool block_remove_solo(struct super_block *sb, struct block_private *bp)
{
return block_remove_cnt(sb, bp, BLOCK_REF_INSERTED | 1);
if (block_get_remove_live(bp)) {
list_lru_del_obj(&binf->lru, &bp->lru_head);
block_put(sb, bp);
}
}
static bool io_busy(struct block_private *bp)
@@ -396,37 +407,6 @@ static bool io_busy(struct block_private *bp)
return test_bit(BLOCK_BIT_IO_BUSY, &bp->bits);
}
/*
* Called during shutdown with no other users.
*/
static void block_remove_all(struct super_block *sb)
{
DECLARE_BLOCK_INFO(sb, binf);
struct rhashtable_iter iter;
struct block_private *bp;
rhashtable_walk_enter(&binf->ht, &iter);
rhashtable_walk_start(&iter);
for (;;) {
bp = rhashtable_walk_next(&iter);
if (bp == NULL)
break;
if (bp == ERR_PTR(-EAGAIN))
continue;
if (block_get_if_inserted(bp)) {
block_remove(sb, bp);
WARN_ON_ONCE(atomic_read(&bp->refcount) != 1);
block_put(sb, bp);
}
}
rhashtable_walk_stop(&iter);
rhashtable_walk_exit(&iter);
WARN_ON_ONCE(atomic_read(&binf->total_inserted) != 0);
}
/*
* XXX The io_count and sb fields in the block_private are only used
@@ -488,7 +468,7 @@ static int block_submit_bio(struct super_block *sb, struct block_private *bp,
int ret = 0;
if (scoutfs_forcing_unmount(sb))
return -EIO;
return -ENOLINK;
sector = bp->bl.blkno << (SCOUTFS_BLOCK_LG_SHIFT - 9);
@@ -543,6 +523,10 @@ static int block_submit_bio(struct super_block *sb, struct block_private *bp,
return ret;
}
/*
* Return a block with an elevated refcount if it was present in the
* hash table and its refcount didn't indicate that it was being freed.
*/
static struct block_private *block_lookup(struct super_block *sb, u64 blkno)
{
DECLARE_BLOCK_INFO(sb, binf);
@@ -550,8 +534,8 @@ static struct block_private *block_lookup(struct super_block *sb, u64 blkno)
rcu_read_lock();
bp = rhashtable_lookup(&binf->ht, &blkno, block_ht_params);
if (bp)
bp = block_get_if_inserted(bp);
if (bp && !block_get_if_live(bp))
bp = NULL;
rcu_read_unlock();
return bp;
@@ -712,8 +696,8 @@ retry:
ret = 0;
out:
if ((ret == -ESTALE || scoutfs_trigger(sb, BLOCK_REMOVE_STALE)) &&
!retried && !block_is_dirty(bp)) {
if (!retried && !IS_ERR_OR_NULL(bp) && !block_is_dirty(bp) &&
(ret == -ESTALE || scoutfs_trigger(sb, BLOCK_REMOVE_STALE))) {
retried = true;
scoutfs_inc_counter(sb, block_cache_remove_stale);
block_remove(sb, bp);
@@ -1078,100 +1062,106 @@ static unsigned long block_count_objects(struct shrinker *shrink, struct shrink_
struct super_block *sb = binf->sb;
scoutfs_inc_counter(sb, block_cache_count_objects);
return shrinker_min_long(atomic_read(&binf->total_inserted));
return list_lru_shrink_count(&binf->lru, sc);
}
struct isolate_args {
struct super_block *sb;
struct list_head dispose;
};
#define DECLARE_ISOLATE_ARGS(sb_, name_) \
struct isolate_args name_ = { \
.sb = sb_, \
.dispose = LIST_HEAD_INIT(name_.dispose), \
}
static enum lru_status isolate_lru_block(struct list_head *item, struct list_lru_one *list,
void *cb_arg)
{
struct block_private *bp = container_of(item, struct block_private, lru_head);
struct isolate_args *ia = cb_arg;
TRACE_BLOCK(isolate, bp);
/* rotate accessed blocks to the tail of the list (lazy promotion) */
if (test_and_clear_bit(BLOCK_BIT_ACCESSED, &bp->bits)) {
scoutfs_inc_counter(ia->sb, block_cache_isolate_rotate);
return LRU_ROTATE;
}
/* any refs, including dirty/io, stop us from acquiring lru refcount */
if (!block_get_remove_live_only(bp)) {
scoutfs_inc_counter(ia->sb, block_cache_isolate_skip);
return LRU_SKIP;
}
scoutfs_inc_counter(ia->sb, block_cache_isolate_removed);
list_lru_isolate_move(list, &bp->lru_head, &ia->dispose);
return LRU_REMOVED;
}
static void shrink_dispose_blocks(struct super_block *sb, struct list_head *dispose)
{
struct block_private *bp;
struct block_private *bp__;
list_for_each_entry_safe(bp, bp__, dispose, lru_head) {
list_del_init(&bp->lru_head);
block_remove(sb, bp);
block_put(sb, bp);
}
}
/*
* Remove a number of cached blocks that haven't been used recently.
*
* We don't maintain a strictly ordered LRU to avoid the contention of
* accesses always moving blocks around in some precise global
* structure.
*
* Instead we use counters to divide the blocks into two roughly equal
* groups by how recently they were accessed. We randomly walk all
* inserted blocks looking for any blocks in the older half to remove
* and free. The random walk and there being two groups means that we
* typically only walk a small multiple of the number we're looking for
* before we find them all.
*
* Our rcu walk of blocks can see blocks in all stages of their life
* cycle, from dirty blocks to those with 0 references that are queued
* for freeing. We only want to free idle inserted blocks so we
* atomically remove blocks when the only references are ours and the
* hash table.
*/
static unsigned long block_scan_objects(struct shrinker *shrink, struct shrink_control *sc)
{
struct block_info *binf = KC_SHRINKER_CONTAINER_OF(shrink, struct block_info);
struct super_block *sb = binf->sb;
struct rhashtable_iter iter;
struct block_private *bp;
bool stop = false;
unsigned long freed = 0;
unsigned long nr = sc->nr_to_scan;
u64 recently;
DECLARE_ISOLATE_ARGS(sb, ia);
unsigned long freed;
scoutfs_inc_counter(sb, block_cache_scan_objects);
recently = accessed_recently(binf);
rhashtable_walk_enter(&binf->ht, &iter);
rhashtable_walk_start(&iter);
freed = kc_list_lru_shrink_walk(&binf->lru, sc, isolate_lru_block, &ia);
shrink_dispose_blocks(sb, &ia.dispose);
return freed;
}
/*
* This isn't great but I don't see a better way. We want to
* walk the hash from a random point so that we're not
* constantly walking over the same region that we've already
* freed old blocks within. The interface doesn't let us do
* this explicitly, but this seems to work? The difference this
* makes is enormous, around a few orders of magnitude fewer
* _nexts per shrink.
*/
if (iter.walker.tbl)
iter.slot = prandom_u32_max(iter.walker.tbl->size);
static enum lru_status dump_lru_block(struct list_head *item, struct list_lru_one *list,
void *cb_arg)
{
struct block_private *bp = container_of(item, struct block_private, lru_head);
while (nr > 0) {
bp = rhashtable_walk_next(&iter);
if (bp == NULL)
break;
if (bp == ERR_PTR(-EAGAIN)) {
/*
* We can be called from reclaim in the allocation
* to resize the hash table itself. We have to
* return so that the caller can proceed and
* enable hash table iteration again.
*/
scoutfs_inc_counter(sb, block_cache_shrink_stop);
stop = true;
break;
}
printk("blkno %llu refcount 0x%x io_count %d bits 0x%lx\n",
bp->bl.blkno, atomic_read(&bp->refcount), atomic_read(&bp->io_count),
bp->bits);
print_block_stack(bp);
scoutfs_inc_counter(sb, block_cache_shrink_next);
return LRU_SKIP;
}
if (bp->accessed >= recently) {
scoutfs_inc_counter(sb, block_cache_shrink_recent);
continue;
}
/*
* Called during shutdown with no other users. The isolating walk must
* find blocks on the lru that only have references for presence on the
* lru and in the hash table.
*/
static void block_shrink_all(struct super_block *sb)
{
DECLARE_BLOCK_INFO(sb, binf);
DECLARE_ISOLATE_ARGS(sb, ia);
long count;
if (block_get_if_inserted(bp)) {
if (block_remove_solo(sb, bp)) {
scoutfs_inc_counter(sb, block_cache_shrink_remove);
TRACE_BLOCK(shrink, bp);
freed++;
nr--;
}
block_put(sb, bp);
}
count = DIV_ROUND_UP(list_lru_count(&binf->lru), 128) * 2;
do {
kc_list_lru_walk(&binf->lru, isolate_lru_block, &ia, 128);
shrink_dispose_blocks(sb, &ia.dispose);
} while (list_lru_count(&binf->lru) > 0 && --count > 0);
count = list_lru_count(&binf->lru);
if (count > 0) {
scoutfs_err(sb, "failed to isolate/dispose %ld blocks", count);
kc_list_lru_walk(&binf->lru, dump_lru_block, sb, count);
}
rhashtable_walk_stop(&iter);
rhashtable_walk_exit(&iter);
if (stop)
return SHRINK_STOP;
else
return freed;
}
struct sm_block_completion {
@@ -1210,7 +1200,7 @@ static int sm_block_io(struct super_block *sb, struct block_device *bdev, blk_op
BUILD_BUG_ON(PAGE_SIZE < SCOUTFS_BLOCK_SM_SIZE);
if (scoutfs_forcing_unmount(sb))
return -EIO;
return -ENOLINK;
if (WARN_ON_ONCE(len > SCOUTFS_BLOCK_SM_SIZE) ||
WARN_ON_ONCE(!op_is_write(opf) && !blk_crc))
@@ -1276,7 +1266,7 @@ int scoutfs_block_write_sm(struct super_block *sb,
int scoutfs_block_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct block_info *binf;
struct block_info *binf = NULL;
int ret;
binf = kzalloc(sizeof(struct block_info), GFP_KERNEL);
@@ -1285,15 +1275,15 @@ int scoutfs_block_setup(struct super_block *sb)
goto out;
}
ret = rhashtable_init(&binf->ht, &block_ht_params);
if (ret < 0) {
kfree(binf);
ret = list_lru_init(&binf->lru);
if (ret < 0)
goto out;
ret = rhashtable_init(&binf->ht, &block_ht_params);
if (ret < 0)
goto out;
}
binf->sb = sb;
atomic_set(&binf->total_inserted, 0);
atomic64_set(&binf->access_counter, 0);
init_waitqueue_head(&binf->waitq);
KC_INIT_SHRINKER_FUNCS(&binf->shrinker, block_count_objects,
block_scan_objects);
@@ -1305,8 +1295,10 @@ int scoutfs_block_setup(struct super_block *sb)
ret = 0;
out:
if (ret)
scoutfs_block_destroy(sb);
if (ret < 0 && binf) {
list_lru_destroy(&binf->lru);
kfree(binf);
}
return ret;
}
@@ -1318,9 +1310,10 @@ void scoutfs_block_destroy(struct super_block *sb)
if (binf) {
KC_UNREGISTER_SHRINKER(&binf->shrinker);
block_remove_all(sb);
block_shrink_all(sb);
flush_work(&binf->free_work);
rhashtable_destroy(&binf->ht);
list_lru_destroy(&binf->lru);
kfree(binf);
sbi->block_info = NULL;

View File

@@ -435,8 +435,8 @@ static int lookup_mounted_client_item(struct super_block *sb, u64 rid)
if (ret == -ENOENT)
ret = 0;
kfree(super);
out:
kfree(super);
return ret;
}

View File

@@ -26,17 +26,15 @@
EXPAND_COUNTER(block_cache_alloc_page_order) \
EXPAND_COUNTER(block_cache_alloc_virt) \
EXPAND_COUNTER(block_cache_end_io_error) \
EXPAND_COUNTER(block_cache_isolate_removed) \
EXPAND_COUNTER(block_cache_isolate_rotate) \
EXPAND_COUNTER(block_cache_isolate_skip) \
EXPAND_COUNTER(block_cache_forget) \
EXPAND_COUNTER(block_cache_free) \
EXPAND_COUNTER(block_cache_free_work) \
EXPAND_COUNTER(block_cache_remove_stale) \
EXPAND_COUNTER(block_cache_count_objects) \
EXPAND_COUNTER(block_cache_scan_objects) \
EXPAND_COUNTER(block_cache_shrink) \
EXPAND_COUNTER(block_cache_shrink_next) \
EXPAND_COUNTER(block_cache_shrink_recent) \
EXPAND_COUNTER(block_cache_shrink_remove) \
EXPAND_COUNTER(block_cache_shrink_stop) \
EXPAND_COUNTER(btree_compact_values) \
EXPAND_COUNTER(btree_compact_values_enomem) \
EXPAND_COUNTER(btree_delete) \
@@ -90,6 +88,7 @@
EXPAND_COUNTER(forest_read_items) \
EXPAND_COUNTER(forest_roots_next_hint) \
EXPAND_COUNTER(forest_set_bloom_bits) \
EXPAND_COUNTER(inode_deleted) \
EXPAND_COUNTER(item_cache_count_objects) \
EXPAND_COUNTER(item_cache_scan_objects) \
EXPAND_COUNTER(item_clear_dirty) \
@@ -117,10 +116,11 @@
EXPAND_COUNTER(item_pcpu_page_hit) \
EXPAND_COUNTER(item_pcpu_page_miss) \
EXPAND_COUNTER(item_pcpu_page_miss_keys) \
EXPAND_COUNTER(item_read_pages_barrier) \
EXPAND_COUNTER(item_read_pages_retry) \
EXPAND_COUNTER(item_read_pages_split) \
EXPAND_COUNTER(item_shrink_page) \
EXPAND_COUNTER(item_shrink_page_dirty) \
EXPAND_COUNTER(item_shrink_page_reader) \
EXPAND_COUNTER(item_shrink_page_trylock) \
EXPAND_COUNTER(item_update) \
EXPAND_COUNTER(item_write_dirty) \
@@ -145,6 +145,7 @@
EXPAND_COUNTER(lock_shrink_work) \
EXPAND_COUNTER(lock_unlock) \
EXPAND_COUNTER(lock_wait) \
EXPAND_COUNTER(log_merge_no_finalized) \
EXPAND_COUNTER(log_merge_wait_timeout) \
EXPAND_COUNTER(net_dropped_response) \
EXPAND_COUNTER(net_send_bytes) \
@@ -181,6 +182,7 @@
EXPAND_COUNTER(quorum_send_vote) \
EXPAND_COUNTER(quorum_server_shutdown) \
EXPAND_COUNTER(quorum_term_follower) \
EXPAND_COUNTER(reclaimed_open_logs) \
EXPAND_COUNTER(server_commit_hold) \
EXPAND_COUNTER(server_commit_queue) \
EXPAND_COUNTER(server_commit_worker) \

View File

@@ -2053,6 +2053,9 @@ const struct inode_operations scoutfs_dir_iops = {
#endif
.listxattr = scoutfs_listxattr,
.get_acl = scoutfs_get_acl,
#ifdef KC_GET_ACL_DENTRY
.set_acl = scoutfs_set_acl,
#endif
.symlink = scoutfs_symlink,
.permission = scoutfs_permission,
#ifdef KC_LINUX_HAVE_RHEL_IOPS_WRAPPER

View File

@@ -470,7 +470,7 @@ struct scoutfs_srch_compact {
* @get_trans_seq, @commit_trans_seq: These pair of sequence numbers
* determine if a transaction is currently open for the mount that owns
* the log_trees struct. get_trans_seq is advanced by the server as the
* transaction is opened. The server sets comimt_trans_seq equal to
* transaction is opened. The server sets commit_trans_seq equal to
* get_ as the transaction is committed.
*/
struct scoutfs_log_trees {
@@ -1091,7 +1091,8 @@ enum scoutfs_net_cmd {
EXPAND_NET_ERRNO(ENOMEM) \
EXPAND_NET_ERRNO(EIO) \
EXPAND_NET_ERRNO(ENOSPC) \
EXPAND_NET_ERRNO(EINVAL)
EXPAND_NET_ERRNO(EINVAL) \
EXPAND_NET_ERRNO(ENOLINK)
#undef EXPAND_NET_ERRNO
#define EXPAND_NET_ERRNO(which) SCOUTFS_NET_ERR_##which,

View File

@@ -150,6 +150,9 @@ static const struct inode_operations scoutfs_file_iops = {
#endif
.listxattr = scoutfs_listxattr,
.get_acl = scoutfs_get_acl,
#ifdef KC_GET_ACL_DENTRY
.set_acl = scoutfs_set_acl,
#endif
.fiemap = scoutfs_data_fiemap,
};
@@ -163,6 +166,9 @@ static const struct inode_operations scoutfs_special_iops = {
#endif
.listxattr = scoutfs_listxattr,
.get_acl = scoutfs_get_acl,
#ifdef KC_GET_ACL_DENTRY
.set_acl = scoutfs_set_acl,
#endif
};
/*
@@ -1476,12 +1482,6 @@ static int remove_index_items(struct super_block *sb, u64 ino,
* Return an allocated and unused inode number. Returns -ENOSPC if
* we're out of inode.
*
* Each parent directory has its own pool of free inode numbers. Items
* are sorted by their inode numbers as they're stored in segments.
* This will tend to group together files that are created in a
* directory at the same time in segments. Concurrent creation across
* different directories will be stored in their own regions.
*
* Inode numbers are never reclaimed. If the inode is evicted or we're
* unmounted the pending inode numbers will be lost. Asking for a
* relatively small number from the server each time will tend to
@@ -1491,12 +1491,18 @@ static int remove_index_items(struct super_block *sb, u64 ino,
int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
{
DECLARE_INODE_SB_INFO(sb, inf);
struct scoutfs_mount_options opts;
struct inode_allocator *ia;
u64 ino;
u64 nr;
int ret;
ia = is_dir ? &inf->dir_ino_alloc : &inf->ino_alloc;
scoutfs_options_read(sb, &opts);
if (is_dir && opts.ino_alloc_per_lock == SCOUTFS_LOCK_INODE_GROUP_NR)
ia = &inf->dir_ino_alloc;
else
ia = &inf->ino_alloc;
spin_lock(&ia->lock);
@@ -1517,6 +1523,17 @@ int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
*ino_ret = ia->ino++;
ia->nr--;
if (opts.ino_alloc_per_lock != SCOUTFS_LOCK_INODE_GROUP_NR) {
nr = ia->ino & SCOUTFS_LOCK_INODE_GROUP_MASK;
if (nr >= opts.ino_alloc_per_lock) {
nr = SCOUTFS_LOCK_INODE_GROUP_NR - nr;
if (nr > ia->nr)
nr = ia->nr;
ia->ino += nr;
ia->nr -= nr;
}
}
spin_unlock(&ia->lock);
ret = 0;
out:
@@ -1854,6 +1871,9 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)
goto out;
ret = delete_inode_items(sb, ino, &sinode, lock, orph_lock);
if (ret == 0)
scoutfs_inc_counter(sb, inode_deleted);
out:
if (clear_trying)
clear_bit(bit_nr, ldata->trying);
@@ -1962,6 +1982,8 @@ static void iput_worker(struct work_struct *work)
while (count-- > 0)
iput(inode);
cond_resched();
/* can't touch inode after final iput */
spin_lock(&inf->iput_lock);
@@ -2183,7 +2205,7 @@ int scoutfs_inode_walk_writeback(struct super_block *sb, bool write)
struct scoutfs_inode_info *si;
struct scoutfs_inode_info *tmp;
struct inode *inode;
int ret;
int ret = 0;
spin_lock(&inf->writeback_lock);

View File

@@ -441,8 +441,6 @@ static long scoutfs_ioc_data_wait_err(struct file *file, unsigned long arg)
if (!S_ISREG(inode->i_mode)) {
ret = -EINVAL;
} else if (scoutfs_inode_data_version(inode) != args.data_version) {
ret = -ESTALE;
} else {
ret = scoutfs_data_wait_err(inode, sblock, eblock, args.op,
args.err);
@@ -954,6 +952,9 @@ static int copy_alloc_detail_to_user(struct super_block *sb, void *arg,
if (args->copied == args->nr)
return -EOVERFLOW;
/* .type and .pad need clearing */
memset(&ade, 0, sizeof(struct scoutfs_ioctl_alloc_detail_entry));
ade.blocks = blocks;
ade.id = id;
ade.meta = !!meta;
@@ -1369,7 +1370,7 @@ static long scoutfs_ioc_get_referring_entries(struct file *file, unsigned long a
ent.d_type = bref->d_type;
ent.name_len = name_len;
if (copy_to_user(uent, &ent, sizeof(struct scoutfs_ioctl_dirent)) ||
if (copy_to_user(uent, &ent, offsetof(struct scoutfs_ioctl_dirent, name[0])) ||
copy_to_user(&uent->name[0], bref->dent.name, name_len) ||
put_user('\0', &uent->name[name_len])) {
ret = -EFAULT;

View File

@@ -366,10 +366,15 @@ struct scoutfs_ioctl_statfs_more {
*
* Find current waiters that match the inode, op, and block range to wake
* up and return an error.
*
* (*) ca. v1.25 and earlier required that the data_version passed match
* that of the waiter, but this check is removed. It was never needed
* because no data is modified during this ioctl. Any data_version value
* here is thus since then ignored.
*/
struct scoutfs_ioctl_data_wait_err {
__u64 ino;
__u64 data_version;
__u64 data_version; /* Ignored, see above (*) */
__u64 offset;
__u64 count;
__u64 op;

View File

@@ -86,6 +86,8 @@ struct item_cache_info {
/* often walked, but per-cpu refs are fast path */
rwlock_t rwlock;
struct rb_root pg_root;
/* stop readers from caching stale items behind reclaimed cleaned written items */
u64 read_dirty_barrier;
/* page-granular modification by writers, then exclusive to commit */
spinlock_t dirty_lock;
@@ -96,10 +98,6 @@ struct item_cache_info {
spinlock_t lru_lock;
struct list_head lru_list;
unsigned long lru_pages;
/* written by page readers, read by shrink */
spinlock_t active_lock;
struct list_head active_list;
};
#define DECLARE_ITEM_CACHE_INFO(sb, name) \
@@ -1285,78 +1283,6 @@ static int cache_empty_page(struct super_block *sb,
return 0;
}
/*
* Readers operate independently from dirty items and transactions.
* They read a set of persistent items and insert them into the cache
* when there aren't already pages whose key range contains the items.
* This naturally prefers cached dirty items over stale read items.
*
* We have to deal with the case where dirty items are written and
* invalidated while a read is in flight. The reader won't have seen
* the items that were dirty in their persistent roots as they started
* reading. By the time they insert their read pages the previously
* dirty items have been reclaimed and are not in the cache. The old
* stale items will be inserted in their place, effectively corrupting
* by having the dirty items disappear.
*
* We fix this by tracking the max seq of items in pages. As readers
* start they record the current transaction seq. Invalidation skips
* pages with a max seq greater than the first reader seq because the
* items in the page have to stick around to prevent the readers stale
* items from being inserted.
*
* This naturally only affects a small set of pages with items that were
* written relatively recently. If we're in memory pressure then we
* probably have a lot of pages and they'll naturally have items that
* were visible to any raders. We don't bother with the complicated and
* expensive further refinement of tracking the ranges that are being
* read and comparing those with pages to invalidate.
*/
struct active_reader {
struct list_head head;
u64 seq;
};
#define INIT_ACTIVE_READER(rdr) \
struct active_reader rdr = { .head = LIST_HEAD_INIT(rdr.head) }
static void add_active_reader(struct super_block *sb, struct active_reader *active)
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
BUG_ON(!list_empty(&active->head));
active->seq = scoutfs_trans_sample_seq(sb);
spin_lock(&cinf->active_lock);
list_add_tail(&active->head, &cinf->active_list);
spin_unlock(&cinf->active_lock);
}
static u64 first_active_reader_seq(struct item_cache_info *cinf)
{
struct active_reader *active;
u64 first;
/* only the calling task adds or deletes this active */
spin_lock(&cinf->active_lock);
active = list_first_entry_or_null(&cinf->active_list, struct active_reader, head);
first = active ? active->seq : U64_MAX;
spin_unlock(&cinf->active_lock);
return first;
}
static void del_active_reader(struct item_cache_info *cinf, struct active_reader *active)
{
/* only the calling task adds or deletes this active */
if (!list_empty(&active->head)) {
spin_lock(&cinf->active_lock);
list_del_init(&active->head);
spin_unlock(&cinf->active_lock);
}
}
/*
* Add a newly read item to the pages that we're assembling for
* insertion into the cache. These pages are private, they only exist
@@ -1450,24 +1376,34 @@ static int read_page_item(struct super_block *sb, struct scoutfs_key *key, u64 s
* and duplicates, we insert any resulting pages which don't overlap
* with existing cached pages.
*
* We only insert uncached regions because this is called with cluster
* locks held, but without locking the cache. The regions we read can
* be stale with respect to the current cache, which can be read and
* dirtied by other cluster lock holders on our node, but the cluster
* locks protect the stable items we read. Invalidation is careful not
* to drop pages that have items that we couldn't see because they were
* dirty when we started reading.
*
* The forest item reader is reading stable trees that could be
* overwritten. It can return -ESTALE which we return to the caller who
* will retry the operation and work with a new set of more recent
* btrees.
*
* We only insert uncached regions because this is called with cluster
* locks held, but without locking the cache. The regions we read can
* be stale with respect to the current cache, which can be read and
* dirtied by other cluster lock holders on our node, but the cluster
* locks protect the stable items we read.
*
* Using the presence of locally written dirty pages to override stale
* read pages only works if, well, the more recent locally written pages
* are still present. Readers are totally decoupled from writers and
* can have a set of items that is very old indeed. In the mean time
* more recent items would have been dirtied locally, committed,
* cleaned, and reclaimed. We have a coarse barrier which ensures that
* readers can't insert items read from old roots from before local data
* was written. If a write completes while a read is in progress the
* read will have to retry. The retried read can use cached blocks so
* we're relying on reads being much faster than writes to reduce the
* overhead to mostly cpu work of recollecting the items from cached
* blocks via a more recent root from the server.
*/
static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
struct scoutfs_key *key, struct scoutfs_lock *lock)
{
struct rb_root root = RB_ROOT;
INIT_ACTIVE_READER(active);
struct cached_page *right = NULL;
struct cached_page *pg;
struct cached_page *rd;
@@ -1480,6 +1416,7 @@ static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
struct rb_node *par;
struct rb_node *pg_tmp;
struct rb_node *item_tmp;
u64 rdbar;
int pgi;
int ret;
@@ -1493,8 +1430,9 @@ static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
pg->end = lock->end;
rbtree_insert(&pg->node, NULL, &root.rb_node, &root);
/* set active reader seq before reading persistent roots */
add_active_reader(sb, &active);
read_lock(&cinf->rwlock);
rdbar = cinf->read_dirty_barrier;
read_unlock(&cinf->rwlock);
start = lock->start;
end = lock->end;
@@ -1533,6 +1471,13 @@ static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
retry:
write_lock(&cinf->rwlock);
/* can't insert if write has cleaned since we read */
if (cinf->read_dirty_barrier != rdbar) {
scoutfs_inc_counter(sb, item_read_pages_barrier);
ret = -ESTALE;
goto unlock;
}
while ((rd = first_page(&root))) {
pg = page_rbtree_walk(sb, &cinf->pg_root, &rd->start, &rd->end,
@@ -1570,12 +1515,12 @@ retry:
}
}
ret = 0;
unlock:
write_unlock(&cinf->rwlock);
ret = 0;
out:
del_active_reader(cinf, &active);
/* free any pages we left dangling on error */
for_each_page_safe(&root, rd, pg_tmp) {
rbtree_erase(&rd->node, &root);
@@ -1635,6 +1580,7 @@ retry:
ret = read_pages(sb, cinf, key, lock);
if (ret < 0 && ret != -ESTALE)
goto out;
scoutfs_inc_counter(sb, item_read_pages_retry);
goto retry;
}
@@ -2401,6 +2347,12 @@ out:
* The caller has successfully committed all the dirty btree blocks that
* contained the currently dirty items. Clear all the dirty items and
* pages.
*
* This strange lock/trylock loop comes from sparse issuing spurious
* mismatched context warnings if we do anything (like unlock and relax)
* in the else branch of the failed trylock. We're jumping through
* hoops to not use the else but still drop and reacquire the dirty_lock
* if the trylock fails.
*/
int scoutfs_item_write_done(struct super_block *sb)
{
@@ -2409,40 +2361,35 @@ int scoutfs_item_write_done(struct super_block *sb)
struct cached_item *tmp;
struct cached_page *pg;
retry:
/* don't let read_pages miss written+cleaned items */
write_lock(&cinf->rwlock);
cinf->read_dirty_barrier++;
write_unlock(&cinf->rwlock);
spin_lock(&cinf->dirty_lock);
while ((pg = list_first_entry_or_null(&cinf->dirty_list,
struct cached_page,
dirty_head))) {
if (!write_trylock(&pg->rwlock)) {
while ((pg = list_first_entry_or_null(&cinf->dirty_list, struct cached_page, dirty_head))) {
if (write_trylock(&pg->rwlock)) {
spin_unlock(&cinf->dirty_lock);
cpu_relax();
goto retry;
}
list_for_each_entry_safe(item, tmp, &pg->dirty_list,
dirty_head) {
clear_item_dirty(sb, cinf, pg, item);
if (item->delta)
scoutfs_inc_counter(sb, item_delta_written);
/* free deletion items */
if (item->deletion || item->delta)
erase_item(pg, item);
else
item->persistent = 1;
}
write_unlock(&pg->rwlock);
spin_lock(&cinf->dirty_lock);
}
spin_unlock(&cinf->dirty_lock);
list_for_each_entry_safe(item, tmp, &pg->dirty_list,
dirty_head) {
clear_item_dirty(sb, cinf, pg, item);
if (item->delta)
scoutfs_inc_counter(sb, item_delta_written);
/* free deletion items */
if (item->deletion || item->delta)
erase_item(pg, item);
else
item->persistent = 1;
}
write_unlock(&pg->rwlock);
spin_lock(&cinf->dirty_lock);
}
} while (pg);
spin_unlock(&cinf->dirty_lock);
return 0;
@@ -2597,24 +2544,15 @@ static unsigned long item_cache_scan_objects(struct shrinker *shrink,
struct cached_page *tmp;
struct cached_page *pg;
unsigned long freed = 0;
u64 first_reader_seq;
int nr = sc->nr_to_scan;
scoutfs_inc_counter(sb, item_cache_scan_objects);
/* can't invalidate pages with items that weren't visible to first reader */
first_reader_seq = first_active_reader_seq(cinf);
write_lock(&cinf->rwlock);
spin_lock(&cinf->lru_lock);
list_for_each_entry_safe(pg, tmp, &cinf->lru_list, lru_head) {
if (first_reader_seq <= pg->max_seq) {
scoutfs_inc_counter(sb, item_shrink_page_reader);
continue;
}
if (!write_trylock(&pg->rwlock)) {
scoutfs_inc_counter(sb, item_shrink_page_trylock);
continue;
@@ -2681,8 +2619,6 @@ int scoutfs_item_setup(struct super_block *sb)
atomic_set(&cinf->dirty_pages, 0);
spin_lock_init(&cinf->lru_lock);
INIT_LIST_HEAD(&cinf->lru_list);
spin_lock_init(&cinf->active_lock);
INIT_LIST_HEAD(&cinf->active_list);
cinf->pcpu_pages = alloc_percpu(struct item_percpu_pages);
if (!cinf->pcpu_pages)
@@ -2715,8 +2651,6 @@ void scoutfs_item_destroy(struct super_block *sb)
int cpu;
if (cinf) {
BUG_ON(!list_empty(&cinf->active_list));
#ifdef KC_CPU_NOTIFIER
unregister_hotcpu_notifier(&cinf->notifier);
#endif

View File

@@ -81,3 +81,69 @@ kc_generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
return written ? written : status;
}
#endif
#include <linux/list_lru.h>
#ifdef KC_LIST_LRU_WALK_CB_ITEM_LOCK
static enum lru_status kc_isolate(struct list_head *item, spinlock_t *lock, void *cb_arg)
{
struct kc_isolate_args *args = cb_arg;
/* isolate doesn't use list, nr_items updated in caller */
return args->isolate(item, NULL, args->cb_arg);
}
unsigned long kc_list_lru_walk(struct list_lru *lru, kc_list_lru_walk_cb_t isolate, void *cb_arg,
unsigned long nr_to_walk)
{
struct kc_isolate_args args = {
.isolate = isolate,
.cb_arg = cb_arg,
};
return list_lru_walk(lru, kc_isolate, &args, nr_to_walk);
}
unsigned long kc_list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
kc_list_lru_walk_cb_t isolate, void *cb_arg)
{
struct kc_isolate_args args = {
.isolate = isolate,
.cb_arg = cb_arg,
};
return list_lru_shrink_walk(lru, sc, kc_isolate, &args);
}
#endif
#ifdef KC_LIST_LRU_WALK_CB_LIST_LOCK
static enum lru_status kc_isolate(struct list_head *item, struct list_lru_one *list,
spinlock_t *lock, void *cb_arg)
{
struct kc_isolate_args *args = cb_arg;
return args->isolate(item, list, args->cb_arg);
}
unsigned long kc_list_lru_walk(struct list_lru *lru, kc_list_lru_walk_cb_t isolate, void *cb_arg,
unsigned long nr_to_walk)
{
struct kc_isolate_args args = {
.isolate = isolate,
.cb_arg = cb_arg,
};
return list_lru_walk(lru, kc_isolate, &args, nr_to_walk);
}
unsigned long kc_list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
kc_list_lru_walk_cb_t isolate, void *cb_arg)
{
struct kc_isolate_args args = {
.isolate = isolate,
.cb_arg = cb_arg,
};
return list_lru_shrink_walk(lru, sc, kc_isolate, &args);
}
#endif

View File

@@ -263,6 +263,11 @@ typedef unsigned int blk_opf_t;
#define kc__vmalloc __vmalloc
#endif
#ifdef KC_VFS_METHOD_MNT_IDMAP_ARG
#define KC_VFS_NS_DEF struct mnt_idmap *mnt_idmap,
#define KC_VFS_NS mnt_idmap,
#define KC_VFS_INIT_NS &nop_mnt_idmap,
#else
#ifdef KC_VFS_METHOD_USER_NAMESPACE_ARG
#define KC_VFS_NS_DEF struct user_namespace *mnt_user_ns,
#define KC_VFS_NS mnt_user_ns,
@@ -272,6 +277,7 @@ typedef unsigned int blk_opf_t;
#define KC_VFS_NS
#define KC_VFS_INIT_NS
#endif
#endif /* KC_VFS_METHOD_MNT_IDMAP_ARG */
#ifdef KC_BIO_ALLOC_DEV_OPF_ARGS
#define kc_bio_alloc bio_alloc
@@ -410,4 +416,77 @@ static inline vm_fault_t vmf_error(int err)
}
#endif
#include <linux/list_lru.h>
#ifndef KC_LIST_LRU_SHRINK_COUNT_WALK
/* we don't bother with sc->{nid,memcg} (which doesn't exist in oldest kernels) */
static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
struct shrink_control *sc)
{
return list_lru_count(lru);
}
static inline unsigned long
list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
list_lru_walk_cb isolate, void *cb_arg)
{
return list_lru_walk(lru, isolate, cb_arg, sc->nr_to_scan);
}
#endif
#ifndef KC_LIST_LRU_ADD_OBJ
#define list_lru_add_obj list_lru_add
#define list_lru_del_obj list_lru_del
#endif
#if defined(KC_LIST_LRU_WALK_CB_LIST_LOCK) || defined(KC_LIST_LRU_WALK_CB_ITEM_LOCK)
struct list_lru_one;
typedef enum lru_status (*kc_list_lru_walk_cb_t)(struct list_head *item, struct list_lru_one *list,
void *cb_arg);
struct kc_isolate_args {
kc_list_lru_walk_cb_t isolate;
void *cb_arg;
};
unsigned long kc_list_lru_walk(struct list_lru *lru, kc_list_lru_walk_cb_t isolate, void *cb_arg,
unsigned long nr_to_walk);
unsigned long kc_list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
kc_list_lru_walk_cb_t isolate, void *cb_arg);
#else
#define kc_list_lru_shrink_walk list_lru_shrink_walk
#endif
#if defined(KC_LIST_LRU_WALK_CB_ITEM_LOCK)
/* isolate moved by hand, nr_items updated in walk as _REMOVE returned */
static inline void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
struct list_head *head)
{
list_move(item, head);
}
#endif
#ifndef KC_STACK_TRACE_SAVE
#include <linux/stacktrace.h>
static inline unsigned int stack_trace_save(unsigned long *store, unsigned int size,
unsigned int skipnr)
{
struct stack_trace trace = {
.entries = store,
.max_entries = size,
.skip = skipnr,
};
save_stack_trace(&trace);
return trace.nr_entries;
}
static inline void stack_trace_print(unsigned long *entries, unsigned int nr_entries, int spaces)
{
struct stack_trace trace = {
.entries = entries,
.nr_entries = nr_entries,
};
print_stack_trace(&trace, spaces);
}
#endif
#endif

View File

@@ -168,7 +168,6 @@ static int lock_invalidate(struct super_block *sb, struct scoutfs_lock *lock,
enum scoutfs_lock_mode prev, enum scoutfs_lock_mode mode)
{
struct scoutfs_lock_coverage *cov;
struct scoutfs_lock_coverage *tmp;
u64 ino, last;
int ret = 0;
@@ -192,19 +191,22 @@ static int lock_invalidate(struct super_block *sb, struct scoutfs_lock *lock,
/* have to invalidate if we're not in the only usable case */
if (!(prev == SCOUTFS_LOCK_WRITE && mode == SCOUTFS_LOCK_READ)) {
retry:
/* remove cov items to tell users that their cache is stale */
/*
* Remove cov items to tell users that their cache is
* stale. The unlock pattern comes from avoiding bad
* sparse warnings when taking else in a failed trylock.
*/
spin_lock(&lock->cov_list_lock);
list_for_each_entry_safe(cov, tmp, &lock->cov_list, head) {
if (!spin_trylock(&cov->cov_lock)) {
spin_unlock(&lock->cov_list_lock);
cpu_relax();
goto retry;
while ((cov = list_first_entry_or_null(&lock->cov_list,
struct scoutfs_lock_coverage, head))) {
if (spin_trylock(&cov->cov_lock)) {
list_del_init(&cov->head);
cov->lock = NULL;
spin_unlock(&cov->cov_lock);
scoutfs_inc_counter(sb, lock_invalidate_coverage);
}
list_del_init(&cov->head);
cov->lock = NULL;
spin_unlock(&cov->cov_lock);
scoutfs_inc_counter(sb, lock_invalidate_coverage);
spin_unlock(&lock->cov_list_lock);
spin_lock(&lock->cov_list_lock);
}
spin_unlock(&lock->cov_list_lock);

View File

@@ -35,6 +35,12 @@ do { \
} \
} while (0) \
#define scoutfs_bug_on_err(sb, err, fmt, args...) \
do { \
__typeof__(err) _err = (err); \
scoutfs_bug_on(sb, _err < 0 && _err != -ENOLINK, fmt, ##args); \
} while (0)
/*
* Each message is only generated once per volume. Remounting resets
* the messages.

View File

@@ -20,6 +20,7 @@
#include <net/sock.h>
#include <net/tcp.h>
#include <linux/log2.h>
#include <linux/jhash.h>
#include "format.h"
#include "counters.h"
@@ -31,6 +32,7 @@
#include "endian_swap.h"
#include "tseq.h"
#include "fence.h"
#include "options.h"
/*
* scoutfs networking delivers requests and responses between nodes.
@@ -134,6 +136,7 @@ struct message_send {
struct message_recv {
struct scoutfs_tseq_entry tseq_entry;
struct work_struct proc_work;
struct list_head ordered_head;
struct scoutfs_net_connection *conn;
struct scoutfs_net_header nh;
};
@@ -332,7 +335,7 @@ static int submit_send(struct super_block *sb,
return -EINVAL;
if (scoutfs_forcing_unmount(sb))
return -EIO;
return -ENOLINK;
msend = kmalloc(offsetof(struct message_send,
nh.data[data_len]), GFP_NOFS);
@@ -498,6 +501,51 @@ static void scoutfs_net_proc_worker(struct work_struct *work)
trace_scoutfs_net_proc_work_exit(sb, 0, ret);
}
static void scoutfs_net_ordered_proc_worker(struct work_struct *work)
{
struct scoutfs_work_list *wlist = container_of(work, struct scoutfs_work_list, work);
struct message_recv *mrecv;
struct message_recv *mrecv__;
LIST_HEAD(list);
spin_lock(&wlist->lock);
list_splice_init(&wlist->list, &list);
spin_unlock(&wlist->lock);
list_for_each_entry_safe(mrecv, mrecv__, &list, ordered_head) {
list_del_init(&mrecv->ordered_head);
scoutfs_net_proc_worker(&mrecv->proc_work);
}
}
/*
* Some messages require in-order processing. But the scope of the
* ordering isn't global. In the case of lock messages, it's per lock.
* So for these messages we hash them to a number of ordered workers who
* walk a list and call the usual work function in order. This replaced
* first the proc work detecting OOO and re-ordering, and then only
* calling proc from the one recv work context.
*/
static void queue_ordered_proc(struct scoutfs_net_connection *conn, struct message_recv *mrecv)
{
struct scoutfs_work_list *wlist;
struct scoutfs_net_lock *nl;
u32 h;
if (WARN_ON_ONCE(mrecv->nh.cmd != SCOUTFS_NET_CMD_LOCK ||
le16_to_cpu(mrecv->nh.data_len) != sizeof(struct scoutfs_net_lock)))
return scoutfs_net_proc_worker(&mrecv->proc_work);
nl = (void *)mrecv->nh.data;
h = jhash(&nl->key, sizeof(struct scoutfs_key), 0x6fdd3cd5);
wlist = &conn->ordered_proc_wlists[h % conn->ordered_proc_nr];
spin_lock(&wlist->lock);
list_add_tail(&mrecv->ordered_head, &wlist->list);
spin_unlock(&wlist->lock);
queue_work(conn->workq, &wlist->work);
}
/*
* Free live responses up to and including the seq by marking them dead
* and moving them to the send queue to be freed.
@@ -541,33 +589,17 @@ static void free_acked_responses(struct scoutfs_net_connection *conn, u64 seq)
queue_work(conn->workq, &conn->send_work);
}
static int recvmsg_full(struct socket *sock, void *buf, unsigned len)
static int k_recvmsg(struct socket *sock, void *buf, unsigned len)
{
struct msghdr msg;
struct kvec kv;
int ret;
struct kvec kv = {
.iov_base = buf,
.iov_len = len,
};
struct msghdr msg = {
.msg_flags = MSG_NOSIGNAL,
};
while (len) {
memset(&msg, 0, sizeof(msg));
msg.msg_flags = MSG_NOSIGNAL;
kv.iov_base = buf;
kv.iov_len = len;
#ifndef KC_MSGHDR_STRUCT_IOV_ITER
msg.msg_iov = (struct iovec *)&kv;
msg.msg_iovlen = 1;
#else
iov_iter_init(&msg.msg_iter, READ, (struct iovec *)&kv, len, 1);
#endif
ret = kernel_recvmsg(sock, &msg, &kv, 1, len, msg.msg_flags);
if (ret <= 0)
return -ECONNABORTED;
len -= ret;
buf += ret;
}
return 0;
return kernel_recvmsg(sock, &msg, &kv, 1, len, msg.msg_flags);
}
static bool invalid_message(struct scoutfs_net_connection *conn,
@@ -604,6 +636,72 @@ static bool invalid_message(struct scoutfs_net_connection *conn,
return false;
}
static int recv_one_message(struct super_block *sb, struct net_info *ninf,
struct scoutfs_net_connection *conn, struct scoutfs_net_header *nh,
unsigned int data_len)
{
struct message_recv *mrecv;
int ret;
scoutfs_inc_counter(sb, net_recv_messages);
scoutfs_add_counter(sb, net_recv_bytes, nh_bytes(data_len));
trace_scoutfs_net_recv_message(sb, &conn->sockname, &conn->peername, nh);
/* caller's invalid message checked data len */
mrecv = kmalloc(offsetof(struct message_recv, nh.data[data_len]), GFP_NOFS);
if (!mrecv) {
ret = -ENOMEM;
goto out;
}
mrecv->conn = conn;
INIT_WORK(&mrecv->proc_work, scoutfs_net_proc_worker);
INIT_LIST_HEAD(&mrecv->ordered_head);
mrecv->nh = *nh;
if (data_len)
memcpy(mrecv->nh.data, (nh + 1), data_len);
if (nh->cmd == SCOUTFS_NET_CMD_GREETING) {
/* greetings are out of band, no seq mechanics */
set_conn_fl(conn, saw_greeting);
} else if (le64_to_cpu(nh->seq) <=
atomic64_read(&conn->recv_seq)) {
/* drop any resent duplicated messages */
scoutfs_inc_counter(sb, net_recv_dropped_duplicate);
kfree(mrecv);
ret = 0;
goto out;
} else {
/* record that we've received sender's seq */
atomic64_set(&conn->recv_seq, le64_to_cpu(nh->seq));
/* and free our responses that sender has received */
free_acked_responses(conn, le64_to_cpu(nh->recv_seq));
}
scoutfs_tseq_add(&ninf->msg_tseq_tree, &mrecv->tseq_entry);
/*
* Initial received greetings are processed inline
* before any other incoming messages.
*
* Incoming requests or responses to the lock client
* can't handle re-ordering, so they're queued to
* ordered receive processing work.
*/
if (nh->cmd == SCOUTFS_NET_CMD_GREETING)
scoutfs_net_proc_worker(&mrecv->proc_work);
else if (nh->cmd == SCOUTFS_NET_CMD_LOCK && !conn->listening_conn)
queue_ordered_proc(conn, mrecv);
else
queue_work(conn->workq, &mrecv->proc_work);
ret = 0;
out:
return ret;
}
/*
* Always block receiving from the socket. Errors trigger shutting down
* the connection.
@@ -614,86 +712,72 @@ static void scoutfs_net_recv_worker(struct work_struct *work)
struct super_block *sb = conn->sb;
struct net_info *ninf = SCOUTFS_SB(sb)->net_info;
struct socket *sock = conn->sock;
struct scoutfs_net_header nh;
struct message_recv *mrecv;
struct scoutfs_net_header *nh;
struct page *page = NULL;
unsigned int data_len;
int hdr_off;
int rx_off;
int size;
int ret;
trace_scoutfs_net_recv_work_enter(sb, 0, 0);
page = alloc_page(GFP_NOFS);
if (!page) {
ret = -ENOMEM;
goto out;
}
hdr_off = 0;
rx_off = 0;
for (;;) {
/* receive the header */
ret = recvmsg_full(sock, &nh, sizeof(nh));
if (ret)
break;
/* receiving an invalid message breaks the connection */
if (invalid_message(conn, &nh)) {
scoutfs_inc_counter(sb, net_recv_invalid_message);
ret = -EBADMSG;
break;
ret = k_recvmsg(sock, page_address(page) + rx_off, PAGE_SIZE - rx_off);
if (ret <= 0) {
ret = -ECONNABORTED;
goto out;
}
data_len = le16_to_cpu(nh.data_len);
rx_off += ret;
scoutfs_inc_counter(sb, net_recv_messages);
scoutfs_add_counter(sb, net_recv_bytes, nh_bytes(data_len));
trace_scoutfs_net_recv_message(sb, &conn->sockname,
&conn->peername, &nh);
for (;;) {
size = rx_off - hdr_off;
if (size < sizeof(struct scoutfs_net_header))
break;
/* invalid message checked data len */
mrecv = kmalloc(offsetof(struct message_recv,
nh.data[data_len]), GFP_NOFS);
if (!mrecv) {
ret = -ENOMEM;
break;
nh = page_address(page) + hdr_off;
/* receiving an invalid message breaks the connection */
if (invalid_message(conn, nh)) {
scoutfs_inc_counter(sb, net_recv_invalid_message);
ret = -EBADMSG;
break;
}
data_len = le16_to_cpu(nh->data_len);
if (sizeof(struct scoutfs_net_header) + data_len > size)
break;
ret = recv_one_message(sb, ninf, conn, nh, data_len);
if (ret < 0)
goto out;
hdr_off += sizeof(struct scoutfs_net_header) + data_len;
}
mrecv->conn = conn;
INIT_WORK(&mrecv->proc_work, scoutfs_net_proc_worker);
mrecv->nh = nh;
/* receive the data payload */
ret = recvmsg_full(sock, mrecv->nh.data, data_len);
if (ret) {
kfree(mrecv);
break;
if ((PAGE_SIZE - rx_off) <
(sizeof(struct scoutfs_net_header) + SCOUTFS_NET_MAX_DATA_LEN)) {
if (size)
memmove(page_address(page), page_address(page) + hdr_off, size);
hdr_off = 0;
rx_off = size;
}
if (nh.cmd == SCOUTFS_NET_CMD_GREETING) {
/* greetings are out of band, no seq mechanics */
set_conn_fl(conn, saw_greeting);
} else if (le64_to_cpu(nh.seq) <=
atomic64_read(&conn->recv_seq)) {
/* drop any resent duplicated messages */
scoutfs_inc_counter(sb, net_recv_dropped_duplicate);
kfree(mrecv);
continue;
} else {
/* record that we've received sender's seq */
atomic64_set(&conn->recv_seq, le64_to_cpu(nh.seq));
/* and free our responses that sender has received */
free_acked_responses(conn, le64_to_cpu(nh.recv_seq));
}
scoutfs_tseq_add(&ninf->msg_tseq_tree, &mrecv->tseq_entry);
/*
* Initial received greetings are processed
* synchronously before any other incoming messages.
*
* Incoming requests or responses to the lock client are
* called synchronously to avoid reordering.
*/
if (nh.cmd == SCOUTFS_NET_CMD_GREETING ||
(nh.cmd == SCOUTFS_NET_CMD_LOCK && !conn->listening_conn))
scoutfs_net_proc_worker(&mrecv->proc_work);
else
queue_work(conn->workq, &mrecv->proc_work);
}
out:
__free_page(page);
if (ret)
scoutfs_inc_counter(sb, net_recv_error);
@@ -703,33 +787,41 @@ static void scoutfs_net_recv_worker(struct work_struct *work)
trace_scoutfs_net_recv_work_exit(sb, 0, ret);
}
static int sendmsg_full(struct socket *sock, void *buf, unsigned len)
/*
* This consumes the kvec.
*/
static int k_sendmsg_full(struct socket *sock, struct kvec *kv, unsigned long nr_segs, size_t count)
{
struct msghdr msg;
struct kvec kv;
int ret;
int ret = 0;
while (len) {
memset(&msg, 0, sizeof(msg));
msg.msg_flags = MSG_NOSIGNAL;
kv.iov_base = buf;
kv.iov_len = len;
while (count > 0) {
struct msghdr msg = {
.msg_flags = MSG_NOSIGNAL,
};
#ifndef KC_MSGHDR_STRUCT_IOV_ITER
msg.msg_iov = (struct iovec *)&kv;
msg.msg_iovlen = 1;
#else
iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)&kv, len, 1);
#endif
ret = kernel_sendmsg(sock, &msg, &kv, 1, len);
if (ret <= 0)
return -ECONNABORTED;
ret = kernel_sendmsg(sock, &msg, kv, nr_segs, count);
if (ret <= 0) {
ret = -ECONNABORTED;
break;
}
len -= ret;
buf += ret;
count -= ret;
if (count) {
while (nr_segs > 0 && ret >= kv->iov_len) {
ret -= kv->iov_len;
kv++;
nr_segs--;
}
if (nr_segs > 0 && ret > 0) {
kv->iov_base += ret;
kv->iov_len -= ret;
}
BUG_ON(nr_segs == 0);
}
ret = 0;
}
return 0;
return ret;
}
static void free_msend(struct net_info *ninf, struct message_send *msend)
@@ -760,54 +852,73 @@ static void scoutfs_net_send_worker(struct work_struct *work)
struct super_block *sb = conn->sb;
struct net_info *ninf = SCOUTFS_SB(sb)->net_info;
struct message_send *msend;
int ret = 0;
struct message_send *_msend_;
struct kvec kv[16];
unsigned long nr_segs;
size_t count;
int len;
int ret;
trace_scoutfs_net_send_work_enter(sb, 0, 0);
spin_lock(&conn->lock);
while ((msend = list_first_entry_or_null(&conn->send_queue,
struct message_send, head))) {
if (msend->dead) {
free_msend(ninf, msend);
continue;
}
if ((msend->nh.cmd == SCOUTFS_NET_CMD_FAREWELL) &&
nh_is_response(&msend->nh)) {
set_conn_fl(conn, saw_farewell);
}
msend->nh.recv_seq =
cpu_to_le64(atomic64_read(&conn->recv_seq));
spin_unlock(&conn->lock);
len = nh_bytes(le16_to_cpu(msend->nh.data_len));
scoutfs_inc_counter(sb, net_send_messages);
scoutfs_add_counter(sb, net_send_bytes, len);
trace_scoutfs_net_send_message(sb, &conn->sockname,
&conn->peername, &msend->nh);
ret = sendmsg_full(conn->sock, &msend->nh, len);
for (;;) {
nr_segs = 0;
count = 0;
spin_lock(&conn->lock);
list_for_each_entry_safe(msend, _msend_, &conn->send_queue, head) {
if (msend->dead) {
free_msend(ninf, msend);
continue;
}
msend->nh.recv_seq = 0;
len = nh_bytes(le16_to_cpu(msend->nh.data_len));
if (ret)
break;
if ((msend->nh.cmd == SCOUTFS_NET_CMD_FAREWELL) &&
nh_is_response(&msend->nh)) {
set_conn_fl(conn, saw_farewell);
}
/* resend if it wasn't freed while we sent */
if (!msend->dead)
list_move_tail(&msend->head, &conn->resend_queue);
msend->nh.recv_seq = cpu_to_le64(atomic64_read(&conn->recv_seq));
scoutfs_inc_counter(sb, net_send_messages);
scoutfs_add_counter(sb, net_send_bytes, len);
trace_scoutfs_net_send_message(sb, &conn->sockname,
&conn->peername, &msend->nh);
count += len;
kv[nr_segs].iov_base = &msend->nh;
kv[nr_segs].iov_len = len;
if (++nr_segs == ARRAY_SIZE(kv))
break;
}
spin_unlock(&conn->lock);
if (nr_segs == 0) {
ret = 0;
goto out;
}
ret = k_sendmsg_full(conn->sock, kv, nr_segs, count);
if (ret < 0)
goto out;
spin_lock(&conn->lock);
list_for_each_entry_safe(msend, _msend_, &conn->send_queue, head) {
msend->nh.recv_seq = 0;
/* resend if it wasn't freed while we sent */
if (!msend->dead)
list_move_tail(&msend->head, &conn->resend_queue);
if (--nr_segs == 0)
break;
}
spin_unlock(&conn->lock);
}
spin_unlock(&conn->lock);
out:
if (ret) {
scoutfs_inc_counter(sb, net_send_error);
shutdown_conn(conn);
@@ -862,6 +973,7 @@ static void scoutfs_net_destroy_worker(struct work_struct *work)
destroy_workqueue(conn->workq);
scoutfs_tseq_del(&ninf->conn_tseq_tree, &conn->tseq_entry);
kfree(conn->info);
kfree(conn->ordered_proc_wlists);
trace_scoutfs_conn_destroy_free(conn);
kfree(conn);
@@ -887,7 +999,7 @@ static void destroy_conn(struct scoutfs_net_connection *conn)
* The TCP_KEEP* and TCP_USER_TIMEOUT option interaction is subtle.
* TCP_USER_TIMEOUT only applies if there is unacked written data in the
* send queue. It doesn't work if the connection is idle. Adding
* keepalice probes with user_timeout set changes how the keepalive
* keepalive probes with user_timeout set changes how the keepalive
* timeout is calculated. CNT no longer matters. Each time
* additional probes (not the first) are sent the user timeout is
* checked against the last time data was received. If none of the
@@ -899,14 +1011,16 @@ static void destroy_conn(struct scoutfs_net_connection *conn)
* elapses during the probe timer processing after the unsuccessful
* probes.
*/
#define UNRESPONSIVE_TIMEOUT_SECS 10
#define UNRESPONSIVE_PROBES 3
static int sock_opts_and_names(struct scoutfs_net_connection *conn,
static int sock_opts_and_names(struct super_block *sb,
struct scoutfs_net_connection *conn,
struct socket *sock)
{
struct scoutfs_mount_options opts;
int optval;
int ret;
scoutfs_options_read(sb, &opts);
/* we use a keepalive timeout instead of send timeout */
ret = kc_sock_set_sndtimeo(sock, 0);
if (ret)
@@ -919,8 +1033,7 @@ static int sock_opts_and_names(struct scoutfs_net_connection *conn,
if (ret)
goto out;
BUILD_BUG_ON(UNRESPONSIVE_PROBES >= UNRESPONSIVE_TIMEOUT_SECS);
optval = UNRESPONSIVE_TIMEOUT_SECS - (UNRESPONSIVE_PROBES);
optval = (opts.tcp_keepalive_timeout_ms / MSEC_PER_SEC) - UNRESPONSIVE_PROBES;
ret = kc_tcp_sock_set_keepidle(sock, optval);
if (ret)
goto out;
@@ -930,7 +1043,7 @@ static int sock_opts_and_names(struct scoutfs_net_connection *conn,
if (ret)
goto out;
optval = UNRESPONSIVE_TIMEOUT_SECS * MSEC_PER_SEC;
optval = opts.tcp_keepalive_timeout_ms;
ret = kc_tcp_sock_set_user_timeout(sock, optval);
if (ret)
goto out;
@@ -992,13 +1105,19 @@ static void scoutfs_net_listen_worker(struct work_struct *work)
conn->notify_down,
conn->info_size,
conn->req_funcs, "accepted");
/*
* scoutfs_net_alloc_conn() can fail due to ENOMEM. If this
* is the only thing that does so, there's no harm in trying
* to see if kernel_accept() can get enough memory to try accepting
* a new connection again. If that then fails with ENOMEM, it'll
* shut down the conn anyway. So just retry here.
*/
if (!acc_conn) {
sock_release(acc_sock);
ret = -ENOMEM;
continue;
}
ret = sock_opts_and_names(acc_conn, acc_sock);
ret = sock_opts_and_names(sb, acc_conn, acc_sock);
if (ret) {
sock_release(acc_sock);
destroy_conn(acc_conn);
@@ -1069,7 +1188,7 @@ static void scoutfs_net_connect_worker(struct work_struct *work)
if (ret)
goto out;
ret = sock_opts_and_names(conn, sock);
ret = sock_opts_and_names(sb, conn, sock);
if (ret)
goto out;
@@ -1330,25 +1449,30 @@ scoutfs_net_alloc_conn(struct super_block *sb,
{
struct net_info *ninf = SCOUTFS_SB(sb)->net_info;
struct scoutfs_net_connection *conn;
unsigned int nr;
unsigned int i;
nr = min_t(unsigned int, num_possible_cpus(),
PAGE_SIZE / sizeof(struct scoutfs_work_list));
conn = kzalloc(sizeof(struct scoutfs_net_connection), GFP_NOFS);
if (!conn)
return NULL;
if (info_size) {
conn->info = kzalloc(info_size, GFP_NOFS);
if (!conn->info) {
kfree(conn);
return NULL;
}
if (conn) {
if (info_size)
conn->info = kzalloc(info_size, GFP_NOFS);
conn->ordered_proc_wlists = kmalloc_array(nr, sizeof(struct scoutfs_work_list),
GFP_NOFS);
conn->workq = alloc_workqueue("scoutfs_net_%s",
WQ_UNBOUND | WQ_NON_REENTRANT, 0,
name_suffix);
}
conn->workq = alloc_workqueue("scoutfs_net_%s",
WQ_UNBOUND | WQ_NON_REENTRANT, 0,
name_suffix);
if (!conn->workq) {
kfree(conn->info);
kfree(conn);
if (!conn || (info_size && !conn->info) || !conn->workq || !conn->ordered_proc_wlists) {
if (conn) {
kfree(conn->info);
kfree(conn->ordered_proc_wlists);
if (conn->workq)
destroy_workqueue(conn->workq);
kfree(conn);
}
return NULL;
}
@@ -1378,6 +1502,13 @@ scoutfs_net_alloc_conn(struct super_block *sb,
INIT_DELAYED_WORK(&conn->reconn_free_dwork,
scoutfs_net_reconn_free_worker);
conn->ordered_proc_nr = nr;
for (i = 0; i < nr; i++) {
INIT_WORK(&conn->ordered_proc_wlists[i].work, scoutfs_net_ordered_proc_worker);
spin_lock_init(&conn->ordered_proc_wlists[i].lock);
INIT_LIST_HEAD(&conn->ordered_proc_wlists[i].list);
}
scoutfs_tseq_add(&ninf->conn_tseq_tree, &conn->tseq_entry);
trace_scoutfs_conn_alloc(conn);

View File

@@ -1,10 +1,18 @@
#ifndef _SCOUTFS_NET_H_
#define _SCOUTFS_NET_H_
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/in.h>
#include "endian_swap.h"
#include "tseq.h"
struct scoutfs_work_list {
struct work_struct work;
spinlock_t lock;
struct list_head list;
};
struct scoutfs_net_connection;
/* These are called in their own blocking context */
@@ -61,6 +69,8 @@ struct scoutfs_net_connection {
struct list_head resend_queue;
atomic64_t recv_seq;
unsigned int ordered_proc_nr;
struct scoutfs_work_list *ordered_proc_wlists;
struct workqueue_struct *workq;
struct work_struct listen_work;

View File

@@ -592,7 +592,7 @@ static int handle_request(struct super_block *sb, struct omap_request *req)
ret = 0;
out:
free_rids(&priv_rids);
if (ret < 0) {
if ((ret < 0) && (req != NULL)) {
ret = scoutfs_server_send_omap_response(sb, req->client_rid, req->client_id,
NULL, ret);
free_req(req);

View File

@@ -33,12 +33,14 @@ enum {
Opt_acl,
Opt_data_prealloc_blocks,
Opt_data_prealloc_contig_only,
Opt_ino_alloc_per_lock,
Opt_log_merge_wait_timeout_ms,
Opt_metadev_path,
Opt_noacl,
Opt_orphan_scan_delay_ms,
Opt_quorum_heartbeat_timeout_ms,
Opt_quorum_slot_nr,
Opt_tcp_keepalive_timeout_ms,
Opt_err,
};
@@ -46,12 +48,14 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_data_prealloc_blocks, "data_prealloc_blocks=%s"},
{Opt_data_prealloc_contig_only, "data_prealloc_contig_only=%s"},
{Opt_ino_alloc_per_lock, "ino_alloc_per_lock=%s"},
{Opt_log_merge_wait_timeout_ms, "log_merge_wait_timeout_ms=%s"},
{Opt_metadev_path, "metadev_path=%s"},
{Opt_noacl, "noacl"},
{Opt_orphan_scan_delay_ms, "orphan_scan_delay_ms=%s"},
{Opt_quorum_heartbeat_timeout_ms, "quorum_heartbeat_timeout_ms=%s"},
{Opt_quorum_slot_nr, "quorum_slot_nr=%s"},
{Opt_tcp_keepalive_timeout_ms, "tcp_keepalive_timeout_ms=%s"},
{Opt_err, NULL}
};
@@ -126,16 +130,20 @@ static void free_options(struct scoutfs_mount_options *opts)
#define MIN_DATA_PREALLOC_BLOCKS 1ULL
#define MAX_DATA_PREALLOC_BLOCKS ((unsigned long long)SCOUTFS_BLOCK_SM_MAX)
#define DEFAULT_TCP_KEEPALIVE_TIMEOUT_MS (60 * MSEC_PER_SEC)
static void init_default_options(struct scoutfs_mount_options *opts)
{
memset(opts, 0, sizeof(*opts));
opts->data_prealloc_blocks = SCOUTFS_DATA_PREALLOC_DEFAULT_BLOCKS;
opts->data_prealloc_contig_only = 1;
opts->ino_alloc_per_lock = SCOUTFS_LOCK_INODE_GROUP_NR;
opts->log_merge_wait_timeout_ms = DEFAULT_LOG_MERGE_WAIT_TIMEOUT_MS;
opts->orphan_scan_delay_ms = -1;
opts->quorum_heartbeat_timeout_ms = SCOUTFS_QUORUM_DEF_HB_TIMEO_MS;
opts->quorum_slot_nr = -1;
opts->tcp_keepalive_timeout_ms = DEFAULT_TCP_KEEPALIVE_TIMEOUT_MS;
}
static int verify_log_merge_wait_timeout_ms(struct super_block *sb, int ret, int val)
@@ -168,6 +176,21 @@ static int verify_quorum_heartbeat_timeout_ms(struct super_block *sb, int ret, u
return 0;
}
static int verify_tcp_keepalive_timeout_ms(struct super_block *sb, int ret, int val)
{
if (ret < 0) {
scoutfs_err(sb, "failed to parse tcp_keepalive_timeout_ms value");
return -EINVAL;
}
if (val <= (UNRESPONSIVE_PROBES * MSEC_PER_SEC)) {
scoutfs_err(sb, "invalid tcp_keepalive_timeout_ms value %d, must be larger than %lu",
val, (UNRESPONSIVE_PROBES * MSEC_PER_SEC));
return -EINVAL;
}
return 0;
}
/*
* Parse the option string into our options struct. This can allocate
* memory in the struct. The caller is responsible for always calling
@@ -218,6 +241,26 @@ static int parse_options(struct super_block *sb, char *options, struct scoutfs_m
opts->data_prealloc_contig_only = nr;
break;
case Opt_ino_alloc_per_lock:
ret = match_int(args, &nr);
if (ret < 0 || nr < 1 || nr > SCOUTFS_LOCK_INODE_GROUP_NR) {
scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
SCOUTFS_LOCK_INODE_GROUP_NR);
if (ret == 0)
ret = -EINVAL;
return ret;
}
opts->ino_alloc_per_lock = nr;
break;
case Opt_tcp_keepalive_timeout_ms:
ret = match_int(args, &nr);
ret = verify_tcp_keepalive_timeout_ms(sb, ret, nr);
if (ret < 0)
return ret;
opts->tcp_keepalive_timeout_ms = nr;
break;
case Opt_log_merge_wait_timeout_ms:
ret = match_int(args, &nr);
ret = verify_log_merge_wait_timeout_ms(sb, ret, nr);
@@ -365,12 +408,14 @@ int scoutfs_options_show(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",acl");
seq_printf(seq, ",data_prealloc_blocks=%llu", opts.data_prealloc_blocks);
seq_printf(seq, ",data_prealloc_contig_only=%u", opts.data_prealloc_contig_only);
seq_printf(seq, ",ino_alloc_per_lock=%u", opts.ino_alloc_per_lock);
seq_printf(seq, ",metadev_path=%s", opts.metadev_path);
if (!is_acl)
seq_puts(seq, ",noacl");
seq_printf(seq, ",orphan_scan_delay_ms=%u", opts.orphan_scan_delay_ms);
if (opts.quorum_slot_nr >= 0)
seq_printf(seq, ",quorum_slot_nr=%d", opts.quorum_slot_nr);
seq_printf(seq, ",tcp_keepalive_timeout_ms=%d", opts.tcp_keepalive_timeout_ms);
return 0;
}
@@ -452,6 +497,45 @@ static ssize_t data_prealloc_contig_only_store(struct kobject *kobj, struct kobj
}
SCOUTFS_ATTR_RW(data_prealloc_contig_only);
static ssize_t ino_alloc_per_lock_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
struct scoutfs_mount_options opts;
scoutfs_options_read(sb, &opts);
return snprintf(buf, PAGE_SIZE, "%u", opts.ino_alloc_per_lock);
}
static ssize_t ino_alloc_per_lock_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
DECLARE_OPTIONS_INFO(sb, optinf);
char nullterm[20]; /* more than enough for octal -U32_MAX */
long val;
int len;
int ret;
len = min(count, sizeof(nullterm) - 1);
memcpy(nullterm, buf, len);
nullterm[len] = '\0';
ret = kstrtol(nullterm, 0, &val);
if (ret < 0 || val < 1 || val > SCOUTFS_LOCK_INODE_GROUP_NR) {
scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
SCOUTFS_LOCK_INODE_GROUP_NR);
return -EINVAL;
}
write_seqlock(&optinf->seqlock);
optinf->opts.ino_alloc_per_lock = val;
write_sequnlock(&optinf->seqlock);
return count;
}
SCOUTFS_ATTR_RW(ino_alloc_per_lock);
static ssize_t log_merge_wait_timeout_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
@@ -592,6 +676,7 @@ SCOUTFS_ATTR_RO(quorum_slot_nr);
static struct attribute *options_attrs[] = {
SCOUTFS_ATTR_PTR(data_prealloc_blocks),
SCOUTFS_ATTR_PTR(data_prealloc_contig_only),
SCOUTFS_ATTR_PTR(ino_alloc_per_lock),
SCOUTFS_ATTR_PTR(log_merge_wait_timeout_ms),
SCOUTFS_ATTR_PTR(metadev_path),
SCOUTFS_ATTR_PTR(orphan_scan_delay_ms),

View File

@@ -8,13 +8,17 @@
struct scoutfs_mount_options {
u64 data_prealloc_blocks;
bool data_prealloc_contig_only;
unsigned int ino_alloc_per_lock;
unsigned int log_merge_wait_timeout_ms;
char *metadev_path;
unsigned int orphan_scan_delay_ms;
int quorum_slot_nr;
u64 quorum_heartbeat_timeout_ms;
int tcp_keepalive_timeout_ms;
};
#define UNRESPONSIVE_PROBES 3
void scoutfs_options_read(struct super_block *sb, struct scoutfs_mount_options *opts);
int scoutfs_options_show(struct seq_file *seq, struct dentry *root);

View File

@@ -243,10 +243,6 @@ static int send_msg_members(struct super_block *sb, int type, u64 term, int only
};
struct sockaddr_in sin;
struct msghdr mh = {
#ifndef KC_MSGHDR_STRUCT_IOV_ITER
.msg_iov = (struct iovec *)&kv,
.msg_iovlen = 1,
#endif
.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
.msg_name = &sin,
.msg_namelen = sizeof(sin),
@@ -268,9 +264,7 @@ static int send_msg_members(struct super_block *sb, int type, u64 term, int only
scoutfs_quorum_slot_sin(&qinf->qconf, i, &sin);
now = ktime_get();
#ifdef KC_MSGHDR_STRUCT_IOV_ITER
iov_iter_init(&mh.msg_iter, WRITE, (struct iovec *)&kv, sizeof(qmes), 1);
#endif
ret = kernel_sendmsg(qinf->sock, &mh, &kv, 1, kv.iov_len);
if (ret != kv.iov_len)
failed++;
@@ -312,10 +306,6 @@ static int recv_msg(struct super_block *sb, struct quorum_host_msg *msg,
.iov_len = sizeof(struct scoutfs_quorum_message),
};
struct msghdr mh = {
#ifndef KC_MSGHDR_STRUCT_IOV_ITER
.msg_iov = (struct iovec *)&kv,
.msg_iovlen = 1,
#endif
.msg_flags = MSG_NOSIGNAL,
};
@@ -333,9 +323,6 @@ static int recv_msg(struct super_block *sb, struct quorum_host_msg *msg,
ret = kc_tcp_sock_set_rcvtimeo(qinf->sock, rel_to);
}
#ifdef KC_MSGHDR_STRUCT_IOV_ITER
iov_iter_init(&mh.msg_iter, READ, (struct iovec *)&kv, sizeof(struct scoutfs_quorum_message), 1);
#endif
ret = kernel_recvmsg(qinf->sock, &mh, &kv, 1, kv.iov_len, mh.msg_flags);
if (ret < 0)
return ret;
@@ -520,10 +507,10 @@ static int update_quorum_block(struct super_block *sb, int event, u64 term, bool
set_quorum_block_event(sb, &blk, event, term);
ret = write_quorum_block(sb, blkno, &blk);
if (ret < 0)
scoutfs_err(sb, "error %d reading quorum block %llu to update event %d term %llu",
scoutfs_err(sb, "error %d writing quorum block %llu after updating event %d term %llu",
ret, blkno, event, term);
} else {
scoutfs_err(sb, "error %d writing quorum block %llu after updating event %d term %llu",
scoutfs_err(sb, "error %d reading quorum block %llu to update event %d term %llu",
ret, blkno, event, term);
}
@@ -822,6 +809,7 @@ static void scoutfs_quorum_worker(struct work_struct *work)
/* followers and candidates start new election on timeout */
if (qst.role != LEADER &&
msg.type == SCOUTFS_QUORUM_MSG_INVALID &&
ktime_after(ktime_get(), qst.timeout)) {
/* .. but only if their server has stopped */
if (!scoutfs_server_is_down(sb)) {
@@ -982,7 +970,10 @@ static void scoutfs_quorum_worker(struct work_struct *work)
}
/* record that this slot no longer has an active quorum */
update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_END, qst.term, true);
err = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_END, qst.term, true);
if (err < 0 && ret == 0)
ret = err;
out:
if (ret < 0) {
scoutfs_err(sb, "quorum service saw error %d, shutting down. This mount is no longer participating in quorum. It should be remounted to restore service.",
@@ -1071,7 +1062,7 @@ static char *role_str(int role)
[LEADER] = "leader",
};
if (role < 0 || role > ARRAY_SIZE(roles) || !roles[role])
if (role < 0 || role >= ARRAY_SIZE(roles) || !roles[role])
return "invalid";
return roles[role];

View File

@@ -823,13 +823,14 @@ DEFINE_EVENT(scoutfs_lock_info_class, scoutfs_lock_destroy,
);
TRACE_EVENT(scoutfs_xattr_set,
TP_PROTO(struct super_block *sb, size_t name_len, const void *value,
size_t size, int flags),
TP_PROTO(struct super_block *sb, __u64 ino, size_t name_len,
const void *value, size_t size, int flags),
TP_ARGS(sb, name_len, value, size, flags),
TP_ARGS(sb, ino, name_len, value, size, flags),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, ino)
__field(size_t, name_len)
__field(const void *, value)
__field(size_t, size)
@@ -838,15 +839,16 @@ TRACE_EVENT(scoutfs_xattr_set,
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ino = ino;
__entry->name_len = name_len;
__entry->value = value;
__entry->size = size;
__entry->flags = flags;
),
TP_printk(SCSBF" name_len %zu value %p size %zu flags 0x%x",
SCSB_TRACE_ARGS, __entry->name_len, __entry->value,
__entry->size, __entry->flags)
TP_printk(SCSBF" ino %llu name_len %zu value %p size %zu flags 0x%x",
SCSB_TRACE_ARGS, __entry->ino, __entry->name_len,
__entry->value, __entry->size, __entry->flags)
);
TRACE_EVENT(scoutfs_advance_dirty_super,
@@ -1966,15 +1968,17 @@ DEFINE_EVENT(scoutfs_server_client_count_class, scoutfs_server_client_down,
);
DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing,
exceeded),
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(int, holding)
__field(int, applying)
__field(int, nr_holders)
__field(u32, budget)
__field(__u32, avail_before)
__field(__u32, freed_before)
__field(int, committing)
@@ -1985,35 +1989,45 @@ DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
__entry->holding = !!holding;
__entry->applying = !!applying;
__entry->nr_holders = nr_holders;
__entry->budget = budget;
__entry->avail_before = avail_before;
__entry->freed_before = freed_before;
__entry->committing = !!committing;
__entry->exceeded = !!exceeded;
),
TP_printk(SCSBF" holding %u applying %u nr %u avail_before %u freed_before %u committing %u exceeded %u",
SCSB_TRACE_ARGS, __entry->holding, __entry->applying, __entry->nr_holders,
__entry->avail_before, __entry->freed_before, __entry->committing,
__entry->exceeded)
TP_printk(SCSBF" holding %u applying %u nr %u budget %u avail_before %u freed_before %u committing %u exceeded %u",
SCSB_TRACE_ARGS, __entry->holding, __entry->applying,
__entry->nr_holders, __entry->budget,
__entry->avail_before, __entry->freed_before,
__entry->committing, __entry->exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_hold,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_apply,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_start,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_end,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
#define slt_symbolic(mode) \
@@ -2451,6 +2465,27 @@ TRACE_EVENT(scoutfs_block_dirty_ref,
__entry->block_blkno, __entry->block_seq)
);
TRACE_EVENT(scoutfs_get_file_block,
TP_PROTO(struct super_block *sb, u64 blkno, int flags),
TP_ARGS(sb, blkno, flags),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, blkno)
__field(int, flags)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->blkno = blkno;
__entry->flags = flags;
),
TP_printk(SCSBF" blkno %llu flags 0x%x",
SCSB_TRACE_ARGS, __entry->blkno, __entry->flags)
);
TRACE_EVENT(scoutfs_block_stale,
TP_PROTO(struct super_block *sb, struct scoutfs_block_ref *ref,
struct scoutfs_block_header *hdr, u32 magic, u32 crc),
@@ -2491,8 +2526,8 @@ TRACE_EVENT(scoutfs_block_stale,
DECLARE_EVENT_CLASS(scoutfs_block_class,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno, int refcount, int io_count,
unsigned long bits, __u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed),
unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(void *, bp)
@@ -2500,7 +2535,6 @@ DECLARE_EVENT_CLASS(scoutfs_block_class,
__field(int, refcount)
__field(int, io_count)
__field(long, bits)
__field(__u64, accessed)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
@@ -2509,71 +2543,65 @@ DECLARE_EVENT_CLASS(scoutfs_block_class,
__entry->refcount = refcount;
__entry->io_count = io_count;
__entry->bits = bits;
__entry->accessed = accessed;
),
TP_printk(SCSBF" bp %p blkno %llu refcount %d io_count %d bits 0x%lx accessed %llu",
TP_printk(SCSBF" bp %p blkno %llu refcount %x io_count %d bits 0x%lx",
SCSB_TRACE_ARGS, __entry->bp, __entry->blkno, __entry->refcount,
__entry->io_count, __entry->bits, __entry->accessed)
__entry->io_count, __entry->bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_allocate,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_free,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_insert,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_remove,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_end_io,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_submit,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_invalidate,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_mark_dirty,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_forget,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_shrink,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits,
__u64 accessed),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits, accessed)
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DEFINE_EVENT(scoutfs_block_class, scoutfs_block_isolate,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno,
int refcount, int io_count, unsigned long bits),
TP_ARGS(sb, bp, blkno, refcount, io_count, bits)
);
DECLARE_EVENT_CLASS(scoutfs_ext_next_class,
@@ -3048,6 +3076,27 @@ DEFINE_EVENT(scoutfs_srch_compact_class, scoutfs_srch_compact_client_recv,
TP_ARGS(sb, sc)
);
TRACE_EVENT(scoutfs_ioc_search_xattrs,
TP_PROTO(struct super_block *sb, u64 ino, u64 last_ino),
TP_ARGS(sb, ino, last_ino),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(u64, ino)
__field(u64, last_ino)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ino = ino;
__entry->last_ino = last_ino;
),
TP_printk(SCSBF" ino %llu last_ino %llu", SCSB_TRACE_ARGS,
__entry->ino, __entry->last_ino)
);
#endif /* _TRACE_SCOUTFS_H */
/* This part must be outside protection */

View File

@@ -65,6 +65,7 @@ struct commit_users {
struct list_head holding;
struct list_head applying;
unsigned int nr_holders;
u32 budget;
u32 avail_before;
u32 freed_before;
bool committing;
@@ -84,8 +85,9 @@ static void init_commit_users(struct commit_users *cusers)
do { \
__typeof__(cusers) _cusers = (cusers); \
trace_scoutfs_server_commit_##which(sb, !list_empty(&_cusers->holding), \
!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->avail_before, \
_cusers->freed_before, _cusers->committing, _cusers->exceeded); \
!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->budget, \
_cusers->avail_before, _cusers->freed_before, _cusers->committing, \
_cusers->exceeded); \
} while (0)
struct server_info {
@@ -303,7 +305,6 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
u32 freed_used;
u32 avail_now;
u32 freed_now;
u32 budget;
assert_spin_locked(&cusers->lock);
@@ -318,15 +319,14 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
else
freed_used = SCOUTFS_ALLOC_LIST_MAX_BLOCKS - freed_now;
budget = cusers->nr_holders * COMMIT_HOLD_ALLOC_BUDGET;
if (avail_used <= budget && freed_used <= budget)
if (avail_used <= cusers->budget && freed_used <= cusers->budget)
return;
exceeded_once = true;
cusers->exceeded = cusers->nr_holders;
scoutfs_err(sb, "%u holders exceeded alloc budget av: bef %u now %u, fr: bef %u now %u",
cusers->nr_holders, cusers->avail_before, avail_now,
scoutfs_err(sb, "holders exceeded alloc budget %u av: bef %u now %u, fr: bef %u now %u",
cusers->budget, cusers->avail_before, avail_now,
cusers->freed_before, freed_now);
list_for_each_entry(hold, &cusers->holding, entry) {
@@ -349,7 +349,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
{
bool has_room;
bool held;
u32 budget;
u32 new_budget;
u32 av;
u32 fr;
@@ -367,8 +367,8 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
}
/* +2 for our additional hold and then for the final commit work the server does */
budget = (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET;
has_room = av >= budget && fr >= budget;
new_budget = max(cusers->budget, (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET);
has_room = av >= new_budget && fr >= new_budget;
/* checking applying so holders drain once an apply caller starts waiting */
held = !cusers->committing && has_room && list_empty(&cusers->applying);
@@ -388,6 +388,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
list_add_tail(&hold->entry, &cusers->holding);
cusers->nr_holders++;
cusers->budget = new_budget;
} else if (!has_room && cusers->nr_holders == 0 && !cusers->committing) {
cusers->committing = true;
@@ -516,6 +517,7 @@ static void commit_end(struct super_block *sb, struct commit_users *cusers, int
list_for_each_entry_safe(hold, tmp, &cusers->applying, entry)
list_del_init(&hold->entry);
cusers->committing = false;
cusers->budget = 0;
spin_unlock(&cusers->lock);
wake_up(&cusers->waitq);
@@ -608,7 +610,7 @@ static void scoutfs_server_commit_func(struct work_struct *work)
goto out;
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
ret = -ENOLINK;
goto out;
}
@@ -992,10 +994,11 @@ static int for_each_rid_last_lt(struct super_block *sb, struct scoutfs_btree_roo
}
/*
* Log merge range items are stored at the starting fs key of the range.
* The only fs key field that doesn't hold information is the zone, so
* we use the zone to differentiate all types that we store in the log
* merge tree.
* Log merge range items are stored at the starting fs key of the range
* with the zone overwritten to indicate the log merge item type. This
* day0 mistake loses sorting information for items in the different
* zones in the fs root, so the range items aren't strictly sorted by
* the starting key of their range.
*/
static void init_log_merge_key(struct scoutfs_key *key, u8 zone, u64 first,
u64 second)
@@ -1027,6 +1030,51 @@ static int next_log_merge_item_key(struct super_block *sb, struct scoutfs_btree_
return ret;
}
/*
* The range items aren't sorted by their range.start because
* _RANGE_ZONE clobbers the range's zone. We sweep all the items and
* find the range with the next least starting key that's greater than
* the caller's starting key. We have to be careful to iterate over the
* log_merge tree keys because the ranges can overlap as they're mapped
* to the log_merge keys by clobbering their zone.
*/
static int next_log_merge_range(struct super_block *sb, struct scoutfs_btree_root *root,
struct scoutfs_key *start, struct scoutfs_log_merge_range *rng)
{
struct scoutfs_log_merge_range *next;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
int ret;
key = *start;
key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
scoutfs_key_set_ones(&rng->start);
do {
ret = scoutfs_btree_next(sb, root, &key, &iref);
if (ret == 0) {
if (iref.key->sk_zone != SCOUTFS_LOG_MERGE_RANGE_ZONE) {
ret = -ENOENT;
} else if (iref.val_len != sizeof(struct scoutfs_log_merge_range)) {
ret = -EIO;
} else {
next = iref.val;
if (scoutfs_key_compare(&next->start, &rng->start) < 0 &&
scoutfs_key_compare(&next->start, start) >= 0)
*rng = *next;
key = *iref.key;
scoutfs_key_inc(&key);
}
scoutfs_btree_put_iref(&iref);
}
} while (ret == 0);
if (ret == -ENOENT && !scoutfs_key_is_ones(&rng->start))
ret = 0;
return ret;
}
static int next_log_merge_item(struct super_block *sb,
struct scoutfs_btree_root *root,
u8 zone, u64 first, u64 second,
@@ -1038,6 +1086,101 @@ static int next_log_merge_item(struct super_block *sb,
return next_log_merge_item_key(sb, root, zone, &key, val, val_len);
}
static int do_finalize_ours(struct super_block *sb,
struct scoutfs_log_trees *lt,
struct commit_hold *hold)
{
struct server_info *server = SCOUTFS_SB(sb)->server_info;
struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
struct scoutfs_key key;
char *err_str = NULL;
u64 rid = le64_to_cpu(lt->rid);
bool more;
int ret;
int err;
mutex_lock(&server->srch_mutex);
ret = scoutfs_srch_rotate_log(sb, &server->alloc, &server->wri,
&super->srch_root, &lt->srch_file, true);
mutex_unlock(&server->srch_mutex);
if (ret < 0) {
scoutfs_err(sb, "error rotating srch log for rid %016llx: %d",
rid, ret);
return ret;
}
do {
more = false;
/*
* All of these can return errors, perhaps indicating successful
* partial progress, after having modified the allocator trees.
* We always have to update the roots in the log item.
*/
mutex_lock(&server->alloc_mutex);
ret = (err_str = "splice meta_freed to other_freed",
scoutfs_alloc_splice_list(sb, &server->alloc,
&server->wri, server->other_freed,
&lt->meta_freed)) ?:
(err_str = "splice meta_avail",
scoutfs_alloc_splice_list(sb, &server->alloc,
&server->wri, server->other_freed,
&lt->meta_avail)) ?:
(err_str = "empty data_avail",
alloc_move_empty(sb, &super->data_alloc,
&lt->data_avail,
COMMIT_HOLD_ALLOC_BUDGET / 2)) ?:
(err_str = "empty data_freed",
alloc_move_empty(sb, &super->data_alloc,
&lt->data_freed,
COMMIT_HOLD_ALLOC_BUDGET / 2));
mutex_unlock(&server->alloc_mutex);
/*
* only finalize, allowing merging, once the allocators are
* fully freed
*/
if (ret == 0) {
/* the transaction is no longer open */
le64_add_cpu(&lt->flags, SCOUTFS_LOG_TREES_FINALIZED);
lt->finalize_seq = cpu_to_le64(scoutfs_server_next_seq(sb));
}
scoutfs_key_init_log_trees(&key, rid, le64_to_cpu(lt->nr));
err = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, lt,
sizeof(*lt));
BUG_ON(err != 0); /* alloc, log, srch items out of sync */
if (ret == -EINPROGRESS) {
more = true;
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, hold, 0);
if (ret < 0)
WARN_ON_ONCE(ret < 0);
server_hold_commit(sb, hold);
mutex_lock(&server->logs_mutex);
} else if (ret == 0) {
memset(&lt->item_root, 0, sizeof(lt->item_root));
memset(&lt->bloom_ref, 0, sizeof(lt->bloom_ref));
lt->inode_count_delta = 0;
lt->max_item_seq = 0;
lt->finalize_seq = 0;
le64_add_cpu(&lt->nr, 1);
lt->flags = 0;
}
} while (more);
if (ret < 0) {
scoutfs_err(sb,
"error %d finalizing log trees for rid %016llx: %s",
ret, rid, err_str);
}
return ret;
}
/*
* Finalizing the log btrees for merging needs to be done carefully so
* that items don't appear to go backwards in time.
@@ -1089,7 +1232,6 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
struct scoutfs_log_merge_range rng;
struct scoutfs_mount_options opts;
struct scoutfs_log_trees each_lt;
struct scoutfs_log_trees fin;
unsigned int delay_ms;
unsigned long timeo;
bool saw_finalized;
@@ -1160,6 +1302,7 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
/* done if we're not finalizing and there's no finalized */
if (!finalize_ours && !saw_finalized) {
ret = 0;
scoutfs_inc_counter(sb, log_merge_no_finalized);
break;
}
@@ -1194,32 +1337,11 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
/* Finalize ours if it's visible to others */
if (ours_visible) {
fin = *lt;
memset(&fin.meta_avail, 0, sizeof(fin.meta_avail));
memset(&fin.meta_freed, 0, sizeof(fin.meta_freed));
memset(&fin.data_avail, 0, sizeof(fin.data_avail));
memset(&fin.data_freed, 0, sizeof(fin.data_freed));
memset(&fin.srch_file, 0, sizeof(fin.srch_file));
le64_add_cpu(&fin.flags, SCOUTFS_LOG_TREES_FINALIZED);
fin.finalize_seq = cpu_to_le64(scoutfs_server_next_seq(sb));
scoutfs_key_init_log_trees(&key, le64_to_cpu(fin.rid),
le64_to_cpu(fin.nr));
ret = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, &fin,
sizeof(fin));
ret = do_finalize_ours(sb, lt, hold);
if (ret < 0) {
err_str = "updating finalized log_trees";
err_str = "finalizing ours";
break;
}
memset(&lt->item_root, 0, sizeof(lt->item_root));
memset(&lt->bloom_ref, 0, sizeof(lt->bloom_ref));
lt->inode_count_delta = 0;
lt->max_item_seq = 0;
lt->finalize_seq = 0;
le64_add_cpu(&lt->nr, 1);
lt->flags = 0;
}
/* wait a bit for mounts to arrive */
@@ -1606,6 +1728,7 @@ static int server_commit_log_trees(struct super_block *sb,
int ret;
if (arg_len != sizeof(struct scoutfs_log_trees)) {
err_str = "invalid message log_trees size";
ret = -EINVAL;
goto out;
}
@@ -1669,7 +1792,7 @@ static int server_commit_log_trees(struct super_block *sb,
ret = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, &lt, sizeof(lt));
BUG_ON(ret < 0); /* dirtying should have guaranteed success */
BUG_ON(ret < 0); /* dirtying should have guaranteed success, srch item inconsistent */
if (ret < 0)
err_str = "updating log trees item";
@@ -1677,11 +1800,10 @@ unlock:
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
if (ret < 0)
scoutfs_err(sb, "server error %d committing client logs for rid %016llx: %s",
ret, rid, err_str);
out:
WARN_ON_ONCE(ret < 0);
if (ret < 0)
scoutfs_err(sb, "server error %d committing client logs for rid %016llx, nr %llu: %s",
ret, rid, le64_to_cpu(lt.nr), err_str);
return scoutfs_net_response(sb, conn, cmd, id, ret, NULL, 0);
}
@@ -1814,6 +1936,9 @@ static int reclaim_open_log_tree(struct super_block *sb, u64 rid)
out:
mutex_unlock(&server->logs_mutex);
if (ret == 0)
scoutfs_inc_counter(sb, reclaimed_open_logs);
if (ret < 0 && ret != -EINPROGRESS)
scoutfs_err(sb, "server error %d reclaiming log trees for rid %016llx: %s",
ret, rid, err_str);
@@ -2015,7 +2140,7 @@ static int server_srch_get_compact(struct super_block *sb,
apply:
ret = server_apply_commit(sb, &hold, ret);
WARN_ON_ONCE(ret < 0 && ret != -ENOENT); /* XXX leaked busy item */
WARN_ON_ONCE(ret < 0 && ret != -ENOENT && ret != -ENOLINK); /* XXX leaked busy item */
out:
ret = scoutfs_net_response(sb, conn, cmd, id, ret,
sc, sizeof(struct scoutfs_srch_compact));
@@ -2055,7 +2180,7 @@ static int server_srch_commit_compact(struct super_block *sb,
&super->srch_root, rid, sc,
&av, &fr);
mutex_unlock(&server->srch_mutex);
if (ret < 0) /* XXX very bad, leaks allocators */
if (ret < 0)
goto apply;
/* reclaim allocators if they were set by _srch_commit_ */
@@ -2065,10 +2190,10 @@ static int server_srch_commit_compact(struct super_block *sb,
scoutfs_alloc_splice_list(sb, &server->alloc, &server->wri,
server->other_freed, &fr);
mutex_unlock(&server->alloc_mutex);
WARN_ON(ret < 0); /* XXX leaks allocators */
apply:
ret = server_apply_commit(sb, &hold, ret);
out:
WARN_ON(ret < 0); /* XXX leaks allocators */
return scoutfs_net_response(sb, conn, cmd, id, ret, NULL, 0);
}
@@ -2393,10 +2518,9 @@ out:
}
}
if (ret < 0)
scoutfs_err(sb, "server error %d splicing log merge completion: %s", ret, err_str);
BUG_ON(ret); /* inconsistent */
/* inconsistent */
scoutfs_bug_on_err(sb, ret,
"server error %d splicing log merge completion: %s", ret, err_str);
return ret ?: einprogress;
}
@@ -2531,7 +2655,7 @@ static void server_log_merge_free_work(struct work_struct *work)
ret = scoutfs_btree_free_blocks(sb, &server->alloc,
&server->wri, &fr.key,
&fr.root, COMMIT_HOLD_ALLOC_BUDGET / 2);
&fr.root, COMMIT_HOLD_ALLOC_BUDGET / 8);
if (ret < 0) {
err_str = "freeing log btree";
break;
@@ -2550,7 +2674,7 @@ static void server_log_merge_free_work(struct work_struct *work)
/* freed blocks are in allocator, we *have* to update fr */
BUG_ON(ret < 0);
if (server_hold_alloc_used_since(sb, &hold) >= COMMIT_HOLD_ALLOC_BUDGET / 2) {
if (server_hold_alloc_used_since(sb, &hold) >= (COMMIT_HOLD_ALLOC_BUDGET * 3) / 4) {
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
commit = false;
@@ -2641,10 +2765,7 @@ restart:
/* find the next range, always checking for splicing */
for (;;) {
key = stat.next_range_key;
key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
ret = next_log_merge_item_key(sb, &super->log_merge, SCOUTFS_LOG_MERGE_RANGE_ZONE,
&key, &rng, sizeof(rng));
ret = next_log_merge_range(sb, &super->log_merge, &stat.next_range_key, &rng);
if (ret < 0 && ret != -ENOENT) {
err_str = "finding merge range item";
goto out;

45
kmod/src/sparse-filtered.sh Executable file
View File

@@ -0,0 +1,45 @@
#!/bin/bash
#
# Unfortunately, kernels can ship which contain sparse errors that are
# unrelated to us.
#
# The exit status of this filtering wrapper will indicate an error if
# sparse wasn't found or if there were any unfiltered output lines. It
# can hide error exit status from sparse or grep if they don't produce
# output that makes it past the filters.
#
# must have sparse. Fail with error message, mask success path.
which sparse > /dev/null || exit 1
# initial unmatchable, additional added as RE+="|..."
RE="$^"
#
# Darn. sparse has multi-line error messages, and I'd rather not bother
# with multi-line filters. So we'll just drop this context.
#
# command-line: note: in included file (through include/linux/netlink.h, include/linux/ethtool.h, include/linux/netdevice.h, include/net/sock.h, /root/scoutfs/kmod/src/kernelcompat.h, builtin):
# fprintf(stderr, "%s: note: in included file%s:\n",
#
RE+="|: note: in included file"
# 3.10.0-1160.119.1.el7.x86_64.debug
# include/linux/posix_acl.h:138:9: warning: incorrect type in assignment (different address spaces)
# include/linux/posix_acl.h:138:9: expected struct posix_acl *<noident>
# include/linux/posix_acl.h:138:9: got struct posix_acl [noderef] <asn:4>*<noident>
RE+="|include/linux/posix_acl.h:"
# 3.10.0-1160.119.1.el7.x86_64.debug
#include/uapi/linux/perf_event.h:146:56: warning: cast truncates bits from constant value (8000000000000000 becomes 0)
RE+="|include/uapi/linux/perf_event.h:"
# 4.18.0-513.24.1.el8_9.x86_64+debug'
#./include/linux/skbuff.h:824:1: warning: directive in macro's argument list
RE+="|include/linux/skbuff.h:"
sparse "$@" |& \
grep -E -v "($RE)" |& \
awk '{ print $0 } END { exit NR > 0 }'
exit $?

View File

@@ -62,7 +62,7 @@
* re-allocated and re-written. Search can restart by checking the
* btree for the current set of files. Compaction reads log files which
* are protected from other compactions by the persistent busy items
* created by the server. Compaction won't see it's blocks reused out
* created by the server. Compaction won't see its blocks reused out
* from under it, but it can encounter stale cached blocks that need to
* be invalidated.
*/
@@ -442,6 +442,10 @@ out:
if (ret == 0 && (flags & GFB_INSERT) && blk >= le64_to_cpu(sfl->blocks))
sfl->blocks = cpu_to_le64(blk + 1);
if (bl) {
trace_scoutfs_get_file_block(sb, bl->blkno, flags);
}
*bl_ret = bl;
return ret;
}
@@ -533,23 +537,35 @@ out:
* the pairs cancel each other out by all readers (the second encoding
* looks like deletion) so they aren't visible to the first/last bounds of
* the block or file.
*
* We use the same entry repeatedly, so the diff between them will be empty.
* This lets us just emit the two-byte count word, leaving the other bytes
* as zero.
*
* Split the desired total len into two pieces, adding any remainder to the
* first four-bit value.
*/
static int append_padded_entry(struct scoutfs_srch_file *sfl, u64 blk,
struct scoutfs_srch_block *srb, struct scoutfs_srch_entry *sre)
static void append_padded_entry(struct scoutfs_srch_file *sfl,
struct scoutfs_srch_block *srb,
int len)
{
int ret;
int each;
int rem;
u16 lengths = 0;
u8 *buf = srb->entries + le32_to_cpu(srb->entry_bytes);
ret = encode_entry(srb->entries + le32_to_cpu(srb->entry_bytes),
sre, &srb->tail);
if (ret > 0) {
srb->tail = *sre;
le32_add_cpu(&srb->entry_nr, 1);
le32_add_cpu(&srb->entry_bytes, ret);
le64_add_cpu(&sfl->entries, 1);
ret = 0;
}
each = (len - 2) >> 1;
rem = (len - 2) & 1;
return ret;
lengths |= each + rem;
lengths |= each << 4;
memset(buf, 0, len);
put_unaligned_le16(lengths, buf);
le32_add_cpu(&srb->entry_nr, 1);
le32_add_cpu(&srb->entry_bytes, len);
le64_add_cpu(&sfl->entries, 1);
}
/*
@@ -560,61 +576,41 @@ static int append_padded_entry(struct scoutfs_srch_file *sfl, u64 blk,
* This is called when there is a single existing entry in the block.
* We have the entire block to work with. We encode pairs of matching
* entries. This hides them from readers (both searches and merging) as
* they're interpreted as creation and deletion and are deleted. We use
* the existing hash value of the first entry in the block but then set
* the inode to an impossibly large number so it doesn't interfere with
* anything.
* they're interpreted as creation and deletion and are deleted.
*
* To hit the specific offset we very carefully manage the amount of
* bytes of change between fields in the entry. We know that if we
* change all the byte of the ino and id we end up with a 20 byte
* (2+8+8,2) encoding of the pair of entries. To have the last entry
* start at the _SAFE_POS offset we know that the final 20 byte pair
* encoding needs to end at 2 bytes (second entry encoding) after the
* _SAFE_POS offset.
* For simplicity and to maintain sort ordering within the block, we reuse
* the existing entry. This lets us skip the encoding step, because we know
* the diff will be zero. We can zero-pad the resulting entries to hit the
* target offset exactly.
*
* So as we encode pairs we watch the delta of our current offset from
* that desired final offset of 2 past _SAFE_POS. If we're a multiple
* of 20 away then we encode the full 20 byte pairs. If we're not, then
* we drop a byte to encode 19 bytes. That'll slowly change the offset
* to be a multiple of 20 again while encoding large entries.
* Because we can't predict the exact number of entry_bytes when we start,
* we adjust the byte count of subsequent entries until we wind up at a
* multiple of 20 bytes away from our goal and then use that length for
* the remaining entries.
*
* We could just use a single pair of unnaturally large entries to consume
* the needed space, adjusting for an odd number of entry_bytes if necessary.
* The use of 19 or 20 bytes for the entry pair matches what we would see with
* real (non-zero) entries that vary from the existing entry.
*/
static void pad_entries_at_safe(struct scoutfs_srch_file *sfl, u64 blk,
static void pad_entries_at_safe(struct scoutfs_srch_file *sfl,
struct scoutfs_srch_block *srb)
{
struct scoutfs_srch_entry sre;
u32 target;
s32 diff;
u64 hash;
u64 ino;
u64 id;
int ret;
hash = le64_to_cpu(srb->tail.hash);
ino = le64_to_cpu(srb->tail.ino) | (1ULL << 62);
id = le64_to_cpu(srb->tail.id);
target = SCOUTFS_SRCH_BLOCK_SAFE_BYTES + 2;
while ((diff = target - le32_to_cpu(srb->entry_bytes)) > 0) {
ino ^= 1ULL << (7 * 8);
append_padded_entry(sfl, srb, 10);
if (diff % 20 == 0) {
id ^= 1ULL << (7 * 8);
append_padded_entry(sfl, srb, 10);
} else {
id ^= 1ULL << (6 * 8);
append_padded_entry(sfl, srb, 9);
}
sre.hash = cpu_to_le64(hash);
sre.ino = cpu_to_le64(ino);
sre.id = cpu_to_le64(id);
ret = append_padded_entry(sfl, blk, srb, &sre);
if (ret == 0)
ret = append_padded_entry(sfl, blk, srb, &sre);
BUG_ON(ret != 0);
diff = target - le32_to_cpu(srb->entry_bytes);
}
WARN_ON_ONCE(diff != 0);
}
/*
@@ -749,14 +745,14 @@ static int search_log_file(struct super_block *sb,
for (i = 0; i < le32_to_cpu(srb->entry_nr); i++) {
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
ret = decode_entry(srb->entries + pos, &sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
pos += ret;
@@ -859,15 +855,15 @@ static int search_sorted_file(struct super_block *sb,
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
break;
ret = -EIO;
goto out;
}
ret = decode_entry(srb->entries + pos, &sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
break;
ret = -EIO;
goto out;
}
pos += ret;
prev = sre;
@@ -972,6 +968,8 @@ int scoutfs_srch_search_xattrs(struct super_block *sb,
scoutfs_inc_counter(sb, srch_search_xattrs);
trace_scoutfs_ioc_search_xattrs(sb, ino, last_ino);
*done = false;
srch_init_rb_root(sroot);
@@ -1408,7 +1406,7 @@ int scoutfs_srch_commit_compact(struct super_block *sb,
ret = -EIO;
scoutfs_btree_put_iref(&iref);
}
if (ret < 0) /* XXX leaks allocators */
if (ret < 0)
goto out;
/* restore busy to pending if the operation failed */
@@ -1428,10 +1426,8 @@ int scoutfs_srch_commit_compact(struct super_block *sb,
/* update file references if we finished compaction (!deleting) */
if (!(res->flags & SCOUTFS_SRCH_COMPACT_FLAG_DELETE)) {
ret = commit_files(sb, alloc, wri, root, res);
if (ret < 0) {
/* XXX we can't commit, shutdown? */
if (ret < 0)
goto out;
}
/* transition flags for deleting input files */
for (i = 0; i < res->nr; i++) {
@@ -1458,7 +1454,7 @@ update:
le64_to_cpu(pending->id), 0);
ret = scoutfs_btree_insert(sb, alloc, wri, root, &key,
pending, sizeof(*pending));
if (ret < 0)
if (WARN_ON_ONCE(ret < 0)) /* XXX inconsistency */
goto out;
}
@@ -1471,7 +1467,6 @@ update:
BUG_ON(err); /* both busy and pending present */
}
out:
WARN_ON_ONCE(ret < 0); /* XXX inconsistency */
kfree(busy);
return ret;
}
@@ -1669,7 +1664,7 @@ static int kway_merge(struct super_block *sb,
/* end sorted block on _SAFE offset for testing */
if (bl && le32_to_cpu(srb->entry_nr) == 1 && logs_input &&
scoutfs_trigger(sb, SRCH_COMPACT_LOGS_PAD_SAFE)) {
pad_entries_at_safe(sfl, blk, srb);
pad_entries_at_safe(sfl, srb);
scoutfs_block_put(sb, bl);
bl = NULL;
blk++;
@@ -1802,7 +1797,7 @@ static void swap_page_sre(void *A, void *B, int size)
* typically, ~10x worst case).
*
* Because we read and sort all the input files we must perform the full
* compaction in one operation. The server must have given us a
* compaction in one operation. The server must have given us
* sufficiently large avail/freed lists, otherwise we'll return ENOSPC.
*/
static int compact_logs(struct super_block *sb,
@@ -1866,14 +1861,14 @@ static int compact_logs(struct super_block *sb,
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
break;
ret = -EIO;
goto out;
}
ret = decode_entry(srb->entries + pos, sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
goto out;
}
prev = *sre;
@@ -2281,12 +2276,11 @@ static void scoutfs_srch_compact_worker(struct work_struct *work)
} else {
ret = -EINVAL;
}
if (ret < 0)
goto commit;
ret = scoutfs_alloc_prepare_commit(sb, &alloc, &wri) ?:
scoutfs_alloc_prepare_commit(sb, &alloc, &wri);
if (ret == 0)
scoutfs_block_writer_write(sb, &wri);
commit:
/* the server won't use our partial compact if _ERROR is set */
sc->meta_avail = alloc.avail;
sc->meta_freed = alloc.freed;
@@ -2303,7 +2297,7 @@ out:
scoutfs_inc_counter(sb, srch_compact_error);
scoutfs_block_writer_forget_all(sb, &wri);
queue_compact_work(srinf, sc->nr > 0 && ret == 0);
queue_compact_work(srinf, sc != NULL && sc->nr > 0 && ret == 0);
kfree(sc);
}

View File

@@ -512,9 +512,9 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
sbi = kzalloc(sizeof(struct scoutfs_sb_info), GFP_KERNEL);
sb->s_fs_info = sbi;
sbi->sb = sb;
if (!sbi)
return -ENOMEM;
sbi->sb = sb;
ret = assign_random_id(sbi);
if (ret < 0)

View File

@@ -196,7 +196,7 @@ static int retry_forever(struct super_block *sb, int (*func)(struct super_block
}
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
ret = -ENOLINK;
break;
}
@@ -252,7 +252,7 @@ void scoutfs_trans_write_func(struct work_struct *work)
}
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
ret = -ENOLINK;
goto out;
}

View File

@@ -742,7 +742,7 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
int ret;
int err;
trace_scoutfs_xattr_set(sb, name_len, value, size, flags);
trace_scoutfs_xattr_set(sb, ino, name_len, value, size, flags);
if (WARN_ON_ONCE(tgs->totl && tgs->indx) ||
WARN_ON_ONCE((tgs->totl | tgs->indx) && !tag_lock))

View File

@@ -117,6 +117,7 @@ used during the test.
| T\_NR\_MOUNTS | number of mounts | -n | 3 |
| T\_O[0-9] | mount options | created per run | -o server\_addr= |
| T\_QUORUM | quorum count | -q | 2 |
| T\_EXTRA | per-test file dir | revision ctled | tests/extra/t |
| T\_TMP | per-test tmp prefix | made for test | results/tmp/t/tmp |
| T\_TMPDIR | per-test tmp dir dir | made for test | results/tmp/t |

View File

@@ -0,0 +1,882 @@
Ran:
generic/001
generic/002
generic/004
generic/005
generic/006
generic/007
generic/008
generic/009
generic/011
generic/012
generic/013
generic/014
generic/015
generic/016
generic/018
generic/020
generic/021
generic/022
generic/023
generic/024
generic/025
generic/026
generic/028
generic/029
generic/030
generic/031
generic/032
generic/033
generic/034
generic/035
generic/037
generic/039
generic/040
generic/041
generic/050
generic/052
generic/053
generic/056
generic/057
generic/058
generic/059
generic/060
generic/061
generic/062
generic/063
generic/064
generic/065
generic/066
generic/067
generic/069
generic/070
generic/071
generic/073
generic/076
generic/078
generic/079
generic/080
generic/081
generic/082
generic/084
generic/086
generic/087
generic/088
generic/090
generic/091
generic/092
generic/094
generic/096
generic/097
generic/098
generic/099
generic/101
generic/104
generic/105
generic/106
generic/107
generic/110
generic/111
generic/113
generic/114
generic/115
generic/116
generic/117
generic/118
generic/119
generic/120
generic/121
generic/122
generic/123
generic/124
generic/126
generic/128
generic/129
generic/130
generic/131
generic/134
generic/135
generic/136
generic/138
generic/139
generic/140
generic/141
generic/142
generic/143
generic/144
generic/145
generic/146
generic/147
generic/148
generic/149
generic/150
generic/151
generic/152
generic/153
generic/154
generic/155
generic/156
generic/157
generic/158
generic/159
generic/160
generic/161
generic/162
generic/163
generic/169
generic/171
generic/172
generic/173
generic/174
generic/177
generic/178
generic/179
generic/180
generic/181
generic/182
generic/183
generic/184
generic/185
generic/188
generic/189
generic/190
generic/191
generic/193
generic/194
generic/195
generic/196
generic/197
generic/198
generic/199
generic/200
generic/201
generic/202
generic/203
generic/205
generic/206
generic/207
generic/210
generic/211
generic/212
generic/214
generic/215
generic/216
generic/217
generic/218
generic/219
generic/220
generic/221
generic/222
generic/223
generic/225
generic/227
generic/228
generic/229
generic/230
generic/235
generic/236
generic/237
generic/238
generic/240
generic/244
generic/245
generic/246
generic/247
generic/248
generic/249
generic/250
generic/252
generic/253
generic/254
generic/255
generic/256
generic/257
generic/258
generic/259
generic/260
generic/261
generic/262
generic/263
generic/264
generic/265
generic/266
generic/267
generic/268
generic/271
generic/272
generic/276
generic/277
generic/278
generic/279
generic/281
generic/282
generic/283
generic/284
generic/286
generic/287
generic/288
generic/289
generic/290
generic/291
generic/292
generic/293
generic/294
generic/295
generic/296
generic/301
generic/302
generic/303
generic/304
generic/305
generic/306
generic/307
generic/308
generic/309
generic/312
generic/313
generic/314
generic/315
generic/316
generic/317
generic/319
generic/322
generic/324
generic/325
generic/326
generic/327
generic/328
generic/329
generic/330
generic/331
generic/332
generic/335
generic/336
generic/337
generic/341
generic/342
generic/343
generic/346
generic/348
generic/353
generic/355
generic/358
generic/359
generic/360
generic/361
generic/362
generic/363
generic/364
generic/365
generic/366
generic/367
generic/368
generic/369
generic/370
generic/371
generic/372
generic/373
generic/374
generic/375
generic/376
generic/377
generic/378
generic/379
generic/380
generic/381
generic/382
generic/383
generic/384
generic/385
generic/386
generic/389
generic/391
generic/392
generic/393
generic/394
generic/395
generic/396
generic/397
generic/398
generic/400
generic/401
generic/402
generic/403
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/426
generic/427
generic/428
generic/436
generic/437
generic/439
generic/440
generic/443
generic/445
generic/446
generic/448
generic/449
generic/450
generic/451
generic/452
generic/453
generic/454
generic/456
generic/458
generic/460
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/477
generic/478
generic/479
generic/480
generic/481
generic/483
generic/485
generic/486
generic/487
generic/488
generic/489
generic/490
generic/491
generic/492
generic/498
generic/499
generic/501
generic/502
generic/503
generic/504
generic/505
generic/506
generic/507
generic/508
generic/509
generic/510
generic/511
generic/512
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/523
generic/524
generic/525
generic/526
generic/527
generic/528
generic/529
generic/530
generic/531
generic/533
generic/534
generic/535
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/547
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/557
generic/566
generic/567
generic/571
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/604
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/611
generic/612
generic/613
generic/614
generic/618
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/632
generic/634
generic/635
generic/637
generic/638
generic/639
generic/640
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/676
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/032
Not
run:
generic/008
generic/009
generic/012
generic/015
generic/016
generic/018
generic/021
generic/022
generic/025
generic/026
generic/031
generic/033
generic/050
generic/052
generic/058
generic/059
generic/060
generic/061
generic/063
generic/064
generic/078
generic/079
generic/081
generic/082
generic/091
generic/094
generic/096
generic/110
generic/111
generic/113
generic/114
generic/115
generic/116
generic/118
generic/119
generic/121
generic/122
generic/123
generic/128
generic/130
generic/134
generic/135
generic/136
generic/138
generic/139
generic/140
generic/142
generic/143
generic/144
generic/145
generic/146
generic/147
generic/148
generic/149
generic/150
generic/151
generic/152
generic/153
generic/154
generic/155
generic/156
generic/157
generic/158
generic/159
generic/160
generic/161
generic/162
generic/163
generic/171
generic/172
generic/173
generic/174
generic/177
generic/178
generic/179
generic/180
generic/181
generic/182
generic/183
generic/185
generic/188
generic/189
generic/190
generic/191
generic/193
generic/194
generic/195
generic/196
generic/197
generic/198
generic/199
generic/200
generic/201
generic/202
generic/203
generic/205
generic/206
generic/207
generic/210
generic/211
generic/212
generic/214
generic/216
generic/217
generic/218
generic/219
generic/220
generic/222
generic/223
generic/225
generic/227
generic/229
generic/230
generic/235
generic/238
generic/240
generic/244
generic/250
generic/252
generic/253
generic/254
generic/255
generic/256
generic/259
generic/260
generic/261
generic/262
generic/263
generic/264
generic/265
generic/266
generic/267
generic/268
generic/271
generic/272
generic/276
generic/277
generic/278
generic/279
generic/281
generic/282
generic/283
generic/284
generic/287
generic/288
generic/289
generic/290
generic/291
generic/292
generic/293
generic/295
generic/296
generic/301
generic/302
generic/303
generic/304
generic/305
generic/312
generic/314
generic/316
generic/317
generic/324
generic/326
generic/327
generic/328
generic/329
generic/330
generic/331
generic/332
generic/353
generic/355
generic/358
generic/359
generic/361
generic/362
generic/363
generic/364
generic/365
generic/366
generic/367
generic/368
generic/369
generic/370
generic/371
generic/372
generic/373
generic/374
generic/378
generic/379
generic/380
generic/381
generic/382
generic/383
generic/384
generic/385
generic/386
generic/391
generic/392
generic/395
generic/396
generic/397
generic/398
generic/400
generic/402
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/427
generic/439
generic/440
generic/446
generic/449
generic/450
generic/451
generic/453
generic/454
generic/456
generic/458
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/485
generic/487
generic/488
generic/491
generic/492
generic/499
generic/501
generic/503
generic/505
generic/506
generic/507
generic/508
generic/511
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/528
generic/530
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/566
generic/567
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/612
generic/613
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/635
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/032
Passed all 512 tests

View File

@@ -0,0 +1,44 @@
generic/003 # missing atime update in buffered read
generic/075 # file content mismatch failures (fds, etc)
generic/103 # enospc causes trans commit failures
generic/108 # mount fails on failing device?
generic/112 # file content mismatch failures (fds, etc)
generic/213 # enospc causes trans commit failures
generic/318 # can't support user namespaces until v5.11
generic/321 # requires selinux enabled for '+' in ls?
generic/338 # BUG_ON update inode error handling
generic/347 # _dmthin_mount doesn't work?
generic/356 # swap
generic/357 # swap
generic/409 # bind mounts not scripted yet
generic/410 # bind mounts not scripted yet
generic/411 # bind mounts not scripted yet
generic/423 # symlink inode size is strlen() + 1 on scoutfs
generic/430 # xfs_io copy_range missing in el7
generic/431 # xfs_io copy_range missing in el7
generic/432 # xfs_io copy_range missing in el7
generic/433 # xfs_io copy_range missing in el7
generic/434 # xfs_io copy_range missing in el7
generic/441 # dm-mapper
generic/444 # el9's posix_acl_update_mode is buggy ?
generic/467 # open_by_handle ESTALE
generic/472 # swap
generic/484 # dm-mapper
generic/493 # swap
generic/494 # swap
generic/495 # swap
generic/496 # swap
generic/497 # swap
generic/532 # xfs_io statx attrib_mask missing in el7
generic/554 # swap
generic/563 # cgroup+loopdev
generic/564 # xfs_io copy_range missing in el7
generic/565 # xfs_io copy_range missing in el7
generic/568 # falloc not resulting in block count increase
generic/569 # swap
generic/570 # swap
generic/620 # dm-hugedisk
generic/633 # id-mapped mounts missing in el7
generic/636 # swap
generic/641 # swap
generic/643 # swap

View File

@@ -8,36 +8,34 @@
echo "$0 running rid '$SCOUTFS_FENCED_REQ_RID' ip '$SCOUTFS_FENCED_REQ_IP' args '$@'"
log() {
echo "$@" > /dev/stderr
echo_fail() {
echo "$@" >> /dev/stderr
exit 1
}
echo_fail() {
echo "$@" > /dev/stderr
exit 1
# silence error messages
quiet_cat()
{
cat "$@" 2>/dev/null
}
rid="$SCOUTFS_FENCED_REQ_RID"
shopt -s nullglob
for fs in /sys/fs/scoutfs/*; do
[ ! -d "$fs" ] && continue
fs_rid="$(quiet_cat $fs/rid)"
nr="$(quiet_cat $fs/data_device_maj_min)"
[ ! -d "$fs" -o "$fs_rid" != "$rid" ] && continue
fs_rid="$(cat $fs/rid)" || \
echo_fail "failed to get rid in $fs"
if [ "$fs_rid" != "$rid" ]; then
continue
fi
nr="$(cat $fs/data_device_maj_min)" || \
echo_fail "failed to get data device major:minor in $fs"
mnts=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
mnt=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
echo_fail "findmnt -t scoutfs -S $nr failed"
for mnt in $mnts; do
umount -f "$mnt" || \
echo_fail "umout -f $mnt failed"
done
[ -z "$mnt" ] && continue
if ! umount -qf "$mnt"; then
if [ -d "$fs" ]; then
echo_fail "umount -qf $mnt failed"
fi
fi
done
exit 0

View File

@@ -64,21 +64,27 @@ t_rc()
}
#
# redirect test output back to the output of the invoking script intead
# of the compared output.
# As run, stdout/err are redirected to a file that will be compared with
# the stored expected golden output of the test. This redirects
# stdout/err in the script to stdout of the invoking run-test. It's
# intended to give visible output of tests without being included in the
# golden output.
#
t_restore_output()
# (see the goofy "exec" fd manipulation in the main run-tests as it runs
# each test)
#
t_stdout_invoked()
{
exec >&6 2>&1
}
#
# redirect a command's output back to the compared output after the
# test has restored its output
# This undoes t_stdout_invokved, returning the test's stdout/err to the
# output file as it was when it was launched.
#
t_compare_output()
t_stdout_compare()
{
"$@" >&7 2>&1
exec >&7 2>&1
}
#

View File

@@ -121,6 +121,7 @@ t_filter_dmesg()
# in debugging kernels we can slow things down a bit
re="$re|hrtimer: interrupt took .*"
re="$re|clocksource: Long readout interval"
# fencing tests force unmounts and trigger timeouts
re="$re|scoutfs .* forcing unmount"
@@ -140,6 +141,9 @@ t_filter_dmesg()
re="$re|scoutfs .* error.*server failed to bind to.*"
re="$re|scoutfs .* critical transaction commit failure.*"
# ENOLINK (-67) indicates an expected forced unmount error
re="$re|scoutfs .* error -67 .*"
# change-devices causes loop device resizing
re="$re|loop: module loaded"
re="$re|loop[0-9].* detected capacity change from.*"
@@ -163,6 +167,9 @@ t_filter_dmesg()
# perf warning that it adjusted sample rate
re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
# some ci test guests are unresponsive
re="$re|longest quorum heartbeat .* delay"
egrep -v "($re)" | \
ignore_harmless_unwind_kasan_stack_oob
}

View File

@@ -1,4 +1,3 @@
== setting longer hung task timeout
== creating fragmented extents
== unlink file with moved extents to free extents per block
== cleanup

View File

@@ -49,7 +49,7 @@ offline wating should be empty:
0
== truncating does wait
truncate should be waiting for first block:
trunate should no longer be waiting:
truncate should no longer be waiting:
0
== writing waits
should be waiting for write

View File

@@ -1,882 +0,0 @@
Ran:
generic/001
generic/002
generic/004
generic/005
generic/006
generic/007
generic/008
generic/009
generic/011
generic/012
generic/013
generic/014
generic/015
generic/016
generic/018
generic/020
generic/021
generic/022
generic/023
generic/024
generic/025
generic/026
generic/028
generic/029
generic/030
generic/031
generic/032
generic/033
generic/034
generic/035
generic/037
generic/039
generic/040
generic/041
generic/050
generic/052
generic/053
generic/056
generic/057
generic/058
generic/059
generic/060
generic/061
generic/062
generic/063
generic/064
generic/065
generic/066
generic/067
generic/069
generic/070
generic/071
generic/073
generic/076
generic/078
generic/079
generic/080
generic/081
generic/082
generic/084
generic/086
generic/087
generic/088
generic/090
generic/091
generic/092
generic/094
generic/096
generic/097
generic/098
generic/099
generic/101
generic/104
generic/105
generic/106
generic/107
generic/110
generic/111
generic/113
generic/114
generic/115
generic/116
generic/117
generic/118
generic/119
generic/120
generic/121
generic/122
generic/123
generic/124
generic/126
generic/128
generic/129
generic/130
generic/131
generic/134
generic/135
generic/136
generic/138
generic/139
generic/140
generic/141
generic/142
generic/143
generic/144
generic/145
generic/146
generic/147
generic/148
generic/149
generic/150
generic/151
generic/152
generic/153
generic/154
generic/155
generic/156
generic/157
generic/158
generic/159
generic/160
generic/161
generic/162
generic/163
generic/169
generic/171
generic/172
generic/173
generic/174
generic/177
generic/178
generic/179
generic/180
generic/181
generic/182
generic/183
generic/184
generic/185
generic/188
generic/189
generic/190
generic/191
generic/193
generic/194
generic/195
generic/196
generic/197
generic/198
generic/199
generic/200
generic/201
generic/202
generic/203
generic/205
generic/206
generic/207
generic/210
generic/211
generic/212
generic/214
generic/215
generic/216
generic/217
generic/218
generic/219
generic/220
generic/221
generic/222
generic/223
generic/225
generic/227
generic/228
generic/229
generic/230
generic/235
generic/236
generic/237
generic/238
generic/240
generic/244
generic/245
generic/246
generic/247
generic/248
generic/249
generic/250
generic/252
generic/253
generic/254
generic/255
generic/256
generic/257
generic/258
generic/259
generic/260
generic/261
generic/262
generic/263
generic/264
generic/265
generic/266
generic/267
generic/268
generic/271
generic/272
generic/276
generic/277
generic/278
generic/279
generic/281
generic/282
generic/283
generic/284
generic/286
generic/287
generic/288
generic/289
generic/290
generic/291
generic/292
generic/293
generic/294
generic/295
generic/296
generic/301
generic/302
generic/303
generic/304
generic/305
generic/306
generic/307
generic/308
generic/309
generic/312
generic/313
generic/314
generic/315
generic/316
generic/317
generic/319
generic/322
generic/324
generic/325
generic/326
generic/327
generic/328
generic/329
generic/330
generic/331
generic/332
generic/335
generic/336
generic/337
generic/341
generic/342
generic/343
generic/346
generic/348
generic/353
generic/355
generic/358
generic/359
generic/360
generic/361
generic/362
generic/363
generic/364
generic/365
generic/366
generic/367
generic/368
generic/369
generic/370
generic/371
generic/372
generic/373
generic/374
generic/375
generic/376
generic/377
generic/378
generic/379
generic/380
generic/381
generic/382
generic/383
generic/384
generic/385
generic/386
generic/389
generic/391
generic/392
generic/393
generic/394
generic/395
generic/396
generic/397
generic/398
generic/400
generic/401
generic/402
generic/403
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/426
generic/427
generic/428
generic/436
generic/437
generic/439
generic/440
generic/443
generic/445
generic/446
generic/448
generic/449
generic/450
generic/451
generic/452
generic/453
generic/454
generic/456
generic/458
generic/460
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/477
generic/478
generic/479
generic/480
generic/481
generic/483
generic/485
generic/486
generic/487
generic/488
generic/489
generic/490
generic/491
generic/492
generic/498
generic/499
generic/501
generic/502
generic/503
generic/504
generic/505
generic/506
generic/507
generic/508
generic/509
generic/510
generic/511
generic/512
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/523
generic/524
generic/525
generic/526
generic/527
generic/528
generic/529
generic/530
generic/531
generic/533
generic/534
generic/535
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/547
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/557
generic/566
generic/567
generic/571
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/604
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/611
generic/612
generic/613
generic/614
generic/618
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/632
generic/634
generic/635
generic/637
generic/638
generic/639
generic/640
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/676
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/032
Not
run:
generic/008
generic/009
generic/012
generic/015
generic/016
generic/018
generic/021
generic/022
generic/025
generic/026
generic/031
generic/033
generic/050
generic/052
generic/058
generic/059
generic/060
generic/061
generic/063
generic/064
generic/078
generic/079
generic/081
generic/082
generic/091
generic/094
generic/096
generic/110
generic/111
generic/113
generic/114
generic/115
generic/116
generic/118
generic/119
generic/121
generic/122
generic/123
generic/128
generic/130
generic/134
generic/135
generic/136
generic/138
generic/139
generic/140
generic/142
generic/143
generic/144
generic/145
generic/146
generic/147
generic/148
generic/149
generic/150
generic/151
generic/152
generic/153
generic/154
generic/155
generic/156
generic/157
generic/158
generic/159
generic/160
generic/161
generic/162
generic/163
generic/171
generic/172
generic/173
generic/174
generic/177
generic/178
generic/179
generic/180
generic/181
generic/182
generic/183
generic/185
generic/188
generic/189
generic/190
generic/191
generic/193
generic/194
generic/195
generic/196
generic/197
generic/198
generic/199
generic/200
generic/201
generic/202
generic/203
generic/205
generic/206
generic/207
generic/210
generic/211
generic/212
generic/214
generic/216
generic/217
generic/218
generic/219
generic/220
generic/222
generic/223
generic/225
generic/227
generic/229
generic/230
generic/235
generic/238
generic/240
generic/244
generic/250
generic/252
generic/253
generic/254
generic/255
generic/256
generic/259
generic/260
generic/261
generic/262
generic/263
generic/264
generic/265
generic/266
generic/267
generic/268
generic/271
generic/272
generic/276
generic/277
generic/278
generic/279
generic/281
generic/282
generic/283
generic/284
generic/287
generic/288
generic/289
generic/290
generic/291
generic/292
generic/293
generic/295
generic/296
generic/301
generic/302
generic/303
generic/304
generic/305
generic/312
generic/314
generic/316
generic/317
generic/324
generic/326
generic/327
generic/328
generic/329
generic/330
generic/331
generic/332
generic/353
generic/355
generic/358
generic/359
generic/361
generic/362
generic/363
generic/364
generic/365
generic/366
generic/367
generic/368
generic/369
generic/370
generic/371
generic/372
generic/373
generic/374
generic/378
generic/379
generic/380
generic/381
generic/382
generic/383
generic/384
generic/385
generic/386
generic/391
generic/392
generic/395
generic/396
generic/397
generic/398
generic/400
generic/402
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/427
generic/439
generic/440
generic/446
generic/449
generic/450
generic/451
generic/453
generic/454
generic/456
generic/458
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/485
generic/487
generic/488
generic/491
generic/492
generic/499
generic/501
generic/503
generic/505
generic/506
generic/507
generic/508
generic/511
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/528
generic/530
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/566
generic/567
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/612
generic/613
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/635
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/032
Passed all 512 tests

View File

@@ -56,6 +56,7 @@ $(basename $0) options:
| only tests matching will be run. Can be provided multiple
| times
-i | Force removing and inserting the built scoutfs.ko module.
-l <nr> | Loop each test <nr> times while passing, last run counts.
-M <file> | Specify the filesystem's meta data device path that contains
| the file system to be tested. Will be clobbered by -m mkfs.
-m | Run mkfs on the device before mounting and running
@@ -69,6 +70,7 @@ $(basename $0) options:
-r <dir> | Specify the directory in which to store results of
| test runs. The directory will be created if it doesn't
| exist. Previous results will be deleted as each test runs.
-R | shuffle the test order randomly using shuf
-s | Skip git repo checkouts.
-t | Enabled trace events that match the given glob argument.
| Multiple options enable multiple globbed events.
@@ -89,6 +91,8 @@ done
# set some T_ defaults
T_TRACE_DUMP="0"
T_TRACE_PRINTK="0"
T_PORT_START="19700"
T_LOOP_ITER="1"
# array declarations to be able to use array ops
declare -a T_TRACE_GLOB
@@ -129,6 +133,12 @@ while true; do
-i)
T_INSMOD="1"
;;
-l)
test -n "$2" || die "-l must have a nr iterations argument"
test "$2" -eq "$2" 2>/dev/null || die "-l <nr> argument must be an integer"
T_LOOP_ITER="$2"
shift
;;
-M)
test -n "$2" || die "-z must have meta device file argument"
T_META_DEVICE="$2"
@@ -164,6 +174,9 @@ while true; do
T_RESULTS="$2"
shift
;;
-R)
T_SHUF="1"
;;
-s)
T_SKIP_CHECKOUT="1"
;;
@@ -261,13 +274,37 @@ for e in T_META_DEVICE T_DATA_DEVICE T_EX_META_DEV T_EX_DATA_DEV T_KMOD T_RESULT
eval $e=\"$(readlink -f "${!e}")\"
done
# try and check ports, but not necessary
T_TEST_PORT="$T_PORT_START"
T_SCRATCH_PORT="$((T_PORT_START + 100))"
T_DEV_PORT="$((T_PORT_START + 200))"
read local_start local_end < /proc/sys/net/ipv4/ip_local_port_range
if [ -n "$local_start" -a -n "$local_end" -a "$local_start" -lt "$local_end" ]; then
if [ ! "$T_DEV_PORT" -lt "$local_start" -a ! "$T_TEST_PORT" -gt "$local_end" ]; then
die "listening port range $T_TEST_PORT - $T_DEV_PORT is within local dynamic port range $local_start - $local_end in /proc/sys/net/ipv4/ip_local_port_range"
fi
fi
# permute sequence?
T_SEQUENCE=sequence
if [ -n "$T_SHUF" ]; then
msg "shuffling test order"
shuf sequence -o sequence.shuf
# keep xfstests at the end
if grep -q 'xfstests.sh' sequence.shuf ; then
sed -i '/xfstests.sh/d' sequence.shuf
echo "xfstests.sh" >> sequence.shuf
fi
T_SEQUENCE=sequence.shuf
fi
# include everything by default
test -z "$T_INCLUDE" && T_INCLUDE="-e '.*'"
# (quickly) exclude nothing by default
test -z "$T_EXCLUDE" && T_EXCLUDE="-e '\Zx'"
# eval to strip re ticks but not expand
tests=$(grep -v "^#" sequence |
tests=$(grep -v "^#" $T_SEQUENCE |
eval grep "$T_INCLUDE" | eval grep -v "$T_EXCLUDE")
test -z "$tests" && \
die "no tests found by including $T_INCLUDE and excluding $T_EXCLUDE"
@@ -346,7 +383,7 @@ fi
quo=""
if [ -n "$T_MKFS" ]; then
for i in $(seq -0 $((T_QUORUM - 1))); do
quo="$quo -Q $i,127.0.0.1,$((42000 + i))"
quo="$quo -Q $i,127.0.0.1,$((T_TEST_PORT + i))"
done
msg "making new filesystem with $T_QUORUM quorum members"
@@ -401,6 +438,30 @@ cmd grep . /sys/kernel/debug/tracing/options/trace_printk \
/sys/kernel/debug/tracing/buffer_size_kb \
/proc/sys/kernel/ftrace_dump_on_oops
# we can record pids to kill as we exit, we kill in reverse added order
atexit_kill_pids=""
add_atexit_kill_pid()
{
atexit_kill_pids="$1 $atexit_kill_pids"
}
atexit_kill()
{
local pid
# suppress bg function exited messages
exec {ERR}>&2 2>/dev/null
for pid in $atexit_kill_pids; do
if test -e "/proc/$pid/status" ; then
kill "$pid"
wait "$pid"
fi
done
exec 2>&$ERR {ERR}>&-
}
trap atexit_kill EXIT
#
# Build a fenced config that runs scripts out of the repository rather
# than the default system directory
@@ -414,26 +475,43 @@ EOF
export SCOUTFS_FENCED_CONFIG_FILE="$conf"
T_FENCED_LOG="$T_RESULTS/fenced.log"
#
# Run the agent in the background, log its output, an kill it if we
# exit
#
fenced_log()
{
echo "[$(timestamp)] $*" >> "$T_FENCED_LOG"
}
fenced_pid=""
kill_fenced()
{
if test -n "$fenced_pid" -a -d "/proc/$fenced_pid" ; then
fenced_log "killing fenced pid $fenced_pid"
kill "$fenced_pid"
fi
}
trap kill_fenced EXIT
$T_UTILS/fenced/scoutfs-fenced > "$T_FENCED_LOG" 2>&1 &
fenced_pid=$!
fenced_log "started fenced pid $fenced_pid in the background"
add_atexit_kill_pid $fenced_pid
#
# some critical failures will cause fs operations to hang. We can watch
# for evidence of them and cause the system to crash, at least.
#
crash_monitor()
{
local bad=0
while sleep 1; do
if dmesg | grep -q "inserting extent.*overlaps existing"; then
echo "run-tests monitor saw overlapping extent message"
bad=1
fi
if dmesg | grep -q "error indicated by fence action" ; then
echo "run-tests monitor saw fence agent error message"
bad=1
fi
if [ ! -e "/proc/${fenced_pid}/status" ]; then
echo "run-tests monitor didn't see fenced pid $fenced_pid /proc dir"
bad=1
fi
if [ "$bad" != 0 ]; then
echo "run-tests monitor triggering crash"
echo c > /proc/sysrq-trigger
exit 1
fi
done
}
crash_monitor &
add_atexit_kill_pid $!
# setup dm tables
echo "0 $(blockdev --getsz $T_META_DEVICE) linear $T_META_DEVICE 0" > \
@@ -506,7 +584,7 @@ fi
. funcs/filter.sh
# give tests access to built binaries in src/, prefer over installed
PATH="$PWD/src:$PATH"
export PATH="$PWD/src:$PATH"
msg "running tests"
> "$T_RESULTS/skip.log"
@@ -526,101 +604,110 @@ for t in $tests; do
t="tests/$t"
test_name=$(basename "$t" | sed -e 's/.sh$//')
# create a temporary dir and file path for the test
T_TMPDIR="$T_RESULTS/tmp/$test_name"
T_TMP="$T_TMPDIR/tmp"
cmd rm -rf "$T_TMPDIR"
cmd mkdir -p "$T_TMPDIR"
# create a test name dir in the fs, clean up old data as needed
T_DS=""
for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
dir="${T_M[$i]}/test/$test_name"
test $i == 0 && (
test -d "$dir" && cmd rm -rf "$dir"
cmd mkdir -p "$dir"
)
eval T_D$i=$dir
T_D[$i]=$dir
T_DS+="$dir "
done
# export all our T_ variables
for v in ${!T_*}; do
eval export $v
done
export PATH # give test access to scoutfs binary
# prepare to compare output to golden output
test -e "$T_RESULTS/output" || cmd mkdir -p "$T_RESULTS/output"
out="$T_RESULTS/output/$test_name"
> "$T_TMPDIR/status.msg"
golden="golden/$test_name"
# get stats from previous pass
last="$T_RESULTS/last-passed-test-stats"
stats=$(grep -s "^$test_name " "$last" | cut -d " " -f 2-)
test -n "$stats" && stats="last: $stats"
printf " %-30s $stats" "$test_name"
# mark in dmesg as to what test we are running
echo "run scoutfs test $test_name" > /dev/kmsg
# record dmesg before
dmesg | t_filter_dmesg > "$T_TMPDIR/dmesg.before"
# let the test get at its extra files
T_EXTRA="$T_TESTS/extra/$test_name"
# give tests stdout and compared output on specific fds
exec 6>&1
exec 7>$out
for iter in $(seq 1 $T_LOOP_ITER); do
# run the test with access to our functions
start_secs=$SECONDS
bash -c "for f in funcs/*.sh; do . \$f; done; . $t" >&7 2>&1
sts="$?"
log "test $t exited with status $sts"
stats="$((SECONDS - start_secs))s"
# create a temporary dir and file path for the test
T_TMPDIR="$T_RESULTS/tmp/$test_name"
T_TMP="$T_TMPDIR/tmp"
cmd rm -rf "$T_TMPDIR"
cmd mkdir -p "$T_TMPDIR"
# close our weird descriptors
exec 6>&-
exec 7>&-
# create a test name dir in the fs, clean up old data as needed
T_DS=""
for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
dir="${T_M[$i]}/test/$test_name"
# compare output if the test returned passed status
if [ "$sts" == "$T_PASS_STATUS" ]; then
if [ ! -e "$golden" ]; then
message="no golden output"
sts=$T_FAIL_STATUS
elif ! cmp -s "$golden" "$out"; then
message="output differs"
sts=$T_FAIL_STATUS
diff -u "$golden" "$out" >> "$T_RESULTS/fail.log"
test $i == 0 && (
test -d "$dir" && cmd rm -rf "$dir"
cmd mkdir -p "$dir"
)
eval T_D$i=$dir
T_D[$i]=$dir
T_DS+="$dir "
done
# export all our T_ variables
for v in ${!T_*}; do
eval export $v
done
# prepare to compare output to golden output
test -e "$T_RESULTS/output" || cmd mkdir -p "$T_RESULTS/output"
out="$T_RESULTS/output/$test_name"
> "$T_TMPDIR/status.msg"
golden="golden/$test_name"
# record dmesg before
dmesg | t_filter_dmesg > "$T_TMPDIR/dmesg.before"
# give tests stdout and compared output on specific fds
exec 6>&1
exec 7>$out
# run the test with access to our functions
start_secs=$SECONDS
bash -c "for f in funcs/*.sh; do . \$f; done; . $t" >&7 2>&1
sts="$?"
log "test $t exited with status $sts"
stats="$((SECONDS - start_secs))s"
# close our weird descriptors
exec 6>&-
exec 7>&-
# compare output if the test returned passed status
if [ "$sts" == "$T_PASS_STATUS" ]; then
if [ ! -e "$golden" ]; then
message="no golden output"
sts=$T_FAIL_STATUS
elif ! cmp -s "$golden" "$out"; then
message="output differs"
sts=$T_FAIL_STATUS
diff -u "$golden" "$out" >> "$T_RESULTS/fail.log"
fi
else
# get message from t_*() functions
message=$(cat "$T_TMPDIR/status.msg")
fi
else
# get message from t_*() functions
message=$(cat "$T_TMPDIR/status.msg")
fi
# see if anything unexpected was added to dmesg
if [ "$sts" == "$T_PASS_STATUS" ]; then
dmesg | t_filter_dmesg > "$T_TMPDIR/dmesg.after"
diff --old-line-format="" --unchanged-line-format="" \
"$T_TMPDIR/dmesg.before" "$T_TMPDIR/dmesg.after" > \
"$T_TMPDIR/dmesg.new"
# see if anything unexpected was added to dmesg
if [ "$sts" == "$T_PASS_STATUS" ]; then
dmesg | t_filter_dmesg > "$T_TMPDIR/dmesg.after"
diff --old-line-format="" --unchanged-line-format="" \
"$T_TMPDIR/dmesg.before" "$T_TMPDIR/dmesg.after" > \
"$T_TMPDIR/dmesg.new"
if [ -s "$T_TMPDIR/dmesg.new" ]; then
message="unexpected messages in dmesg"
sts=$T_FAIL_STATUS
cat "$T_TMPDIR/dmesg.new" >> "$T_RESULTS/fail.log"
if [ -s "$T_TMPDIR/dmesg.new" ]; then
message="unexpected messages in dmesg"
sts=$T_FAIL_STATUS
cat "$T_TMPDIR/dmesg.new" >> "$T_RESULTS/fail.log"
fi
fi
fi
# record unknown exit status
if [ "$sts" -lt "$T_FIRST_STATUS" -o "$sts" -gt "$T_LAST_STATUS" ]; then
message="unknown status: $sts"
sts=$T_FAIL_STATUS
fi
# record unknown exit status
if [ "$sts" -lt "$T_FIRST_STATUS" -o "$sts" -gt "$T_LAST_STATUS" ]; then
message="unknown status: $sts"
sts=$T_FAIL_STATUS
fi
# stop looping if we didn't pass
if [ "$sts" != "$T_PASS_STATUS" ]; then
break;
fi
done
# show and record the result of the test
if [ "$sts" == "$T_PASS_STATUS" ]; then

View File

@@ -19,6 +19,7 @@
#include <sys/types.h>
#include <stdio.h>
#include <sys/stat.h>
#include <inttypes.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
@@ -29,7 +30,7 @@
#include <errno.h>
static int size = 0;
static int count = 0; /* XXX make this duration instead */
static int duration = 0;
struct thread_info {
int nr;
@@ -41,6 +42,8 @@ static void *run_test_func(void *ptr)
void *buf = NULL;
char *addr = NULL;
struct thread_info *tinfo = ptr;
uint64_t seconds = 0;
struct timespec ts;
int c = 0;
int fd;
ssize_t read, written, ret;
@@ -61,9 +64,15 @@ static void *run_test_func(void *ptr)
usleep(100000); /* 0.1sec to allow all threads to start roughly at the same time */
clock_gettime(CLOCK_REALTIME, &ts); /* record start time */
seconds = ts.tv_sec + duration;
for (;;) {
if (++c > count)
break;
if (++c % 16 == 0) {
clock_gettime(CLOCK_REALTIME, &ts);
if (ts.tv_sec >= seconds)
break;
}
switch (rand() % 4) {
case 0: /* pread */
@@ -99,6 +108,8 @@ static void *run_test_func(void *ptr)
memcpy(addr, buf, size); /* noerr */
break;
}
usleep(10000);
}
munmap(addr, size);
@@ -120,7 +131,7 @@ int main(int argc, char **argv)
int i;
if (argc != 8) {
fprintf(stderr, "%s requires 7 arguments - size count file1 file2 file3 file4 file5\n", argv[0]);
fprintf(stderr, "%s requires 7 arguments - size duration file1 file2 file3 file4 file5\n", argv[0]);
exit(-1);
}
@@ -130,9 +141,9 @@ int main(int argc, char **argv)
exit(-1);
}
count = atoi(argv[2]);
if (count < 0) {
fprintf(stderr, "invalid count, must be greater than 0\n");
duration = atoi(argv[2]);
if (duration < 0) {
fprintf(stderr, "invalid duration, must be greater than or equal to 0\n");
exit(-1);
}

View File

@@ -15,7 +15,7 @@ echo "== prepare devices, mount point, and logs"
SCR="$T_TMPDIR/mnt.scratch"
mkdir -p "$SCR"
> $T_TMP.mount.out
scoutfs mkfs -f -Q 0,127.0.0.1,53000 "$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 \
scoutfs mkfs -f -Q 0,127.0.0.1,$T_SCRATCH_PORT "$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 \
|| t_fail "mkfs failed"
echo "== bad devices, bad options"

View File

@@ -11,7 +11,7 @@ truncate -s $sz "$T_TMP.equal"
truncate -s $large_sz "$T_TMP.large"
echo "== make scratch fs"
t_quiet scoutfs mkfs -f -Q 0,127.0.0.1,53000 "$T_EX_META_DEV" "$T_EX_DATA_DEV"
t_quiet scoutfs mkfs -f -Q 0,127.0.0.1,$T_SCRATCH_PORT "$T_EX_META_DEV" "$T_EX_DATA_DEV"
SCR="$T_TMPDIR/mnt.scratch"
mkdir -p "$SCR"

View File

@@ -57,7 +57,7 @@ test "$before" == "$after" || \
# XXX this is all pretty manual, would be nice to have helpers
echo "== make small meta fs"
# meta device just big enough for reserves and the metadata we'll fill
scoutfs mkfs -A -f -Q 0,127.0.0.1,53000 -m 10G "$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 || \
scoutfs mkfs -A -f -Q 0,127.0.0.1,$T_SCRATCH_PORT -m 10G "$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 || \
t_fail "mkfs failed"
SCR="$T_TMPDIR/mnt.scratch"
mkdir -p "$SCR"

View File

@@ -11,8 +11,8 @@
# format version.
#
# not supported on el9!
if [ $(source /etc/os-release ; echo ${VERSION_ID:0:1}) -gt 8 ]; then
# not supported on el8 or higher
if [ $(source /etc/os-release ; echo ${VERSION_ID:0:1}) -gt 7 ]; then
t_skip_permitted "Unsupported OS version"
fi
@@ -89,7 +89,7 @@ for vers in $(seq $MIN $((MAX - 1))); do
old_module="$builds/$vers/scoutfs.ko"
echo "mkfs $vers" >> "$T_TMP.log"
t_quiet $old_scoutfs mkfs -f -Q 0,127.0.0.1,53000 "$T_EX_META_DEV" "$T_EX_DATA_DEV" \
t_quiet $old_scoutfs mkfs -f -Q 0,127.0.0.1,$T_SCRATCH_PORT "$T_EX_META_DEV" "$T_EX_DATA_DEV" \
|| t_fail "mkfs $vers failed"
echo "mount $vers with $vers" >> "$T_TMP.log"

View File

@@ -10,30 +10,6 @@ EXTENTS_PER_BTREE_BLOCK=600
EXTENTS_PER_LIST_BLOCK=8192
FREED_EXTENTS=$((EXTENTS_PER_BTREE_BLOCK * EXTENTS_PER_LIST_BLOCK))
#
# This test specifically creates a pathologically sparse file that will
# be as expensive as possible to free. This is usually fine on
# dedicated or reasonable hardware, but trying to run this in
# virtualized debug kernels can take a very long time. This test is
# about making sure that the server doesn't fail, not that the platform
# can handle the scale of work that our btree formats happen to require
# while execution is bogged down with use-after-free memory reference
# tracking. So we give the test a lot more breathing room before
# deciding that its hung.
#
echo "== setting longer hung task timeout"
if [ -w /proc/sys/kernel/hung_task_timeout_secs ]; then
secs=$(cat /proc/sys/kernel/hung_task_timeout_secs)
test "$secs" -gt 0 || \
t_fail "confusing value '$secs' from /proc/sys/kernel/hung_task_timeout_secs"
restore_hung_task_timeout()
{
echo "$secs" > /proc/sys/kernel/hung_task_timeout_secs
}
trap restore_hung_task_timeout EXIT
echo "$((secs * 5))" > /proc/sys/kernel/hung_task_timeout_secs
fi
echo "== creating fragmented extents"
fragmented_data_extents $FREED_EXTENTS $EXTENTS_PER_BTREE_BLOCK "$T_D0/alloc" "$T_D0/move"

View File

@@ -5,7 +5,7 @@
t_require_commands mmap_stress mmap_validate scoutfs xfs_io
echo "== mmap_stress"
mmap_stress 8192 2000 "$T_D0/mmap_stress" "$T_D1/mmap_stress" "$T_D2/mmap_stress" "$T_D3/mmap_stress" "$T_D4/mmap_stress" | sed 's/:.*//g' | sort
mmap_stress 8192 30 "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D3/mmap_stress" "$T_D3/mmap_stress" | sed 's/:.*//g' | sort
echo "== basic mmap/read/write consistency checks"
mmap_validate 256 1000 "$T_D0/mmap_val1" "$T_D1/mmap_val1"

View File

@@ -157,7 +157,7 @@ echo "truncate should be waiting for first block:"
expect_wait "$DIR/file" "change_size" $ino 0
scoutfs stage "$DIR/golden" "$DIR/file" -V "$vers" -o 0 -l $BYTES
sleep .1
echo "trunate should no longer be waiting:"
echo "truncate should no longer be waiting:"
scoutfs data-waiting -B 0 -I 0 -p "$DIR" | wc -l
cat "$DIR/golden" > "$DIR/file"
vers=$(scoutfs stat -s data_version "$DIR/file")
@@ -168,10 +168,13 @@ scoutfs release "$DIR/file" -V "$vers" -o 0 -l $BYTES
# overwrite, not truncate+write
dd if="$DIR/other" of="$DIR/file" \
bs=$BS count=$BLOCKS conv=notrunc status=none &
pid="$!"
sleep .1
echo "should be waiting for write"
expect_wait "$DIR/file" "write" $ino 0
scoutfs stage "$DIR/golden" "$DIR/file" -V "$vers" -o 0 -l $BYTES
# wait for the background dd to complete
wait "$pid" 2> /dev/null
cmp "$DIR/file" "$DIR/other"
echo "== cleanup"

View File

@@ -67,18 +67,49 @@ t_mount_all
while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
sleep .5
done
# wait for orphan scans to run
t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
# also have to wait for delayed log merge work from mount
C=120
while (( C-- )); do
brk=1
for ino in $inos; do
inode_exists $ino && brk=0
done
test $brk -eq 1 && break
sv=$(t_server_nr)
# wait for reclaim_open_log_tree() to complete for each mount
while [ $(t_counter reclaimed_open_logs $sv) -lt $T_NR_MOUNTS ]; do
sleep 1
done
# wait for finalize_and_start_log_merge() to find no active merges in flight
# and not find any finalized trees
while [ $(t_counter log_merge_no_finalized $sv) -lt 1 ]; do
sleep 1
done
# wait for orphan scans to run
t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
# wait until we see two consecutive orphan scan attempts without
# any inode deletion forward progress in each mount
for nr in $(t_fs_nrs); do
C=0
LOSA=$(t_counter orphan_scan_attempts $nr)
LDOP=$(t_counter inode_deleted $nr)
while [ $C -lt 2 ]; do
sleep 1
OSA=$(t_counter orphan_scan_attempts $nr)
DOP=$(t_counter inode_deleted $nr)
if [ $OSA != $LOSA ]; then
if [ $DOP == $LDOP ]; then
(( C++ ))
else
C=0
fi
fi
LOSA=$OSA
LDOP=$DOP
done
done
for ino in $inos; do
inode_exists $ino && echo "$ino still exists"
done

View File

@@ -72,7 +72,7 @@ quarter_data=$(echo "$size_data / 4" | bc)
# XXX this is all pretty manual, would be nice to have helpers
echo "== make initial small fs"
scoutfs mkfs -A -f -Q 0,127.0.0.1,53000 -m $quarter_meta -d $quarter_data \
scoutfs mkfs -A -f -Q 0,127.0.0.1,$T_SCRATCH_PORT -m $quarter_meta -d $quarter_data \
"$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 || \
t_fail "mkfs failed"
SCR="$T_TMPDIR/mnt.scratch"

View File

@@ -50,9 +50,9 @@ t_quiet sync
cat << EOF > local.config
export FSTYP=scoutfs
export MKFS_OPTIONS="-f"
export MKFS_TEST_OPTIONS="-Q 0,127.0.0.1,42000"
export MKFS_SCRATCH_OPTIONS="-Q 0,127.0.0.1,43000"
export MKFS_DEV_OPTIONS="-Q 0,127.0.0.1,44000"
export MKFS_TEST_OPTIONS="-Q 0,127.0.0.1,$T_TEST_PORT"
export MKFS_SCRATCH_OPTIONS="-Q 0,127.0.0.1,$T_SCRATCH_PORT"
export MKFS_DEV_OPTIONS="-Q 0,127.0.0.1,$T_DEV_PORT"
export TEST_DEV=$T_DB0
export TEST_DIR=$T_M0
export SCRATCH_META_DEV=$T_EX_META_DEV
@@ -63,73 +63,47 @@ export MOUNT_OPTIONS="-o quorum_slot_nr=0,metadev_path=$T_MB0"
export TEST_FS_MOUNT_OPTS="-o quorum_slot_nr=0,metadev_path=$T_MB0"
EOF
cat << EOF > local.exclude
generic/003 # missing atime update in buffered read
generic/075 # file content mismatch failures (fds, etc)
generic/103 # enospc causes trans commit failures
generic/108 # mount fails on failing device?
generic/112 # file content mismatch failures (fds, etc)
generic/213 # enospc causes trans commit failures
generic/318 # can't support user namespaces until v5.11
generic/321 # requires selinux enabled for '+' in ls?
generic/338 # BUG_ON update inode error handling
generic/347 # _dmthin_mount doesn't work?
generic/356 # swap
generic/357 # swap
generic/409 # bind mounts not scripted yet
generic/410 # bind mounts not scripted yet
generic/411 # bind mounts not scripted yet
generic/423 # symlink inode size is strlen() + 1 on scoutfs
generic/430 # xfs_io copy_range missing in el7
generic/431 # xfs_io copy_range missing in el7
generic/432 # xfs_io copy_range missing in el7
generic/433 # xfs_io copy_range missing in el7
generic/434 # xfs_io copy_range missing in el7
generic/441 # dm-mapper
generic/444 # el9's posix_acl_update_mode is buggy ?
generic/467 # open_by_handle ESTALE
generic/472 # swap
generic/484 # dm-mapper
generic/493 # swap
generic/494 # swap
generic/495 # swap
generic/496 # swap
generic/497 # swap
generic/532 # xfs_io statx attrib_mask missing in el7
generic/554 # swap
generic/563 # cgroup+loopdev
generic/564 # xfs_io copy_range missing in el7
generic/565 # xfs_io copy_range missing in el7
generic/568 # falloc not resulting in block count increase
generic/569 # swap
generic/570 # swap
generic/620 # dm-hugedisk
generic/633 # id-mapped mounts missing in el7
generic/636 # swap
generic/641 # swap
generic/643 # swap
EOF
cp "$T_EXTRA/local.exclude" local.exclude
t_restore_output
t_stdout_invoked
echo " (showing output of xfstests)"
args="-E local.exclude ${T_XFSTESTS_ARGS:--g quick}"
./check $args
# the fs is unmounted when check finishes
t_stdout_compare
#
# ./check writes the results of the run to check.log. It lists
# the tests it ran, skipped, or failed. Then it writes a line saying
# everything passed or some failed. We scrape the most recent run and
# use it as the output to compare to make sure that we run the right
# tests and get the right results.
# ./check writes the results of the run to check.log. It lists the
# tests it ran, skipped, or failed. Then it writes a line saying
# everything passed or some failed.
#
#
# If XFSTESTS_ARGS were specified then we just pass/fail to match the
# check run.
#
if [ -n "$T_XFSTESTS_ARGS" ]; then
if tail -1 results/check.log | grep -q "Failed"; then
t_fail
else
t_pass
fi
fi
#
# Otherwise, typically, when there were no args then we scrape the most
# recent run and use it as the output to compare to make sure that we
# run the right tests and get the right results.
#
awk '
/^(Ran|Not run|Failures):.*/ {
if (pf) {
res=""
pf=""
} res = res "\n" $0
}
res = res "\n" $0
}
/^(Passed|Failed).*tests$/ {
pf=$0
@@ -139,10 +113,14 @@ awk '
}' < results/check.log > "$T_TMPDIR/results"
# put a test per line so diff shows tests that differ
egrep "^(Ran|Not run|Failures):" "$T_TMPDIR/results" | \
fmt -w 1 > "$T_TMPDIR/results.fmt"
egrep "^(Passed|Failed).*tests$" "$T_TMPDIR/results" >> "$T_TMPDIR/results.fmt"
grep -E "^(Ran|Not run|Failures):" "$T_TMPDIR/results" | fmt -w 1 > "$T_TMPDIR/results.fmt"
grep -E "^(Passed|Failed).*tests$" "$T_TMPDIR/results" >> "$T_TMPDIR/results.fmt"
t_compare_output cat "$T_TMPDIR/results.fmt"
diff -u "$T_EXTRA/expected-results" "$T_TMPDIR/results.fmt" > "$T_TMPDIR/results.diff"
if [ -s "$T_TMPDIR/results.diff" ]; then
echo "tests that were skipped/run differed from expected:"
cat "$T_TMPDIR/results.diff"
t_fail
fi
t_pass

View File

@@ -7,7 +7,7 @@ message_output()
error_message()
{
message_output "$@" >&2
message_output "$@" >> /dev/stderr
}
error_exit()
@@ -62,31 +62,27 @@ test -x "$SCOUTFS_FENCED_RUN" || \
# files disappear.
#
# generate failure messages to stderr while still echoing 0 for the caller
careful_cat()
# silence error messages
quiet_cat()
{
local path="$@"
cat "$@" || echo 0
cat "$@" 2>/dev/null
}
while sleep $SCOUTFS_FENCED_DELAY; do
shopt -s nullglob
for fence in /sys/fs/scoutfs/*/fence/*; do
# catches unmatched regex when no dirs
if [ ! -d "$fence" ]; then
continue
fi
# skip requests that have been handled
if [ "$(careful_cat $fence/fenced)" == 1 -o \
"$(careful_cat $fence/error)" == 1 ]; then
continue
fi
srv=$(basename $(dirname $(dirname $fence)))
rid="$(cat $fence/rid)"
ip="$(cat $fence/ipv4_addr)"
reason="$(cat $fence/reason)"
fenced="$(quiet_cat $fence/fenced)"
error="$(quiet_cat $fence/error)"
rid="$(quiet_cat $fence/rid)"
ip="$(quiet_cat $fence/ipv4_addr)"
reason="$(quiet_cat $fence/reason)"
# request dirs can linger then disappear after fenced/error is set
if [ ! -d "$fence" -o "$fenced" == "1" -o "$error" == "1" ]; then
continue
fi
log_message "server $srv fencing rid $rid at IP $ip for $reason"

View File

@@ -55,6 +55,14 @@ with initial sparse regions (perhaps by multiple threads writing to
different regions) and wasted space isn't an issue (perhaps because the
file population contains few small files).
.TP
.B ino_alloc_per_lock=<number>
This option determines how many inode numbers are allocated in the same
cluster lock. The default, and maximum, is 1024. The minimum is 1.
Allocating fewer inodes per lock can allow more parallelism between
mounts because there are more locks that cover the same number of
created files. This can be helpful when working with smaller numbers of
large files.
.TP
.B log_merge_wait_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that log merge
creation can wait before timing out. This setting is per-mount, only
@@ -130,6 +138,23 @@ the server for the filesystem if it is elected leader.
The assigned number must match one of the slots defined with \-Q options
when the filesystem was created with mkfs. If the number assigned
doesn't match a number created during mkfs then the mount will fail.
.TP
.B tcp_keepalive_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that a client
connection will wait for active TCP packets, before deciding that
the connection is dead. This setting is per-mount and only changes
the behavior of that mount.
.sp
The default value of this setting is 60000msec (60s). Any precision
beyond a whole second is likely unrealistic due to the nature of
TCP keepalive mechanisms in the Linux kernel. Valid values are any
value higher than 3000 (3s).
.sp
The TCP keepalive mechanism is complex and observing a lost connection
quickly is important to maintain cluster stability. If the local
network suffers from intermittent outages this option may provide
some respite to overcome these outages without the cluster becoming
desynchronized.
.SH VOLUME OPTIONS
Volume options are persistent options which are stored in the super
block in the metadata device and which apply to all mounts of the volume.

View File

@@ -1,7 +1,7 @@
#!/bin/bash
# can we find sparse? If not, we're done.
which sparse > /dev/null 2>&1 || exit 0
# must have sparse. Fail with error message, mask success path.
which sparse > /dev/null || exit 1
#
# one of the problems with using sparse in userspace is that it picks up
@@ -22,6 +22,11 @@ RE="$RE|warning: memset with byte count of 4194304"
# some sparse versions don't know about some builtins
RE="$RE|error: undefined identifier '__builtin_fpclassify'"
# on el8, sparse can't handle __has_include for some reason when _GNU_SOURCE
# is defined, and we need that for O_DIRECT.
RE="$RE|note: in included file .through /usr/include/sys/stat.h.:"
RE="$RE|/usr/include/bits/statx.h:30:6: error: "
#
# don't filter out 'too many errors' here, it can signify that
# sparse doesn't understand something and is throwing a *ton*