Commit Graph

1237 Commits

Author SHA1 Message Date
Zach Brown
70a5b6ffe2 Use percpu_counter_add_batch
__percpu_counter_add_batch was renamed to make it clear that the __
doesn't mean it's less safe, as it means in other calls in the API, but
just that it takes an additional parameter.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
b89ecd47b4 Use __posix_acl_create/_chmod and add backwards compatibility
There are new interfaces available but the old one has been retained
for us to use. In case of older kernels, we will need to fall back
to the previous name of these functions.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
4293816764 Fix argument test for __posix_acl_valid.
The argument is fixed to be user_namespace, instead of user_ns.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
f0de59a9a3 Use setattr_preapre() as inode_change_ok() was removed in v4.8-rc1
Instead, we can call setattr_prepare() directly. We provide a fallback
for older kernels.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
1f0a08eacb Use the new inode->i_version manipulation methods.
Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly
manipulating the member as needed.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
dac3f056a5 inode->i_mutex has been replaced with inode->i_rwsem.
Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced
with ->i_rwsem. However, long since whenever, inode_lock() and
related functions already worked as intended and provided fully
exclusive locking to the inode.

To avoid a name clash on pre-rhel8 kernels, we have to rename a
stack variable in `src/file.c`.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
af868aad9b New inode->i_version API requires <iversion.h>
Since v4.15-rc3-4-gae5e165d855d, <linux/iversion.h> contains a new
inode->i_version API and it is not included by default.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
cf4df0ef9f use $(MAKE) to allow passing jobserver flags.
With this, we can `make -jX` to speed up compiles a bit from
the kmod folder.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
81aa58253e module_init/_exit should have a semicolon at eol.
In the past this was not needed but since el7 onwards these macros
should require the semicolon.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
c683ded0e6 Adjust for new augmented rbtree compute callback function signature
The new variant of the code that recomputes the augmented value
is designed to handle non-scalar types and to facilitate that, it
has new semantics for the _compute callback. It is now passed a
boolean flag `exit` that indicates that if the value isn't changed,
it should exit and halt propagation.

The callback function now shall return whether that propagation should
stop or not, and not the computed new value. The callback can now
directly update the new computed value in the node.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
f27431b3ae Add include <blkdev.h>.
Fixes: Error: implicit declaration of function ‘blkdev_put’

Previously this was an `extern` in <fs.h> and included implicitly,
hence the need to hard include it now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
28c3cee995 preempt_mask.h is removed entirely.
v4.1-rc4-22-g92cf211874e9 merges this into preempt.h, and on
rhel7 kernels we don't need this include anymore either.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
430960ef3c page_cache_release() is removed. put_page() instead.
Even in 3.x, this already was equivalent.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
7006a84d96 flush_work_sync is equivalent to flush_work.
v3.15-rc1-6-g1a56f2aa4752 removes flush_work_sync entirely, but
ever since v3.6-rc1-25-g606a5020b9bd which made all workqueues
non-reentrant, it has been equivalent to flush_work.

This is safe because in all cases only one server->work can be
in flight at a time.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
eafb8621da d_materialise_unique replaced with d_splice_alias.
Note argument order reversal.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
006555d42a READ_ONCE() replaces ACCESS_ONCE()
v3.18-rc3-2-g230fa253df63 forces us to remove ACCESS_ONCE() with
READ_ONCE(), but it is probably the better interface and works with
non-scalar types.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
8e458f9230 PAGE_CACHE_SIZE was removed, replace with PAGE_SIZE.
PAGE_CACHE_SIZE was previously defined to be equivalent to PAGE_SIZE.

This symbol was removed in v4.6-rc1-32-g1fa64f198b9f.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
32c0dbce09 Include kernel.h and fs.h at the top of kernelcompat.h
Because we `-include src/kernelcompat.h` from the command line,
this header gets included before any of the kernel includes in
most .c and .h files. We should at least make sure we pull in
<fs> and <kernel> since they're required.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Zach Brown
4784ccdfd5 Start server commits when holds wait for alloc
Server code that wants to dirty blocks by holding a commit won't be
allowed to until the current allocators for the server transaction have
enough space for the holder.  As an active holder applies the commit the
allocators are refilled and the waiting holders will proceed.

But the current allocators can have no resources as the server starts
up.  There will never be active holders to apply the commit and refill
the allocators.  In this case all the holders will block indefinitely.

The fix is to trigger a server commit when a holder doesn't have room.
It used to be that commits were only triggered when apply callers were
waiting.  We transfer some of that logic into a new 'committing' field
so that we can have commits in flight without apply callers waiting.  We
add it to the server commit tracing.

While we're at it we clean up the logic that tests if a hold can
proceed.  It used to be confusingly split across two functions that both
could sample the current allocator space remaining.  This could lead to
weird cases where the first holder could use the second alloc remaining
call, not the one whose values were tested to see if the holder could
fit.  Now each hold check only samples the allocators once.

And finally we fix a subtle case where the budget exceeded message can
spuriously trigger in the case where dirtying the freed list created a
new empty block after the holder recorded the amount of space in the
freed block.

Signed-off-by: Zach Brown <zab@versity.com>
2023-10-03 13:32:09 -07:00
Zach Brown
1672b3ecec Merge pull request #130 from versity/zab/noncontig_alloc_einval
Fix partial preallocation when _contig_only = 0
2023-07-17 10:21:18 -07:00
Zach Brown
55f9435fad Fix partial preallocation when _contig_only = 0
Data preallocation attempts to allocate large aligned regions of
extents.  It tried to fill the hole around a write offset that
didn't contain an extent.  It missed the case where there can be
multiple extents between the start of the region and the hole.
It could try to overwrite these additional existing extents and writes
could return EINVAL.

We fix this by trimming the preallocation to start at the write offset
if there are any extents in the region before the write offset.  The
data preallocation test output has to be updated now that allocation
extents won't grow towards the start of the region when there are
existing extents.

Signed-off-by: Zach Brown <zab@versity.com>
2023-07-17 09:36:09 -07:00
Zach Brown
8a64b46a2f Process log merge splicing in many commits
Log merge completions were spliced in one server commit.  It's possible
to get enough completion work pending that it all can't be completed in
one server commit.  Operations fail with ENOSPC and because these
changes can't be unwound cleanly the server asserts.

This allows the completion splicing to break the work up into multiple
commits.

Processing completions in multiple commits means that request creation
can observe the merge status in states that weren't possible before.
Splicing is careful to maintain an elevated nr_complete count while the
client can't get requests because the tree is rebalancing.

Signed-off-by: Zach Brown <zab@versity.com>
2023-07-14 13:28:29 -07:00
Zach Brown
a9da27444f Merge pull request #128 from versity/zab/prealloc_fragmentation
Zab/prealloc fragmentation
2023-06-29 09:57:32 -07:00
Zach Brown
49fe89741d Merge pull request #125 from versity/zab/get_referring_entries
Zab/get referring entries
2023-06-29 09:57:06 -07:00
Zach Brown
847916860d Advance move_blocks extent search offset
The move_blocks ioctl finds extents to move in the source file by
searching from the starting block offset of the region to move.
Logically, this is fine.  After each extent item is deleted the next
search will find the next extent.

The problem is that deleted items still exist in the item cache.  The
next iteration has to skip over all the deleted extents from the start
of the region.  This is fine with large extents, but with heavily
fragmented extents this creates a huge amplification of the number of
items to traverse when moving the fragmented extents in a large file.
(It's not quite O(n^2)/2 for the total extents, deleted items are purged
as we write out the dirty items in each transaction.. but it's still
immense.)

The fix is to simply start searching for the next extent after the one
we just moved.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-28 16:54:28 -07:00
Zach Brown
3d99fda0f6 Preallocate data around iblock when noncontig
If the _contig_only option isn't set then we try to preallocate aligned
regions of files.  The initial implementation naively only allowed one
preallocation attempt in each aligned region.  If it got a small
allocation that didn't fill the region then every future allocation
in the region would be a single block.

This changes every preallocation in the region to attempt to fill the
hole in the region that iblock fell in.  It uses an extra extent search
(item cache search) to try and avoid thousands of single block
allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-28 12:21:25 -07:00
Zach Brown
acafb869e7 Avoid deadlock from block reclaim in rht resize
The RCU hash table uses deferred work to resize the hash table.  There's
a time during resize when hash table iteration will return EAGAIN until
resize makes more progress.  During this time resize can perform
GFP_KERNEL allocations.

Our shrinker tries to iterate over its RCU hash table to find blocks to
reclaim.  It tries to restart iteration if it gets EAGAIN on the
assumption that it will be usable again soon.

Combine the two and our shrinker can get stuck retrying iteration
indefinitely because it's shrinking on behalf of the hash table resizing
that is trying to allocate the next table before making iteration work
again.  We have to stop shrinking in this case so that the resizing
caller can proceed.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-15 14:45:26 -07:00
Zach Brown
707752a7bf Add get_referring_entries ioctl
Add an ioctl that gives the callers all entries that refer to an inode.
It's like a backwards readdir.  It's a light bit of translation between
the internal _add_next_linkrefs() list of entries and the ioctl
interface of a buffer of entry structs.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
0316c22026 Extend scoutfs_dir_add_next_linkrefs
Extend scoutfs_dir_add_next_linkref() to be able to return multiple
backrefs under the lock for each call and have it take an argument to
limit the number of backrefs that can be added and returned.

Its return code changes a bit in that it returns 1 on success instead of
0 so we have to be a little careful with callers who were expecting 0.
It still returns -ENOENT when no entries are found.

We break up its tracepoint into one that records each entry added and
one that records the result of each call.

This will be used by an ioctl to give callers just the entries that
point to an inode instead of assembling full paths from the root.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
2b72c57cb0 Fix crash in quorum_heartbeat_timeout_ms parsing
Mount option parsing runs early enough that the rest of the option
read/write serialization infrastructure isn't set up yet.  The
quorum_heartbeat_timeout_ms mount option tried to use a helper that
updated the stored option but it wasn't initialized yet so it crashed.

The helper was really only to have the option validity test in one
place.  It's reworked to only verify the option and the actual setting
is left to the callers.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-22 16:29:56 -07:00
Zach Brown
15de0c21c1 Have quorum drop messages on force unmount
Forced unmount is supposed to isolate the mount from the world.  The
net.c TCP messaging returns errors when sending during forced unmount.
The quorum code has its own UDP messaging and wasn't taking forced
unmount into account.

This lead to quorum still being able to send resignation messages to
other quorum peers during forced unmount, making it hard to test
heartbeat timeouts with forced unmount.

The quorum messaging is already unreliable so we can easily make it drop
messages during forced unmount.  Now forced unmount more fully isolates
the quorum code and it becomes easier to test.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-18 10:01:19 -07:00
Zach Brown
7b65767803 Track and log quorum heartbeat delays
Add tracking and reporting of delays in sending or receiving quorum
heartbeat messages.  We measure the time between back to back sends or
receives of heartbeat messages.  We record these delays truncated down
to second granularity in the quorum sysfs status file.  We log messages
to the console for each longest measured delay up to the maximum
configurable heartbeat timeout.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
46640e4ff9 Add counter for quorum heartbeat send failures
Add a counter which tracks the number of heartbeat message send attempts
which fail.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
912906f050 Make quorum heartbeat timeout tunable
Add mount and sysfs options for changing the quorum heartbeat timeout.
This allows setting a longer delay in taking over for failed hosts that
has a greater chance of surviving temporary non-fatal delays.

We also double the existing default timeout to 10s which is still
reasonably responsive.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
ec02cf442b Use lower latency allocation in quorum socket
The quorum udp socket allocation still allowed starting io which can
trigger longer latencies trying to free memory.  We change the flags to
prefer dipping into emergency pools and then failing rather than
blocking trying to satisfy an allocation.  We'd much rather have a given
heartbeat attempt fail and have the opportunity to succeed at the next
interval rather than running the risk of blocking across multiple
intervals.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
0e9cd1eea5 Use specific work queue for quorum work
The quorum work was using the system workq.  While that's mostly fine,
we can create a dedicated workqueue with the specific flags that we
need.  The quorum work needs to run promptly to avoid fencing so we set
it to high priority.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
e18ea24561 Move quorum recv that sets timeout before check
In the quorum work loop some message receive actions extend the timeout
after the timeout expiration is checked.  This is usually fine when the
work runs soon after the messages are received and before the timeout
expires.  But under load the work might not schedule until long after
both the message has been received and the timeout has expired.

If the message was a heartbeat message then the wakeup delay would be
mistaken for lack of activity on the server and it would try to take
over for an otherwise active server.

This moves the extension of the heartbeat on message receive to before
the timeout is checked.  In our case of a delayed heartbeat message it
would still find it in the recv queue and extend the timeout, avoiding
fencing an active server.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 09:56:53 -07:00
Zach Brown
bb01a3990f Set sb->s_time_gran to support nsecs
We missed initializing sb->s_time_gran which controls how some parts of
the kernel truncate the granularity of nsec in timespec.  Some paths
don't use it at all so time would be maintained at full precision.  But
other paths, particularly setattr_copy() from userspace and
notify_change() from the kernel use it to truncate as times are set.

Setting s_time_gran to 1 maintains full nsec precision.

Signed-off-by: Zach Brown <zab@versity.com>
2023-03-24 10:50:34 -07:00
Zach Brown
a61b8d9961 Fix renaming into root directory
The VFS performs a lot of checks on renames before calling the fs
method.  We acquire locks and refresh inodes in the rename method so we
have to duplciate a lot of the vfs checks.

One of the checks involves loops with ancestors and subdirectories.  We
missed the case where the root directory is the destination and doesn't
have any parent directories.  The backref walker it calls returns
-ENOENT instead of 0 with an empty set of parents and that error bubbled
up to rename.

The fix is to notice when we're asking for ancestors of the one
directory that can't have ancestors and short circuit the test.

Signed-off-by: Zach Brown <zab@versity.com>
2023-03-08 11:00:59 -08:00
Zach Brown
2e2ccb6f61 Allow replaying srch file rotation
When a client no longer needs to append to a srch file, for whatever
reason, we move the reference from the log_trees item into a specific
srch file btree item in the server's srch file tracking btree.

Zeroing the log_trees item and inserting the server's btree item are
done in a server commit and should be written atomically.

But commit_log_trees had an error handling case that could leave the
newly inserted item dirty in memory without zeroing the srch file
reference in the existing log_trees item.  Future attempts to rotate the
file reference, perhaps by retrying the commit or by reclaiming the
client's rid, would get EEXIST and fail.

This fixes the error handling path to ensure that we'll keep the dirty
srch file btree and log_trees item in sync.  The desynced items can
still exist in the world so we'll tolerate getting EEXIST on insertion.
After enough time has passed, or if repair zeroed the duplicate
reference, we could remove this special case from insertion.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-17 14:33:27 -08:00
Zach Brown
01c8bba56d Merge pull request #109 from versity/zab/server_statfs_stable_blocks
Zab/server statfs stable blocks
2023-01-12 09:58:48 -08:00
Zach Brown
17cb1fe84b Merge pull request #110 from versity/zab/partial_alloc_move
Allow partial extent motion
2023-01-12 09:58:12 -08:00
Zach Brown
a23e7478a0 Fix move_blocks loop exit conditions
The move_blocks ioctl intends to only move extents whose bytes fall
inside i_size.  This is easy except for a final extent that straddles an
i_size that isn't aligned to 4K data blocks.

The code that either checked for an extent being entirely past i_size or
for limiting the number of blocks to move by i_size clumsily compared
i_size offsets in bytes with extent counts in 4KB blocks.  In just the
right circumstances, probably with the help of a byte length to move
that is much larger than i_size, the length calculation could result in
trying to move 0 blocks.  Once this hit the loop would keep finding that
extent and calculating 0 blocks to move and would be stuck.

We fix this by clamping the count of blocks in extents to move in terms
of byte offsets at the start of the loop.  This gets rid of the extra
size checks and byte offset use in the loop.  We also add a sanity check
to make sure that we can't get stuck if, say, corruption resulted in an
otherwise impossible zero length extent.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-10 09:34:52 -08:00
Zach Brown
7c2d83e2f8 Remove saved super block in scoutfs_sb_info
Now that we've removed its users we can remove the global saved copy of
the super block from scoutfs_sb_info.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-06 11:15:45 -08:00
Zach Brown
40aa47c888 Have the server keep a private dirty super block
As the server does its work its transactions modify a dirty super block
in memory.  This used the global super block in scoutfs_sb_info which
was visible to everything, including the client.  Move the dirty super
block over to the private server info so that only the server can see
it.

This is mostly boring storage motion but we do change that the quorum
code hands the server a static copy of the quorum config to use as it
starts up before it reads the most recent super block.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-06 11:15:45 -08:00
Zach Brown
c1bd7bcce5 Allow partial extent motion
Refilling a client's data_avail is the only alloc_move call that doesn't
try and limit the number of blocks that it dirties.  If it doesn't find
sufficiently large extents it can exhaust the server's alloc budget
without hitting the target.  It'll try to dirty blocks and return a hard
error.

This changes that behaviour to allow returning 0 if it moved any
extents.  Other callers can deal with partial progress as they already
limit the blocks they dirty.  This will also return ENOSPC if it hadn't
moved anything just as the current code would.

The result is that data fill can not necessarily hit the target.  It
might take multiple commits to fill the data_avail btree.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-15 20:47:41 -08:00
Zach Brown
7720222588 Have statfs use unlocked stable roots
The server's statfs request handler was intending to lock dirty
structures as they were walked to get sums used for statfs fields.
Other callers walk stable structures, though, so the summation calls had
grown iteration over other structures that the server didn't know it had
to lock.

This meant that the server was walking unlocked dirty structures as they
were being modified.  The races are very tight, but it can result in
request handling errors that shut down connections and IO errors from
trying to read inconsistent refs as they were modified by the locked
writer.

We've built up infrastructure so the server can now walk stable
structures just like the other callers.  It will no longer wander into
dirty blocks so it doesn't need to lock them and it will retry if its
walk of stale data crosses a broken reference.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
fff07ce19c Use stale block read retrying helper
Transition from manual checking for persistent ESTALE to the shared
helper that we just added.  This should not change behavior.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
464de56d28 Add stale block read retrying helper
Many readers had little implementations of the logic to decide to retry
stale reads with different refs or decide that they're persistent and
return hard errors.  Let's move that into a small helper.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
342c206550 Have scoutfs_forest_inode_count return stale reads
scoutfs_forest_inode_count() assumed it was called with stable refs and
would always translate ESTALE to EIO.  Change it so that it passes
ESTALE to the caller who is responsible for handling it.

The server will use this to retry reading from stable supers that it's
storing in memory.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00