The move_blocks ioctl finds extents to move in the source file by
searching from the starting block offset of the region to move.
Logically, this is fine. After each extent item is deleted the next
search will find the next extent.
The problem is that deleted items still exist in the item cache. The
next iteration has to skip over all the deleted extents from the start
of the region. This is fine with large extents, but with heavily
fragmented extents this creates a huge amplification of the number of
items to traverse when moving the fragmented extents in a large file.
(It's not quite O(n^2)/2 for the total extents, deleted items are purged
as we write out the dirty items in each transaction.. but it's still
immense.)
The fix is to simply start searching for the next extent after the one
we just moved.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which exercises filling holes in prealloc regions when the
_contig_only prealloc option is not set.
Signed-off-by: Zach Brown <zab@versity.com>
If the _contig_only option isn't set then we try to preallocate aligned
regions of files. The initial implementation naively only allowed one
preallocation attempt in each aligned region. If it got a small
allocation that didn't fill the region then every future allocation
in the region would be a single block.
This changes every preallocation in the region to attempt to fill the
hole in the region that iblock fell in. It uses an extra extent search
(item cache search) to try and avoid thousands of single block
allocations.
Signed-off-by: Zach Brown <zab@versity.com>
The RCU hash table uses deferred work to resize the hash table. There's
a time during resize when hash table iteration will return EAGAIN until
resize makes more progress. During this time resize can perform
GFP_KERNEL allocations.
Our shrinker tries to iterate over its RCU hash table to find blocks to
reclaim. It tries to restart iteration if it gets EAGAIN on the
assumption that it will be usable again soon.
Combine the two and our shrinker can get stuck retrying iteration
indefinitely because it's shrinking on behalf of the hash table resizing
that is trying to allocate the next table before making iteration work
again. We have to stop shrinking in this case so that the resizing
caller can proceed.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that gives the callers all entries that refer to an inode.
It's like a backwards readdir. It's a light bit of translation between
the internal _add_next_linkrefs() list of entries and the ioctl
interface of a buffer of entry structs.
Signed-off-by: Zach Brown <zab@versity.com>
Extend scoutfs_dir_add_next_linkref() to be able to return multiple
backrefs under the lock for each call and have it take an argument to
limit the number of backrefs that can be added and returned.
Its return code changes a bit in that it returns 1 on success instead of
0 so we have to be a little careful with callers who were expecting 0.
It still returns -ENOENT when no entries are found.
We break up its tracepoint into one that records each entry added and
one that records the result of each call.
This will be used by an ioctl to give callers just the entries that
point to an inode instead of assembling full paths from the root.
Signed-off-by: Zach Brown <zab@versity.com>
Update the quorum_heartbeat_timeout_ms test to also test the mount
option, not just updating the timeout via sysfs. This takes some
reworking as we have to avoid the active leader/server when setting the
timeout via the mount option. We also allow for a bit more slack around
comparing kernel sleeps and userspace wall clocks.
Signed-off-by: Zach Brown <zab@versity.com>
Mount option parsing runs early enough that the rest of the option
read/write serialization infrastructure isn't set up yet. The
quorum_heartbeat_timeout_ms mount option tried to use a helper that
updated the stored option but it wasn't initialized yet so it crashed.
The helper was really only to have the option validity test in one
place. It's reworked to only verify the option and the actual setting
is left to the callers.
Signed-off-by: Zach Brown <zab@versity.com>
If setting a sysfs option failes the bash write error is output. It
contains the script line number which can fail over time, leading to
mismatched golden output failures if we used the output as an expected
indication of failure. Callers should test its rc and output
accordingly if they want the failure logged and compared.
Signed-off-by: Zach Brown <zab@versity.com>
Forced unmount is supposed to isolate the mount from the world. The
net.c TCP messaging returns errors when sending during forced unmount.
The quorum code has its own UDP messaging and wasn't taking forced
unmount into account.
This lead to quorum still being able to send resignation messages to
other quorum peers during forced unmount, making it hard to test
heartbeat timeouts with forced unmount.
The quorum messaging is already unreliable so we can easily make it drop
messages during forced unmount. Now forced unmount more fully isolates
the quorum code and it becomes easier to test.
Signed-off-by: Zach Brown <zab@versity.com>
Add tracking and reporting of delays in sending or receiving quorum
heartbeat messages. We measure the time between back to back sends or
receives of heartbeat messages. We record these delays truncated down
to second granularity in the quorum sysfs status file. We log messages
to the console for each longest measured delay up to the maximum
configurable heartbeat timeout.
Signed-off-by: Zach Brown <zab@versity.com>
Add mount and sysfs options for changing the quorum heartbeat timeout.
This allows setting a longer delay in taking over for failed hosts that
has a greater chance of surviving temporary non-fatal delays.
We also double the existing default timeout to 10s which is still
reasonably responsive.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum udp socket allocation still allowed starting io which can
trigger longer latencies trying to free memory. We change the flags to
prefer dipping into emergency pools and then failing rather than
blocking trying to satisfy an allocation. We'd much rather have a given
heartbeat attempt fail and have the opportunity to succeed at the next
interval rather than running the risk of blocking across multiple
intervals.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum work was using the system workq. While that's mostly fine,
we can create a dedicated workqueue with the specific flags that we
need. The quorum work needs to run promptly to avoid fencing so we set
it to high priority.
Signed-off-by: Zach Brown <zab@versity.com>
In the quorum work loop some message receive actions extend the timeout
after the timeout expiration is checked. This is usually fine when the
work runs soon after the messages are received and before the timeout
expires. But under load the work might not schedule until long after
both the message has been received and the timeout has expired.
If the message was a heartbeat message then the wakeup delay would be
mistaken for lack of activity on the server and it would try to take
over for an otherwise active server.
This moves the extension of the heartbeat on message receive to before
the timeout is checked. In our case of a delayed heartbeat message it
would still find it in the recv queue and extend the timeout, avoiding
fencing an active server.
Signed-off-by: Zach Brown <zab@versity.com>
Add a command for writing a super block to a new data device after
reading the metadata device to ensure that there's no existing
data on the old data device.
Signed-off-by: Zach Brown <zab@versity.com>
Some tests had grown a bad pattern of making a mount point for the
scratch mount in the root /mnt directory. Change them to use a mount
point in their test's temp directory outside the testing fs.
Signed-off-by: Zach Brown <zab@versity.com>
Split the existing device_size() into get_device_size() and
limit_device_size(). An upcoming command wants to get the device size
without applying limiting policy.
Signed-off-by: Zach Brown <zab@versity.com>
We missed initializing sb->s_time_gran which controls how some parts of
the kernel truncate the granularity of nsec in timespec. Some paths
don't use it at all so time would be maintained at full precision. But
other paths, particularly setattr_copy() from userspace and
notify_change() from the kernel use it to truncate as times are set.
Setting s_time_gran to 1 maintains full nsec precision.
Signed-off-by: Zach Brown <zab@versity.com>
The VFS performs a lot of checks on renames before calling the fs
method. We acquire locks and refresh inodes in the rename method so we
have to duplciate a lot of the vfs checks.
One of the checks involves loops with ancestors and subdirectories. We
missed the case where the root directory is the destination and doesn't
have any parent directories. The backref walker it calls returns
-ENOENT instead of 0 with an empty set of parents and that error bubbled
up to rename.
The fix is to notice when we're asking for ancestors of the one
directory that can't have ancestors and short circuit the test.
Signed-off-by: Zach Brown <zab@versity.com>