Add tracking and reporting of delays in sending or receiving quorum
heartbeat messages. We measure the time between back to back sends or
receives of heartbeat messages. We record these delays truncated down
to second granularity in the quorum sysfs status file. We log messages
to the console for each longest measured delay up to the maximum
configurable heartbeat timeout.
Signed-off-by: Zach Brown <zab@versity.com>
Add mount and sysfs options for changing the quorum heartbeat timeout.
This allows setting a longer delay in taking over for failed hosts that
has a greater chance of surviving temporary non-fatal delays.
We also double the existing default timeout to 10s which is still
reasonably responsive.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum udp socket allocation still allowed starting io which can
trigger longer latencies trying to free memory. We change the flags to
prefer dipping into emergency pools and then failing rather than
blocking trying to satisfy an allocation. We'd much rather have a given
heartbeat attempt fail and have the opportunity to succeed at the next
interval rather than running the risk of blocking across multiple
intervals.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum work was using the system workq. While that's mostly fine,
we can create a dedicated workqueue with the specific flags that we
need. The quorum work needs to run promptly to avoid fencing so we set
it to high priority.
Signed-off-by: Zach Brown <zab@versity.com>
In the quorum work loop some message receive actions extend the timeout
after the timeout expiration is checked. This is usually fine when the
work runs soon after the messages are received and before the timeout
expires. But under load the work might not schedule until long after
both the message has been received and the timeout has expired.
If the message was a heartbeat message then the wakeup delay would be
mistaken for lack of activity on the server and it would try to take
over for an otherwise active server.
This moves the extension of the heartbeat on message receive to before
the timeout is checked. In our case of a delayed heartbeat message it
would still find it in the recv queue and extend the timeout, avoiding
fencing an active server.
Signed-off-by: Zach Brown <zab@versity.com>
Add a command for writing a super block to a new data device after
reading the metadata device to ensure that there's no existing
data on the old data device.
Signed-off-by: Zach Brown <zab@versity.com>
Some tests had grown a bad pattern of making a mount point for the
scratch mount in the root /mnt directory. Change them to use a mount
point in their test's temp directory outside the testing fs.
Signed-off-by: Zach Brown <zab@versity.com>
Split the existing device_size() into get_device_size() and
limit_device_size(). An upcoming command wants to get the device size
without applying limiting policy.
Signed-off-by: Zach Brown <zab@versity.com>
We missed initializing sb->s_time_gran which controls how some parts of
the kernel truncate the granularity of nsec in timespec. Some paths
don't use it at all so time would be maintained at full precision. But
other paths, particularly setattr_copy() from userspace and
notify_change() from the kernel use it to truncate as times are set.
Setting s_time_gran to 1 maintains full nsec precision.
Signed-off-by: Zach Brown <zab@versity.com>
The VFS performs a lot of checks on renames before calling the fs
method. We acquire locks and refresh inodes in the rename method so we
have to duplciate a lot of the vfs checks.
One of the checks involves loops with ancestors and subdirectories. We
missed the case where the root directory is the destination and doesn't
have any parent directories. The backref walker it calls returns
-ENOENT instead of 0 with an empty set of parents and that error bubbled
up to rename.
The fix is to notice when we're asking for ancestors of the one
directory that can't have ancestors and short circuit the test.
Signed-off-by: Zach Brown <zab@versity.com>
When a client no longer needs to append to a srch file, for whatever
reason, we move the reference from the log_trees item into a specific
srch file btree item in the server's srch file tracking btree.
Zeroing the log_trees item and inserting the server's btree item are
done in a server commit and should be written atomically.
But commit_log_trees had an error handling case that could leave the
newly inserted item dirty in memory without zeroing the srch file
reference in the existing log_trees item. Future attempts to rotate the
file reference, perhaps by retrying the commit or by reclaiming the
client's rid, would get EEXIST and fail.
This fixes the error handling path to ensure that we'll keep the dirty
srch file btree and log_trees item in sync. The desynced items can
still exist in the world so we'll tolerate getting EEXIST on insertion.
After enough time has passed, or if repair zeroed the duplicate
reference, we could remove this special case from insertion.
Signed-off-by: Zach Brown <zab@versity.com>
The move_blocks ioctl intends to only move extents whose bytes fall
inside i_size. This is easy except for a final extent that straddles an
i_size that isn't aligned to 4K data blocks.
The code that either checked for an extent being entirely past i_size or
for limiting the number of blocks to move by i_size clumsily compared
i_size offsets in bytes with extent counts in 4KB blocks. In just the
right circumstances, probably with the help of a byte length to move
that is much larger than i_size, the length calculation could result in
trying to move 0 blocks. Once this hit the loop would keep finding that
extent and calculating 0 blocks to move and would be stuck.
We fix this by clamping the count of blocks in extents to move in terms
of byte offsets at the start of the loop. This gets rid of the extra
size checks and byte offset use in the loop. We also add a sanity check
to make sure that we can't get stuck if, say, corruption resulted in an
otherwise impossible zero length extent.
Signed-off-by: Zach Brown <zab@versity.com>
There were kernels that didn't apply the current umask to inode modes
created with O_TMPFILE without acls. Let's have a test running to make
sure that we're not surprised if we come across one.
Signed-off-by: Zach Brown <zab@versity.com>
We had a one-off test that was overly specific to staging from tmpfile.
This renames it to a more generic test where we can add more tests of
o_tmpfile in general.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we've removed its users we can remove the global saved copy of
the super block from scoutfs_sb_info.
Signed-off-by: Zach Brown <zab@versity.com>
As the server does its work its transactions modify a dirty super block
in memory. This used the global super block in scoutfs_sb_info which
was visible to everything, including the client. Move the dirty super
block over to the private server info so that only the server can see
it.
This is mostly boring storage motion but we do change that the quorum
code hands the server a static copy of the quorum config to use as it
starts up before it reads the most recent super block.
Signed-off-by: Zach Brown <zab@versity.com>
Refilling a client's data_avail is the only alloc_move call that doesn't
try and limit the number of blocks that it dirties. If it doesn't find
sufficiently large extents it can exhaust the server's alloc budget
without hitting the target. It'll try to dirty blocks and return a hard
error.
This changes that behaviour to allow returning 0 if it moved any
extents. Other callers can deal with partial progress as they already
limit the blocks they dirty. This will also return ENOSPC if it hadn't
moved anything just as the current code would.
The result is that data fill can not necessarily hit the target. It
might take multiple commits to fill the data_avail btree.
Signed-off-by: Zach Brown <zab@versity.com>
The server's statfs request handler was intending to lock dirty
structures as they were walked to get sums used for statfs fields.
Other callers walk stable structures, though, so the summation calls had
grown iteration over other structures that the server didn't know it had
to lock.
This meant that the server was walking unlocked dirty structures as they
were being modified. The races are very tight, but it can result in
request handling errors that shut down connections and IO errors from
trying to read inconsistent refs as they were modified by the locked
writer.
We've built up infrastructure so the server can now walk stable
structures just like the other callers. It will no longer wander into
dirty blocks so it doesn't need to lock them and it will retry if its
walk of stale data crosses a broken reference.
Signed-off-by: Zach Brown <zab@versity.com>
Transition from manual checking for persistent ESTALE to the shared
helper that we just added. This should not change behavior.
Signed-off-by: Zach Brown <zab@versity.com>
Many readers had little implementations of the logic to decide to retry
stale reads with different refs or decide that they're persistent and
return hard errors. Let's move that into a small helper.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_forest_inode_count() assumed it was called with stable refs and
would always translate ESTALE to EIO. Change it so that it passes
ESTALE to the caller who is responsible for handling it.
The server will use this to retry reading from stable supers that it's
storing in memory.
Signed-off-by: Zach Brown <zab@versity.com>
The server has a mechanism for tracking the last stable roots used by
network rpcs. We expand it a bit to include the entire super so
that we can add users in the server which want the last full stable
super. We can still use the stable super to give out the stable
roots.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum code was using the copy of the super block in the sb info for
its config. With that going away we make different users more carefully
reference the config. The quorum agent has a copy that it reads on
setup, the client rarely reads a copy when trying to connect, and the
server uses its super.
This is about data access isolation and should have no functional effect
other than to cause more super reads.
Signed-off-by: Zach Brown <zab@versity.com>
A few paths throughout the code get the fsid for the current mount by
using the copy of the super block that we store in the scoutfs_sb_info
for the mount. We'd like to remove the super block from the sbi and
it's cleaner to have a specific constant field for the fsid of the mount
which will not change.
Signed-off-by: Zach Brown <zab@versity.com>
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.
We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this. We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction. This follows the
pattern in .write_end.
Signed-off-by: Zach Brown <zab@versity.com>