Commit Graph

1605 Commits

Author SHA1 Message Date
Zach Brown
fb2ff753ad Merge pull request #83 from versity/zab/heartbeat_during_fencing
Send quorum heartbeats while fencing
2022-04-01 09:12:41 -07:00
Zach Brown
bb3db7e272 Send quorum heartbeats while fencing
Quorum members will try to elect a new leader when they don't receive
heartbeats from the currently elected leader.   This timeout is short to
encourage restoring service promptly.

Heartbeats are sent from the quorum worker thread and are delayed while
it synchronously starts up the server, which includes fencing previous
servers.  If fence requests take too long then heartbeats will be
delayed long enough for remaining quorum members to elect a new leader
while the recently elected server is still busy fencing.

To fix this we decouple server startup from the quorum main thread.
Server starting and stopping becomes asynchronous so the quorum thread
is able to send heartbeats while the server work is off starting up and
fencing.

The server used to call into quorum to clear a flag as it exited.   We
remove that mechanism and have the server maintain a running status that
quorum can query.

We add some state to the quorum work to track the asynchronous state of
the server.   This lets the quorum protocol change roles immediately as
needed while remembering that there is a server running that needs to be
acted on.

The server used to also call into quorum to update quorum blocks.   This
is a read-modify-write operation that has to be serialized.  Now that we
have both the server starting up and the quorum work running they both
can't perform these read-modify-write cycles.  Instead we have the
quorum work own all the block updates and it queries the server status
to determine when it should update the quorum block to indicate that the
server has fenced or shut down.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-31 10:29:43 -07:00
Zach Brown
c94b072925 Merge pull request #81 from versity/zab/fenced_test
Zab/fenced test
2022-03-29 09:05:09 -07:00
Zach Brown
26ae9c6e04 Verify local unmount testing fence script
The fence script we use for our single node multi-mount tests only knows
how to fence by using forced unmount to destroy a mount.  As of now, the
tests only generate failing nodes that need to be fenced by using forced
unmount as well.  This results in the awkward situation where the
testing fence script doesn't have anything to do because the mount is
already gone.

When the test fence script has nothing to do we might not notice if it
isn't run.  This adds explicit verification to the fencing tests that
the script was really run.  It adds per-invocation logging to the fence
script and the test makes sure that it was run.

While we're at it, we take the opportunity to tidy up some of the
scripting around this.  We use a sysfs file with the data device
major:minor numbers so that the fencing script can find and unmount
mounts without having to ask them for their rid.  They may not be
operational.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-28 14:52:08 -07:00
Zach Brown
c8d7221ec5 Show data device numbers in sysfs file
It can be handy to associate mounts with their sysfs directory by their
data device number.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-25 14:43:25 -07:00
Zach Brown
7fd03dc311 Merge pull request #80 from versity/zab/avoid_xattr_vmalloc
Don't use vmalloc in get/set xattr
2022-03-22 12:00:51 -07:00
Zach Brown
4e8a088cc5 Don't use vmalloc in get/set xattr
Extended attribute values can be larger than a reasonable maximum size
for our btree items so we store xattrs in many items.  The first pass at
this code used vmalloc to make it relatively easy to work with a
contiguous buffer that was cut up into multiple items.

The problem, of course, is that vmalloc() is expensive.  Well, the
problem is that I always forget just how expensive it can be and use it
when I shouldn't.  We had loads on high cpu count machines that were
catastrophically cpu bound on all the contentious work that vmalloc does
to maintain a coherent global address space.

This removes the use of vmalloc and only allocates a small buffer for
the first compound item.  The later items directly reference regions of
value buffer rather than copying it to and from the large intermediate
vmalloced buffer.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-21 21:44:11 -07:00
Zach Brown
9c751c1197 Merge pull request #78 from versity/zab/quorum_leader_visibility
Zab/quorum leader visibility
2022-03-16 09:16:57 -07:00
Zach Brown
875583b7ef Add t_fs_is_leader test helper
The t_server_nr and t_first_client_nr helpers iterated over all the fs
numbers examining their quorum/is_leader files, but clients don't have a
quorum/ directory.  This was causing spurious outputs in tests that were
looking for servers but didn't find it in the first quorum fs number and
made it down into the clients.

Give them a helper that returns 0 for being a leader if the quorum/ dir
doesn't exist.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-15 16:09:55 -07:00
Zach Brown
38e5aa77c4 Update quorum status files more frequently
We were seeing rare test failures where it looked like is_leader wasn't
set for any of the mounts.   The test that couldn't find a set is_leader
file had just perfomed some mounts so we know that a server was up and
processing requests.

The quorum task wasn't updating the status that's shown in sysfs and
debugfs until after the server started up.  This opened the race where
the server was able to serve mount requests and have the test run to
find no is_leader file set before the quorum task was able to update the
stats and make its election visible.

This updates the quorum task to make its status visible more often,
typically before it does something that will take a while.  The
is_leader will now be visible before the server is started so the test
will always see the file after server starts up and lets mounts finish.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-15 15:07:57 -07:00
Zach Brown
57a1d75e52 Merge pull request #77 from versity/zab/v1_2_release
Zab/v1 2 release
2022-03-14 18:10:16 -07:00
Zach Brown
51d19d797f Start v1.3-rc release notes
Create the 1.3 section in the release notes for commits to fill.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-14 17:15:24 -07:00
Zach Brown
029a684c25 v1.2 Release
Cut the release notes for the 1.2 release.

Signed-off-by: Zach Brown <zab@versity.com>
v1.2
2022-03-14 17:15:05 -07:00
Zach Brown
f2679d9598 Merge pull request #76 from versity/zab/inode_deletion_fixes
Zab/inode deletion fixes
2022-03-11 16:23:21 -08:00
Zach Brown
bddca171ee Call iput outside cluster locked transactions
The final iput of an inode can delete items in cluster locked
transactions.   It was never safe to call iput within locked
transactions but we never saw the problem.   Recent work on inode
deletion raised the issue again.

This makes sure that we always perform iput outside of locked
transactions.  The only interesting change is making scoutfs_new_inode()
return the allocated inode on error so that the caller can put the inode
after releasing the transaction.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-11 15:29:20 -08:00
Zach Brown
18171b8543 Put allocator block references on forced unmount
During forced unmount commits abort due to errors and the open
transaction is left in a dirty state that is cleaned up by
scoutfs_shutdown_trans().   It cleans all the dirty blocks in the commit
write context with scoutfs_block_writer_forget_all(), but it forgot to
call scoutfs_alloc_prepare_commit() to put the block references held by
the allocator.

This was generating leaked block warnings during testing that used
forced unmount.  It wouldn't affect regular operations.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-11 15:29:20 -08:00
Zach Brown
d846eec5e8 Harden final inode deletion
We were seeing a number of problems coming from races that allowed tasks
in a mount to try and concurrently delete an inode's items.  We could
see error messages indicating that deletion failed with -ENOENT, we
could see users of inodes behave erratically as inodes were deleted from
under them, and we could see eventual server errors trying to merge
overlapping data extents which were "freed" (add to transaction lists)
multiple times.

This commit addresses the problems in one relatively large patch.  While
we could mechanically split up the fixes, they're all interdependent and
splitting them up (bisecting through them) could cause failures that
would be devilishly hard to diagnose.

First we stop allowing multiple cached vfs inodes.  This was initially
done to avoid deadlocks between lock invalidation and final inode
deletion.  We add a specific lookup that's used by invalidation which
ignores any inodes which are in I_NEW or I_FREEING.  Now that iget can
wait on inode flags we call iget5_locked before acquiring the cluster
lock.  This ensures that we can only have one cached vfs inode for a
given inode number in evict_inode trying to delete.

Now that we can only have one cached inode, we can rework the omap
tracking to use _set and _clear instead of _inc and _put.  This isn't
strictly necessary but is a simplification and lets us issue warnings if
we see that we ever try to set an inode numbers bit on behalf of
multiple cached inodes.  We also add a _test helper.

Orphan scanning would try to perform deletion by instantiating a cached
inode and then putting it, triggering eviction and final deletion.  This
was an attempt to simplify concurrency but ended up causing more
problems.  It no longer tries to interact with inode cache at all and
attempts to safely delete inode items directly.  It uses the omap test
to determine that it should skip an already cached inode.

We had attempted to forbid opening inodes by handle if they had an nlink
of 0.  Since we allowed multiple cached inodes for an inode number this
was to prevent adding cached inodes that were being deleted.  It was
only performing the check on newly allocated inodes, though, so it could
get a reference to the cached inode that the scanner had inserted for
deleting.  We're chosing to keep restricting opening by handle to only
linked inodes so we also check existing inodes after they're refreshed.

We're left with a task evicting an inode and the orphan scanner racing
to delete an inode's items.  We move the work of determining if its safe
to delete out of scoutfs_omap_should_delete() and into
try_delete_inode_items() which is called directly from eviction and
scanning.  This is mostly code motion but we do make three critical
changes.  We get rid of the goofy concurrent deletion detection in
delete_inode_items() and instead use a bit in the lock data to serialize
multiple attempts to delete an inode's items.  We no longer assume that
the inode must still be around because we were called from evict and
specifically check that inode item is still present for deleting.
Finally, we use the omap test to discover that we shouldn't delete an
inode that is locally cached (and would be not be included to the omap
response).  We do all this under the inode write lock to serialize
between mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-11 15:28:58 -08:00
Zach Brown
e2c90339c5 Add orphan-inodes to race final deletion
We're seeing some trouble with very specific race conditions.   This
updates the orphan-inodes test to try and force final inode deletion
during eviction, the orphan scan worker, and opening inodes by handle to
all race and hit an inode number at the same time.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-11 14:30:17 -08:00
Zach Brown
4a0b14a4f2 Wait for stdin open in orphan-inodes test
The orphan inode test often uses a trick where it runs sleep in the
abckground with a file as stdin as a means of holding files open.  This
can very rarely fail if the background sleep happens to be first
schedled after the unlink of the file it's reading as stdin.  A small
delay gives it a chance to run and open the file before its unlinked.
It's still possible to lose the race, of course, but so far this has
been good enough.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-10 11:43:11 -08:00
Zach Brown
90518a0fbd Add handle_fsetxattr test utility
Add a quick little binary that spins opening an inode by a handle and
calling fsetxattr.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-10 11:43:11 -08:00
Zach Brown
cd23cc61ca Add mount option test bash functions
Add some test functions which work with mount options.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-10 11:43:11 -08:00
Zach Brown
a67ea30bb7 Add orphan_scan_delay_ms mount option
Add a mount option to set the delay betwen scanning of the orphan list.
The sysfs file for the option is writable so this option can be set at
run time.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-10 11:43:11 -08:00
Zach Brown
f3b7c683f0 Fix quorum_server_nr syfs file name typo
The quorum_slot_nr mount option was being mistakenly shown in a sysfs
file named quorum_server_nr.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-09 11:12:36 -08:00
Zach Brown
8decc54467 Clean up mount option handling
The mount options code is some of the oldest in the tree and is weirdly
split between options.c and super.c.  This cleans up the options code,
moves it all to options.c, and reworks it to be more in line with the
modern subsystem convenction of storing state in an allocated info
struct.

Rather than putting the parsed options in the super for everyone to
directly reference we put them in the private options info struct and
add a locked read function.  This will let us add sysfs files to change
mount options while safely serializing with readers.

All the users of mount options that used to directly reference the
parsed struct now call the read function to get a copy.  They're all
small local changes except for quorum which saves a static copy of the
quorum slot number because it references it in so many places and relies
on it not changing.

Finally, we remove the empty debugfs "options" directory.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-09 11:12:36 -08:00
Zach Brown
5adcf7677f Export omap group calc helper
The inode caller of omap was manually calculating the group and bits,
which isn't fantastic.   Export the little helper to calculate it so
the inode caller doesn't have to.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-09 11:12:36 -08:00
Zach Brown
07f03d499f Remove duplicate orphan work delay calculation
You can almost feel the editing mistake that brought the delay
calculation into the conditional and forgot to remove the initial
calculation at declaration.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-09 11:12:23 -08:00
Zach Brown
c5068efef0 Merge pull request #75 from versity/zab/bad_mount_option
Zab/bad mount option
2022-02-28 09:07:15 -08:00
Zach Brown
66678dc63b Fail mounts with unknown options
Weirdly, the mout option parser silently returned when it found mount
options that weren't recognized.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-21 10:44:56 -08:00
Zach Brown
b2834d3c28 Add basic bad mount testing
Add some tests which exercise the kinds of reasonable mistakes that
people will make in the field.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-21 10:44:38 -08:00
Zach Brown
cff50bec6b Merge pull request #74 from versity/zab/fallocate_read_inversion
Zab/fallocate read inversion
2022-02-21 09:58:49 -08:00
Zach Brown
4d6350b3b0 Fix lock ordering in fallocate
We were seeing ABBA deadlocks on the dio_count wait and extent_sem
between fallocate and reads.  It turns out that fallocate got lock
ordering wrong.

This brings fallocate in line with the rest of the adherents to the lock
heirarchy.   Most importantly, the extent_sem is used after the
dio_count.   While we're at it we bring the i_mutex down to just before
the cluster lock for consistency.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-17 14:48:13 -08:00
Zach Brown
48966b42bb Add simple fallocate test
Signed-off-by: Zach Brown <zab@versity.com>
2022-02-17 11:20:08 -08:00
Zach Brown
97cb8ad50d Merge pull request #72 from versity/zab/quick_man_fix
Clean quorum and format change command docs
2022-02-09 09:22:50 -08:00
Zach Brown
ae08a797ae Clean quorum and format change command docs
The man pages and inline help blurbs for the recently added format
version and quorum config commands incorrectly described the device
arguments which are needed.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-08 11:23:27 -08:00
Zach Brown
2634fadfcb Merge pull request #71 from versity/zab/v1_1_release
Zab/v1 1 release
v1.1
2022-02-04 11:35:39 -08:00
Zach Brown
0c1f19556d Prepare v1.2-rc release
Add the v1.2-rc section to the release notes so that we can add entries
with commits as needed.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-04 11:32:53 -08:00
Zach Brown
19caae3da8 v1.1 Release
Finish off the release notes for the 1.1 release.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-04 11:32:37 -08:00
Zach Brown
2989afbf46 Merge pull request #70 from versity/zab/silence_duplicate_log_merge_complete_error
Silence resent log merge commit error
2022-02-02 14:35:01 -08:00
Zach Brown
730a84af92 Silence resent log merge commit error
The server's log merge complete request handler was considering the
absence of the client's original request as a failure.  Unfortunately,
this case is possible if a previous server successfully completed the
client's request but the response was lost because it stopped for
whatever reason.

The failure was being logged as a hard error to the console which was
causing tests to occasionally fail during server failover that hit just
as the log merge completion was being processed.

The error was being sent to the client as a response, we just need to
silence the message for these expected but rare errors.

We also fix the related case where the server printed the even more
harsh WARN_ON if there was a next original request but it wasn't the one
we expected to find from our requesting client.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-02 11:26:36 -08:00
Zach Brown
5b77133c3b Merge pull request #68 from versity/zab/collection_of_fixes
Zab/collection of fixes
2022-01-24 11:22:41 -08:00
Zach Brown
329ac0347d Remove unused scoutfs_net_cancel_request()
The net _cancel_request call hasn't been used or tested in approximately
a bazillion years.   Best to get rid of it and have to add and test it
if we think we need it again.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
15d7eec1f9 Disallow openening unlinked files by handle
Our open by handle functions didn't care that the inode wasn't
referenced and let tasks open unlinked inodes by number.  This
interacted badly with the inode deletion mechanisms which required that
inodes couldn't be cached on other nodes after the transaction which
removed their final reference.

If a task did accidentally open a file by inode while it was being
deleted it could see the inode items in an inconsistent state and return
very confusing errors that look like corruption.

The fix is to give the handle iget callers a flag to tell iget to only
get the inode if it has a positive nlink.   If iget sees that the inode
has been unlinked it returns enoent.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
cff17a4cae Remove unused flags scoutfs_inode_refresh arg
The flags argument to scoutfs_inode_refresh wasn't being used.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
9fa2c6af89 Use get-allocated-inos in orphan-inodes test
The orphan inodes test needs to test if inode items exist as it
manipulates inodes.  It used to open the inode by a handle but we're
fixing that to not allow opening unlinked files.   The
get-allocated-inos ioctl tests for the presence of items owned by the
inode regardless of any other vfs state so we can use it to verify what
scoutfs is doing as we work with the vfs inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
e067961714 Add get-allocated-inos scoutfs command
Add the get-allocated-inos scoutfs command which wraps the
GET_ALLOCATED_INOS ioctl.   It'll be used by tests to find items
associated with an inode instead of trying to open the inode by a
constructed handle after it was unlinked.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
7a96e03148 Add get_allocated_inos ioctl
Add an ioctl that can give some indication of inodes that have inode
items.   We're exposing this for tests that verify the handling of open
unlinked inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
e9b3cc873a Export scoutfs_inode_init_key
We're adding an ioctl that wants to build inode item keys so let's
export the private inode key initializer.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
5f2259c48f Revert "Fix client/server race btwn lock recov and farewell"
This reverts commit 61ad844891.

This fix was trying to ensure that lock recovery response handling
can't run after farewell calls reclaim_rid() by jumping through a bunch
of hoops to tear down locking state as the first farewell request
arrived.

It introduced very slippery use after free during shutdown.  It appears
that it was from drain_workqueue() previously being able to stop
chaining work.   That's no longer possible when you're trying to drain
two workqueues that can queue work in each other.

We found a much clearer way to solve the problem so we can toss this.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
e14912974d Wait for lock recovery before sending farewell
We recently found that the server can send a farewell response and try
to tear down a client's lock state while it was still in lock recovery
with the client.   The lock recovery response could add a lock
for the client after farell's reclaim_rid() had thought the client was
gone forever and tore down its locks.

This left a lock in the lock server that wasn't associated with any
clients and so could never be invalidated.   Attempts to acquire
conflicting locks with it would hang forever, which we saw as hangs in
testing with lots of unmounting.

We tried to fix it by serializing incoming request handling and
forcefully clobbering the client's lock state as we first got
the farewell request.   That went very badly.

This takes another approach of trying to explicitly wait for lock
recovery to finish before sending farewell responses.   It's more in
line with the overall pattern of having the client be up and functional
until farewell tears it down.

With this in place we can revert the other attempted fix that was
causing so many problems.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:39:51 -08:00
Zach Brown
813ce24d79 Move local-force-unmount test script into tests/
The local-force-unmount fenced fencing script only works when all the
mounts are on the local host and it uses force unmount.   It is only
used in our specific local testing scripts.  Packaging it as an example
lead people to believe that it could be used to cobble together a
multi-host testing network, however temporary.

Move it from being in utils and packged to being private to our tests so
that it doesn't present an attractive nuisance.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-19 11:33:34 -08:00