Extended attribute values can be larger than a reasonable maximum size
for our btree items so we store xattrs in many items. The first pass at
this code used vmalloc to make it relatively easy to work with a
contiguous buffer that was cut up into multiple items.
The problem, of course, is that vmalloc() is expensive. Well, the
problem is that I always forget just how expensive it can be and use it
when I shouldn't. We had loads on high cpu count machines that were
catastrophically cpu bound on all the contentious work that vmalloc does
to maintain a coherent global address space.
This removes the use of vmalloc and only allocates a small buffer for
the first compound item. The later items directly reference regions of
value buffer rather than copying it to and from the large intermediate
vmalloced buffer.
Signed-off-by: Zach Brown <zab@versity.com>
The t_server_nr and t_first_client_nr helpers iterated over all the fs
numbers examining their quorum/is_leader files, but clients don't have a
quorum/ directory. This was causing spurious outputs in tests that were
looking for servers but didn't find it in the first quorum fs number and
made it down into the clients.
Give them a helper that returns 0 for being a leader if the quorum/ dir
doesn't exist.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing rare test failures where it looked like is_leader wasn't
set for any of the mounts. The test that couldn't find a set is_leader
file had just perfomed some mounts so we know that a server was up and
processing requests.
The quorum task wasn't updating the status that's shown in sysfs and
debugfs until after the server started up. This opened the race where
the server was able to serve mount requests and have the test run to
find no is_leader file set before the quorum task was able to update the
stats and make its election visible.
This updates the quorum task to make its status visible more often,
typically before it does something that will take a while. The
is_leader will now be visible before the server is started so the test
will always see the file after server starts up and lets mounts finish.
Signed-off-by: Zach Brown <zab@versity.com>
The final iput of an inode can delete items in cluster locked
transactions. It was never safe to call iput within locked
transactions but we never saw the problem. Recent work on inode
deletion raised the issue again.
This makes sure that we always perform iput outside of locked
transactions. The only interesting change is making scoutfs_new_inode()
return the allocated inode on error so that the caller can put the inode
after releasing the transaction.
Signed-off-by: Zach Brown <zab@versity.com>
During forced unmount commits abort due to errors and the open
transaction is left in a dirty state that is cleaned up by
scoutfs_shutdown_trans(). It cleans all the dirty blocks in the commit
write context with scoutfs_block_writer_forget_all(), but it forgot to
call scoutfs_alloc_prepare_commit() to put the block references held by
the allocator.
This was generating leaked block warnings during testing that used
forced unmount. It wouldn't affect regular operations.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing a number of problems coming from races that allowed tasks
in a mount to try and concurrently delete an inode's items. We could
see error messages indicating that deletion failed with -ENOENT, we
could see users of inodes behave erratically as inodes were deleted from
under them, and we could see eventual server errors trying to merge
overlapping data extents which were "freed" (add to transaction lists)
multiple times.
This commit addresses the problems in one relatively large patch. While
we could mechanically split up the fixes, they're all interdependent and
splitting them up (bisecting through them) could cause failures that
would be devilishly hard to diagnose.
First we stop allowing multiple cached vfs inodes. This was initially
done to avoid deadlocks between lock invalidation and final inode
deletion. We add a specific lookup that's used by invalidation which
ignores any inodes which are in I_NEW or I_FREEING. Now that iget can
wait on inode flags we call iget5_locked before acquiring the cluster
lock. This ensures that we can only have one cached vfs inode for a
given inode number in evict_inode trying to delete.
Now that we can only have one cached inode, we can rework the omap
tracking to use _set and _clear instead of _inc and _put. This isn't
strictly necessary but is a simplification and lets us issue warnings if
we see that we ever try to set an inode numbers bit on behalf of
multiple cached inodes. We also add a _test helper.
Orphan scanning would try to perform deletion by instantiating a cached
inode and then putting it, triggering eviction and final deletion. This
was an attempt to simplify concurrency but ended up causing more
problems. It no longer tries to interact with inode cache at all and
attempts to safely delete inode items directly. It uses the omap test
to determine that it should skip an already cached inode.
We had attempted to forbid opening inodes by handle if they had an nlink
of 0. Since we allowed multiple cached inodes for an inode number this
was to prevent adding cached inodes that were being deleted. It was
only performing the check on newly allocated inodes, though, so it could
get a reference to the cached inode that the scanner had inserted for
deleting. We're chosing to keep restricting opening by handle to only
linked inodes so we also check existing inodes after they're refreshed.
We're left with a task evicting an inode and the orphan scanner racing
to delete an inode's items. We move the work of determining if its safe
to delete out of scoutfs_omap_should_delete() and into
try_delete_inode_items() which is called directly from eviction and
scanning. This is mostly code motion but we do make three critical
changes. We get rid of the goofy concurrent deletion detection in
delete_inode_items() and instead use a bit in the lock data to serialize
multiple attempts to delete an inode's items. We no longer assume that
the inode must still be around because we were called from evict and
specifically check that inode item is still present for deleting.
Finally, we use the omap test to discover that we shouldn't delete an
inode that is locally cached (and would be not be included to the omap
response). We do all this under the inode write lock to serialize
between mounts.
Signed-off-by: Zach Brown <zab@versity.com>
We're seeing some trouble with very specific race conditions. This
updates the orphan-inodes test to try and force final inode deletion
during eviction, the orphan scan worker, and opening inodes by handle to
all race and hit an inode number at the same time.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan inode test often uses a trick where it runs sleep in the
abckground with a file as stdin as a means of holding files open. This
can very rarely fail if the background sleep happens to be first
schedled after the unlink of the file it's reading as stdin. A small
delay gives it a chance to run and open the file before its unlinked.
It's still possible to lose the race, of course, but so far this has
been good enough.
Signed-off-by: Zach Brown <zab@versity.com>
Add a mount option to set the delay betwen scanning of the orphan list.
The sysfs file for the option is writable so this option can be set at
run time.
Signed-off-by: Zach Brown <zab@versity.com>
The mount options code is some of the oldest in the tree and is weirdly
split between options.c and super.c. This cleans up the options code,
moves it all to options.c, and reworks it to be more in line with the
modern subsystem convenction of storing state in an allocated info
struct.
Rather than putting the parsed options in the super for everyone to
directly reference we put them in the private options info struct and
add a locked read function. This will let us add sysfs files to change
mount options while safely serializing with readers.
All the users of mount options that used to directly reference the
parsed struct now call the read function to get a copy. They're all
small local changes except for quorum which saves a static copy of the
quorum slot number because it references it in so many places and relies
on it not changing.
Finally, we remove the empty debugfs "options" directory.
Signed-off-by: Zach Brown <zab@versity.com>
The inode caller of omap was manually calculating the group and bits,
which isn't fantastic. Export the little helper to calculate it so
the inode caller doesn't have to.
Signed-off-by: Zach Brown <zab@versity.com>
You can almost feel the editing mistake that brought the delay
calculation into the conditional and forgot to remove the initial
calculation at declaration.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing ABBA deadlocks on the dio_count wait and extent_sem
between fallocate and reads. It turns out that fallocate got lock
ordering wrong.
This brings fallocate in line with the rest of the adherents to the lock
heirarchy. Most importantly, the extent_sem is used after the
dio_count. While we're at it we bring the i_mutex down to just before
the cluster lock for consistency.
Signed-off-by: Zach Brown <zab@versity.com>
The man pages and inline help blurbs for the recently added format
version and quorum config commands incorrectly described the device
arguments which are needed.
Signed-off-by: Zach Brown <zab@versity.com>
The server's log merge complete request handler was considering the
absence of the client's original request as a failure. Unfortunately,
this case is possible if a previous server successfully completed the
client's request but the response was lost because it stopped for
whatever reason.
The failure was being logged as a hard error to the console which was
causing tests to occasionally fail during server failover that hit just
as the log merge completion was being processed.
The error was being sent to the client as a response, we just need to
silence the message for these expected but rare errors.
We also fix the related case where the server printed the even more
harsh WARN_ON if there was a next original request but it wasn't the one
we expected to find from our requesting client.
Signed-off-by: Zach Brown <zab@versity.com>
The net _cancel_request call hasn't been used or tested in approximately
a bazillion years. Best to get rid of it and have to add and test it
if we think we need it again.
Signed-off-by: Zach Brown <zab@versity.com>
Our open by handle functions didn't care that the inode wasn't
referenced and let tasks open unlinked inodes by number. This
interacted badly with the inode deletion mechanisms which required that
inodes couldn't be cached on other nodes after the transaction which
removed their final reference.
If a task did accidentally open a file by inode while it was being
deleted it could see the inode items in an inconsistent state and return
very confusing errors that look like corruption.
The fix is to give the handle iget callers a flag to tell iget to only
get the inode if it has a positive nlink. If iget sees that the inode
has been unlinked it returns enoent.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan inodes test needs to test if inode items exist as it
manipulates inodes. It used to open the inode by a handle but we're
fixing that to not allow opening unlinked files. The
get-allocated-inos ioctl tests for the presence of items owned by the
inode regardless of any other vfs state so we can use it to verify what
scoutfs is doing as we work with the vfs inodes.
Signed-off-by: Zach Brown <zab@versity.com>
Add the get-allocated-inos scoutfs command which wraps the
GET_ALLOCATED_INOS ioctl. It'll be used by tests to find items
associated with an inode instead of trying to open the inode by a
constructed handle after it was unlinked.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can give some indication of inodes that have inode
items. We're exposing this for tests that verify the handling of open
unlinked inodes.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding an ioctl that wants to build inode item keys so let's
export the private inode key initializer.
Signed-off-by: Zach Brown <zab@versity.com>
This reverts commit 61ad844891.
This fix was trying to ensure that lock recovery response handling
can't run after farewell calls reclaim_rid() by jumping through a bunch
of hoops to tear down locking state as the first farewell request
arrived.
It introduced very slippery use after free during shutdown. It appears
that it was from drain_workqueue() previously being able to stop
chaining work. That's no longer possible when you're trying to drain
two workqueues that can queue work in each other.
We found a much clearer way to solve the problem so we can toss this.
Signed-off-by: Zach Brown <zab@versity.com>
We recently found that the server can send a farewell response and try
to tear down a client's lock state while it was still in lock recovery
with the client. The lock recovery response could add a lock
for the client after farell's reclaim_rid() had thought the client was
gone forever and tore down its locks.
This left a lock in the lock server that wasn't associated with any
clients and so could never be invalidated. Attempts to acquire
conflicting locks with it would hang forever, which we saw as hangs in
testing with lots of unmounting.
We tried to fix it by serializing incoming request handling and
forcefully clobbering the client's lock state as we first got
the farewell request. That went very badly.
This takes another approach of trying to explicitly wait for lock
recovery to finish before sending farewell responses. It's more in
line with the overall pattern of having the client be up and functional
until farewell tears it down.
With this in place we can revert the other attempted fix that was
causing so many problems.
Signed-off-by: Zach Brown <zab@versity.com>
The local-force-unmount fenced fencing script only works when all the
mounts are on the local host and it uses force unmount. It is only
used in our specific local testing scripts. Packaging it as an example
lead people to believe that it could be used to cobble together a
multi-host testing network, however temporary.
Move it from being in utils and packged to being private to our tests so
that it doesn't present an attractive nuisance.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_recov_shutdown() tried to move the recovery tracking structs off
the shared list and into a private list so they could be freed. But
then it went and walked the now empty shared list to free entries. It
should walk the private list.
This would leak a small amount of memory in the rare cases where the
server was shutdown while recovery was still pending.
Signed-off-by: Zach Brown <zab@versity.com>
Back when we added the get/commit transaction sequence numbers to the
log_trees we forgot to add them to the scoutfs print output.
Signed-off-by: Zach Brown <zab@versity.com>
The server's little set_shutting_down() helper accidentally used a read
barrier instead of a write barrier.
Signed-off-by: Zach Brown <zab@versity.com>