try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which gives the server a transaction with a free list block
that contains blknos that each dirty an individiaul btree blocks in the
global data free extent btree.
Signed-off-by: Zach Brown <zab@versity.com>
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free. The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.
The limiting logic was a bit off. It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator. It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.
Unfortunately, we don't have per-caller tracking of allocator resource
consumption. The best we can do is sample the allocators as we start
and return if they drop by the caller's limit. This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.
This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant. We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.
Signed-off-by: Zach Brown <zab@versity.com>
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.
Signed-off-by: Zach Brown <zab@versity.com>
As _get_log_trees() in the server prepares the log_trees item for the
client's commit, it moves all the freed data extents from the log_trees
item into core data extent allocator btree items. If the freed blocks
are very fragmented then it can exceed a commit's metadata allocation
budget trying to dirty blocks in the free data extent btree.
The fix is to move the freed data extents in multiple commits. First we
move a limited number in the main commit that does all the rest of the
work preparing the commit. Then we try to move the remaining freed
extents in multiple additional commits.
Signed-off-by: Zach Brown <zab@versity.com>
Callers who send to specific client connections can get -ENOTCONN if
their client has gone away. We forgot to free the send tracking struct
in that case.
Signed-off-by: Zach Brown <zab@versity.com>
The omap code keeps track of rids that are connected to the server. It
only freed the tracked rids as the server told it that rids were being
removed. But that removal only happened as clients were evicted. If
the server shutdown it'd leave the old rid entries around. They'd be
leaked as the mount was unmounted and could linger and crate duplicate
entries if the server started back up and the same clients reconnected.
The fix is to free the tracking rids as the server shuts down. They'll
be rebuilt as clients reconnect if the server restarts.
Signed-off-by: Zach Brown <zab@versity.com>
If we return an error from .fill_super without having set sb->s_root
then the vfs won't call our put_super. Our fill_super is careful to
call put_super so that it can tear down partial state, but we weren't
doing this with a few very early errors in fill_super. This tripped
leak detection when we weren't freeing the sbi when returning errors
from bad option parsing.
Signed-off-by: Zach Brown <zab@versity.com>
Clients don't use the net conn info and specified that it has 0 size.
The net layer would try and allocate a zero size region which returns
the magic ZERO_SIZE_PTR, which it would then later try and free. While
that works, it's a little goofy. We can avoid the allocation when the
size is 0. The pointer will remain null which kfree also accepts.
Signed-off-by: Zach Brown <zab@versity.com>
Add an option to skip printing structures that are likely to be so huge
that the print output becomes completely unwieldly on large systems.
Signed-off-by: Zach Brown <zab@versity.com>
Like a lot of places in the server, get_log_trees() doesn't have the
tools in needs to safely unwind partial changes in the face of an error.
In the worst case, it can have moved extents from the mount's log_trees
item into the server's main data allocator. The dirty data allocator
reference is in the super block so it can be written later. The dirty
log_trees reference is on stack, though, so it will be thrown away on
error. This ends up duplicating extents in the persistent structures
because they're written in the new dirty allocator but still remain in
the unwritten source log_trees allocator.
This change makes it harder for that to happen. It dirties the
log_trees item and always tries to update so that the dirty blocks are
consistent if they're later written out. If we do get an error updating
the item we throw an assertion. It's not great, but it matches other
similar circumstances in other parts of the server.
Signed-off-by: Zach Brown <zab@versity.com>
We were setting sk_allocation on the quorum UDP sockets to prevent
entering reclaim while using sockets but we missed setting it on the
regular messaging TCP sockets. This could create deadlocks where the
sending socket could enter scoutfs reclaim and wait for server messages
while holding the socket lock, preventing the receive thread from
receiving messages while it blocked on the socket lock.
The fix is to prevent entering the FS to reclaim during socket
allocations.
Signed-off-by: Zach Brown <zab@versity.com>
Client log_trees allocator btrees can build up quite a number of
extents. In the right circumstances fragmented extents can have to
dirty a large number of paths to leaf blocks in the core allocator
btrees. It might not be possible to dirty all the blocks necessary to
move all the extents in one commit.
This reworks the extent motion so that it can be performed in multiple
commits if the meta allocator for the commit runs out while it is moving
extents. It's a minimal fix with as little disruption to the ordering
of commits and locking as possible. It simply bubbles up an error when
the allocators run out and retries functions that can already be retried
in other circumstances.
Signed-off-by: Zach Brown <zab@versity.com>
We're seeing allocator motion during get_log_trees dirty quite a lot of
blocks, which makes sense. Let's continue to up the budget. If we
still need significantly larger budgets we'll want to look into capping
the dirty block use of the allocator extent movers which will mean
changing callers to support partial progress.
Signed-off-by: Zach Brown <zab@versity.com>
When a new server starts up it rebuilds its view of all the granted
locks with lock recovery messages. Clients give the server their
granted lock modes which the server then uses to process all the resent
lock requests from clients.
The lock invalidation work in the client is responsible for
transitioning an old granted mode to a new invalidated mode from an
unsolicited message from the server. It has to process any client state
that'd be incompatible with the new mode (write dirty data, drop
caches). While it is doing this work, as an implementation short cut,
it sets the granted lock mode to the new mode so that users that are
compatible with the new invalidated mode can use the lock whlie it's
being invalidated. Picture readers reading data while a write lock is
invalidating and writing dirty data.
A problem arises when a lock recover request is processed during lock
invalidation. The client lock recover request handler sends a response
with the current granted mode. The server takes this to mean that the
invalidation is done but the client invalidation worker might still be
writing data, dropping caches, etc. The server will allow the state
machine to advance which can send grants to pending client requests
which believed that the invalidation was done.
All of this can lead to a grant response handler in the client tripping
the assertion that there can not be cached items that were incompatible
with the old mode in a grant from the server. Invalidation might still
be invalidating caches. Hitting this bug is very rare and requires a
new server starting up while a client has both a request outstanding and
an invalidation being processed when the lock recover request arrives.
The fix is to record the old mode during invalidation and send that in
lock recover responses. This can lead the lock server to resend
invalidation requests to the client. The client already safely handles
duplicate invalidation requests from other failover cases.
Signed-off-by: Zach Brown <zab@versity.com>
The change to only allocate a buffer for the first xattr item with
kmalloc instead of the entire logical xattr payload with vmalloc
included a regression for getting large xattrs.
getxattr used to copy the entire payload into the large vmalloc so it
could unlock just after get_next_xattr. The change to only getting the
first item buffer added a call to copy from the rest of the items but
those copies weren't covered by the locks. This would often work
because the lock pointer still pointed to a valid lock. But if the lock
was invalidated then the mode would no longer be compatible and
_item_lookup would return EINVAL.
The fix is to extend xattr_rwsem and cluster lock coverage to the rest
fo the function body, which includes the value item copies. This also
makes getxattr's lock coverage consistent with setxattr and listxattr
which might reduce the risk of similar mistakes in the future.
Signed-off-by: Zach Brown <zab@versity.com>
After we've merged a log btree back into the main fs tree we kick off
work to free all its blocks. This would fully fill the transactions
free blocks list before stopping to apply the commit.
Consuming the entire free list makes it hard to have concurrent holders
of a commit who also want to free things. This chnages the log btree
block freeing to limit itself to a fraction of the budget that each
holder gets. That coarse limit avoids us having to precisely account
for the allocations and frees while modifying the freeing item while
still freeing many blocks per commit.
Signed-off-by: Zach Brown <zab@versity.com>
Server commits use an allocator that has a limited number of available
metadata blocks and entries in a list for freed blocks. The allocator
is refilled between commits. Holders can't fully consume the allocator
during the commit and that tended to work out because server commit
holders commit before sending responses. We'd tend to commit frequently
enough that we'd get a chance to refill the allocators before they were
consumed.
But there was no mechanism to ensure that this would be the case.
Enough concurrent server holders were able to fully consume the
allocators before committing. This causes scoutfs_meta_alloc and _free
to return errors, leading the server to fail in the worst cases.
This changes the server commit tracking to use more robust structures
which limit the number of concurrent holders so that the allocators
aren't exhausted. The commit_users struct stops holders from making
progress once the allocators don't have room for more holders. It also
lets us stop future holders from making progress once the commit work
has been queued. The previous cute use of a rwsem didn't allow for
either of these protections.
We don't have precise tracking of each holder's allocation consumption
so we don't try and reserve blocks for each holder. Instead we have a
maxmimum consumption per holder and make sure that all the holders can't
consume the allocators if they all use their full limit.
All of this requires the holding code paths to be well behaved and not
use more than the per-hold limit. We add some debugging code to print
the stacks of holders that were active when the total holder limit was
exceeded. This is the motivation for having state in the holders. We
can record some data at the time their hold started that'll make it a
little easier to track down which of the holders exceeded their limit.
Signed-off-by: Zach Brown <zab@versity.com>
Add helper function to give the caller the number of blocks remaining in
the first list block that's used for meta allocation and freeing.
Signed-off-by: Zach Brown <zab@versity.com>
There was a brief time where we exported the ability to hold and apply
commits outside of the main server code. That wasn't a great idea, and
the few users have seen been reworked to not require directly
manipulating server transactions, so we can reduce risk and make these
functions private again.
Signed-off-by: Zach Brown <zab@versity.com>
Quorum members will try to elect a new leader when they don't receive
heartbeats from the currently elected leader. This timeout is short to
encourage restoring service promptly.
Heartbeats are sent from the quorum worker thread and are delayed while
it synchronously starts up the server, which includes fencing previous
servers. If fence requests take too long then heartbeats will be
delayed long enough for remaining quorum members to elect a new leader
while the recently elected server is still busy fencing.
To fix this we decouple server startup from the quorum main thread.
Server starting and stopping becomes asynchronous so the quorum thread
is able to send heartbeats while the server work is off starting up and
fencing.
The server used to call into quorum to clear a flag as it exited. We
remove that mechanism and have the server maintain a running status that
quorum can query.
We add some state to the quorum work to track the asynchronous state of
the server. This lets the quorum protocol change roles immediately as
needed while remembering that there is a server running that needs to be
acted on.
The server used to also call into quorum to update quorum blocks. This
is a read-modify-write operation that has to be serialized. Now that we
have both the server starting up and the quorum work running they both
can't perform these read-modify-write cycles. Instead we have the
quorum work own all the block updates and it queries the server status
to determine when it should update the quorum block to indicate that the
server has fenced or shut down.
Signed-off-by: Zach Brown <zab@versity.com>
The fence script we use for our single node multi-mount tests only knows
how to fence by using forced unmount to destroy a mount. As of now, the
tests only generate failing nodes that need to be fenced by using forced
unmount as well. This results in the awkward situation where the
testing fence script doesn't have anything to do because the mount is
already gone.
When the test fence script has nothing to do we might not notice if it
isn't run. This adds explicit verification to the fencing tests that
the script was really run. It adds per-invocation logging to the fence
script and the test makes sure that it was run.
While we're at it, we take the opportunity to tidy up some of the
scripting around this. We use a sysfs file with the data device
major:minor numbers so that the fencing script can find and unmount
mounts without having to ask them for their rid. They may not be
operational.
Signed-off-by: Zach Brown <zab@versity.com>
Extended attribute values can be larger than a reasonable maximum size
for our btree items so we store xattrs in many items. The first pass at
this code used vmalloc to make it relatively easy to work with a
contiguous buffer that was cut up into multiple items.
The problem, of course, is that vmalloc() is expensive. Well, the
problem is that I always forget just how expensive it can be and use it
when I shouldn't. We had loads on high cpu count machines that were
catastrophically cpu bound on all the contentious work that vmalloc does
to maintain a coherent global address space.
This removes the use of vmalloc and only allocates a small buffer for
the first compound item. The later items directly reference regions of
value buffer rather than copying it to and from the large intermediate
vmalloced buffer.
Signed-off-by: Zach Brown <zab@versity.com>
The t_server_nr and t_first_client_nr helpers iterated over all the fs
numbers examining their quorum/is_leader files, but clients don't have a
quorum/ directory. This was causing spurious outputs in tests that were
looking for servers but didn't find it in the first quorum fs number and
made it down into the clients.
Give them a helper that returns 0 for being a leader if the quorum/ dir
doesn't exist.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing rare test failures where it looked like is_leader wasn't
set for any of the mounts. The test that couldn't find a set is_leader
file had just perfomed some mounts so we know that a server was up and
processing requests.
The quorum task wasn't updating the status that's shown in sysfs and
debugfs until after the server started up. This opened the race where
the server was able to serve mount requests and have the test run to
find no is_leader file set before the quorum task was able to update the
stats and make its election visible.
This updates the quorum task to make its status visible more often,
typically before it does something that will take a while. The
is_leader will now be visible before the server is started so the test
will always see the file after server starts up and lets mounts finish.
Signed-off-by: Zach Brown <zab@versity.com>