Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes. Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan_scan_delay_ms option setting code mistakenly set the default
before testing the option for -1 (not the default) to discover if
multiple options had been set. This made any attempt to set fail.
Initialize the option to -1 so the first set succeeds and apply the
default if we don't set the value.
Signed-off-by: Zach Brown <zab@versity.com>
The simple-xattr-unit test had a helper that failed by exiting with
non-zero instead of emitting a message. Let's make it a bit easier to
see what's going on.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
The upcoming acl support wants to be able to get and set xattrs from
callers who already have cluster locks and transactions. We refactor
the existing xattr get and set calls into locked and unlocked variants.
It's mostly boring code motion with the unfortunate situation that the
caller needs to acquire the totl cluster lock before holding a
transaction before calling into the xattr code. We push the parsing of
the tags to the caller of the locked get and set so that they can know
to acquire the right lock. (The acl callers will never be setting
scoutfs. prefixed xattrs so they will never have tags.)
Signed-off-by: Zach Brown <zab@versity.com>
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix. This
will make it easier to add handling of the pseudo system. ACL xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which gives the server a transaction with a free list block
that contains blknos that each dirty an individiaul btree blocks in the
global data free extent btree.
Signed-off-by: Zach Brown <zab@versity.com>
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free. The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.
The limiting logic was a bit off. It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator. It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.
Unfortunately, we don't have per-caller tracking of allocator resource
consumption. The best we can do is sample the allocators as we start
and return if they drop by the caller's limit. This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.
This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant. We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.
Signed-off-by: Zach Brown <zab@versity.com>
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.
Signed-off-by: Zach Brown <zab@versity.com>
As _get_log_trees() in the server prepares the log_trees item for the
client's commit, it moves all the freed data extents from the log_trees
item into core data extent allocator btree items. If the freed blocks
are very fragmented then it can exceed a commit's metadata allocation
budget trying to dirty blocks in the free data extent btree.
The fix is to move the freed data extents in multiple commits. First we
move a limited number in the main commit that does all the rest of the
work preparing the commit. Then we try to move the remaining freed
extents in multiple additional commits.
Signed-off-by: Zach Brown <zab@versity.com>
Callers who send to specific client connections can get -ENOTCONN if
their client has gone away. We forgot to free the send tracking struct
in that case.
Signed-off-by: Zach Brown <zab@versity.com>
The omap code keeps track of rids that are connected to the server. It
only freed the tracked rids as the server told it that rids were being
removed. But that removal only happened as clients were evicted. If
the server shutdown it'd leave the old rid entries around. They'd be
leaked as the mount was unmounted and could linger and crate duplicate
entries if the server started back up and the same clients reconnected.
The fix is to free the tracking rids as the server shuts down. They'll
be rebuilt as clients reconnect if the server restarts.
Signed-off-by: Zach Brown <zab@versity.com>
If we return an error from .fill_super without having set sb->s_root
then the vfs won't call our put_super. Our fill_super is careful to
call put_super so that it can tear down partial state, but we weren't
doing this with a few very early errors in fill_super. This tripped
leak detection when we weren't freeing the sbi when returning errors
from bad option parsing.
Signed-off-by: Zach Brown <zab@versity.com>
Clients don't use the net conn info and specified that it has 0 size.
The net layer would try and allocate a zero size region which returns
the magic ZERO_SIZE_PTR, which it would then later try and free. While
that works, it's a little goofy. We can avoid the allocation when the
size is 0. The pointer will remain null which kfree also accepts.
Signed-off-by: Zach Brown <zab@versity.com>
Add an option to skip printing structures that are likely to be so huge
that the print output becomes completely unwieldly on large systems.
Signed-off-by: Zach Brown <zab@versity.com>
Like a lot of places in the server, get_log_trees() doesn't have the
tools in needs to safely unwind partial changes in the face of an error.
In the worst case, it can have moved extents from the mount's log_trees
item into the server's main data allocator. The dirty data allocator
reference is in the super block so it can be written later. The dirty
log_trees reference is on stack, though, so it will be thrown away on
error. This ends up duplicating extents in the persistent structures
because they're written in the new dirty allocator but still remain in
the unwritten source log_trees allocator.
This change makes it harder for that to happen. It dirties the
log_trees item and always tries to update so that the dirty blocks are
consistent if they're later written out. If we do get an error updating
the item we throw an assertion. It's not great, but it matches other
similar circumstances in other parts of the server.
Signed-off-by: Zach Brown <zab@versity.com>
We were setting sk_allocation on the quorum UDP sockets to prevent
entering reclaim while using sockets but we missed setting it on the
regular messaging TCP sockets. This could create deadlocks where the
sending socket could enter scoutfs reclaim and wait for server messages
while holding the socket lock, preventing the receive thread from
receiving messages while it blocked on the socket lock.
The fix is to prevent entering the FS to reclaim during socket
allocations.
Signed-off-by: Zach Brown <zab@versity.com>
Client log_trees allocator btrees can build up quite a number of
extents. In the right circumstances fragmented extents can have to
dirty a large number of paths to leaf blocks in the core allocator
btrees. It might not be possible to dirty all the blocks necessary to
move all the extents in one commit.
This reworks the extent motion so that it can be performed in multiple
commits if the meta allocator for the commit runs out while it is moving
extents. It's a minimal fix with as little disruption to the ordering
of commits and locking as possible. It simply bubbles up an error when
the allocators run out and retries functions that can already be retried
in other circumstances.
Signed-off-by: Zach Brown <zab@versity.com>
We're seeing allocator motion during get_log_trees dirty quite a lot of
blocks, which makes sense. Let's continue to up the budget. If we
still need significantly larger budgets we'll want to look into capping
the dirty block use of the allocator extent movers which will mean
changing callers to support partial progress.
Signed-off-by: Zach Brown <zab@versity.com>
When a new server starts up it rebuilds its view of all the granted
locks with lock recovery messages. Clients give the server their
granted lock modes which the server then uses to process all the resent
lock requests from clients.
The lock invalidation work in the client is responsible for
transitioning an old granted mode to a new invalidated mode from an
unsolicited message from the server. It has to process any client state
that'd be incompatible with the new mode (write dirty data, drop
caches). While it is doing this work, as an implementation short cut,
it sets the granted lock mode to the new mode so that users that are
compatible with the new invalidated mode can use the lock whlie it's
being invalidated. Picture readers reading data while a write lock is
invalidating and writing dirty data.
A problem arises when a lock recover request is processed during lock
invalidation. The client lock recover request handler sends a response
with the current granted mode. The server takes this to mean that the
invalidation is done but the client invalidation worker might still be
writing data, dropping caches, etc. The server will allow the state
machine to advance which can send grants to pending client requests
which believed that the invalidation was done.
All of this can lead to a grant response handler in the client tripping
the assertion that there can not be cached items that were incompatible
with the old mode in a grant from the server. Invalidation might still
be invalidating caches. Hitting this bug is very rare and requires a
new server starting up while a client has both a request outstanding and
an invalidation being processed when the lock recover request arrives.
The fix is to record the old mode during invalidation and send that in
lock recover responses. This can lead the lock server to resend
invalidation requests to the client. The client already safely handles
duplicate invalidation requests from other failover cases.
Signed-off-by: Zach Brown <zab@versity.com>
The change to only allocate a buffer for the first xattr item with
kmalloc instead of the entire logical xattr payload with vmalloc
included a regression for getting large xattrs.
getxattr used to copy the entire payload into the large vmalloc so it
could unlock just after get_next_xattr. The change to only getting the
first item buffer added a call to copy from the rest of the items but
those copies weren't covered by the locks. This would often work
because the lock pointer still pointed to a valid lock. But if the lock
was invalidated then the mode would no longer be compatible and
_item_lookup would return EINVAL.
The fix is to extend xattr_rwsem and cluster lock coverage to the rest
fo the function body, which includes the value item copies. This also
makes getxattr's lock coverage consistent with setxattr and listxattr
which might reduce the risk of similar mistakes in the future.
Signed-off-by: Zach Brown <zab@versity.com>
After we've merged a log btree back into the main fs tree we kick off
work to free all its blocks. This would fully fill the transactions
free blocks list before stopping to apply the commit.
Consuming the entire free list makes it hard to have concurrent holders
of a commit who also want to free things. This chnages the log btree
block freeing to limit itself to a fraction of the budget that each
holder gets. That coarse limit avoids us having to precisely account
for the allocations and frees while modifying the freeing item while
still freeing many blocks per commit.
Signed-off-by: Zach Brown <zab@versity.com>
Server commits use an allocator that has a limited number of available
metadata blocks and entries in a list for freed blocks. The allocator
is refilled between commits. Holders can't fully consume the allocator
during the commit and that tended to work out because server commit
holders commit before sending responses. We'd tend to commit frequently
enough that we'd get a chance to refill the allocators before they were
consumed.
But there was no mechanism to ensure that this would be the case.
Enough concurrent server holders were able to fully consume the
allocators before committing. This causes scoutfs_meta_alloc and _free
to return errors, leading the server to fail in the worst cases.
This changes the server commit tracking to use more robust structures
which limit the number of concurrent holders so that the allocators
aren't exhausted. The commit_users struct stops holders from making
progress once the allocators don't have room for more holders. It also
lets us stop future holders from making progress once the commit work
has been queued. The previous cute use of a rwsem didn't allow for
either of these protections.
We don't have precise tracking of each holder's allocation consumption
so we don't try and reserve blocks for each holder. Instead we have a
maxmimum consumption per holder and make sure that all the holders can't
consume the allocators if they all use their full limit.
All of this requires the holding code paths to be well behaved and not
use more than the per-hold limit. We add some debugging code to print
the stacks of holders that were active when the total holder limit was
exceeded. This is the motivation for having state in the holders. We
can record some data at the time their hold started that'll make it a
little easier to track down which of the holders exceeded their limit.
Signed-off-by: Zach Brown <zab@versity.com>
Add helper function to give the caller the number of blocks remaining in
the first list block that's used for meta allocation and freeing.
Signed-off-by: Zach Brown <zab@versity.com>