Add a lock call to get the current refresh_gen of a held lock. If the
lock doesn't exist or isn't readable then we return 0. This an be used
to track lock coverage of structures without the overhead and lifetime
binding of the lock coverage struct.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_sysfs_exit() is called during error handling in module init.
When scoutfs is built-in (so, never.) the __exit section won't be
loaded. Remove the __exit annotation so it's always available to be
called.
Signed-off-by: Zach Brown <zab@versity.com>
The dentry cache life cycles are far too crazy to rely on d_fsdata being
kept in sync with the rest of the dentry fields. Callers can do all
sorts of crazy things with dentries. Only unlink and rename need these
fields and those operations are already so expensive that item lookups
to get the current actual hash and pos are lost in the noise.
Signed-off-by: Zach Brown <zab@versity.com>
The test shell helpers for saving and restoring mount options were
trying to put each mount's option value in an array. It meant to build
the array key by concatenating the option name and the mount number.
But it didn't isolate the option "name" variable when evaluating it,
instead always evaluating "name_" to nothing and building keys for all
options that only contained the mount index. This then broke when tests
attempted to save and restore multiple options.
Signed-off-by: Zach Brown <zab@versity.com>
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes. Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan_scan_delay_ms option setting code mistakenly set the default
before testing the option for -1 (not the default) to discover if
multiple options had been set. This made any attempt to set fail.
Initialize the option to -1 so the first set succeeds and apply the
default if we don't set the value.
Signed-off-by: Zach Brown <zab@versity.com>
The simple-xattr-unit test had a helper that failed by exiting with
non-zero instead of emitting a message. Let's make it a bit easier to
see what's going on.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
The upcoming acl support wants to be able to get and set xattrs from
callers who already have cluster locks and transactions. We refactor
the existing xattr get and set calls into locked and unlocked variants.
It's mostly boring code motion with the unfortunate situation that the
caller needs to acquire the totl cluster lock before holding a
transaction before calling into the xattr code. We push the parsing of
the tags to the caller of the locked get and set so that they can know
to acquire the right lock. (The acl callers will never be setting
scoutfs. prefixed xattrs so they will never have tags.)
Signed-off-by: Zach Brown <zab@versity.com>
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix. This
will make it easier to add handling of the pseudo system. ACL xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which gives the server a transaction with a free list block
that contains blknos that each dirty an individiaul btree blocks in the
global data free extent btree.
Signed-off-by: Zach Brown <zab@versity.com>
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free. The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.
The limiting logic was a bit off. It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator. It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.
Unfortunately, we don't have per-caller tracking of allocator resource
consumption. The best we can do is sample the allocators as we start
and return if they drop by the caller's limit. This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.
This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant. We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.
Signed-off-by: Zach Brown <zab@versity.com>
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.
Signed-off-by: Zach Brown <zab@versity.com>
As _get_log_trees() in the server prepares the log_trees item for the
client's commit, it moves all the freed data extents from the log_trees
item into core data extent allocator btree items. If the freed blocks
are very fragmented then it can exceed a commit's metadata allocation
budget trying to dirty blocks in the free data extent btree.
The fix is to move the freed data extents in multiple commits. First we
move a limited number in the main commit that does all the rest of the
work preparing the commit. Then we try to move the remaining freed
extents in multiple additional commits.
Signed-off-by: Zach Brown <zab@versity.com>
Callers who send to specific client connections can get -ENOTCONN if
their client has gone away. We forgot to free the send tracking struct
in that case.
Signed-off-by: Zach Brown <zab@versity.com>
The omap code keeps track of rids that are connected to the server. It
only freed the tracked rids as the server told it that rids were being
removed. But that removal only happened as clients were evicted. If
the server shutdown it'd leave the old rid entries around. They'd be
leaked as the mount was unmounted and could linger and crate duplicate
entries if the server started back up and the same clients reconnected.
The fix is to free the tracking rids as the server shuts down. They'll
be rebuilt as clients reconnect if the server restarts.
Signed-off-by: Zach Brown <zab@versity.com>
If we return an error from .fill_super without having set sb->s_root
then the vfs won't call our put_super. Our fill_super is careful to
call put_super so that it can tear down partial state, but we weren't
doing this with a few very early errors in fill_super. This tripped
leak detection when we weren't freeing the sbi when returning errors
from bad option parsing.
Signed-off-by: Zach Brown <zab@versity.com>
Clients don't use the net conn info and specified that it has 0 size.
The net layer would try and allocate a zero size region which returns
the magic ZERO_SIZE_PTR, which it would then later try and free. While
that works, it's a little goofy. We can avoid the allocation when the
size is 0. The pointer will remain null which kfree also accepts.
Signed-off-by: Zach Brown <zab@versity.com>
Add an option to skip printing structures that are likely to be so huge
that the print output becomes completely unwieldly on large systems.
Signed-off-by: Zach Brown <zab@versity.com>
Like a lot of places in the server, get_log_trees() doesn't have the
tools in needs to safely unwind partial changes in the face of an error.
In the worst case, it can have moved extents from the mount's log_trees
item into the server's main data allocator. The dirty data allocator
reference is in the super block so it can be written later. The dirty
log_trees reference is on stack, though, so it will be thrown away on
error. This ends up duplicating extents in the persistent structures
because they're written in the new dirty allocator but still remain in
the unwritten source log_trees allocator.
This change makes it harder for that to happen. It dirties the
log_trees item and always tries to update so that the dirty blocks are
consistent if they're later written out. If we do get an error updating
the item we throw an assertion. It's not great, but it matches other
similar circumstances in other parts of the server.
Signed-off-by: Zach Brown <zab@versity.com>
We were setting sk_allocation on the quorum UDP sockets to prevent
entering reclaim while using sockets but we missed setting it on the
regular messaging TCP sockets. This could create deadlocks where the
sending socket could enter scoutfs reclaim and wait for server messages
while holding the socket lock, preventing the receive thread from
receiving messages while it blocked on the socket lock.
The fix is to prevent entering the FS to reclaim during socket
allocations.
Signed-off-by: Zach Brown <zab@versity.com>
Client log_trees allocator btrees can build up quite a number of
extents. In the right circumstances fragmented extents can have to
dirty a large number of paths to leaf blocks in the core allocator
btrees. It might not be possible to dirty all the blocks necessary to
move all the extents in one commit.
This reworks the extent motion so that it can be performed in multiple
commits if the meta allocator for the commit runs out while it is moving
extents. It's a minimal fix with as little disruption to the ordering
of commits and locking as possible. It simply bubbles up an error when
the allocators run out and retries functions that can already be retried
in other circumstances.
Signed-off-by: Zach Brown <zab@versity.com>
We're seeing allocator motion during get_log_trees dirty quite a lot of
blocks, which makes sense. Let's continue to up the budget. If we
still need significantly larger budgets we'll want to look into capping
the dirty block use of the allocator extent movers which will mean
changing callers to support partial progress.
Signed-off-by: Zach Brown <zab@versity.com>