The move_blocks ioctl intends to only move extents whose bytes fall
inside i_size. This is easy except for a final extent that straddles an
i_size that isn't aligned to 4K data blocks.
The code that either checked for an extent being entirely past i_size or
for limiting the number of blocks to move by i_size clumsily compared
i_size offsets in bytes with extent counts in 4KB blocks. In just the
right circumstances, probably with the help of a byte length to move
that is much larger than i_size, the length calculation could result in
trying to move 0 blocks. Once this hit the loop would keep finding that
extent and calculating 0 blocks to move and would be stuck.
We fix this by clamping the count of blocks in extents to move in terms
of byte offsets at the start of the loop. This gets rid of the extra
size checks and byte offset use in the loop. We also add a sanity check
to make sure that we can't get stuck if, say, corruption resulted in an
otherwise impossible zero length extent.
Signed-off-by: Zach Brown <zab@versity.com>
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.
We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this. We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction. This follows the
pattern in .write_end.
Signed-off-by: Zach Brown <zab@versity.com>
The d_prune_aliases in lock invalidation was thought to be safe because
the caller had an inode refernece, surely it can't get into iput_final.
I missed the fundamental dcache pattern that dput can ascend through
parents and end up in inode eviction for entirely unrelated inodes.
It's very easy for this to deadlock, imagine if nothing else that the
inode invalidation is blocked on in dput->iput->evict->delete->lock is
itself in the list of locks to invalidate in the caller.
We fix this by always kicking off d_prune and dput into async work.
This increases the chance that inodes will still be referenced after
invalidation and prevent inline deletion. More deletions can be
deferred until the orphan scanner finds them. It should be rare,
though. We're still likely to put and drop invalidated inodes before a
writer gets around to removing the final unlink and asking us for the
omap that describes our cached inodes.
To perform the d_prune in work we make it a behavioural flag and make
our queued iputs a little more robust. We use much safer and
understandable locking to cover the count and the new flags and we put
the work in re-entrant work in their own workqueue instead of one work
instance in the system_wq.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick test of the index items to make sure that rapid inode
updates don't create duplicate meta_seq items.
Signed-off-by: Zach Brown <zab@versity.com>
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete. The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item. Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock. All deletion items will be merged
with the older item and both will be dropped.
This doesn't work for concurrent write-only locks. The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted. That grant order can be mismatched with item
creation order. We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.
Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers. Consistency between writers
is done with another primary write lock. For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.
The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock. This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.
All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we've removed the hash and pos from the dentry_info struct we
can do without it. We can store the refresh gen in the d_fsdsta pointer
(sorry, 64bit only for now.. could allocate if we needed to.) This gets
rid of the lock coverage spinlocks and puts a bit more pressure on lock
lookup, which we already know we have to make more efficient. We can
get rid of all the dentry info allocation calls.
Now that we're not setting d_op as we allocate d_fsdata we put the ops
on the super block so that we get d_revalidate called on all our
dentries.
We also are a bit more precise about the errors we can return from
verification. If the target of a dentry link changes then we return
-ESTALE rather than silently performing the caller's operation on
another inode.
Signed-off-by: Zach Brown <zab@versity.com>
Add a lock call to get the current refresh_gen of a held lock. If the
lock doesn't exist or isn't readable then we return 0. This an be used
to track lock coverage of structures without the overhead and lifetime
binding of the lock coverage struct.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_sysfs_exit() is called during error handling in module init.
When scoutfs is built-in (so, never.) the __exit section won't be
loaded. Remove the __exit annotation so it's always available to be
called.
Signed-off-by: Zach Brown <zab@versity.com>
The dentry cache life cycles are far too crazy to rely on d_fsdata being
kept in sync with the rest of the dentry fields. Callers can do all
sorts of crazy things with dentries. Only unlink and rename need these
fields and those operations are already so expensive that item lookups
to get the current actual hash and pos are lost in the noise.
Signed-off-by: Zach Brown <zab@versity.com>
The test shell helpers for saving and restoring mount options were
trying to put each mount's option value in an array. It meant to build
the array key by concatenating the option name and the mount number.
But it didn't isolate the option "name" variable when evaluating it,
instead always evaluating "name_" to nothing and building keys for all
options that only contained the mount index. This then broke when tests
attempted to save and restore multiple options.
Signed-off-by: Zach Brown <zab@versity.com>
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes. Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan_scan_delay_ms option setting code mistakenly set the default
before testing the option for -1 (not the default) to discover if
multiple options had been set. This made any attempt to set fail.
Initialize the option to -1 so the first set succeeds and apply the
default if we don't set the value.
Signed-off-by: Zach Brown <zab@versity.com>
The simple-xattr-unit test had a helper that failed by exiting with
non-zero instead of emitting a message. Let's make it a bit easier to
see what's going on.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
The upcoming acl support wants to be able to get and set xattrs from
callers who already have cluster locks and transactions. We refactor
the existing xattr get and set calls into locked and unlocked variants.
It's mostly boring code motion with the unfortunate situation that the
caller needs to acquire the totl cluster lock before holding a
transaction before calling into the xattr code. We push the parsing of
the tags to the caller of the locked get and set so that they can know
to acquire the right lock. (The acl callers will never be setting
scoutfs. prefixed xattrs so they will never have tags.)
Signed-off-by: Zach Brown <zab@versity.com>
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix. This
will make it easier to add handling of the pseudo system. ACL xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which gives the server a transaction with a free list block
that contains blknos that each dirty an individiaul btree blocks in the
global data free extent btree.
Signed-off-by: Zach Brown <zab@versity.com>
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free. The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.
The limiting logic was a bit off. It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator. It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.
Unfortunately, we don't have per-caller tracking of allocator resource
consumption. The best we can do is sample the allocators as we start
and return if they drop by the caller's limit. This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.
This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant. We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.
Signed-off-by: Zach Brown <zab@versity.com>
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.
Signed-off-by: Zach Brown <zab@versity.com>
As _get_log_trees() in the server prepares the log_trees item for the
client's commit, it moves all the freed data extents from the log_trees
item into core data extent allocator btree items. If the freed blocks
are very fragmented then it can exceed a commit's metadata allocation
budget trying to dirty blocks in the free data extent btree.
The fix is to move the freed data extents in multiple commits. First we
move a limited number in the main commit that does all the rest of the
work preparing the commit. Then we try to move the remaining freed
extents in multiple additional commits.
Signed-off-by: Zach Brown <zab@versity.com>
Callers who send to specific client connections can get -ENOTCONN if
their client has gone away. We forgot to free the send tracking struct
in that case.
Signed-off-by: Zach Brown <zab@versity.com>
The omap code keeps track of rids that are connected to the server. It
only freed the tracked rids as the server told it that rids were being
removed. But that removal only happened as clients were evicted. If
the server shutdown it'd leave the old rid entries around. They'd be
leaked as the mount was unmounted and could linger and crate duplicate
entries if the server started back up and the same clients reconnected.
The fix is to free the tracking rids as the server shuts down. They'll
be rebuilt as clients reconnect if the server restarts.
Signed-off-by: Zach Brown <zab@versity.com>
If we return an error from .fill_super without having set sb->s_root
then the vfs won't call our put_super. Our fill_super is careful to
call put_super so that it can tear down partial state, but we weren't
doing this with a few very early errors in fill_super. This tripped
leak detection when we weren't freeing the sbi when returning errors
from bad option parsing.
Signed-off-by: Zach Brown <zab@versity.com>