Now that we've removed its users we can remove the global saved copy of
the super block from scoutfs_sb_info.
Signed-off-by: Zach Brown <zab@versity.com>
As the server does its work its transactions modify a dirty super block
in memory. This used the global super block in scoutfs_sb_info which
was visible to everything, including the client. Move the dirty super
block over to the private server info so that only the server can see
it.
This is mostly boring storage motion but we do change that the quorum
code hands the server a static copy of the quorum config to use as it
starts up before it reads the most recent super block.
Signed-off-by: Zach Brown <zab@versity.com>
The server's statfs request handler was intending to lock dirty
structures as they were walked to get sums used for statfs fields.
Other callers walk stable structures, though, so the summation calls had
grown iteration over other structures that the server didn't know it had
to lock.
This meant that the server was walking unlocked dirty structures as they
were being modified. The races are very tight, but it can result in
request handling errors that shut down connections and IO errors from
trying to read inconsistent refs as they were modified by the locked
writer.
We've built up infrastructure so the server can now walk stable
structures just like the other callers. It will no longer wander into
dirty blocks so it doesn't need to lock them and it will retry if its
walk of stale data crosses a broken reference.
Signed-off-by: Zach Brown <zab@versity.com>
Transition from manual checking for persistent ESTALE to the shared
helper that we just added. This should not change behavior.
Signed-off-by: Zach Brown <zab@versity.com>
Many readers had little implementations of the logic to decide to retry
stale reads with different refs or decide that they're persistent and
return hard errors. Let's move that into a small helper.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_forest_inode_count() assumed it was called with stable refs and
would always translate ESTALE to EIO. Change it so that it passes
ESTALE to the caller who is responsible for handling it.
The server will use this to retry reading from stable supers that it's
storing in memory.
Signed-off-by: Zach Brown <zab@versity.com>
The server has a mechanism for tracking the last stable roots used by
network rpcs. We expand it a bit to include the entire super so
that we can add users in the server which want the last full stable
super. We can still use the stable super to give out the stable
roots.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum code was using the copy of the super block in the sb info for
its config. With that going away we make different users more carefully
reference the config. The quorum agent has a copy that it reads on
setup, the client rarely reads a copy when trying to connect, and the
server uses its super.
This is about data access isolation and should have no functional effect
other than to cause more super reads.
Signed-off-by: Zach Brown <zab@versity.com>
A few paths throughout the code get the fsid for the current mount by
using the copy of the super block that we store in the scoutfs_sb_info
for the mount. We'd like to remove the super block from the sbi and
it's cleaner to have a specific constant field for the fsid of the mount
which will not change.
Signed-off-by: Zach Brown <zab@versity.com>
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.
We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this. We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction. This follows the
pattern in .write_end.
Signed-off-by: Zach Brown <zab@versity.com>
The d_prune_aliases in lock invalidation was thought to be safe because
the caller had an inode refernece, surely it can't get into iput_final.
I missed the fundamental dcache pattern that dput can ascend through
parents and end up in inode eviction for entirely unrelated inodes.
It's very easy for this to deadlock, imagine if nothing else that the
inode invalidation is blocked on in dput->iput->evict->delete->lock is
itself in the list of locks to invalidate in the caller.
We fix this by always kicking off d_prune and dput into async work.
This increases the chance that inodes will still be referenced after
invalidation and prevent inline deletion. More deletions can be
deferred until the orphan scanner finds them. It should be rare,
though. We're still likely to put and drop invalidated inodes before a
writer gets around to removing the final unlink and asking us for the
omap that describes our cached inodes.
To perform the d_prune in work we make it a behavioural flag and make
our queued iputs a little more robust. We use much safer and
understandable locking to cover the count and the new flags and we put
the work in re-entrant work in their own workqueue instead of one work
instance in the system_wq.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick test of the index items to make sure that rapid inode
updates don't create duplicate meta_seq items.
Signed-off-by: Zach Brown <zab@versity.com>
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete. The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item. Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock. All deletion items will be merged
with the older item and both will be dropped.
This doesn't work for concurrent write-only locks. The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted. That grant order can be mismatched with item
creation order. We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.
Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers. Consistency between writers
is done with another primary write lock. For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.
The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock. This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.
All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we've removed the hash and pos from the dentry_info struct we
can do without it. We can store the refresh gen in the d_fsdsta pointer
(sorry, 64bit only for now.. could allocate if we needed to.) This gets
rid of the lock coverage spinlocks and puts a bit more pressure on lock
lookup, which we already know we have to make more efficient. We can
get rid of all the dentry info allocation calls.
Now that we're not setting d_op as we allocate d_fsdata we put the ops
on the super block so that we get d_revalidate called on all our
dentries.
We also are a bit more precise about the errors we can return from
verification. If the target of a dentry link changes then we return
-ESTALE rather than silently performing the caller's operation on
another inode.
Signed-off-by: Zach Brown <zab@versity.com>
Add a lock call to get the current refresh_gen of a held lock. If the
lock doesn't exist or isn't readable then we return 0. This an be used
to track lock coverage of structures without the overhead and lifetime
binding of the lock coverage struct.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_sysfs_exit() is called during error handling in module init.
When scoutfs is built-in (so, never.) the __exit section won't be
loaded. Remove the __exit annotation so it's always available to be
called.
Signed-off-by: Zach Brown <zab@versity.com>
The dentry cache life cycles are far too crazy to rely on d_fsdata being
kept in sync with the rest of the dentry fields. Callers can do all
sorts of crazy things with dentries. Only unlink and rename need these
fields and those operations are already so expensive that item lookups
to get the current actual hash and pos are lost in the noise.
Signed-off-by: Zach Brown <zab@versity.com>
The test shell helpers for saving and restoring mount options were
trying to put each mount's option value in an array. It meant to build
the array key by concatenating the option name and the mount number.
But it didn't isolate the option "name" variable when evaluating it,
instead always evaluating "name_" to nothing and building keys for all
options that only contained the mount index. This then broke when tests
attempted to save and restore multiple options.
Signed-off-by: Zach Brown <zab@versity.com>
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes. Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The orphan_scan_delay_ms option setting code mistakenly set the default
before testing the option for -1 (not the default) to discover if
multiple options had been set. This made any attempt to set fail.
Initialize the option to -1 so the first set succeeds and apply the
default if we don't set the value.
Signed-off-by: Zach Brown <zab@versity.com>
The simple-xattr-unit test had a helper that failed by exiting with
non-zero instead of emitting a message. Let's make it a bit easier to
see what's going on.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
The upcoming acl support wants to be able to get and set xattrs from
callers who already have cluster locks and transactions. We refactor
the existing xattr get and set calls into locked and unlocked variants.
It's mostly boring code motion with the unfortunate situation that the
caller needs to acquire the totl cluster lock before holding a
transaction before calling into the xattr code. We push the parsing of
the tags to the caller of the locked get and set so that they can know
to acquire the right lock. (The acl callers will never be setting
scoutfs. prefixed xattrs so they will never have tags.)
Signed-off-by: Zach Brown <zab@versity.com>
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix. This
will make it easier to add handling of the pseudo system. ACL xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test which gives the server a transaction with a free list block
that contains blknos that each dirty an individiaul btree blocks in the
global data free extent btree.
Signed-off-by: Zach Brown <zab@versity.com>
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free. The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.
The limiting logic was a bit off. It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator. It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.
Unfortunately, we don't have per-caller tracking of allocator resource
consumption. The best we can do is sample the allocators as we start
and return if they drop by the caller's limit. This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.
This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant. We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.
Signed-off-by: Zach Brown <zab@versity.com>
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.
Signed-off-by: Zach Brown <zab@versity.com>