The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.
Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.
This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The value of `ret` is not initialized. If the writeback list is empty,
or, if igrab() fails on the only inode on the list, the value
of `ret` is returned without being initialized. This would cause the
caller to needlessly have to retry, perhaps possibly make things worse.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The iput worker can accumulate quite a bit of pending work to do. We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels). There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.
Signed-off-by: Zach Brown <zab@versity.com>
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.
Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.
For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v5.12-rc6-9-g4f0f586bf0c8
All list_sort functions use the list_cmp_func_t type, which compares
list_head member types. These are now required to be `const` as the
compiler will now check them. This propagates into our callers.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add support for project IDs. They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode. The bit is visible through the _attr_x
interface. It can only be set on regular files and when set it prevents
modification to all but non-user xattrs. It can be cleared by root.
Signed-off-by: Zach Brown <zab@versity.com>
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version. This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
We were using a seqcount to protect high frequency reads and writes to
some of our private inode fields. The writers were serialized by the
caller but that's a bit too easy to get wrong. We're already storing
the write seqcount update so the additional internal spinlock stores in
seqlocks isn't a significant additional overhead. The seqlocks also
handle preemption for us.
Signed-off-by: Zach Brown <zab@versity.com>
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.
We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.
We add a new data_wait_check_iter() function accordingly.
With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.
As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly
manipulating the member as needed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced
with ->i_rwsem. However, long since whenever, inode_lock() and
related functions already worked as intended and provided fully
exclusive locking to the inode.
To avoid a name clash on pre-rhel8 kernels, we have to rename a
stack variable in `src/file.c`.
Signed-off-by: Auke Kok <auke.kok@versity.com>
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.
We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this. We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction. This follows the
pattern in .write_end.
Signed-off-by: Zach Brown <zab@versity.com>
The d_prune_aliases in lock invalidation was thought to be safe because
the caller had an inode refernece, surely it can't get into iput_final.
I missed the fundamental dcache pattern that dput can ascend through
parents and end up in inode eviction for entirely unrelated inodes.
It's very easy for this to deadlock, imagine if nothing else that the
inode invalidation is blocked on in dput->iput->evict->delete->lock is
itself in the list of locks to invalidate in the caller.
We fix this by always kicking off d_prune and dput into async work.
This increases the chance that inodes will still be referenced after
invalidation and prevent inline deletion. More deletions can be
deferred until the orphan scanner finds them. It should be rare,
though. We're still likely to put and drop invalidated inodes before a
writer gets around to removing the final unlink and asking us for the
omap that describes our cached inodes.
To perform the d_prune in work we make it a behavioural flag and make
our queued iputs a little more robust. We use much safer and
understandable locking to cover the count and the new flags and we put
the work in re-entrant work in their own workqueue instead of one work
instance in the system_wq.
Signed-off-by: Zach Brown <zab@versity.com>
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete. The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item. Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock. All deletion items will be merged
with the older item and both will be dropped.
This doesn't work for concurrent write-only locks. The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted. That grant order can be mismatched with item
creation order. We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.
Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers. Consistency between writers
is done with another primary write lock. For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.
The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock. This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.
All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix. This
will make it easier to add handling of the pseudo system. ACL xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items. One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount. It sets a bit in lock data while it's working and backs off if
the bit is already set.
Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not. This would let the next attempt perform
the deletion again before the working task had finished. This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.
But it's a big problem if a deletion attempt takes a very long time. It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.
I hit this in a test that built files with an absurd number of
fragmented extents. The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.
The fix is to only clear the bit if we set it. Now all concurrent
attempts will back off until the first task is done.
Signed-off-by: Zach Brown <zab@versity.com>
The final iput of an inode can delete items in cluster locked
transactions. It was never safe to call iput within locked
transactions but we never saw the problem. Recent work on inode
deletion raised the issue again.
This makes sure that we always perform iput outside of locked
transactions. The only interesting change is making scoutfs_new_inode()
return the allocated inode on error so that the caller can put the inode
after releasing the transaction.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing a number of problems coming from races that allowed tasks
in a mount to try and concurrently delete an inode's items. We could
see error messages indicating that deletion failed with -ENOENT, we
could see users of inodes behave erratically as inodes were deleted from
under them, and we could see eventual server errors trying to merge
overlapping data extents which were "freed" (add to transaction lists)
multiple times.
This commit addresses the problems in one relatively large patch. While
we could mechanically split up the fixes, they're all interdependent and
splitting them up (bisecting through them) could cause failures that
would be devilishly hard to diagnose.
First we stop allowing multiple cached vfs inodes. This was initially
done to avoid deadlocks between lock invalidation and final inode
deletion. We add a specific lookup that's used by invalidation which
ignores any inodes which are in I_NEW or I_FREEING. Now that iget can
wait on inode flags we call iget5_locked before acquiring the cluster
lock. This ensures that we can only have one cached vfs inode for a
given inode number in evict_inode trying to delete.
Now that we can only have one cached inode, we can rework the omap
tracking to use _set and _clear instead of _inc and _put. This isn't
strictly necessary but is a simplification and lets us issue warnings if
we see that we ever try to set an inode numbers bit on behalf of
multiple cached inodes. We also add a _test helper.
Orphan scanning would try to perform deletion by instantiating a cached
inode and then putting it, triggering eviction and final deletion. This
was an attempt to simplify concurrency but ended up causing more
problems. It no longer tries to interact with inode cache at all and
attempts to safely delete inode items directly. It uses the omap test
to determine that it should skip an already cached inode.
We had attempted to forbid opening inodes by handle if they had an nlink
of 0. Since we allowed multiple cached inodes for an inode number this
was to prevent adding cached inodes that were being deleted. It was
only performing the check on newly allocated inodes, though, so it could
get a reference to the cached inode that the scanner had inserted for
deleting. We're chosing to keep restricting opening by handle to only
linked inodes so we also check existing inodes after they're refreshed.
We're left with a task evicting an inode and the orphan scanner racing
to delete an inode's items. We move the work of determining if its safe
to delete out of scoutfs_omap_should_delete() and into
try_delete_inode_items() which is called directly from eviction and
scanning. This is mostly code motion but we do make three critical
changes. We get rid of the goofy concurrent deletion detection in
delete_inode_items() and instead use a bit in the lock data to serialize
multiple attempts to delete an inode's items. We no longer assume that
the inode must still be around because we were called from evict and
specifically check that inode item is still present for deleting.
Finally, we use the omap test to discover that we shouldn't delete an
inode that is locally cached (and would be not be included to the omap
response). We do all this under the inode write lock to serialize
between mounts.
Signed-off-by: Zach Brown <zab@versity.com>
Add a mount option to set the delay betwen scanning of the orphan list.
The sysfs file for the option is writable so this option can be set at
run time.
Signed-off-by: Zach Brown <zab@versity.com>
The inode caller of omap was manually calculating the group and bits,
which isn't fantastic. Export the little helper to calculate it so
the inode caller doesn't have to.
Signed-off-by: Zach Brown <zab@versity.com>
You can almost feel the editing mistake that brought the delay
calculation into the conditional and forgot to remove the initial
calculation at declaration.
Signed-off-by: Zach Brown <zab@versity.com>
Our open by handle functions didn't care that the inode wasn't
referenced and let tasks open unlinked inodes by number. This
interacted badly with the inode deletion mechanisms which required that
inodes couldn't be cached on other nodes after the transaction which
removed their final reference.
If a task did accidentally open a file by inode while it was being
deleted it could see the inode items in an inconsistent state and return
very confusing errors that look like corruption.
The fix is to give the handle iget callers a flag to tell iget to only
get the inode if it has a positive nlink. If iget sees that the inode
has been unlinked it returns enoent.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding an ioctl that wants to build inode item keys so let's
export the private inode key initializer.
Signed-off-by: Zach Brown <zab@versity.com>
The code that updates inode index items on behalf of indexed fields uses
an array to track changes in the fields. Those array indexes were the
raw key type values.
We're about to introduce some sparse space between all the key values so
that we have some room to add keys in the future at arbitrary sort
positions amongst the previous keys.
We don't want the inode index item updating code to keep using raw types
as array indices when the type values are no longer small dense values.
We introduce indirection from type values to array indices to keep the
tracking array in the in-memory inode struct small.
Signed-off-by: Zach Brown <zab@versity.com>
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct. Client transactions track the change in
inode count as they create and delete inodes. The log_trees delta is
added to the count in the super as finalized log_trees are deleted.
Signed-off-by: Zach Brown <zab@versity.com>
This adds i_version to our inode and maintains it as we allocate, load,
modify, and store inodes. We set the flag in the superblock so
in-kernel users can use i_version to see changes in our inodes.
Signed-off-by: Zach Brown <zab@versity.com>
Add an inode creation time field. It's created for all new inodes.
It's visible to stat_more. setattr_more can set it during
restore.
Signed-off-by: Zach Brown <zab@versity.com>
The current orphan scan uses the forest_next_hint to look for candidate
orphan items to delete. It doesn't skip deleted items and checks the
forest of log btrees so it'd return hints for every single item that
existed in all the log btrees across the system. And we call the hint
per-item.
When the system is deleting a lot of files we end up generating a huge
load where all mounts are constantly getting the btree roots from the
server, reading all the newest log btree blocks, finding deleted orphan
items for inodes that have already been deleted, and moving on to the
next deleted orphan item.
The fix is to use a read-only traversal of only one version of the fs
root for all the items in one scan. This avoids all the deleted orphan
items that exist in the log btrees which will disappear when they're
merged. It lets the item iteration happen in a single read-only cached
btree instead of constantly reading in the most recently written root
block of every log btree.
The result is an enormous speedup of large deletions. I don't want to
describe exactly how enormous.
Signed-off-by: Zach Brown <zab@versity.com>
We can be performing final deletion as inodes are evicted during
unmount. We have to keep full locking, transactions, and networking up
and running for the evict_inodes() call in generic_shutdown_super().
Unfortunately, this means that workers can be using inode references
during evict_inodes() which prevents them from being evicted. Those
workers can then remain running as we tear down the system, causing
crashes and deadlocks as the final iputs try to use resources that have
been destroyed.
The fix is to first properly stop orphan scanning, which can instantiate
new cached inodes, up before the call to kill_block_super ends up trying
to evict all inodes. Then we just need to wait for any pending iput and
invalidate work to finish and perform the final iput, which will always
evict because generic_shutdown_super has cleared MS_ACTIVE.
Signed-off-by: Zach Brown <zab@versity.com>
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.
The reality is that we have significant code paths that have trouble
unwinding. Final inode deletion during iput->evict in a task is a good
example. It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.
It also happens that golang built pre-emptive thread scheduling around
signals. Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.
This changes waits to expect that IOs (including network commands) will
complete reasonably promptly. We remove all interruptible waits with
the notable exception of breaking out of a pending mount. That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.
Signed-off-by: Zach Brown <zab@versity.com>
iput() can only be used in contexts that could perform final inode
deletion which requires cluster locks and transactions. This is
absolutely true for the transaction committing worker. We can't have
deletion during transaction commit trying to get locks and dirty *more*
items in the transaction.
Now that we're properly getting locks in final inode deletion and
O_TMPFILE support has put pressure on deletion, we're seeing deadlocks
between inode eviction during transaction commit getting a index lock
and index lock invalidation trying to commit.
We use the newly offered queued iput to defer the iput from walking our
dirty inodes. The transaction commit will be able to proceed while
the iput worker is off waiting for a lock.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation had the ability to kick iput off to work context. We
need to use it for inode writeback as well so we move the mechanism over
to inode.c and give it a proper call.
Signed-off-by: Zach Brown <zab@versity.com>
We hide I_FREEING inodes from inode lookup to avoid inversions with
cluster locking. This can result in duplicate inodes structs for a
given inode number. Then can both race to try and delete the same items
for their shared inode number. This leads to error messages from
evict_inode and could lead to corruption if they, for example, both try
and free the same data extents.
This adds very basic serialization so only one instance can try to
delete items at a time.
Signed-off-by: Zach Brown <zab@versity.com>
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free. This adds support for
returning ENOSPC to client posix allocators as free space gets low.
For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space. The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks. In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing). When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.
Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.
For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.
The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.
We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.
We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.
And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.
Signed-off-by: Zach Brown <zab@versity.com>
Killing a task can end up in evict and break out of acquiring the locks
to perform final inode deletion. This isn't necessarily fatal. The
orphan task will come around and will delete the inode when it is truly
no longer referenced.
So let's silence the error and keep track of how many times it happens.
Signed-off-by: Zach Brown <zab@versity.com>
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages. The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.
This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.
We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items. Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks. Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.
Then we refresh the orphan inode scanning function. It now runs
regularly in the background of all mounts. It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);
Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).
I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget. We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.
Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave. They'll get NULL instead of
waiting from ilookup. They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.
We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.
This change does allow for concurrent iget of an inode number that is
being deleted on a local node. This could happen in fh_to_dentry with a
raw inode number. But this was already a problem between mounts because
they don't have a shared inode cache to serialize them. Once we fix
that between nodes, we fix it on a single node as well.
Signed-off-by: Zach Brown <zab@versity.com>
We've had a long-standing deadlock between lock invalidation and
eviction. Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks. Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.
We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero. I see unmount hang regularly when testing final inode deletion.
This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on. Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated. This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage. The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.
But now cached inodes prevent final inode deletion. If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster. It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.
This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced. We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.
Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.
Signed-off-by: Zach Brown <zab@versity.com>
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount. This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.
We fix this by adding cached inode tracking. Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.
This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group. Removing many files in a group will only lock and get
the open map once per group.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock. This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.
Signed-off-by: Zach Brown <zab@versity.com>