Add support for project IDs. They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode. The bit is visible through the _attr_x
interface. It can only be set on regular files and when set it prevents
modification to all but non-user xattrs. It can be cleared by root.
Signed-off-by: Zach Brown <zab@versity.com>
We were using a seqcount to protect high frequency reads and writes to
some of our private inode fields. The writers were serialized by the
caller but that's a bit too easy to get wrong. We're already storing
the write seqcount update so the additional internal spinlock stores in
seqlocks isn't a significant additional overhead. The seqlocks also
handle preemption for us.
Signed-off-by: Zach Brown <zab@versity.com>
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.
We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.
We add a new data_wait_check_iter() function accordingly.
With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.
As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The d_prune_aliases in lock invalidation was thought to be safe because
the caller had an inode refernece, surely it can't get into iput_final.
I missed the fundamental dcache pattern that dput can ascend through
parents and end up in inode eviction for entirely unrelated inodes.
It's very easy for this to deadlock, imagine if nothing else that the
inode invalidation is blocked on in dput->iput->evict->delete->lock is
itself in the list of locks to invalidate in the caller.
We fix this by always kicking off d_prune and dput into async work.
This increases the chance that inodes will still be referenced after
invalidation and prevent inline deletion. More deletions can be
deferred until the orphan scanner finds them. It should be rare,
though. We're still likely to put and drop invalidated inodes before a
writer gets around to removing the final unlink and asking us for the
omap that describes our cached inodes.
To perform the d_prune in work we make it a behavioural flag and make
our queued iputs a little more robust. We use much safer and
understandable locking to cover the count and the new flags and we put
the work in re-entrant work in their own workqueue instead of one work
instance in the system_wq.
Signed-off-by: Zach Brown <zab@versity.com>
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete. The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item. Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock. All deletion items will be merged
with the older item and both will be dropped.
This doesn't work for concurrent write-only locks. The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted. That grant order can be mismatched with item
creation order. We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.
Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers. Consistency between writers
is done with another primary write lock. For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.
The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock. This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.
All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.
Signed-off-by: Zach Brown <zab@versity.com>
The final iput of an inode can delete items in cluster locked
transactions. It was never safe to call iput within locked
transactions but we never saw the problem. Recent work on inode
deletion raised the issue again.
This makes sure that we always perform iput outside of locked
transactions. The only interesting change is making scoutfs_new_inode()
return the allocated inode on error so that the caller can put the inode
after releasing the transaction.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing a number of problems coming from races that allowed tasks
in a mount to try and concurrently delete an inode's items. We could
see error messages indicating that deletion failed with -ENOENT, we
could see users of inodes behave erratically as inodes were deleted from
under them, and we could see eventual server errors trying to merge
overlapping data extents which were "freed" (add to transaction lists)
multiple times.
This commit addresses the problems in one relatively large patch. While
we could mechanically split up the fixes, they're all interdependent and
splitting them up (bisecting through them) could cause failures that
would be devilishly hard to diagnose.
First we stop allowing multiple cached vfs inodes. This was initially
done to avoid deadlocks between lock invalidation and final inode
deletion. We add a specific lookup that's used by invalidation which
ignores any inodes which are in I_NEW or I_FREEING. Now that iget can
wait on inode flags we call iget5_locked before acquiring the cluster
lock. This ensures that we can only have one cached vfs inode for a
given inode number in evict_inode trying to delete.
Now that we can only have one cached inode, we can rework the omap
tracking to use _set and _clear instead of _inc and _put. This isn't
strictly necessary but is a simplification and lets us issue warnings if
we see that we ever try to set an inode numbers bit on behalf of
multiple cached inodes. We also add a _test helper.
Orphan scanning would try to perform deletion by instantiating a cached
inode and then putting it, triggering eviction and final deletion. This
was an attempt to simplify concurrency but ended up causing more
problems. It no longer tries to interact with inode cache at all and
attempts to safely delete inode items directly. It uses the omap test
to determine that it should skip an already cached inode.
We had attempted to forbid opening inodes by handle if they had an nlink
of 0. Since we allowed multiple cached inodes for an inode number this
was to prevent adding cached inodes that were being deleted. It was
only performing the check on newly allocated inodes, though, so it could
get a reference to the cached inode that the scanner had inserted for
deleting. We're chosing to keep restricting opening by handle to only
linked inodes so we also check existing inodes after they're refreshed.
We're left with a task evicting an inode and the orphan scanner racing
to delete an inode's items. We move the work of determining if its safe
to delete out of scoutfs_omap_should_delete() and into
try_delete_inode_items() which is called directly from eviction and
scanning. This is mostly code motion but we do make three critical
changes. We get rid of the goofy concurrent deletion detection in
delete_inode_items() and instead use a bit in the lock data to serialize
multiple attempts to delete an inode's items. We no longer assume that
the inode must still be around because we were called from evict and
specifically check that inode item is still present for deleting.
Finally, we use the omap test to discover that we shouldn't delete an
inode that is locally cached (and would be not be included to the omap
response). We do all this under the inode write lock to serialize
between mounts.
Signed-off-by: Zach Brown <zab@versity.com>
Add a mount option to set the delay betwen scanning of the orphan list.
The sysfs file for the option is writable so this option can be set at
run time.
Signed-off-by: Zach Brown <zab@versity.com>
Our open by handle functions didn't care that the inode wasn't
referenced and let tasks open unlinked inodes by number. This
interacted badly with the inode deletion mechanisms which required that
inodes couldn't be cached on other nodes after the transaction which
removed their final reference.
If a task did accidentally open a file by inode while it was being
deleted it could see the inode items in an inconsistent state and return
very confusing errors that look like corruption.
The fix is to give the handle iget callers a flag to tell iget to only
get the inode if it has a positive nlink. If iget sees that the inode
has been unlinked it returns enoent.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding an ioctl that wants to build inode item keys so let's
export the private inode key initializer.
Signed-off-by: Zach Brown <zab@versity.com>
The code that updates inode index items on behalf of indexed fields uses
an array to track changes in the fields. Those array indexes were the
raw key type values.
We're about to introduce some sparse space between all the key values so
that we have some room to add keys in the future at arbitrary sort
positions amongst the previous keys.
We don't want the inode index item updating code to keep using raw types
as array indices when the type values are no longer small dense values.
We introduce indirection from type values to array indices to keep the
tracking array in the in-memory inode struct small.
Signed-off-by: Zach Brown <zab@versity.com>
Add an inode creation time field. It's created for all new inodes.
It's visible to stat_more. setattr_more can set it during
restore.
Signed-off-by: Zach Brown <zab@versity.com>
We can be performing final deletion as inodes are evicted during
unmount. We have to keep full locking, transactions, and networking up
and running for the evict_inodes() call in generic_shutdown_super().
Unfortunately, this means that workers can be using inode references
during evict_inodes() which prevents them from being evicted. Those
workers can then remain running as we tear down the system, causing
crashes and deadlocks as the final iputs try to use resources that have
been destroyed.
The fix is to first properly stop orphan scanning, which can instantiate
new cached inodes, up before the call to kill_block_super ends up trying
to evict all inodes. Then we just need to wait for any pending iput and
invalidate work to finish and perform the final iput, which will always
evict because generic_shutdown_super has cleared MS_ACTIVE.
Signed-off-by: Zach Brown <zab@versity.com>
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.
The reality is that we have significant code paths that have trouble
unwinding. Final inode deletion during iput->evict in a task is a good
example. It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.
It also happens that golang built pre-emptive thread scheduling around
signals. Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.
This changes waits to expect that IOs (including network commands) will
complete reasonably promptly. We remove all interruptible waits with
the notable exception of breaking out of a pending mount. That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.
Signed-off-by: Zach Brown <zab@versity.com>
iput() can only be used in contexts that could perform final inode
deletion which requires cluster locks and transactions. This is
absolutely true for the transaction committing worker. We can't have
deletion during transaction commit trying to get locks and dirty *more*
items in the transaction.
Now that we're properly getting locks in final inode deletion and
O_TMPFILE support has put pressure on deletion, we're seeing deadlocks
between inode eviction during transaction commit getting a index lock
and index lock invalidation trying to commit.
We use the newly offered queued iput to defer the iput from walking our
dirty inodes. The transaction commit will be able to proceed while
the iput worker is off waiting for a lock.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation had the ability to kick iput off to work context. We
need to use it for inode writeback as well so we move the mechanism over
to inode.c and give it a proper call.
Signed-off-by: Zach Brown <zab@versity.com>
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free. This adds support for
returning ENOSPC to client posix allocators as free space gets low.
For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space. The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks. In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing). When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.
Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.
For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.
The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.
We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.
We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.
And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.
Signed-off-by: Zach Brown <zab@versity.com>
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages. The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.
This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.
We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items. Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks. Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.
Then we refresh the orphan inode scanning function. It now runs
regularly in the background of all mounts. It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);
Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).
I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget. We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.
Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave. They'll get NULL instead of
waiting from ilookup. They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.
We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.
This change does allow for concurrent iget of an inode number that is
being deleted on a local node. This could happen in fh_to_dentry with a
raw inode number. But this was already a problem between mounts because
they don't have a shared inode cache to serialize them. Once we fix
that between nodes, we fix it on a single node as well.
Signed-off-by: Zach Brown <zab@versity.com>
We've had a long-standing deadlock between lock invalidation and
eviction. Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks. Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.
We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero. I see unmount hang regularly when testing final inode deletion.
This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on. Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated. This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage. The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.
But now cached inodes prevent final inode deletion. If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster. It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.
This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced. We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.
Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock. This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.
Signed-off-by: Zach Brown <zab@versity.com>
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Now that we have full precision extents a writer with i_mutex and a page
lock can be modifying large extent items which cover much of the
surrounding pages in the file. Readers can be in a different page with
only the page lock and try to work with extent items as the writer is
deleting and creating them.
We add a per-inode rwsem which just protects file extent item
manipulation. We try to acquire it as close to the item use as possible
in data.c which is the only place we work with file extent items.
This stops rare read corruption we were seeing where get_block in a
reader was racing with extent item deletion in a stager at a further
offset in the file.
Signed-off-by: Zach Brown <zab@versity.com>
We had previously seen lock contention between mounts that were either
resolving paths by looking up entries in directories or writing xattrs
in file inodes as they did archiving work.
The previous attempt to avoid this contention was to give each directory
its own inode number allocator which ensured that inodes created for
entries in the directory wouldn't share lock groups with inodes in other
directories.
But this creates the problem of operating on few files per lock for
reasonably small directories. It also creates more server commits as
each new directory gets its inode allocation reservation.
The fix is to have mount-wide seperate allocators for directories and
for everything else. This puts directories and files in seperate groups
and locks, regardless of directory population.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
Remove the functions that operate on online and offline blocks
independently now that the file data mapping code isn't using it any
more.
Signed-off-by: Zach Brown <zab@versity.com>
Add functions that atomically change and query the online and offline
block counts as a pair. They're semantically linked and we shouldn't
present counts that don't match if they're in the process of being
updated.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting
bug in getxattr(). We were unconditinally returning the max xattr value
size when someone tried to probe an existing xattrs' value size by
calling getxattr with size == 0. Some kernel paths did this to probe
the existance of xattrs. They expected to get an error if the xattr
didn't exist, but we were giving them the max possible size. This
kernel path then tried to remove the xattrs with XATTR_REMOVE and that
now failed and caused a bunch of errors in xfstests.
The fix is to return the real xattr value size when getxattr is called
with size == 0. To do that with the old format we'd have to iterate
over all the items which happened to be pretty awkward in the current
code paths.
So we're taking this opportunity to land a change that had been brewing
for a while. We now form the xattr keys from the hash of the name and
the item values now store a logical contiquous header, the name, and the
value. This makes it very easy for us to have the full xattr value
length in the header and return it from getxattr when size == 0.
Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE
flags.
And the code is a whole lot easier to follow. And we've removed another
barrier for moving to small fixed size keys.
Signed-off-by: Zach Brown <zab@versity.com>
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved. This means that
concurrent file creation in different directories will create
overlapping inode numbers. This leads to lock contention as reasonable
work loads will tend to distribute work by directories.
The easy fix is to have per-directory inode number allocation pools. We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
Simple attr changes are mostly handled by the VFS, we just have to mirror
them into our inode. Truncates are done in a seperate set of transactions.
We use a flag to indicate an in-progress truncate. This allows us to
detect and continue the truncate should the node crash.
Index locking is a bit complicated, so we add a helper function to grab
index locks and start a transaction.
With this patch we now pass the following xfstests:
generic/014
generic/101
generic/313
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We only set the .getattr method to our locked getattr filler for regular
files. Set it for all files so that stat, etc, will see the current
inode for all file types.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage for inode index items.
Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items. One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking. Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today. And these items can be arbitrarily positioned
in the key space.
So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.
The trickiest bit of the index locking code is having to retry if the
sequence number changes. Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that. If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).
The use of the functions is straight forward. Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.
While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values. The code
ends up a bit nicer. It also gets rid of the indexed time fields that
were left behind and were unused.
It's worth noting that we're getting exclusive locks on the index
updates. Locking the meta/data seq updates results in complete global
serialization of all changes. We'll need concurrent writer locks to get
concurrency back.
Signed-off-by: Zach Brown <zab@versity.com>
Use per_task storage on the inode to pass locks from high level read and
write lock holders down into the callbacks that operate under the locks
so that the locks can then be passed to the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update(). This'll get changed into
passing the full lock into _update soon.
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();
Signed-off-by: Zach Brown <zab@versity.com>
We lock multiple inodes by order of their inode number. This fixes
the directory entry paths that hold parent dir and target inode locks.
Link and unlink are easy because they just acquire the existing parent
dir and target inode locks.
Lookup is a little squirrely because we don't want to try and order
the parent dir lock with locks down in iget. It turns out that it's
safe to drop the dir lock before calling iget as long as iget handles
racing the inode cache instantiation with inode deletion.
Creation is the remaining pattern and it's a little weird because we
want to lock the newly created inode before we create it and the items
that store it. We add a function that correctly orders the locks,
transaction, and inode cache instantiation.
Signed-off-by: Zach Brown <zab@versity.com>
With trylock implemented we can add locking in readpage. After that it's
pretty easy to implement our own read/write functions which at this
point are more or less wrapping the kernel helpers in the correct
cluster locking.
Data invalidation is a bit interesting. If the lock we are invalidating
is an inode group lock, we use the lock boundaries to incrementally
search our inode cache. When an inode struct is found, we sync and
(optionally) truncate pages.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call, fixed some error handling]
Signed-off-by: Zach Brown <zab@versity.com>