Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages. The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.
This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.
We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items. Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks. Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.
Then we refresh the orphan inode scanning function. It now runs
regularly in the background of all mounts. It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);
Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).
I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget. We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.
Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave. They'll get NULL instead of
waiting from ilookup. They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.
We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.
This change does allow for concurrent iget of an inode number that is
being deleted on a local node. This could happen in fh_to_dentry with a
raw inode number. But this was already a problem between mounts because
they don't have a shared inode cache to serialize them. Once we fix
that between nodes, we fix it on a single node as well.
Signed-off-by: Zach Brown <zab@versity.com>
We've had a long-standing deadlock between lock invalidation and
eviction. Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks. Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.
We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero. I see unmount hang regularly when testing final inode deletion.
This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on. Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated. This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage. The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.
But now cached inodes prevent final inode deletion. If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster. It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.
This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced. We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.
Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.
Signed-off-by: Zach Brown <zab@versity.com>
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount. This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.
We fix this by adding cached inode tracking. Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.
This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group. Removing many files in a group will only lock and get
the open map once per group.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock. This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.
Signed-off-by: Zach Brown <zab@versity.com>
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Each transaction maintains a global list of inodes to sync. It checks
the inode and adds it in each write_end call per OS page. Locking and
unlocking the global spinlock was showing up in profiles. At the very
least, we can only get the lock once per large file that's written
during a transaction. This will reduce spinlock traffic on the lock by
the number of pages written per file. We'll want a better solution in
the long run, but this helps for now.
Signed-off-by: Zach Brown <zab@versity.com>
Finally get rid of the last silly vestige of the ancient 'ci' name and
update the scoutfs_inode_info pointers to si. This is just a global
search and replace, nothing functional changes.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have full precision extents a writer with i_mutex and a page
lock can be modifying large extent items which cover much of the
surrounding pages in the file. Readers can be in a different page with
only the page lock and try to work with extent items as the writer is
deleting and creating them.
We add a per-inode rwsem which just protects file extent item
manipulation. We try to acquire it as close to the item use as possible
in data.c which is the only place we work with file extent items.
This stops rare read corruption we were seeing where get_block in a
reader was racing with extent item deletion in a stager at a further
offset in the file.
Signed-off-by: Zach Brown <zab@versity.com>
Audit code for structs allocated on stack without initialization, or
using kmalloc() instead of kzalloc().
- avl.c: zero padding in avl_node on insert.
- btree.c: Verify item padding is zero, or WARN_ONCE.
- inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding.
- net.c: zero pad in net header.
- net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin().
- xattr.c: scoutfs_xattr has padding, zero it.
- forest.c: item_root in forest_next_hint() appears to either be
assigned-to or unused, so no need to zero it.
- key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones}
Signed-off-by: Andy Grover <agrover@versity.com>
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees. Most of this is mechanical
conversion from the _forest calls to the _item calls. The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.
The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path. There were only two users
of this. Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache. Xattr updates were
a little trickier. They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value. This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id. Replacing now
reuses the old id.
And finally we add back in the locking and transaction item cache
integration.
Signed-off-by: Zach Brown <zab@versity.com>
We had previously seen lock contention between mounts that were either
resolving paths by looking up entries in directories or writing xattrs
in file inodes as they did archiving work.
The previous attempt to avoid this contention was to give each directory
its own inode number allocator which ensured that inodes created for
entries in the directory wouldn't share lock groups with inodes in other
directories.
But this creates the problem of operating on few files per lock for
reasonably small directories. It also creates more server commits as
each new directory gets its inode allocation reservation.
The fix is to have mount-wide seperate allocators for directories and
for everything else. This puts directories and files in seperate groups
and locks, regardless of directory population.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce different constants for small and large metadata block
sizes.
The small 4KB size is used for the super block, quorum blocks, and as
the granularity of file data block allocation. The larger 64KB size is
used for the radix, btree, and forest bloom metadata block structures.
The bulk of this are obvious transitions from the old single constant to
the appropriate new constant. But there are a few more involved
changes, though just barely.
The block crc calculation now needs the caller to pass in the size of
the block. The radix function to return free bytes instead returns free
blocks and the caller is responsible for knowing how big its managed
blocks are.
Signed-off-by: Zach Brown <zab@versity.com>
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls. This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.
Signed-off-by: Zach Brown <zab@versity.com>
Use the mount's generated random id in persistent items and the lock
that protects them instead of the assigned node_id.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
The inode deletion path had bit rotted. Delete the ifdefs that were
stopping it from deleting all the items associated with an inode. There
can be a lot of xattr and data mapping items so we have them manage
their own transactions (data already did). The xattr deletion code was
trying to get a lock while the caller already held it so delete that.
Then we accurately account for the small number of remaining items that
finally delete the inode.
Signed-off-by: Zach Brown <zab@versity.com>
The addition of fallocate() now means that offline extents can be
unwritten and allocated and that extents can now be found outside of
i_size.
Truncating needs to know about the possible flag combinations, writing
preallocation needs to know to update an existing extent or allocate up
to the next extent, get_block can't map unwritten extents for read,
extent conversion needs to also clear offline, and truncate needs to
drop extents outside i_size even if truncating to the existing file
size.
Signed-off-by: Zach Brown <zab@versity.com>
There was a typo in the addition of i_blocks tracking that would set
online blocks to the value of offline blocks when reading an existing
inode into memory.
Signed-off-by: Zach Brown <zab@versity.com>
Remove the functions that operate on online and offline blocks
independently now that the file data mapping code isn't using it any
more.
Signed-off-by: Zach Brown <zab@versity.com>
Add functions that atomically change and query the online and offline
block counts as a pair. They're semantically linked and we shouldn't
present counts that don't match if they're in the process of being
updated.
Signed-off-by: Zach Brown <zab@versity.com>
Inode allocations come from batches that are reserved for directories.
As the batch is exhausted a new one is acquired and allocated from.
The batch size was arbitrarily set to the human friendly 10000. This
doesn't interact well with the lock group size being a power of two.
Each allocation batch will straddle an inode group with its previous and
next inode batch.
This often doesn't matter because dirctories very rarely have more than
9000 entries. But as entries pass 10000 they'd see surprising
contention with other inode ranges in directories.
Tweak the allocation size to be a multiple of the lock group size to
stop this from happening.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Originally the item interfaces were written with full support for
vectored keys and values. Callers constructed keys and values made up
of header structs and data buffers. Segments supported much larger
values which could span pages when stored in memory.
But over time we've pulled that support back. Keys are described by a
key struct instead of a multi-element kvec. Values are now much smaller
and don't span pages. The item interfaces still use the kvec arrays but
everyone only uses a single element.
So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec. A
bunch of code disappears and the result is much easier to understand.
Signed-off-by: Zach Brown <zab@versity.com>
Every caller of scoutfs_item_lookup_exact() provided a size that matches
the value buffer. Let's remove the redundant arg and use the value
buffer length as the exact size to match.
Signed-off-by: Zach Brown <zab@versity.com>
Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting
bug in getxattr(). We were unconditinally returning the max xattr value
size when someone tried to probe an existing xattrs' value size by
calling getxattr with size == 0. Some kernel paths did this to probe
the existance of xattrs. They expected to get an error if the xattr
didn't exist, but we were giving them the max possible size. This
kernel path then tried to remove the xattrs with XATTR_REMOVE and that
now failed and caused a bunch of errors in xfstests.
The fix is to return the real xattr value size when getxattr is called
with size == 0. To do that with the old format we'd have to iterate
over all the items which happened to be pretty awkward in the current
code paths.
So we're taking this opportunity to land a change that had been brewing
for a while. We now form the xattr keys from the hash of the name and
the item values now store a logical contiquous header, the name, and the
value. This makes it very easy for us to have the full xattr value
length in the header and return it from getxattr when size == 0.
Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE
flags.
And the code is a whole lot easier to follow. And we've removed another
barrier for moving to small fixed size keys.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't doing anything with the inode blocks field. We weren't even
initializing it which explains why we'd sometimes see garbage i_blocks
values in scoutfs inodes in segments.
The logical blocks field reflects the contents of the file regardless of
whether its online or not. It's the sum of our online and offline block
tracking.
So we can initialize it to our persistent online and offline counts and
then keep it in sync as blocks are allocated and freed.
Signed-off-by: Zach Brown <zab@versity.com>
We had an excessive number of layers between scoutfs and the dlm code in
the kernel. We had dlmglue, the scoutfs locks, and task refs. Each
layer had structs that track the lifetime of the layer below it. We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.
This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.
The users of the lock interface are mainly unchanged. We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use. Lock fields change so a few
external users of those fields change.
This not only removes a lot of code it also contains functional
improvements. For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.
It introduces the concept of an unlock grace period. Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.
There are significant changes to trace points, counters, and debug files
that follow the implementation changes.
Signed-off-by: Zach Brown <zab@versity.com>
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved. This means that
concurrent file creation in different directories will create
overlapping inode numbers. This leads to lock contention as reasonable
work loads will tend to distribute work by directories.
The easy fix is to have per-directory inode number allocation pools. We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
We have to map many index item keys down to a lock that then has a start
and end key range. We also use this mapping over in index item locking
to avoid trying to acquire locks multiple times.
We were duplicating the mapping calculation in these two places. This
refactors these functions to use one range calculation function. It's
going to be used in future patches to fix the mapping of the size index
items.
This should result in no functional changes.
Signed-off-by: Zach Brown <zab@versity.com>
This will give us concurrency yet still allow our ioctls to drive cache
syncing/invalidation on other nodes. Our lock_coverage() checks evolve
to handle direct dlm modes, allowing us to verify correct usage of CW
locks.
As a test, we can run createmany on two nodes at the same time, each
working in their own directory. The following commands were run on each
node:
$ mkdir /scoutfs/`uname -n`
$ cd /scoutfs/`uname -n`
$ /root/createmany -o ./file_$i 100000
Before this patch that test wouldn't finish in any reasonable amount of
time and I would kill it after some number of hours.
After this patch, we make swift progress through the test:
[root@fstest3 fstest3.site]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394646.11 total 0.31 last 0.31)
- created 20000 (time 1509394646.38 total 0.59 last 0.28)
- created 30000 (time 1509394646.81 total 1.01 last 0.43)
- created 40000 (time 1509394647.31 total 1.51 last 0.50)
- created 50000 (time 1509394647.82 total 2.02 last 0.51)
- created 60000 (time 1509394648.40 total 2.60 last 0.58)
- created 70000 (time 1509394649.06 total 3.26 last 0.66)
- created 80000 (time 1509394649.72 total 3.93 last 0.66)
- created 90000 (time 1509394650.36 total 4.56 last 0.64)
total: 100000 creates in 35.02 seconds: 2855.80 creates/second
[root@fstest4 fstest4.fstestnet]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394647.35 total 0.75 last 0.75)
- created 20000 (time 1509394647.89 total 1.28 last 0.54)
- created 30000 (time 1509394648.46 total 1.86 last 0.58)
- created 40000 (time 1509394648.96 total 2.35 last 0.49)
- created 50000 (time 1509394649.51 total 2.90 last 0.55)
- created 60000 (time 1509394650.07 total 3.46 last 0.56)
- created 70000 (time 1509394650.79 total 4.19 last 0.72)
- created 80000 (time 1509394681.26 total 34.66 last 30.47)
- created 90000 (time 1509394681.63 total 35.03 last 0.37)
total: 100000 creates in 35.50 seconds: 2816.76 creates/second
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Simple attr changes are mostly handled by the VFS, we just have to mirror
them into our inode. Truncates are done in a seperate set of transactions.
We use a flag to indicate an in-progress truncate. This allows us to
detect and continue the truncate should the node crash.
Index locking is a bit complicated, so we add a helper function to grab
index locks and start a transaction.
With this patch we now pass the following xfstests:
generic/014
generic/101
generic/313
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We only set the .getattr method to our locked getattr filler for regular
files. Set it for all files so that stat, etc, will see the current
inode for all file types.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_create() hasn't been working with lock coverage. It
wouldn't return -ENOENT if it didn't have the lock cached. It would
create items outside lock coverate so they wouldn't be invalidated and
re-read if another node modified the item.
Add a lock arg and teach it to populate the cache so that it's correctly
consistent.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage for inode index items.
Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items. One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking. Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today. And these items can be arbitrarily positioned
in the key space.
So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.
The trickiest bit of the index locking code is having to retry if the
sequence number changes. Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that. If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).
The use of the functions is straight forward. Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.
While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values. The code
ends up a bit nicer. It also gets rid of the indexed time fields that
were left behind and were unused.
It's worth noting that we're getting exclusive locks on the index
updates. Locking the meta/data seq updates results in complete global
serialization of all changes. We'll need concurrent writer locks to get
concurrency back.
Signed-off-by: Zach Brown <zab@versity.com>
Use per_task storage on the inode to pass locks from high level read and
write lock holders down into the callbacks that operate under the locks
so that the locks can then be passed to the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update(). This'll get changed into
passing the full lock into _update soon.
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_next*() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
Signed-off-by: Zach Brown <zab@versity.com>