Commit Graph

119 Commits

Author SHA1 Message Date
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
603af327ac Ignore I_FREEING in all inode hash lookups
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);

Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).

I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget.  We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.

Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave.  They'll get NULL instead of
waiting from ilookup.  They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.

We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.

This change does allow for concurrent iget of an inode number that is
being deleted on a local node.  This could happen in fh_to_dentry with a
raw inode number.  But this was already a problem between mounts because
they don't have a shared inode cache to serialize them.  Once we fix
that between nodes, we fix it on a single node as well.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:22:10 -07:00
Zach Brown
4389c73c14 Fix deadlock between lock invalidate and evict
We've had a long-standing deadlock between lock invalidation and
eviction.  Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks.  Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.

We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero.  I see unmount hang regularly when testing final inode deletion.

This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on.  Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated.  This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
715c29aad3 Proactively drop dentry/inode caches outside locks
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage.  The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.

But now cached inodes prevent final inode deletion.  If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster.  It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.

This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced.  We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.

Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
22371fe5bd Fully destroy inodes after all mounts evict
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount.  This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.

We fix this by adding cached inode tracking.  Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.

This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group.  Removing many files in a group will only lock and get
the open map once per group.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
da1af9b841 Add scoutfs inode ino lock coverage
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock.  This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Andy Grover
0deb232d3f Support O_TMPFILE and allow MOVE_BLOCKS into released extents
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.

Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.

RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.

Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.

xfstests common/004 now runs because tmpfile is supported.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 14:23:44 -07:00
Zach Brown
f8d39610a2 Only get inode writeback_lock when adding inodes
Each transaction maintains a global list of inodes to sync.  It checks
the inode and adds it in each write_end call per OS page.  Locking and
unlocking the global spinlock was showing up in profiles.  At the very
least, we can only get the lock once per large file that's written
during a transaction.  This will reduce spinlock traffic on the lock by
the number of pages written per file.   We'll want a better solution in
the long run, but this helps for now.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-04 11:39:30 -08:00
Andy Grover
bed33c7ffd Remove item accounting
Remove kmod/src/count.h
Remove scoutfs_trans_track_item()
Remove reserved/actual fields from scoutfs_reservation

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-20 17:01:08 -08:00
Zach Brown
1e0f8ee27a Finally change all 'ci' inode info ptrs to 'si'
Finally get rid of the last silly vestige of the ancient 'ci' name and
update the scoutfs_inode_info pointers to si.  This is just a global
search and replace, nothing functional changes.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-15 15:20:02 -08:00
Zach Brown
807ae11ee9 Protect per-inode extent items with extent_sem
Now that we have full precision extents a writer with i_mutex and a page
lock can be modifying large extent items which cover much of the
surrounding pages in the file.  Readers can be in a different page with
only the page lock and try to work with extent items as the writer is
deleting and creating them.

We add a per-inode rwsem which just protects file extent item
manipulation.  We try to acquire it as close to the item use as possible
in data.c which is the only place we work with file extent items.

This stops rare read corruption we were seeing where get_block in a
reader was racing with extent item deletion in a stager at a further
offset in the file.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-15 11:56:50 -08:00
Andy Grover
e6228ead73 scoutfs: Ensure padding in structs remains zeroed
Audit code for structs allocated on stack without initialization, or
using kmalloc() instead of kzalloc().

- avl.c: zero padding in avl_node on insert.
- btree.c: Verify item padding is zero, or WARN_ONCE.
- inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding.
- net.c: zero pad in net header.
- net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin().
- xattr.c: scoutfs_xattr has padding, zero it.
- forest.c: item_root in forest_next_hint() appears to either be
    assigned-to or unused, so no need to zero it.
- key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones}

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
6bacd95aea scoutfs: fs uses item cache instead of forest
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees.  Most of this is mechanical
conversion from the _forest calls to the _item calls.  The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.

The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path.  There were only two users
of this.  Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache.  Xattr updates were
a little trickier.  They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value.  This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id.  Replacing now
reuses the old id.

And finally we add back in the locking and transaction item cache
integration.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
3a82090ab1 scoutfs: have per-fs inode nr allocators
We had previously seen lock contention between mounts that were either
resolving paths by looking up entries in directories or writing xattrs
in file inodes as they did archiving work.

The previous attempt to avoid this contention was to give each directory
its own inode number allocator which ensured that inodes created for
entries in the directory wouldn't share lock groups with inodes in other
directories.

But this creates the problem of operating on few files per lock for
reasonably small directories.  It also creates more server commits as
each new directory gets its inode allocation reservation.

The fix is to have mount-wide seperate allocators for directories and
for everything else.  This puts directories and files in seperate groups
and locks, regardless of directory population.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
177af7f746 scoutfs: use larger metadata blocks
Introduce different constants for small and large metadata block
sizes.

The small 4KB size is used for the super block, quorum blocks, and as
the granularity of file data block allocation.  The larger 64KB size is
used for the radix, btree, and forest bloom metadata block structures.

The bulk of this are obvious transitions from the old single constant to
the appropriate new constant.  But there are a few more involved
changes, though just barely.

The block crc calculation now needs the caller to pass in the size of
the block.  The radix function to return free bytes instead returns free
blocks and the caller is responsible for knowing how big its managed
blocks are.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
48448d3926 scoutfs: convert fs callers to forest
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls.  This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
6f5cfd8cc2 scoutfs: use rid instead of node_id in items
Use the mount's generated random id in persistent items and the lock
that protects them instead of the assigned node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
c010afa8ff scoutfs: add setattr_more ioctl
Add an ioctl that can be used by userspace to restore a file to its
offline state.  To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 13:45:52 -07:00
Zach Brown
a6782fc03f scoutfs: add data waiting
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier.  For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.

This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.

We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline.  We add these checks and waiting to data io
operations that could encounter offline extents.

This has to be done carefully so that we don't wait while holding locks
that would prevent staging.  We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.

And while we're waiting our operation is tracked and reported to
userspace through an ioctl.  This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.

Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online.  This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again.  It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes.  It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
08a140c8b0 scoutfs: use our locking service
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.

The client code gets some shims to send and receive lock messages to and
from the server.  Callers use our lock mode constants instead of the
DLM's.

Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.

The biggest change is in the client lock state machine.  Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing.  We don't have everything
come through a per-lock work queue.  Instead we send requests either
from the blocking lock caller or from a shrink work queue.  Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.

The different processing contexts leads to a slightly different lock
life cycle.  We refactor and seperate allocation and freeing from
tracking and removing locks in data structures.  We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.

Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time.  We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.

As of this change the lock setup and destruction paths are a little
wobbly.  They'll be shored up as we add lock recovery between the client
and server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
59170f41b1 scoutfs: revive item deletion path
The inode deletion path had bit rotted.  Delete the ifdefs that were
stopping it from deleting all the items associated with an inode.  There
can be a lot of xattr and data mapping items so we have them manage
their own transactions (data already did).  The xattr deletion code was
trying to get a lock while the caller already held it so delete that.
Then we accurately account for the small number of remaining items that
finally delete the inode.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
600ecd9fad scoutfs: adapt to fallcated extents
The addition of fallocate() now means that offline extents can be
unwritten and allocated and that extents can now be found outside of
i_size.

Truncating needs to know about the possible flag combinations, writing
preallocation needs to know to update an existing extent or allocate up
to the next extent, get_block can't map unwritten extents for read,
extent conversion needs to also clear offline, and truncate needs to
drop extents outside i_size even if truncating to the existing file
size.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
27d1f3bcf7 scoutfs: inode read shouldn't modify online blocks
There was a typo in the addition of i_blocks tracking that would set
online blocks to the value of offline blocks when reading an existing
inode into memory.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
19f7e0284b scoutfs: add online/offline block trace event
Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
70b2a50c9a scoutfs: remove individual online/offline calls
Remove the functions that operate on online and offline blocks
independently now that the file data mapping code isn't using it any
more.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
036577890f scoutfs: add atomic online/offline blocks calls
Add functions that atomically change and query the online and offline
block counts as a pair.  They're semantically linked and we shouldn't
present counts that don't match if they're in the process of being
updated.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
966c8b8cbc scoutfs: alloc inos at multiple of lock group
Inode allocations come from batches that are reserved for directories.
As the batch is exhausted a new one is acquired and allocated from.

The batch size was arbitrarily set to the human friendly 10000.  This
doesn't interact well with the lock group size being a power of two.
Each allocation batch will straddle an inode group with its previous and
next inode batch.

This often doesn't matter because dirctories very rarely have more than
9000 entries.  But as entries pass 10000 they'd see surprising
contention with other inode ranges in directories.

Tweak the allocation size to be a multiple of the lock group size to
stop this from happening.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:53:58 -07:00
Zach Brown
9148f24aa2 scoutfs: use single small key struct
Variable length keys lead to having a key struct point to the buffer
that contains the key.  With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.

We no longer have a seperate generic key buf struct that points to
specific per-type key storage.  All items use the key struct and fill
out the appropriate fields.  All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.

Each key user now has an init function fills out its fields.  It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.

A bunch of code now takes the address of static key storage instead of
managing allocated keys.  Conversely, swapping now uses the full keys
instead of pointers to the keys.

We don't need all the functions that worked on the generic key buf
struct because they had different lengths.  Copy, clone, length init,
memcpy, all of that goes away.

The item API had some functions that tested the length of keys and
values.  The key length tests vanish, and that gets rid of the _same()
call.  The _same_min() call only had one user who didn't also test for
the value length being too large.  Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.

We no longer have to track the number of key bytes when calculating if
an item population will fit in segments.  This removes the key length
from reservations, transactions, and segment writing.

The item cache key querying ioctls no longer have to deal with variable
length keys.  The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.

The segment no longer has to store the key length.  It stores the key
struct in the item header.

The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct.  The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.

Manifest entries are now a fixed size.  We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq.  They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap.  This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
b0bd273acc scoutfs: remove support for multi-element kvecs
Originally the item interfaces were written with full support for
vectored keys and values.  Callers constructed keys and values made up
of header structs and data buffers.  Segments supported much larger
values which could span pages when stored in memory.

But over time we've pulled that support back.  Keys are described by a
key struct instead of a multi-element kvec.  Values are now much smaller
and don't span pages.  The item interfaces still use the kvec arrays but
everyone only uses a single element.

So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec.  A
bunch of code disappears and the result is much easier to understand.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
08f544cc15 scoutfs: remove scoutfs_item_lookup_exact() size
Every caller of scoutfs_item_lookup_exact() provided a size that matches
the value buffer.  Let's remove the redundant arg and use the value
buffer length as the exact size to match.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
c4de85fd82 scoutfs: cleanup xattr item storage
Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting
bug in getxattr().  We were unconditinally returning the max xattr value
size when someone tried to probe an existing xattrs' value size by
calling getxattr with size == 0.  Some kernel paths did this to probe
the existance of xattrs.  They expected to get an error if the xattr
didn't exist, but we were giving them the max possible size.  This
kernel path then tried to remove the xattrs with XATTR_REMOVE and that
now failed and caused a bunch of errors in xfstests.

The fix is to return the real xattr value size when getxattr is called
with size == 0.  To do that with the old format we'd have to iterate
over all the items which happened to be pretty awkward in the current
code paths.

So we're taking this opportunity to land a change that had been brewing
for a while.  We now form the xattr keys from the hash of the name and
the item values now store a logical contiquous header, the name, and the
value.  This makes it very easy for us to have the full xattr value
length in the header and return it from getxattr when size == 0.

Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE
flags.

And the code is a whole lot easier to follow.  And we've removed another
barrier for moving to small fixed size keys.

Signed-off-by: Zach Brown <zab@versity.com>
2018-03-15 09:23:57 -07:00
Zach Brown
302b0f5316 scoutfs: track inode 512b block count
We weren't doing anything with the inode blocks field.  We weren't even
initializing it which explains why we'd sometimes see garbage i_blocks
values in scoutfs inodes in segments.

The logical blocks field reflects the contents of the file regardless of
whether its online or not.  It's the sum of our online and offline block
tracking.

So we can initialize it to our persistent online and offline counts and
then keep it in sync as blocks are allocated and freed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-21 09:36:44 -08:00
Zach Brown
c1311783d5 scoutfs: add tracking of online and offline blocks
Signed-off-by: Zach Brown <zab@versity.com>
2018-02-21 09:36:44 -08:00
Zach Brown
f52dc28322 scoutfs: simplify lock use of kernel dlm
We had an excessive number of layers between scoutfs and the dlm code in
the kernel.  We had dlmglue, the scoutfs locks, and task refs.  Each
layer had structs that track the lifetime of the layer below it.  We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.

This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.

The users of the lock interface are mainly unchanged.  We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use.  Lock fields change so a few
external users of those fields change.

This not only removes a lot of code it also contains functional
improvements.  For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.

It introduces the concept of an unlock grace period.  Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.

There are significant changes to trace points, counters, and debug files
that follow the implementation changes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-14 15:00:17 -08:00
Zach Brown
4ff1e3020f scoutfs: allocate inode numbers per directory
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved.  This means that
concurrent file creation in different directories will create
overlapping inode numbers.  This leads to lock contention as reasonable
work loads will tend to distribute work by directories.

The easy fix is to have per-directory inode number allocation pools.  We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-09 17:58:19 -08:00
Zach Brown
a49061a7d9 scoutfs: remove the size index
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying.  Let's remove it.

Removing it from the format and no longer maintaining it are straight
forward.

The bulk of this patch is actually the act of removing it from the index
locking functions.  We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction.  A bunch of code to predict the size and
then pass it into locking and transactions goes away.  Like other inode
fields we now update the size as it changes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-01-30 15:03:35 -08:00
Zach Brown
c36d90e216 scoutfs: map inode index item locks in one place
We have to map many index item keys down to a lock that then has a start
and end key range.  We also use this mapping over in index item locking
to avoid trying to acquire locks multiple times.

We were duplicating the mapping calculation in these two places.  This
refactors these functions to use one range calculation function.  It's
going to be used in future patches to fix the mapping of the size index
items.

This should result in no functional changes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-11-21 13:11:43 -08:00
Mark Fasheh
e8f87ff90a scoutfs: use CW locks for inode index updates
This will give us concurrency yet still allow our ioctls to drive cache
syncing/invalidation on other nodes. Our lock_coverage() checks evolve
to handle direct dlm modes, allowing us to verify correct usage of CW
locks.

As a test, we can run createmany on two nodes at the same time, each
working in their own directory. The following commands were run on each
node:
  $ mkdir /scoutfs/`uname -n`
  $ cd /scoutfs/`uname -n`
  $ /root/createmany -o ./file_$i 100000

Before this patch that test wouldn't finish in any reasonable amount of
time and I would kill it after some number of hours.

After this patch, we make swift progress through the test:

[root@fstest3 fstest3.site]# /root/createmany -o ./file_$i 100000
 - created 10000 (time 1509394646.11 total 0.31 last 0.31)
 - created 20000 (time 1509394646.38 total 0.59 last 0.28)
 - created 30000 (time 1509394646.81 total 1.01 last 0.43)
 - created 40000 (time 1509394647.31 total 1.51 last 0.50)
 - created 50000 (time 1509394647.82 total 2.02 last 0.51)
 - created 60000 (time 1509394648.40 total 2.60 last 0.58)
 - created 70000 (time 1509394649.06 total 3.26 last 0.66)
 - created 80000 (time 1509394649.72 total 3.93 last 0.66)
 - created 90000 (time 1509394650.36 total 4.56 last 0.64)
 total: 100000 creates in 35.02 seconds: 2855.80 creates/second

[root@fstest4 fstest4.fstestnet]# /root/createmany -o ./file_$i 100000
 - created 10000 (time 1509394647.35 total 0.75 last 0.75)
 - created 20000 (time 1509394647.89 total 1.28 last 0.54)
 - created 30000 (time 1509394648.46 total 1.86 last 0.58)
 - created 40000 (time 1509394648.96 total 2.35 last 0.49)
 - created 50000 (time 1509394649.51 total 2.90 last 0.55)
 - created 60000 (time 1509394650.07 total 3.46 last 0.56)
 - created 70000 (time 1509394650.79 total 4.19 last 0.72)
 - created 80000 (time 1509394681.26 total 34.66 last 30.47)
 - created 90000 (time 1509394681.63 total 35.03 last 0.37)
 total: 100000 creates in 35.50 seconds: 2816.76 creates/second

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-11-16 16:14:38 -08:00
Zach Brown
5c3962d223 scoutfs: trace correct index item deletion
The trace point for deleting index items was using the wrong major and
minor.

Signed-off-by: Zach Brown <zab@versity.com>
2017-11-08 13:37:16 -08:00
Mark Fasheh
20a22ddc6b scoutfs: provide ->setattr
Simple attr changes are mostly handled by the VFS, we just have to mirror
them into our inode. Truncates are done in a seperate set of transactions.
We use a flag to indicate an in-progress truncate. This allows us to
detect and continue the truncate should the node crash.

Index locking is a bit complicated, so we add a helper function to grab
index locks and start a transaction.

With this patch we now pass the following xfstests:

generic/014
generic/101
generic/313

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-10-18 13:23:01 -07:00
Mark Fasheh
dd99a0127e scoutfs: rename scoutfs_inode_index_lock_hold
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-10-18 13:23:01 -07:00
Zach Brown
856f257085 scoutfs: used locked getattr for all inodes
We only set the .getattr method to our locked getattr filler for regular
files.  Set it for all files so that stat, etc, will see the current
inode for all file types.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-12 14:51:30 -07:00
Zach Brown
9b31c9795b scoutfs: add full lock arg to _item_delete()
Add the full lock arg to _item_delete() so that it can verify lock
coverage.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
6cd64f3228 scoutfs: add full lock arg to _item_update()
Add the full lock arg to _item_update() so that it can verify lock
coverage.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
0aa16f5ef6 scoutfs: add lock arg to _item_create()
scoutfs_item_create() hasn't been working with lock coverage.  It
wouldn't return -ENOENT if it didn't have the lock cached.  It would
create items outside lock coverate so they wouldn't be invalidated and
re-read if another node modified the item.

Add a lock arg and teach it to populate the cache so that it's correctly
consistent.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
950436461a scoutfs: add lock coverage for inode index items
Add lock coverage for inode index items.

Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items.  One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking.  Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today.  And these items can be arbitrarily positioned
in the key space.

So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.

The trickiest bit of the index locking code is having to retry if the
sequence number changes.  Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that.  If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).

The use of the functions is straight forward.  Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.

While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values.  The code
ends up a bit nicer.  It also gets rid of the indexed time fields that
were left behind and were unused.

It's worth noting that we're getting exclusive locks on the index
updates.  Locking the meta/data seq updates results in complete global
serialization of all changes.  We'll need concurrent writer locks to get
concurrency back.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
aa70903154 scoutfs: add lock coverage for data paths
Use per_task storage on the inode to pass locks from high level read and
write lock holders down into the callbacks that operate under the locks
so that the locks can then be passed to the item functions.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
0535e249d1 scoutfs: add lock arg to scoutfs_update_inode_item
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update().  This'll get changed into
passing the full lock into _update soon.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
32a68e84cf scoutfs: add full lock coverage to _item_dirty()
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.

This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
1c6e3e39bf scoutfs: add full lock coverage to _item_next*()
Add the full lock argument to _item_next*() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00