Commit Graph

137 Commits

Author SHA1 Message Date
Auke Kok
64fcbdc15e Zero out dirent padding to avoid leaking to disk.
This allocation here currently leaks through __pad[7] which
is written to disk. Use the initializer to enforce zeroing
the pad. The name member is written right after.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-03-05 16:20:06 -08:00
Auke Kok
533f309aec Switch to .get_inode_acl() to avoid rcu corruption.
In el9.6, the kernel VFS no longer goes through xattr handlers to
retreive ACLs, but instead calls the FS drivers' .get_{inode_}acl
method.  In the initial compat version we hooked up to .get_acl given
the identical name that was used in the past.

However, this results in caching issues, as was encountered by customers
and exposed in the added test case `basic-acl-consistency`. The result
is that some group ACL entries may appear randomly missing. Dropping
caches may temporarily fix the issue.

The root cause of the issue is that the VFS now has 2 separate paths to
retreive ACL's from the FS driver, and, they have conflicting
implications for caching. `.get_acl` is purely meant for filesystems
like overlay/ecryptfs where no caching should ever go on as they are
fully passthrough only. Filesystems with dentries (i.e. all normal
filesystems should not expose this interface, and instead expose the
.get_inode_acl method. And indeed, in introducing the new interface, the
upstream kernel converts all but a few fs's to use .get_inode_acl().

The functional change in the driver is to detach KC_GET_ACL_DENTRY and
introduce KC_GET_INODE_ACL to handle the new (and required) interface.
KC_SET_ACL_DENTRY is detached due to it being a different changeset in
the kernel and we should separate these for good measure now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-01-30 11:31:43 -08:00
Auke Kok
afb6ba00ad POSIX ACL changes.
The .get_acl() method now gets passed a mnt_idmap arg, and we can now
choose to implement either .get_acl() or .get_inode_acl(). Technically
.get_acl() is a new implementation, and .get_inode_acl() is the old.
That second method now also gets an rcu flag passed, but we should be
fine either way.

Deeper under the covers however we do need to hook up the .set_acl()
method for inodes, otherwise setfacl will just fail with -ENOTSUPP. To
make this not super messy (it already is) we tack on the get_acl()
changes here.

This is all roughly ca. v6.1-rc1-4-g7420332a6ff4.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-30 13:59:44 -04:00
Zach Brown
609fc56cd6 Merge pull request #203 from versity/auke/new_inode_ctime
Fix new_inode ctime assignment.
2025-02-25 15:23:16 -08:00
Auke Kok
e3e2cfceec Fix new_inode ctime assignment.
Very old copy/paste bug here, we want to update new_inode's ctime
instead. old_inode already is updated.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-02-18 13:15:49 -05:00
Auke Kok
e9d147260c Fix ctx->pos updating to properly handle dent gaps
We need to assure we're emitting dents with the proper position
and we already have them as part of our dent. The only caveat is
to increment ctx->pos once beyond the list to make sure the caller
doesn't call us once more.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-27 14:49:04 -05:00
Auke Kok
cad12d5ce8 Avoid deadlock in _readdir() due to copy_to_user().
dir_emit() will copy_to_user, which can pagefault. If this happens while
cluster locked, we could deadlock.

We use a single page to stage dir_emit data, and iterate between
fetching dirents while locked, and emitting them while not locked.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-27 14:49:04 -05:00
Auke Kok
1bcd1d4d00 Drop readdir pre-.iterate() compat (el7.5ish).
These 2 sections of compat for readdir are wholly obsolete and can be
hard dropped, which restores the method to look like current upstream
code.

This was added in ddd1a4e.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-01-23 14:28:40 -05:00
Auke Kok
d5c2768f04 .tmpfile method now passed a struct file, which must be opened.
v6.0-rc6-9-g863f144f12ad changes the VFS method to pass in a struct
file and not a dentry in preperation for tmpfile support in fuse.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
4ef64c6fcf Vfs methods become user namespace mount aware.
v5.11-rc4-24-g549c7297717c

All of these VFS methods are now passed a user_namespace.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Zach Brown
38e6f11ee4 Add quota support
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
4a8240748e Add project ID support
Add support for project IDs.  They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
fb5331a1d9 Add inode retention bit
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode.  The bit is visible through the _attr_x
interface.  It can only be set on regular files and when set it prevents
modification to all but non-user xattrs.  It can be cleared by root.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
2af6f47c8b Fix bad error exit path in unlink
Unlink looks up the entry items for the name it is removing because we
no longer store the extra key material in dentries.  If this lookup
fails it will use an error path which release a transaction which wasn't
held.  Thankfully this error path is unlikely (corruption or systemic
errors like eio or enomem) so we haven't hit this in practice.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Auke Kok
d480243c11 Support .read/write_iter callbacks in lieu of .aio_read/write
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.

We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.

We add a new data_wait_check_iter() function accordingly.

With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.

As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
ec50e66fff Timespec64 changes for yr2038.
Provide a fallback `current_time(inode)` implementation for older
kernels.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Auke Kok
006555d42a READ_ONCE() replaces ACCESS_ONCE()
v3.18-rc3-2-g230fa253df63 forces us to remove ACCESS_ONCE() with
READ_ONCE(), but it is probably the better interface and works with
non-scalar types.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Zach Brown
0316c22026 Extend scoutfs_dir_add_next_linkrefs
Extend scoutfs_dir_add_next_linkref() to be able to return multiple
backrefs under the lock for each call and have it take an argument to
limit the number of backrefs that can be added and returned.

Its return code changes a bit in that it returns 1 on success instead of
0 so we have to be a little careful with callers who were expecting 0.
It still returns -ENOENT when no entries are found.

We break up its tracepoint into one that records each entry added and
one that records the result of each call.

This will be used by an ioctl to give callers just the entries that
point to an inode instead of assembling full paths from the root.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
a61b8d9961 Fix renaming into root directory
The VFS performs a lot of checks on renames before calling the fs
method.  We acquire locks and refresh inodes in the rename method so we
have to duplciate a lot of the vfs checks.

One of the checks involves loops with ancestors and subdirectories.  We
missed the case where the root directory is the destination and doesn't
have any parent directories.  The backref walker it calls returns
-ENOENT instead of 0 with an empty set of parents and that error bubbled
up to rename.

The fix is to notice when we're asking for ancestors of the one
directory that can't have ancestors and short circuit the test.

Signed-off-by: Zach Brown <zab@versity.com>
2023-03-08 11:00:59 -08:00
Zach Brown
71ed4512dc Include primary lock write_seq for write_only vers
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete.  The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item.  Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock.  All deletion items will be merged
with the older item and both will be dropped.

This doesn't work for concurrent write-only locks.  The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted.  That grant order can be mismatched with item
creation order.  We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.

Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers.  Consistency between writers
is done with another primary write lock.  For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.

The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock.  This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.

All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.

Signed-off-by: Zach Brown <zab@versity.com>
2022-11-15 13:26:32 -08:00
Zach Brown
aed4313995 Simplify dentry verification
Now that we've removed the hash and pos from the dentry_info struct we
can do without it.  We can store the refresh gen in the d_fsdsta pointer
(sorry, 64bit only for now.. could allocate if we needed to.)  This gets
rid of the lock coverage spinlocks and puts a bit more pressure on lock
lookup, which we already know we have to make more efficient.  We can
get rid of all the dentry info allocation calls.

Now that we're not setting d_op as we allocate d_fsdata we put the ops
on the super block so that we get d_revalidate called on all our
dentries.

We also are a bit more precise about the errors we can return from
verification.  If the target of a dentry link changes then we return
-ESTALE rather than silently performing the caller's operation on
another inode.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-27 14:32:06 -07:00
Zach Brown
c92a7ff705 Don't use dentry private hash/pos for deletion
The dentry cache life cycles are far too crazy to rely on d_fsdata being
kept in sync with the rest of the dentry fields.  Callers can do all
sorts of crazy things with dentries.  Only unlink and rename need these
fields and those operations are already so expensive that item lookups
to get the current actual hash and pos are lost in the noise.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-26 16:42:26 -07:00
Zach Brown
29538a9f45 Add POSIX ACL support
Add support for the POSIX ACLs as described in acl(5).  Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
798fbb793e Move to xattr_handler xattr prefix dispatch
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix.   This
will make it easier to add handling of the pseudo system. ACL xattrs.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-21 14:24:52 -07:00
Zach Brown
bddca171ee Call iput outside cluster locked transactions
The final iput of an inode can delete items in cluster locked
transactions.   It was never safe to call iput within locked
transactions but we never saw the problem.   Recent work on inode
deletion raised the issue again.

This makes sure that we always perform iput outside of locked
transactions.  The only interesting change is making scoutfs_new_inode()
return the allocated inode on error so that the caller can put the inode
after releasing the transaction.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-11 15:29:20 -08:00
Zach Brown
15d7eec1f9 Disallow openening unlinked files by handle
Our open by handle functions didn't care that the inode wasn't
referenced and let tasks open unlinked inodes by number.  This
interacted badly with the inode deletion mechanisms which required that
inodes couldn't be cached on other nodes after the transaction which
removed their final reference.

If a task did accidentally open a file by inode while it was being
deleted it could see the inode items in an inconsistent state and return
very confusing errors that look like corruption.

The fix is to give the handle iget callers a flag to tell iget to only
get the inode if it has a positive nlink.   If iget sees that the inode
has been unlinked it returns enoent.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Bryant G. Duffy-Ly
1b8e3f7c05 Add basic renameat2 syscall support
Support generic renameat2 syscall then add support for the
RENAME_NOREPLACE flag. To suppor the flag we need to check
the existance of both entries and return -EXIST.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 17:54:02 -06:00
Zach Brown
95ed36f9d3 Maintain inode count in super and log trees
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct.   Client transactions track the change in
inode count as they create and delete inodes.   The log_trees delta is
added to the count in the super as finalized log_trees are deleted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Bryant Duffy-Ly
66b8c5fbd7 Enhance clarify of some kfree paths
In some of the allocation paths there are goto statements
that end up calling kfree(). That is fine, but in cases
where the pointer is not initially set to NULL then we
might have an undefined behavior. kfree on a NULL pointer
does nothing, so essentially these changes should not
change behavior, but clarifies the code path better.

Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>
2021-10-06 18:07:27 -05:00
Zach Brown
6ca8c0eec2 Consistently initialize dentry info
Unfortunately, we're back in kernels that don't yet have d_op->d_init.
We allocate our dentry info manually as we're given dentries.  The
recent verification work forgot to consistently make sure the info was
allocated before using it.   Fix that up, and while we're at it be a bit
more robust in how we check to see that it's been initialized without
grabbing the d_lock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
ea2b01434e Add support for i_version
This adds i_version to our inode and maintains it as we allocate, load,
modify, and store inodes.  We set the flag in the superblock so
in-kernel users can use i_version to see changes in our inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
46edf82b6b Add inode crtime creation time
Add an inode creation time field.  It's created for all new inodes.
It's visible to stat_more.  setattr_more can set it during
restore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-03 11:14:41 -07:00
Zach Brown
79fbaa6481 Verify dentries after locking
Our dir methods were trusting dentry args.  The vfs code paths use
i_mutex to protect dentries across revalidate or lookup and method
calls.  But that doesn't protect methods running in other mounts.
Multiple nodes can interleave the initial lookup or revalidate then
actual method call.

Rename got this right.  It is very paranoid about verifying inputs after
acquiring all the locks it needs.

We extend this pattern to the rest of the methods that need to use the
mapping of name to inode (and our hash and pos) in dentries.  Once we
acquire the parent dir lock we verify that the dentry is still current,
returning -EEXIST or -ENOENT as appropriate.

Along these lines, we tighten up dentry info correctness a bit by
updating our dentry info (recording lock coverage and hash/pos) for
negative dentries produced by lookup or as the result of unlink.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-31 09:49:32 -07:00
Zach Brown
d6bed7181f Remove almost all interruptible waits
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.

The reality is that we have significant code paths that have trouble
unwinding.  Final inode deletion during iput->evict in a task is a good
example.  It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.

It also happens that golang built pre-emptive thread scheduling around
signals.  Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.

This changes waits to expect that IOs (including network commands) will
complete reasonably promptly.  We remove all interruptible waits with
the notable exception of breaking out of a pending mount.  That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
4893a6f915 scoutfs_dirents_equal should return bool
It looks like it returned u64 because it was derived from _name_hash().

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
73bf916182 Return ENOSPC as space gets low
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free.  This adds support for
returning ENOSPC to client posix allocators as free space gets low.

For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space.  The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks.  In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing).  When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.

Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.

For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.

The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.

We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.

We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.

And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
22371fe5bd Fully destroy inodes after all mounts evict
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount.  This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.

We fix this by adding cached inode tracking.  Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.

This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group.  Removing many files in a group will only lock and get
the open map once per group.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Andy Grover
0deb232d3f Support O_TMPFILE and allow MOVE_BLOCKS into released extents
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.

Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.

RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.

Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.

xfstests common/004 now runs because tmpfile is supported.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 14:23:44 -07:00
Zach Brown
da5911c311 Use d_materialise_unique to splice dir dentries
When we're splicing in dentries in lookup we can be splicing the result
of changes on other nodes into a stale dcache.  The stale dcache might
contain dir entries and the dcache does not allow aliased directories.

Use d_materialise_unique() to splice in dir inodes so that we remove all
aliased dentries which must be stale.

We can still use d_splice_alias() for all other inode types.  Any
existing stale dentries will fail revalidation before they're used.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Andy Grover
bed33c7ffd Remove item accounting
Remove kmod/src/count.h
Remove scoutfs_trans_track_item()
Remove reserved/actual fields from scoutfs_reservation

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-20 17:01:08 -08:00
Andy Grover
cf278f5fa0 scoutfs: Tidy some enum usage
Prefer named to anonymous enums. This helps readability a little.

Use enum as param type if possible (a couple spots).

Remove unused enum in lock_server.c.

Define enum spbm_flags using shift notation for consistency.

Rename get_file_block()'s "gfb" parameter to "flags" for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:44 -08:00
Zach Brown
6bacd95aea scoutfs: fs uses item cache instead of forest
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees.  Most of this is mechanical
conversion from the _forest calls to the _item calls.  The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.

The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path.  There were only two users
of this.  Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache.  Xattr updates were
a little trickier.  They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value.  This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id.  Replacing now
reuses the old id.

And finally we add back in the locking and transaction item cache
integration.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
42e7fbb4f7 scoutfs: switch to using fnv1a for hashing
We had a few uses of crc for hashing.  That was fine enough for initial
testing but the huge number of xattrs that srch is recording was
seeing very bad collisions from the clumsy combination of crc32c into
a 64bit hash.  Replace it with FNV for now.

This also takes the opportunity to use 3 hash functions in the forest
bloom filter so that we can extract them from the 64bit hash of the key
rather than iterating and recalculating hashes for each function.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
f9ff25db23 scoutfs: add dirent name fingerprint
Entries in a directory are indexed by the hash of their name.  This
introduces a perfectly random access pattern.  And this results in a cow
storm as directories get large enough such that the leaf blocks that
store their entries are larger than our commits.  Each commit ends up
being full of cowed leaf blocks that contain a single new entry.

The dirent name fingerprints change the dirent key to first start with a
fingerprint of the name.  This reduces the scope of hash randomization
from the entire directory to entries with the same fingerprint.

On real customer dir sizes and file names we saw roughly 3x create rate
improvements from being able to create more entries in leaf blocks
within a commit.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
3a82090ab1 scoutfs: have per-fs inode nr allocators
We had previously seen lock contention between mounts that were either
resolving paths by looking up entries in directories or writing xattrs
in file inodes as they did archiving work.

The previous attempt to avoid this contention was to give each directory
its own inode number allocator which ensured that inodes created for
entries in the directory wouldn't share lock groups with inodes in other
directories.

But this creates the problem of operating on few files per lock for
reasonably small directories.  It also creates more server commits as
each new directory gets its inode allocation reservation.

The fix is to have mount-wide seperate allocators for directories and
for everything else.  This puts directories and files in seperate groups
and locks, regardless of directory population.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
495358996c scoutfs: fix older kc readdir emit
When we added the kernelcompat layer around the old and new readdir
interfaces there was some confusion in the old readdir interface filldir
arguments.  We were passing in our scoutfs dent item struct pointer
instead of the filldir callback buf pointer.  This prevented readdir
from working in older kernels because filldir would immediately see a
corrupt buf and return an error.

This renames the emit compat macro arguments to make them consistent
with the other calls and readdir now provides the correct pointer to the
emit wrapper.

Signed-off-by: Zach Brown <zab@versity.com>
2020-04-21 16:28:06 -07:00
Zach Brown
48448d3926 scoutfs: convert fs callers to forest
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls.  This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
ddd1a4ef5a scoutfs: support newer ->iterate readdir
The modern upstream kernel has a ->iterate() readdir file_operattions
method which takes a context and calls dir_emit().   We add some
kernelcompat helpers to juggle the various function definitions, types,
and arguments to support both the old ->readdir(filldir) and the new
->iterate(ctx) interfaces.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-15 14:57:57 -08:00
Zach Brown
08a140c8b0 scoutfs: use our locking service
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.

The client code gets some shims to send and receive lock messages to and
from the server.  Callers use our lock mode constants instead of the
DLM's.

Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.

The biggest change is in the client lock state machine.  Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing.  We don't have everything
come through a per-lock work queue.  Instead we send requests either
from the blocking lock caller or from a shrink work queue.  Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.

The different processing contexts leads to a slightly different lock
life cycle.  We refactor and seperate allocation and freeing from
tracking and removing locks in data structures.  We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.

Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time.  We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.

As of this change the lock setup and destruction paths are a little
wobbly.  They'll be shored up as we add lock recovery between the client
and server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00