Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add a specific lock method for locking the global rename lock instead of
having the caller specify it as a global lock. We're getting rid of the
notion of lock scopes and requiring all locks to be related to keys.
The rename lock will use magic keys at the end of the volume.
Signed-off-by: Zach Brown <zab@versity.com>
We were only issuing one kernel warning when we couldn't resolve a path
to an inode due to excessive retries. It was hard to capture and we
only saw details from the first instance.
This adds a counter for each time we see excessive retries and returns
-ELOOP in that case. We also extend the link backref adding trace point
to include the found entry, if any.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Directory entries were the last items that had large variable length
keys because they stored the entry name in the key. We'd like to have
small fixed size keys so let's store dirents with small keys.
Entries for lookup are stored at the hash of the name instead of the
full name. The key also contains the unique readdir pos so that we
don't have to deal with collision on creation. The lookup procedure now
does need to iterate over all the readdir positions for the hash value
and compare the names.
Entries for link backref walking are stored with the entry's position in
the parent dir instead of the entry's name. The name is then stored in
the value. Inode to path conversion can still walk the backref items
without having to lookup dirent items.
These changes mean that all directory entry items are now stored at a
small key with some u64s (hash, pos, parent dir, etc) and have a value
with the dirent struct and full entry name. This lets us use the same
key and value format for the three entry key types. We no longer have
to allocate keys, we can store them on the stack.
We store the entry's hash and pos in the dirent struct in the item value
so that any item has all the fields to reference all the other item
keys. We store the same values in the dentry_info so that deletion
(unlink and rename) can find all the entries.
The ino_path ioctl can now much more clearly iterate over parent
directories and entry positions instead of oh so cleverly iterating over
null terminated names in the parent directories. The ioctl interface
structs and implementation become simpler.
Signed-off-by: Zach Brown <zab@versity.com>
Originally the item interfaces were written with full support for
vectored keys and values. Callers constructed keys and values made up
of header structs and data buffers. Segments supported much larger
values which could span pages when stored in memory.
But over time we've pulled that support back. Keys are described by a
key struct instead of a multi-element kvec. Values are now much smaller
and don't span pages. The item interfaces still use the kvec arrays but
everyone only uses a single element.
So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec. A
bunch of code disappears and the result is much easier to understand.
Signed-off-by: Zach Brown <zab@versity.com>
The values used in dirent item creation are one of the few places we
have value kvecs with multiple entries. Let's instead allocate and copy
the dirent struct and name into a contiguous buffer so that we can move
towards single vector values.
Signed-off-by: Zach Brown <zab@versity.com>
Every caller of scoutfs_item_lookup_exact() provided a size that matches
the value buffer. Let's remove the redundant arg and use the value
buffer length as the exact size to match.
Signed-off-by: Zach Brown <zab@versity.com>
Initially we had d_revalidate always return that the dentry was invalid.
This avoids dentry cache consistency problems across the cluster by
always performing lookups. That's slow by itself, but it turns out that
the dentry invalidation that happens on revalidation failure is very
expensive if you have lots of dentries.
So we switched to forcefully dropping dirents as we revoked their lock.
That avoided the cost of revalidation failure but it adds the problem
that dentries are unhashed when their locks are dropped. This causes
paths like getcwd() to return errors when they see unhashed dentries
instead of trying to revalidate them.
This implements a d_revalidate which actually does work to determine if
the dentry is still valid. When we populate dentries under a lock we
add them to a list on the lock. As we drop the lock we remove them from
the list. But the dentry is not modified. This lets paths like
getcwd() still work. Then we implement revalidation that does the
actual item lookups if the dentry's lock has been dropped. This lets
revalidation return success and avoid the terrible invalidation costs
from returning failure and then calling lookup to populate a new dentry.
This brings us more in line with the revalidation behaviour of other
systems that maintain multi-node dcache consistency.
Signed-off-by: Zach Brown <zab@versity.com>
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved. This means that
concurrent file creation in different directories will create
overlapping inode numbers. This leads to lock contention as reasonable
work loads will tend to distribute work by directories.
The easy fix is to have per-directory inode number allocation pools. We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
This is implemented by filling in our export ops functions.
When we get those right, the VFS handles most of the details for us.
Internally, scoutfs handles are two u64's (ino and parent ino) and a
type which indicates whether the handle contains the parent ino or not.
Surpisingly enough, no existing type matches this pattern so we use our
own types to identify the handle.
Most of the export ops are self explanatory scoutfs_encode_fh() takes
an inode and an optional parent and encodes those into the smallest
handle that would fit. scoutfs_fh_to_[dentry|parent] turn an existing
file handle into a dentry.
scoutfs_get_parent() is a bit different and would be called on
directory inodes to connect a disconnected dentry path. For
scoutfs_get_parent(), we can export add_next_linkref() and use the backref
mechanism to quickly find a parent directory.
scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're
linking an inode to a name which exists in the parent directory. We can also
use add_next_linkref, and simply copy the name from the backref.
As a result of this patch we can also now export scoutfs file systems
via NFS, however testing NFS thoroughly is outside the scope of this
work so export support should be considered experimental at best.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab edited <= NAME_MAX]
We have a bug filed where the fs got stuck spinning in
scoutfs_dir_get_backref_path(). There's been enough changes lately that
we're not sure if this issue still exists. Catch if we have an excessive
number of iterations through our loop there and exit with some debug info.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Today we use unconditional dentry revalidation to provide directory
entry consistency. Any time the vfs tries to use a cached dentry we
tell it to drop it and perform a lookup. This hits our item cache which
is kept consistent by the locks.
This would just be a waste of cpu if it weren't for how heavy weight the
vfs revalidation->lookup path is here. It doesn't just invalidate the
entry it uses shrink_dcache_parent() to drop all the cached entries in
the subtree rooted at the cached entry.
We saw 22 second long cpu livelocks in this shrink_dcache_parent() when
creating and archiving empty files.
Instead lets let the vfs use dcache entries. We only invalidate them as
we're dropping the lock that covers them. (Today coarse inode locks
cover all the entries in batches of inodes.) We can use d_drop() to
remove entries from the cache to stop them from satisfying lookup
without trying to free all the dentries under them.
Signed-off-by: Zach Brown <zab@versity.com>
Simple attr changes are mostly handled by the VFS, we just have to mirror
them into our inode. Truncates are done in a seperate set of transactions.
We use a flag to indicate an in-progress truncate. This allows us to
detect and continue the truncate should the node crash.
Index locking is a bit complicated, so we add a helper function to grab
index locks and start a transaction.
With this patch we now pass the following xfstests:
generic/014
generic/101
generic/313
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Renaming a dir between parents and clobbering an existing empty dir
wasn't correctly updating the parent link counts. Updating parent link
counts when dirs are moved between parents is an independent operation
from decreasing the link count of a victim existing target of the
rename.
Signed-off-by: Zach Brown <zab@versity.com>
We only set the .getattr method to our locked getattr filler for regular
files. Set it for all files so that stat, etc, will see the current
inode for all file types.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_create() hasn't been working with lock coverage. It
wouldn't return -ENOENT if it didn't have the lock cached. It would
create items outside lock coverate so they wouldn't be invalidated and
re-read if another node modified the item.
Add a lock arg and teach it to populate the cache so that it's correctly
consistent.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage for inode index items.
Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items. One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking. Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today. And these items can be arbitrarily positioned
in the key space.
So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.
The trickiest bit of the index locking code is having to retry if the
sequence number changes. Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that. If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).
The use of the functions is straight forward. Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.
While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values. The code
ends up a bit nicer. It also gets rid of the indexed time fields that
were left behind and were unused.
It's worth noting that we're getting exclusive locks on the index
updates. Locking the meta/data seq updates results in complete global
serialization of all changes. We'll need concurrent writer locks to get
concurrency back.
Signed-off-by: Zach Brown <zab@versity.com>
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update(). This'll get changed into
passing the full lock into _update soon.
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_next*() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
Signed-off-by: Zach Brown <zab@versity.com>
Let's give the item functions the full lock so that they can make sure
that the lock has coverage for the keys involved in the operation.
This _lookup*() conversion is first so it adds the
lock_coverager() helper.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_rename() looks for dirents again after acquiring cluster locks.
It needs to pass in the lock end keys to limit the items that are read
into the cache.
Signed-off-by: Zach Brown <zab@versity.com>
Move to static mapping items instead of unbounded extents.
We get more predictable data structures and simpler code but still get
reasonably dense metadata.
We no longer need all the extent code needed to split and merge extents,
test for overlaps, and all that. The functions that use the mappings
(get_block, fiemap, truncate) now have a pattern where they decode the
mapping item into an allocated native representation, do their work, and
encode the result back into the dense item.
We do have to grow the largest possible item value to fit the worst case
encoding expansion of random block numbers.
The local allocators are no longer two extents but are instead simple
bitmaps: one for full segments and one for individual blocks. There are
helper functions to free and allocate segments and blocks, with careful
coordination of, for example, freeing a segment once all of its
constituent blocks are free.
_fiemap is refactored a bit to make it more clear what's going on.
There's one function that either merges the next bit with the currently
building extent or fills the current and starts recording from a
non-mergable additional block. The old loop worked this way but was
implemented with a single squirrely iteration over the extents. This
wasn't feasible now that we're also iterating over blocks inside the
mapping items. It's a lot clearer to call out to merge or fill the
fiemap entry.
The dirty item reservation counts for using the mappings is reduced
significantly because each modification no longer has to assume that it
might merge with two adjacent contiguous neighbours.
Signed-off-by: Zach Brown <zab@versity.com>
The item count estimate functions didn't obviously differentiate between
adding to a count and resetting it. Most callers initialized the count
struct to 0 on the stack, incremented their estimate once, and passed it
in. The problem is that those same functions that increment once in
callers are also used in other estimates to build counts based on
multiple operations.
This tripped up the data truncate path. It looped and kept incrementing
its count while truncating a file with lots of extents until the count
got so large that it didn't fit in a segment by itself and blocked
forever.
This cleans up the item count code so that it's much harder to get
wrong. We differentiate between the SIC_*() high level count estimates
that are meant to be passed in to _hold_trans(), and the internal
__count_*() functions which are used to add up the item counts that make
up an aggregate operation.
With this fix the only way to use the count in extent truncation is to
correctly reset it for the item count for each transacation.
Signed-off-by: Zach Brown <zab@versity.com>
We lock multiple inodes by order of their inode number. This fixes
the directory entry paths that hold parent dir and target inode locks.
Link and unlink are easy because they just acquire the existing parent
dir and target inode locks.
Lookup is a little squirrely because we don't want to try and order
the parent dir lock with locks down in iget. It turns out that it's
safe to drop the dir lock before calling iget as long as iget handles
racing the inode cache instantiation with inode deletion.
Creation is the remaining pattern and it's a little weird because we
want to lock the newly created inode before we create it and the items
that store it. We add a function that correctly orders the locks,
transaction, and inode cache instantiation.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we had lots of inode creation callers that used a function to
create the dirent items and we had unlink remove entries by hand.
Rename is different because it wants to remove and add multiple links as
it does its work, including recreating links that it has deleted.
We rework add_entry_item() so that it gets the specific fields it needs
instead of getting them from the vfs structs. This makes it clear that
callers are responsible for the source of the fields. Specifically we
need to be able to add entries during failed rename cleanup without
allocating a new readdir pos from the parent dir.
With callers now responsible for the inputs to add_entry_items() we move
some of its code out into all callers: checking name length, dirtying
the parent dir inode, and allocating a readdir pos from the parent.
We then refactor most of _unlink() into a a del_entry_items() to match
addition. This removes the last user of scoutfs_item_delete_many() and
it will be removed in a future commit.
With the entry item helpers taking specific fields all the helpers they
use also need to use specific fields instead of the vfs structs.
To make rename cluster safe we need to get cluster locks for all the
inodes that we work with. We also have to check that the locally cached
vfs input is still valid after acquiring the locks. We only check the
basic structural correctness of the args: that parent dirs don't violate
ancestor rules to create loops and that the entries assumed by the
rename arguments still exist, or not.
Signed-off-by: Zach Brown <zab@versity.com>
We need to lock and refresh the VFS inode before it checks permissions in
system calls, otherwise we risk checking against stale inode metadata.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call]
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the inode refreshing flags let's add them to the
callers that want to have a current inode after they have their lock.
Callers locking newly created items use the new inode flag to reset the
refresh gen.
A few inode tests are moved down to after locking so that it can test
the current refreshed inode.
Signed-off-by: Zach Brown <zab@versity.com>
Lock callers can specify that they want inode fields reread from items
after the lock is acquired. dlmglue sets a refresh_gen in the locks
that we store in inodes to track when they were last refreshed and if
they need a refresh.
Signed-off-by: Zach Brown <zab@versity.com>
This is based on Mark Fasheh <mfasheh@versity.com>'s series that
introduced inode refreshing after locking and a trylock for readpage.
Rework the inode locking function so that it's more clearly named and
takes flags and the inode struct.
We have callers that want to lock the logical inode but aren't doing
anything with the vfs inode so we provide that specific entry point.
Signed-off-by: Zach Brown <zab@versity.com>
We move struct ocfs2_lock_res_ops and flags to dlmglue.c so that
locks.c can get access to it. Similarly, we export
ocfs2_lock_res_init_common() for locks.c can initialize each lockres
before use. Also, free_lock_tree() now has to happen before we shut
down the dlm - this gives dlmglue the opportunity to unlock their
underlying dlm locks before we go off freeing the structures.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This gives us cluster locking for the overwhelming majority of metadata ops
that scoutfs supports. In particular, we can create and modify file metadata
from one node and immediately see the changes reflected on another node.
In addition to synchonrization the cluster locks here are providing an I/O
endpoint for our item cache, ensuring that it doesn't read stale items.
Readdir and file read/write are notable exception - they require a more
specific approach and will be implemented in a future patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[fixed iget unlock and truncated commit message summary]
Signed-off-by: Zach Brown <zab@versity.com>
These transformations are mechanical and there aren't many callers of
these so we combine them into one commit.
Signed-off-by: Zach Brown <zab@versity.com>
Add an end key to the item_next calls to limit how many items will be
read into the cache. Callers typically get this from the lock they hold
that covers the iteration. We differentiate between iteration and
caching so that a series of small iterations (listxattr on inodes,
namespace walk in small dirs) can be satisfied by a single read of
adjacent items from segments.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache can only be populated with items that are covered by
locks. Require callers to provide the farthest key that can be covered
by the locks. Locks provide a key for exactly this purpose.
Signed-off-by: Zach Brown <zab@versity.com>
Holding a DLM lock protects a range of the key space. The DLM locks
span inodes or regions of inodes. We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments. If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.
Previously we were sorting by type then, within types, by inode. Now we
want to sort by inode then by type. But there are structures which
previously had a type but weren't then sorted by inode. We introduce
zones as the primary sort key. Inode index and node zones are sorted by
the inode fields and node ids respectively. Then comes the fs zone
first sorted by inode then the type of the key.
The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.
The orphan keys needed to be put in a zone. They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.
The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."
And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.
Signed-off-by: Zach Brown <zab@versity.com>
Add a relatively small universal value size limit. This will be needed
by more dense item packing to predict the worst case padding to avoid
full items crossing block boundaries.
We refactor the existing symlink and xattr item value limit to use this
new limit.
Signed-off-by: Zach Brown <zab@versity.com>
We're shrinking the max item value size so we need to store symlinks
with large target paths in multiple items. The arbitrary max value size
defined here will be replaced in the future with the new global maximum
value size.
Signed-off-by: Zach Brown <zab@versity.com>
We had a simple mechanism for ensuring that transaction didn't create
more items than would fit in a single written segment. We calculated
the most dirty items that a holder could generate and assumed that all
holders dirtied that much.
This had two big problems.
The first was that it wasn't accounting for nested holds.
write_begin/end calls the generic inode dirtying path whild holding a
transaction. This ended up deadlocking as the dirty inode waited to be
able to write while its trans held back in write_begin prevented
writeout.
The second was that the worst case (full size xattr) item dirtying is
enormous and meaningfully restricts concurrent transaction holders.
With no currently dirty items you can have less than 16 full size xattr
writes. This concurrency limit only gets worse as the transaction fills
up with dirty items.
This fixes those problems. It adds precise accounting of the dirty
items that can be created while a transaction is held. These
reservations are tracked in journal_info so that they can be used by
nested holds. The precision allows much greater concurrency as
something like a create will try to reserve a few hundreds bytes instead
of 64k. Normal sized xattr operations won't try to reserve the largest
possible space.
We add some feedback from the item cache to the transaction to issue
warnings if a holder dirties more items than it reserved.
Now that we have precise item/key/value counts (segment space
consumption is a function of all three :/) we can't have a single atomic
track transaction holders. We add a long-overdue trans_info and put a
proper lock and fields there and much more clearly track transaction
serialization amongst the holders and writer.
Signed-off-by: Zach Brown <zab@versity.com>
The use of the Scout ioctls for inode-since and data-since on the root
directory is a rather helpful boost. This allows user code to start on
blank filesystems and monitor activity without needing to create files.
The existing ioctl code was already present, so wiring into the
directory file operations was all that needed to happen.
Signed-off-by: Nic Henke <nic.henke@versity.com>
Reviewed-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>