This was inadvertantly left out of the main CW locking commit. We
simply need to seq_print the new fields. We add them to the end of
the line, thus preserving backwards compatibility with old versions
of the debug format.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This will give us concurrency yet still allow our ioctls to drive cache
syncing/invalidation on other nodes. Our lock_coverage() checks evolve
to handle direct dlm modes, allowing us to verify correct usage of CW
locks.
As a test, we can run createmany on two nodes at the same time, each
working in their own directory. The following commands were run on each
node:
$ mkdir /scoutfs/`uname -n`
$ cd /scoutfs/`uname -n`
$ /root/createmany -o ./file_$i 100000
Before this patch that test wouldn't finish in any reasonable amount of
time and I would kill it after some number of hours.
After this patch, we make swift progress through the test:
[root@fstest3 fstest3.site]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394646.11 total 0.31 last 0.31)
- created 20000 (time 1509394646.38 total 0.59 last 0.28)
- created 30000 (time 1509394646.81 total 1.01 last 0.43)
- created 40000 (time 1509394647.31 total 1.51 last 0.50)
- created 50000 (time 1509394647.82 total 2.02 last 0.51)
- created 60000 (time 1509394648.40 total 2.60 last 0.58)
- created 70000 (time 1509394649.06 total 3.26 last 0.66)
- created 80000 (time 1509394649.72 total 3.93 last 0.66)
- created 90000 (time 1509394650.36 total 4.56 last 0.64)
total: 100000 creates in 35.02 seconds: 2855.80 creates/second
[root@fstest4 fstest4.fstestnet]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394647.35 total 0.75 last 0.75)
- created 20000 (time 1509394647.89 total 1.28 last 0.54)
- created 30000 (time 1509394648.46 total 1.86 last 0.58)
- created 40000 (time 1509394648.96 total 2.35 last 0.49)
- created 50000 (time 1509394649.51 total 2.90 last 0.55)
- created 60000 (time 1509394650.07 total 3.46 last 0.56)
- created 70000 (time 1509394650.79 total 4.19 last 0.72)
- created 80000 (time 1509394681.26 total 34.66 last 30.47)
- created 90000 (time 1509394681.63 total 35.03 last 0.37)
total: 100000 creates in 35.50 seconds: 2816.76 creates/second
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
These variants will unconditionally overwrite any existing cached
items, making them appropriate for us with CW locked inode index
items.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This is a bit trickier than just dropping in a cw holders count.
dlmglue was comparing levels by a simple greater than or less than check.
Since CW locks are not compatible with PR or EX, this check breaks down.
Instead we provide a function which can tell us whether a conversion to a
given lock levels is is compatible (cache-wise) with the level we have.
We also have some slightly more complicated logic in downconvert. As a
result we update the helper that dlmglue uses to choose a downconvert level.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
dlmglue does some holder checks that can become unwieldy, esepcially
with the upcoming CW patch. Put them in a helper function.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
I accidentally left this off with the initial dlmglue commit. I enabled
it here so that I could see our CW locks happening in real time. We
don't print lock name yet but that will be remedied in a future patch.
Turning this on gives us a debugfs file,
/sys/kernel/debug/scoutfs/<fsid>/locking_state
which exports the full lock state to userspace. The information
exported on each lock is extensive. The export includes each locks name
level, blocking level, request state, flags, etc. We also get a count of
lock attempts and failures for each level (cw, pr, ex). In addition we
also get the total time and max time waited on a given lock request.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
I noticed while working on other code that we weren't trying to
free potentially allocated btree iter keys if one of them saw an
allocation failure.
Signed-off-by: Zach Brown <zab@versity.com>
The augmenting of the btree to track items with bits set was too fiddly
for its own good. We were able to migrate old btree blocks with a
simple stored key while also fixing livelocks as the parent and item
bits got out of sync. This is now unused buggy code that can be
removed.
Signed-off-by: Zach Brown <zab@versity.com>
The bit tracking code was a bit much (HA). It introduced a lot of
complexity just to provide a way to migrate blocks from the old
half of the ring into the current half of the ring.
We can get rid of a ton of code and potential for bugs if we simply
store a persistent migration key in the super and use it to
sweep the tree looking for old blocks to dirty. A simple tree walk
that dirties and returns the next key is all we need.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_xattr_set() refreshes the cached item inode with its current vfs
inode. It has to refresh its vfs item as it acquires the lock before it
asserts that vfs inode as current.
Signed-off-by: Zach Brown <zab@versity.com>
Expand the generic lock tracing event to trace the level and holders,
add an event for acquiring a lock, and switch the invalidation event
over to using the lock class.
Signed-off-by: Zach Brown <zab@versity.com>
We had callers using the initialization macro, it just didn't do
anything. The uninitialized entries triggered a bug on trying to delete
an uninitialized entry. fsx-mpi tripped over this on shutdown after
seeing a consistency error.
Signed-off-by: Zach Brown <zab@versity.com>
fsx-mpi spins creating contention between ex holders of locks between
nodes. It was tripping assertions in item invalidation as it tried to
invalidate dirty items. Tracing showed that we were allowing holders of
locks while we were invalidating. Our invalidation function would
commit the current transaction, another task would hold the lock and
dirty an item, and then invalidation would continue on and try to
invalidate the dirty item. The invalidation code has always assumed
that it's not running concurrently with item dirtying.
The recursive locking change allowed acquireing blocked locks if the
recursive flag was set. It'd then check holders after calling
downconvert_worker (invalidation for us) and retry the downconvert if a
holder appeared. That it allowed recursive holders regardless of who
was alredy holding the lock is what let holders arrive once downconvert
started on the blocked lock. Not only did this create our problem with
invalidation, it also could leave items behind if the holder dirtied an
item and dropped the lock between invalidation and before downconvert
checked the holders again.
The fix is to only allow recursive holders on blocked locks that already
have holders. This ensures that holders will never increase past zero
on blocked locks. Once the downconvert sees the holders drain it will
call invalidation which won't have racing dirtiers. We can remove the
holder check after invalidation entirely.
With this fixed fsx-mpi no longer tries to invalidate dirty items as it
bounces locks back and forth.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't setting the new flag in the mapped buffer head. This tells
the caller that the buffer is newly allocated and needs to be zeroed.
Without this we expose unwritten newly allocated block contents.
fsx found this almost immediately. With this fixed fsx passes.
Signed-off-by: Zach Brown <zab@versity.com>
Simple attr changes are mostly handled by the VFS, we just have to mirror
them into our inode. Truncates are done in a seperate set of transactions.
We use a flag to indicate an in-progress truncate. This allows us to
detect and continue the truncate should the node crash.
Index locking is a bit complicated, so we add a helper function to grab
index locks and start a transaction.
With this patch we now pass the following xfstests:
generic/014
generic/101
generic/313
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Renaming a dir between parents and clobbering an existing empty dir
wasn't correctly updating the parent link counts. Updating parent link
counts when dirs are moved between parents is an independent operation
from decreasing the link count of a victim existing target of the
rename.
Signed-off-by: Zach Brown <zab@versity.com>
We only set the .getattr method to our locked getattr filler for regular
files. Set it for all files so that stat, etc, will see the current
inode for all file types.
Signed-off-by: Zach Brown <zab@versity.com>
The fs/dlm code has a harmless but unannotated inversion between
connection and socket locking that triggers during shutdown and disables
lockdep. We don't want it to mask our warnings during testing that may
happen after the first shared unmount so we disable lockdep around the
dlm shutdown. It's not ideal but then neither are distro kernels that
ship with lockdep warnings.
Signed-off-by: Zach Brown <zab@versity.com>
The xattr trans reservation assumed that it was only dirtying items for
the new xattr size. It didn't account for dirty deletion items for
parts from a larger previous xattr.
With this fixed generic/070 no longer triggers warnings.
Signed-off-by: Zach Brown <zab@versity.com>
Add a network greeting message that's exchanged between the client and
server on every connection to make sure that we have the correct file
system and format hash.
Signed-off-by: Zach Brown <zab@versity.com>
Calculate the hash of format.h and ioctl.h and make sure the hash stored
in the super during mkfs matches our calculated hash on mount.
Signed-off-by: Zach Brown <zab@versity.com>
mkfs needs to know the size of the largest btree when figuring out how
big to make the ring. It needs to know how few items we can have in
parent blocks and to know that it needs to know how empty the blocks can
get.
Signed-off-by: Zach Brown <zab@versity.com>
We're going to be strictly enforcing matching format.h and ioctl.h
between userspace and kernel space. Let's get the exported kernel
function definition out of ioctl.h.
Signed-off-by: Zach Brown <zab@versity.com>
All the item ops now know the limit of the items they're allowed to read
into the cache. Warn if someone asks to read items without knowing
how much they're allowed to read based on their lock coverage.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_create() hasn't been working with lock coverage. It
wouldn't return -ENOENT if it didn't have the lock cached. It would
create items outside lock coverate so they wouldn't be invalidated and
re-read if another node modified the item.
Add a lock arg and teach it to populate the cache so that it's correctly
consistent.
Signed-off-by: Zach Brown <zab@versity.com>
The lock name comparison had a typo where it didn't compare the second
fields between the two names. Only inode index items used the second
field. This bug could cause lock matching when the names don't match
and trigger lock coverage warnings.
While we're in there don't rely so heavily on readers knowing the
relative precedence of subtraction and (magical gcc empty) ternary
operators.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage for inode index items.
Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items. One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking. Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today. And these items can be arbitrarily positioned
in the key space.
So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.
The trickiest bit of the index locking code is having to retry if the
sequence number changes. Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that. If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).
The use of the functions is straight forward. Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.
While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values. The code
ends up a bit nicer. It also gets rid of the indexed time fields that
were left behind and were unused.
It's worth noting that we're getting exclusive locks on the index
updates. Locking the meta/data seq updates results in complete global
serialization of all changes. We'll need concurrent writer locks to get
concurrency back.
Signed-off-by: Zach Brown <zab@versity.com>
Use per_task storage on the inode to pass locks from high level read and
write lock holders down into the callbacks that operate under the locks
so that the locks can then be passed to the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Add some functions for storing and using per-task storage in a list.
Callers can use this to pass pointers to children in a given scope when
interfaces don't allow for passing individual arguments amongst
concurrent callers in the scope.
Signed-off-by: Zach Brown <zab@versity.com>
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update(). This'll get changed into
passing the full lock into _update soon.
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_next*() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
Signed-off-by: Zach Brown <zab@versity.com>
Orphan processing only works with orphans on its node today. Protect
that orphan item use with the node_id lock.
Signed-off-by: Zach Brown <zab@versity.com>
Add cluster lock coverage to scoutfs_data_truncate_items() and plumb the
lock down into the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Let's give the item functions the full lock so that they can make sure
that the lock has coverage for the keys involved in the operation.
This _lookup*() conversion is first so it adds the
lock_coverager() helper.
Signed-off-by: Zach Brown <zab@versity.com>