We have a bug filed where the fs got stuck spinning in
scoutfs_dir_get_backref_path(). There's been enough changes lately that
we're not sure if this issue still exists. Catch if we have an excessive
number of iterations through our loop there and exit with some debug info.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
If scoutfs_unlock() sees that it isn't the last task using a lock it
just returns. It doesn't unlock the lock and it doesn't drop the lock
refcnt and users.
This leaks the lock refcnt and users because find_alloc_scoutfs_lock()
always increments them when it finds a lock. Inflated counts will stop
the shrinker from freeing the locks and eventually the counts will wrap
and could cause locks to be freed while they're still in use.
We can either always drop the refcnt/users in unlock or we can drop them
in lock as we notice that our task already has the lock. I chose to
have the task ref hold one refcnt/users which are only dropped as the
final task unlocks.
Signed-off-by: Zach Brown <zab@versity.com>
Add a file for showing the scoutfs_lock struct contents. This is the
layer above the detailed dlmglue/dlm info provided in the existing
"locking_state" file.
Signed-off-by: Zach Brown <zab@versity.com>
It samples fields that are only consistent under the lock. We also want
to see the fields every time it rechecks the conditions that stop it
from downconverting.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't invalidating our cache before freeing locks due to memory
pressure. This would cause stale data on the node which originally held
the lock. Fix this by firing a callback from dlmglue before we free a
lock from the system. On the scoutfs side, the callback is wired to
call our invalidate function. This will ensure that the right data and
metadata hit disk before another node is allowed to acquire that lock.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We have a corruption that can happen when a lock is reclaimed but it's
cache is still dirty. Detect this corruption by placing a trigger in
statfs which fires off lock reclaim. Statfs is nice because for scoutfs
it's lockless, which means there should not be any references on locks
when the trigger is fired.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We use a new event callback in dlmglue so that scout has a chance to do
some per-lock type counters. I included the most important dlmglue
events - basically those which can cost us network or disk traffic.
Right now scout just counts downconvert events since those are the
most interesting to us. We also just count on the ino and index locks
for now.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We don't have strict consistency protocols protecting the "physical"
caches that hold btree blocks and segments. We have metadata that tells
a reader that it's hit a stale cached entry and needs to invalidate and
read a the current version from the media.
This implements the retrying. If we get stale sequence numbers in
segments or btree blocks we invalidate them from the cache and return
-ESTALE.
This can only happen when reading structures that could have been
modified remotely. This means btree reads in the clients and segment
reads for everyone. btree reads on the server are always consistent
because it is the only writer.
Adding retrying to item reading and compaction catches all of these
cases.
Stale reads are triggered by inconsistency. But that could also be
persistent corruption in persistent media. Callers need to be careful
to turn their retries into hard errors if they're persistent. Item
reading can do this because it knows the btree root seq that anchored
the walk. Compaction doesn't do this today. That gets addressed in a
big sweep of error handling at some point in the not too distant future.
Signed-off-by: Zach Brown <zab@versity.com>
I wanted to add a sysfs file that exports the fsid for the mount of a
given device. But our use of sysfs was confusing and spread through
super.c and counters.c.
This moves the core of our sysfs use to sysfs.c. Instead of defining
the per-mount dir as a kset we define it as an object with attributes
which gives us a place to add an fsid attribute.
counters still have their own whack of sysfs implementation. We'll let
it keep it for now but we could move it into sysfs.c. It's just counter
interation around the insane sysfs obj/attr/type nonsense. For now it
just needs to know to add its counters dir as a child of the per-mount
dir instead of adding it to the kset.
Signed-off-by: Zach Brown <zab@versity.com>
Clean up the counter definition macro. Sort the entries and clean up
whitespace so that adding counters in the future will be more orderly
and satisfying.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't using the right string macros in the recent lock traces, fix
that. Also osb->cconn->cc_name is NULL terminated so we don't need to
keep the string length around.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We can use the excellent code in counters.h to easily place a whole set
of useful counters in dlmglue:
- one for every kind of wait in cluster_lock (blocked, busy, etc)
- one for each type of dlm operation (lock/unlock requests,
converts, etc)
- one for each type of downconvert (cw/pr/ex)
These will give us a decent idea of the amount and type of lock traffic a
given node is seeing.
In addition, we add a second trace at the bottom of invalidate_caches.
By turning both traces in invalidate_caches on, we can look at our
trace log to see how long a given locks downconvert took.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We've yet to really wire up the eventual consistency of btree ring
blocks and segments. The btree block reading code has had a warning
that fires if it sees stale blocks for a long time (which we've yet to
hit) but we have no such warning in the segment. If we hit stale
segments we could have very unpredictable results. So let's add a quick
warning to highlight the case to save us heartache if we hit it before
implementing full retrying.
Signed-off-by: Zach Brown <zab@versity.com>
The cluster_lock and cluster_unlock traces are close to each other but not
quite there so they have to be two different traces (thanks tracepoints!).
The rest (ocfs2_unblock_lock, ocfs2_simple_drop_lock) can use a shared trace
class.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We had disabled ocfs2_log_dlm_error() during the initial import.
Re-enable it so the kernel can log dlm errors. One problem is that our
binary lock names don't lend themselves legible prints. Add a buffer to
the lockres to hold a pretty-printed version of the lock name. We fill
it from the ->print callback.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The versity/rpm-build container has all the bits that scout needs along
with our tooling for building RPMs. Switching allows us to start adding
rpm builds soon. This also picks up sparse and other nice bits that we
are now iterating on in a separate repository from the original omnibus
versity docker repository.
We walk the list of dentries in subdirs on lock invalidation. This can
be a large number so we were trying to back off and give other tasks a
chance to schedule and other processes a chance to grab the parent lock
while we were iterating.
The method for backing off saved our position in the list by getting a
reference on a child dentry. It dropped that reference after resuming
iteration.
But it dropped the reference while holding the parent's lock. This is a
deadlock if the put tries to finally remove the dentry because it's been
unhashed. We saw this deadlock in practice, the crash dump showed us in
the final dentry_kill with the parent locked.
Let's just get rid of this premature optimization entirely. Both memory
pressure and site logistics will tend to keep child lists in parents
reasonably small. A CPU can burn through the locks and list entries for
quite a few entries before anything will notice. We can revisit the hot
spot later it if bubbles to the surface.
Signed-off-by: Zach Brown <zab@versity.com>
Turns out the server wasn't explicitly unlocking the listen lock! This
ended up working because we only shut down an active server on unmount
and unmount will tear down the lock space which will drop the still held
listen lock.
That's just dumb.
But it also forced using an awkward lock flag to avoid setting up a task
ref for the lock hold which wouldn't have been torn down otherwise. By
adding the lock we restore balance to the force and can get rid of that
flag.
Cool, cool, cool.
Signed-off-by: Zach Brown <zab@versity.com>
Today we use unconditional dentry revalidation to provide directory
entry consistency. Any time the vfs tries to use a cached dentry we
tell it to drop it and perform a lookup. This hits our item cache which
is kept consistent by the locks.
This would just be a waste of cpu if it weren't for how heavy weight the
vfs revalidation->lookup path is here. It doesn't just invalidate the
entry it uses shrink_dcache_parent() to drop all the cached entries in
the subtree rooted at the cached entry.
We saw 22 second long cpu livelocks in this shrink_dcache_parent() when
creating and archiving empty files.
Instead lets let the vfs use dcache entries. We only invalidate them as
we're dropping the lock that covers them. (Today coarse inode locks
cover all the entries in batches of inodes.) We can use d_drop() to
remove entries from the cache to stop them from satisfying lookup
without trying to free all the dentries under them.
Signed-off-by: Zach Brown <zab@versity.com>
Hoist the per-inode invalidation up into a function because we're about
to add invalidating dentries in parent directories. This should result
in no functional change.
Signed-off-by: Zach Brown <zab@versity.com>
We were trying to tear down our mounted file system resources in the
->kill_sb() callback. This happens relatively early in the unmount
process. We call kill_block_super() in our teardown which syncs the
mount and tears down the vfs structures. By tearing down in ->kill_sb()
we were forced to juggle tearing down before and after the call to
kill_block_super().
When we got that wrong we'd tear down too many resources and crash in
kill_block_super() or we wouldn't tear down enough and leave work still
pending that'd explode as we tried to shut down after
kill_block_super().
It turns out the vfs has a callback specifcally to solve this ordering
problem. The put_super callback is called after having synced the mount
but before its totally torn down. By putting all our shutdown in there
we no longer have to worry about racing with active use.
Auditing the shutdown dependencies also found some bad cases where we
were tearding down subsystems that were still in use. The biggest
problem was shutting down locking and networking before shutting down
the transaction processing which relies on both. Now we first shut
down all the client processing, then all the server processing, then the
lowest level common infrastructure.
The trickiest part in understanding this is knowing that
kill_block_super() only calls put_super during mount failure if mount
got far enough to assign the root dentry to s_root. We call put_super
manually ourselves in mount failure if it didn't get far enough so that
all teardown goes through put_super. (You'll see this s_root test in
other upstream file system error paths.)
Finally while auding the setup and shutdown paths I noticed a few, trans
and counters, that needed simple fixes to properly cleanup errors and
only shutdown if they've been setup.
This all was stressed with an xfstests that races mount and unmount
across the cluster. Before this change it'd crash/hang almost instantly
and with this change it runs to completion.
Signed-off-by: Zach Brown <zab@versity.com>
This replaces the fragile recursive locking logic in dlmglue. In particular
that code fails when we have a pending downconvert and a process comes in
for a level that's compatible with the existing level. The downconvert will
still happen which causes us to now believe we are holding a lock that we
are not! We could go back to checking for holders that raced our downconvert
worker but that had problems of its own (see commit e8f7ef0).
Instead of trying to infer from lock state what we are allowed to do, let's
be explicit. Each lock now has a tree of task refs. If you come in to
acquire a lock, we look for our task in that tree. If it's not there, we
know this is the first time this task wanted that lock, so we can continue.
Otherwise we incremement a count on the task ref and return the already
locked lock. Unlock does the opposite - it finds the task ref and decreases
the count. On zero it will proceed with the actual unlock.
The owning task is the only process allowed to manipulate a task ref, so we
only have to lock manipulation of the tree. We make an exception for
global locks which might be unlocked from another process context (in this
case that means the node id lock).
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We can't have locks with keys that overlap. This adds an rbtree of
locks that are sorted by their key range so that we can find out if we
create overlapping locks before they cause item cache consistency
problems.
Signed-off-by: Zach Brown <zab@versity.com>
Add some _sk suffix variants of the message printing calls so that we
can use per-cpu key buffer arguments without the full SK_PCPU() wrapper.
Signed-off-by: Zach Brown <zab@versity.com>
The mapping of size index item keys to lock names and key ranges was
completely bonkers. Its method of setting variable length masks could
easily create locks with different names whose key ranges overlapped.
We map ranges of sizes to locks and the big change is that all the
inodes in these sizes are covered. We can't try to have groups of
inodes per size because that would result in too many full precision
size locks.
With this fix the size index item locks no longer trigger warnings that
we're creating locks with overlapping keys.
Signed-off-by: Zach Brown <zab@versity.com>
We have to map many index item keys down to a lock that then has a start
and end key range. We also use this mapping over in index item locking
to avoid trying to acquire locks multiple times.
We were duplicating the mapping calculation in these two places. This
refactors these functions to use one range calculation function. It's
going to be used in future patches to fix the mapping of the size index
items.
This should result in no functional changes.
Signed-off-by: Zach Brown <zab@versity.com>
Lock names don't have minor. They're a unique position in
type.major.ino with ino masked to groups. Any index item is mapped to a
single lock.
But then each lock has a range of items that it covers. The index item
key still has a minor from the bad old days of indexing time. When
setting the range of keys covered by the lock name we set it to 0/~0 for
the range.
This is dead wrong because the minor is a higher priority than the inode
in the key space. By setting the minor to 0/~0 we are saying that each
lock name covers *all the minors and inodes for that major*. This is
wrong because there are multiple lock names for different inode groups
for each major. We're in effect having the different lock names
associated with ranges that all overlap.
And this is very bad because it means that a lock can cache keys that
are covered by other locks. An index item lock for a small inode can
accidentally create a negative item cache region for later inodes
covered by an entirely different lock.
We saw failures in scoutfs/500 because of this. A node trying to
read an existing item would get enoent because it had a false negative
cached region from an unrelated lock that overlapped with the lock
that it just acquired from a writer and was trying to read the contents
from.
The fix is to just set the minor to 0. We're not using it. This stops
the lock names with fixed majors and inode ranges from accidentally
overlapping with each other.
Signed-off-by: Zach Brown <zab@versity.com>
Add tracepoints for allocated lock structs entering the tree and finally
being freed. This gives visibility into the lifetime of locks without
using much higher frequency per-operation tracing that blow out other
events.
Signed-off-by: Zach Brown <zab@versity.com>
This was inadvertantly left out of the main CW locking commit. We
simply need to seq_print the new fields. We add them to the end of
the line, thus preserving backwards compatibility with old versions
of the debug format.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This will give us concurrency yet still allow our ioctls to drive cache
syncing/invalidation on other nodes. Our lock_coverage() checks evolve
to handle direct dlm modes, allowing us to verify correct usage of CW
locks.
As a test, we can run createmany on two nodes at the same time, each
working in their own directory. The following commands were run on each
node:
$ mkdir /scoutfs/`uname -n`
$ cd /scoutfs/`uname -n`
$ /root/createmany -o ./file_$i 100000
Before this patch that test wouldn't finish in any reasonable amount of
time and I would kill it after some number of hours.
After this patch, we make swift progress through the test:
[root@fstest3 fstest3.site]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394646.11 total 0.31 last 0.31)
- created 20000 (time 1509394646.38 total 0.59 last 0.28)
- created 30000 (time 1509394646.81 total 1.01 last 0.43)
- created 40000 (time 1509394647.31 total 1.51 last 0.50)
- created 50000 (time 1509394647.82 total 2.02 last 0.51)
- created 60000 (time 1509394648.40 total 2.60 last 0.58)
- created 70000 (time 1509394649.06 total 3.26 last 0.66)
- created 80000 (time 1509394649.72 total 3.93 last 0.66)
- created 90000 (time 1509394650.36 total 4.56 last 0.64)
total: 100000 creates in 35.02 seconds: 2855.80 creates/second
[root@fstest4 fstest4.fstestnet]# /root/createmany -o ./file_$i 100000
- created 10000 (time 1509394647.35 total 0.75 last 0.75)
- created 20000 (time 1509394647.89 total 1.28 last 0.54)
- created 30000 (time 1509394648.46 total 1.86 last 0.58)
- created 40000 (time 1509394648.96 total 2.35 last 0.49)
- created 50000 (time 1509394649.51 total 2.90 last 0.55)
- created 60000 (time 1509394650.07 total 3.46 last 0.56)
- created 70000 (time 1509394650.79 total 4.19 last 0.72)
- created 80000 (time 1509394681.26 total 34.66 last 30.47)
- created 90000 (time 1509394681.63 total 35.03 last 0.37)
total: 100000 creates in 35.50 seconds: 2816.76 creates/second
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
These variants will unconditionally overwrite any existing cached
items, making them appropriate for us with CW locked inode index
items.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This is a bit trickier than just dropping in a cw holders count.
dlmglue was comparing levels by a simple greater than or less than check.
Since CW locks are not compatible with PR or EX, this check breaks down.
Instead we provide a function which can tell us whether a conversion to a
given lock levels is is compatible (cache-wise) with the level we have.
We also have some slightly more complicated logic in downconvert. As a
result we update the helper that dlmglue uses to choose a downconvert level.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
dlmglue does some holder checks that can become unwieldy, esepcially
with the upcoming CW patch. Put them in a helper function.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
I accidentally left this off with the initial dlmglue commit. I enabled
it here so that I could see our CW locks happening in real time. We
don't print lock name yet but that will be remedied in a future patch.
Turning this on gives us a debugfs file,
/sys/kernel/debug/scoutfs/<fsid>/locking_state
which exports the full lock state to userspace. The information
exported on each lock is extensive. The export includes each locks name
level, blocking level, request state, flags, etc. We also get a count of
lock attempts and failures for each level (cw, pr, ex). In addition we
also get the total time and max time waited on a given lock request.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
I noticed while working on other code that we weren't trying to
free potentially allocated btree iter keys if one of them saw an
allocation failure.
Signed-off-by: Zach Brown <zab@versity.com>
The augmenting of the btree to track items with bits set was too fiddly
for its own good. We were able to migrate old btree blocks with a
simple stored key while also fixing livelocks as the parent and item
bits got out of sync. This is now unused buggy code that can be
removed.
Signed-off-by: Zach Brown <zab@versity.com>
The bit tracking code was a bit much (HA). It introduced a lot of
complexity just to provide a way to migrate blocks from the old
half of the ring into the current half of the ring.
We can get rid of a ton of code and potential for bugs if we simply
store a persistent migration key in the super and use it to
sweep the tree looking for old blocks to dirty. A simple tree walk
that dirties and returns the next key is all we need.
Signed-off-by: Zach Brown <zab@versity.com>