Commit Graph

20 Commits

Author SHA1 Message Date
Zach Brown
04660dbfee scoutfs: add scoutfs_extent_prev()
Add an extent function for iterating backwards through extents.  We add
the wrapper and have the extent IO functions call their storage _prev
functions.  Data extent IO can now call the new scoutfs_item_prev().

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
9c74f2011d scoutfs: add server work tracing
Add some server workqueue and work tracing to chase down the destruction
of an active workqueue.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
41c29c48dd scoutfs: add extent corruption cases
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata.  The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption.  In this case we
continue returning the error that caused us to try and clean up.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
c01a715852 scoutfs: use extents in the server allocator
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.

We add a client request to allocate an extent of a given length.  The
existing segment alloc and free now work with a segment's worth of
blocks.

The server maintains counters in the super block of free blocks instead
of free segments.  We maintain an allocation cursor so that allocation
results tend to cycle through the device.  It's stored in the super so
that it is maintained across server instances.

This doesn't remove unused dead code to keep the commit from getting too
noisy.  It'll be removed in a future commit.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
f3007f10ca scoutfs: shut down server on commit errors
We hadn't yet implemented any error handling in the server when commits
fail.

Commit errors are serious and we take them as a sign that something has
gone horribly wrong.  This patch prints commit error warnings to the
console and shuts down.  Clients will try to reconnect and resend their
requests.

The hope is that another server will be able to make progress.  But this
same node could become the server again and it could well be that the
errors are persistent.

The next steps are to implement server startup backoff, client retry
backoff, and hard failure policies.

Signed-off-by: Zach Brown <zab@versity.com>
2018-05-01 11:48:19 -07:00
Zach Brown
24cc5cc296 scoutfs: lock manifest root request
The manifest root request processing samples the stable_manifest_root in
the server info.  The stable_manifest_root is updated after a
commit has suceeded.

The read of stable_manifest_root in request processing was locking the
manifest.  The update during commit doesn't lock the manifest so these
paths were racing.  The race is very tight, a few cpu stores, but it
could in theory give a client a malformed root that could be
misinterpreted as corruption.

Add a seqcount around the store of the stable manifest root during
commit and its load during request processing.  This ensures that
clients always get a consistent manifest root.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
8061a5cd28 scoutfs: add server bind warning
Emit an error message if the server fails to bind.  It can mean that
there is a bad configured address.  But we might want to be able to bind
if the address becomes available, so we don't hard error.  We only emit
the message once for a series of failures.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
9148f24aa2 scoutfs: use single small key struct
Variable length keys lead to having a key struct point to the buffer
that contains the key.  With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.

We no longer have a seperate generic key buf struct that points to
specific per-type key storage.  All items use the key struct and fill
out the appropriate fields.  All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.

Each key user now has an init function fills out its fields.  It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.

A bunch of code now takes the address of static key storage instead of
managing allocated keys.  Conversely, swapping now uses the full keys
instead of pointers to the keys.

We don't need all the functions that worked on the generic key buf
struct because they had different lengths.  Copy, clone, length init,
memcpy, all of that goes away.

The item API had some functions that tested the length of keys and
values.  The key length tests vanish, and that gets rid of the _same()
call.  The _same_min() call only had one user who didn't also test for
the value length being too large.  Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.

We no longer have to track the number of key bytes when calculating if
an item population will fit in segments.  This removes the key length
from reservations, transactions, and segment writing.

The item cache key querying ioctls no longer have to deal with variable
length keys.  The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.

The segment no longer has to store the key length.  It stores the key
struct in the item header.

The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct.  The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.

Manifest entries are now a fixed size.  We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq.  They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap.  This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
c76c6582f0 scoutfs: release server conn under mutex
I was rarely seeing null derefs during unmount.  The per-mount listening
scoutfs_server_func() was seeing null sock->ops as it called
kernel_sock_shutdown() to shutdown the connected client sockets.
sock_release() sets the ops to null.  We're not supposed to use a socket
after we call it.

The per-connection scoutfs_server_recv_func() calls sock_release() as it
tears down its connection.  But it does this before it removes the
connection from the listener's list.  There's a brief window where the
connection's socket has been released but is still visible on the list.
If the listener tries to shutdown during this time it will crash.

Hitting this window depends on scheduling races during unmount.  The
unmount path has the client close its connection to the server then the
server closes all its connected clients.  If the local mount is the
server then it will have recv work see an error as the client
disconnects and it will be racing to shut down the connection with the
listening thread during unmount.

I think I only saw this in my guests because they're running slower
debug kernels on my slower laptop.  The window of vulnerability while
the released socket is on the list is longer.

The fix is to release the socket while we hold the mutex and are
removing the connection from the list.  A released socket is never
visible on the list.

While we're at it don't use list_for_each_entry_safe() to iterate over
the connection list.  We're not modifying it.  This is an lingering
artifact from previous versions of the server code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-22 14:27:01 -08:00
Zach Brown
f52dc28322 scoutfs: simplify lock use of kernel dlm
We had an excessive number of layers between scoutfs and the dlm code in
the kernel.  We had dlmglue, the scoutfs locks, and task refs.  Each
layer had structs that track the lifetime of the layer below it.  We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.

This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.

The users of the lock interface are mainly unchanged.  We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use.  Lock fields change so a few
external users of those fields change.

This not only removes a lot of code it also contains functional
improvements.  For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.

It introduces the concept of an unlock grace period.  Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.

There are significant changes to trace points, counters, and debug files
that follow the implementation changes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-14 15:00:17 -08:00
Zach Brown
4ff1e3020f scoutfs: allocate inode numbers per directory
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved.  This means that
concurrent file creation in different directories will create
overlapping inode numbers.  This leads to lock contention as reasonable
work loads will tend to distribute work by directories.

The easy fix is to have per-directory inode number allocation pools.  We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-09 17:58:19 -08:00
Zach Brown
ec91a4375f scoutfs: unlock the server listen lock
Turns out the server wasn't explicitly unlocking the listen lock!  This
ended up working because we only shut down an active server on unmount
and unmount will tear down the lock space which will drop the still held
listen lock.

That's just dumb.

But it also forced using an awkward lock flag to avoid setting up a task
ref for the lock hold which wouldn't have been torn down otherwise.  By
adding the lock we restore balance to the force and can get rid of that
flag.

Cool, cool, cool.

Signed-off-by: Zach Brown <zab@versity.com>
2017-12-08 17:00:44 -06:00
Mark Fasheh
8064a161f0 scoutfs: better tracking of recursive lock holders
This replaces the fragile recursive locking logic in dlmglue. In particular
that code fails when we have a pending downconvert and a process comes in
for a level that's compatible with the existing level. The downconvert will
still happen which causes us to now believe we are holding a lock that we
are not! We could go back to checking for holders that raced our downconvert
worker but that had problems of its own (see commit e8f7ef0).

Instead of trying to infer from lock state what we are allowed to do, let's
be explicit. Each lock now has a tree of task refs. If you come in to
acquire a lock, we look for our task in that tree. If it's not there, we
know this is the first time this task wanted that lock, so we can continue.
Otherwise we incremement a count on the task ref and return the already
locked lock. Unlock does the opposite - it finds the task ref and decreases
the count. On zero it will proceed with the actual unlock.

The owning task is the only process allowed to manipulate a task ref, so we
only have to lock manipulation of the tree. We make an exception for
global locks which might be unlocked from another process context (in this
case that means the node id lock).

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-12-08 10:25:30 -08:00
Zach Brown
cb879d9f37 scoutfs: add network greeting message
Add a network greeting message that's exchanged between the client and
server on every connection to make sure that we have the correct file
system and format hash.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-12 13:57:31 -07:00
Zach Brown
1da18d17cf scoutfs: use trylock for global server lock
Shared unmount hasn't worked for a long time because we didn't have the
server work woken out of blocking trying to acquire the lock.  In the
old lock code the wait conditions didn't test ->shutdown.

dlmglue doesn't give us a reasonable way to break a caller out of a
blocked lock.  We could add some code to do it with a global context
that'd have to wake all locks or add a call with a lock resource name,
not a held lock, that'd wake that specific lock.  Neither sound great.

So instead we'll use trylock to get the server lock.  It's guaranteed to
make reasonble forward progress.  The server work is already requeued
with a delay to retry.

While we're at it we add a global server lock instead of using the weird
magical inode lock in the fs space.  The server lock doesn't need keys
or to participate in item cache consistency, etc.

With this unmount works.  All mounts will now generate regular
background trylock requests.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
7854471475 scoutfs: fix server wq destory warning
We were seeing warnings in destroy_workqueue() which meant that work was
queued on the server workqueue after it was drained and before it was
finally destroyed.

The only work that wasn't properly waited for was the commit work.  It
looks like it'd be idle because the server receive threads all wait for
their request processing work to finish.  But the way the commit work is
batched means that a request can have its commit processed by executing
commit work while leaving the work queued for another run.

Fix this by specifically waiting for the commit work to finish after the
server work has waited for all the recv and compaction work to finish.

I wasn't able to reliably trigger the assertion in repeated xfstests
runs.  This survived many runs also, let's see if it stops the
destroy_workqueue() assertion from triggering in the future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 15:22:03 -07:00
Zach Brown
51e03dcb7a scoutfs: refactor inode locking function
This is based on Mark Fasheh <mfasheh@versity.com>'s series that
introduced inode refreshing after locking and a trylock for readpage.

Rework the inode locking function so that it's more clearly named and
takes flags and the inode struct.

We have callers that want to lock the logical inode but aren't doing
anything with the vfs inode so we provide that specific entry point.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:37:59 -07:00
Zach Brown
87ab27beb1 scoutfs: add statfs network message
The ->statfs method was still using the super_block in the super_info
that was read during mount.  This will get progressively more out
of date.

We add a network message to ask the server for the current fields that
impact statfs.  This is always racy and the fields are mostly nonsense,
but we try our best.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:43:35 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00