Commit Graph

42 Commits

Author SHA1 Message Date
Zach Brown
c061ada671 scoutfs: mounts connect once server is listening
An elected leader writes a quorum block showing that it's elected before
it assumes exclusive access to the device and starts bringing up the
server.  This lets another later elected leader find and fence it if
something happens.

Other mounts were trying to connect to the server once this elected
quorum block was written and before the server was listening.  They'd
get conection refused, decide to elect a new leader, and try to fence
the server that's still running.

Now, they should have tried much harder to connect to the elected leader
instead of taking a single failed attempt as fatal.  But that's a
problem for another day that involves more work in balancing timeouts
and retries.

But mounts should not have tried try to connect to the server until its
listening.  That's easy to signal by adding a simple listening flag to
the quorum block.  Now mounts will only try to connect once they see the
listening flag and don't see these racey refused connections.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 15:01:00 -07:00
Zach Brown
36b0df336b scoutfs: add unmount barrier
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount.  We can't
let unmounting clients leave the remaining mounted clients without
quorum.

The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests.  It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.

We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.

The commit introduces and maintains the unmount_barrier field in the
quorum blocks.  It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.

The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.

Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established.  This also makes it easier to
make global decisions based on the count of pending farewell requests.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
fe63b566c9 scoutfs: use _unaligned instead of __packed
We were relying on a cute (and probably broken) trick of defining
pointers to unaligned base types with __packed.  Modern versions of gcc
warn about this.

Instead we either directly access unaligned types with get_ and
put_unaligned, or we copy unaligned data into aligned copies before
working with it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
e88b5732ad scoutfs: track trans seq in btree
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory.  It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.

This stores open transaction sequence numbers in a shared persistent
btree instead of in memory.  It removes tracking for clients as they
send their farewell during unmount.  A new server that starts up will
see existing entries for clients that were created by old servers.

This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
ec0fb5380a scoutfs: implement lock recovery
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO.  As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.

This implements lock recovery by having the lock service recover locks
from clients as it starts up.

First the lock service stores records of connected clients in a btree
off the super block.  Records are added as the server receives their
greeting and are removed as the server receives their farewell.

Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.

We add lock recover request and response messages that are used to
communicate locks from the clients to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
74366f0df1 scoutfs: make networking more reliable
The current networking code has loose reliability guarantees.  If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection.  The client resends
requests but no responses are resent.  A client's requests could be
processed twice on the same server.  The server throws away disconnected
client state.

This was fine, sort of, for the simple requests we had implemented so
far.  It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.

This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.

The server keeps track of disconnected clients and restores state if the
same client reconnects.  This required some work around the greetings so
that clients and servers can recognize each other.  Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.

Now that connections between the client and server are preserved we can
resend responses across reconnection.  We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.

When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.

This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
675275fbf1 scoutfs: use hdr.fsid in greeting instead of id
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with.  This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
288d781645 scoutfs: start and stop server with quorum
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem.  That isn't going to
work if we're moving to locking provided by the server.

This uses quorum election to determine who should run the server.  We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
34b8950bca scoutfs: initial lock server core
Add the core lock server code for providing a lock service from our
server.  The lock messages are wired up but nothing calls them.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
7e9d40d65a scoutfs: init ret when freeing zero extents
The server forgot to initialize ret to 0 and might return
undefined errnos if a client asked it to free zero extents.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-12 15:37:45 -07:00
Zach Brown
2cc990406a scoutfs: compact using net requests
Currently compaction is only performed by one thread running in the
server.  Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.

This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server.  This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.

The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight.  It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.

A server thread still coordinates which segments are compacted.  The
search for a candidate compaction operation is largely unchanged.  It
now has to deal with being unable to process a compaction because its
segments are busy.  We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests.  If there are none at the level we move up to the next level.

The server will only issue a given number of compaction requests to a
client at a time.  When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.

If a client disconnects the server forgets the compactions it had sent
to that client.  If those compactions still need to be processed they'll
be sent to the next client.

The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes.  This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.

The server needs to block as it does work for compaction in the
notify_up and response callbacks.  We move them out from under spin
locks.

The server needs to clean up allocated segnos for a compaction request
that fails.  We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
62d6c11e3c scoutfs: clean up workqueue flags
We had gotten a bit sloppy with the workqueue flags.  We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish.  We add NON_REENTRANT out of an abundance of caution.  It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
0adbd7e439 scoutfs: have server track connected clients
This extends the notify up and down calls to let the server keep track
of connected clients.

It adds the notion of per-connection info that is allocated for each
connection.  It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.

It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
746293987c scoutfs: let server send msg to specific node_id
The current sending interfaces only send a message to the peer of a
given connection.  For the server to send to a specific connected client
it'd have to track connections itself and send to them.

This adds a sending interface that uses the node_id to send to a
specific connected client.  The conn argument is the listening socket
and its accepted sockets are searched for the destination node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
8b3193ea72 scoutfs: server allocates node_id
Today node_ids are randomly assigned.  This adds the risk of failure
from random number generation and still allows for the risk of
collisions.

Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange.  This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.

To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing.  This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.

Now that net_connect is sync in the client we don't need the notify_up
callback anymore.  The client can perform those duties when the connect
returns.

The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
a25b6324d2 scoutfs: maintain free_blocks in one place
The free_blocks counter in the super is meant to track the number of
total blocks in the primary free extent index.  Callers of extent
manipulation were trying to keep it in sync with the extents.

Segment allocation was allocating extents manually using a cursor.  It
forgot to update free_blocks.  Segment freeing then freed the segment as
an extent which did update free_blocks.  This created ever accumulating
free blocks over time which eventually pushed it greater than total
blocks and caused df to report negative usage.

This updates the free_blocks count in server extent io which is the only
place we update the extent items themselves.  This ensures that we'll
keep the count in sync with the extent items.  Callers don't have to
worry about it.

Signed-off-by: Zach Brown <zab@versity.com>

T# with '#' will be ignored, and an empty message aborts the commit.
2018-08-21 13:25:05 -07:00
Zach Brown
d708421cfb scoutfs: remove unused client and server code
The previous commit added shared networking code and disabled the old
unused code.  This removes all that unused client and server code that
was refactored to become the shared networking code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
17dec65a52 scoutfs: add bidirectional network messages
The client and server networking code was a bit too rudimentary.

The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to.  We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.

This refactors sending and receiving in both the client and server code
into shared networking code.  It's built around a connection struct that
then holds the message state.  Both peers on the connection can send
requests and send responses.

The existing code only retransmitted requests down newly established
connections.  Requests could be processed twice.

This adds robust reliability guarantees.  Requests are resend until
their response is received.  Requests are only processed once by a given
peer, regardless of the connection's transport socket.  Responses are
reiably resent until acknowledged.

This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal.  A following commit will remove all
the unused code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
295bf6b73b scoutfs: return free extents to server
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node.  Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents.  With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.

This adds a simple high water mark after which nodes start returning
free extents to the server.  From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-05 16:19:31 -07:00
Zach Brown
e19716a0f2 scoutfs: clean up super block use
The code that works with the super block had drifted a bit.  We still
had two from an old design and we weren't doing anything with its crc.

Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 15:56:42 -07:00
Zach Brown
002daf3c1c scoutfs: return -ENOSPC to client alloc segno
The server send_reply interface is confusing.  It uses errors to shut
down the connection.  Clients getting enospc needs to happen in the
message reply payload.

The segno allocation server processing needs to set the segno to 0 so
that the client gets it and translates that into -ENOSPC.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
2efba47b77 scoutfs: satisfy large allocs with smaller extents
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size.  This prematurely
returns -ENOSPC if a very large allocation is attempted.  Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.

This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server.  It looks for previous extents in the index of
extents by length.  This builds on the previously added item and extent
_prev operations.

Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for.  The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
04660dbfee scoutfs: add scoutfs_extent_prev()
Add an extent function for iterating backwards through extents.  We add
the wrapper and have the extent IO functions call their storage _prev
functions.  Data extent IO can now call the new scoutfs_item_prev().

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
9c74f2011d scoutfs: add server work tracing
Add some server workqueue and work tracing to chase down the destruction
of an active workqueue.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
41c29c48dd scoutfs: add extent corruption cases
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata.  The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption.  In this case we
continue returning the error that caused us to try and clean up.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
c01a715852 scoutfs: use extents in the server allocator
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.

We add a client request to allocate an extent of a given length.  The
existing segment alloc and free now work with a segment's worth of
blocks.

The server maintains counters in the super block of free blocks instead
of free segments.  We maintain an allocation cursor so that allocation
results tend to cycle through the device.  It's stored in the super so
that it is maintained across server instances.

This doesn't remove unused dead code to keep the commit from getting too
noisy.  It'll be removed in a future commit.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
f3007f10ca scoutfs: shut down server on commit errors
We hadn't yet implemented any error handling in the server when commits
fail.

Commit errors are serious and we take them as a sign that something has
gone horribly wrong.  This patch prints commit error warnings to the
console and shuts down.  Clients will try to reconnect and resend their
requests.

The hope is that another server will be able to make progress.  But this
same node could become the server again and it could well be that the
errors are persistent.

The next steps are to implement server startup backoff, client retry
backoff, and hard failure policies.

Signed-off-by: Zach Brown <zab@versity.com>
2018-05-01 11:48:19 -07:00
Zach Brown
24cc5cc296 scoutfs: lock manifest root request
The manifest root request processing samples the stable_manifest_root in
the server info.  The stable_manifest_root is updated after a
commit has suceeded.

The read of stable_manifest_root in request processing was locking the
manifest.  The update during commit doesn't lock the manifest so these
paths were racing.  The race is very tight, a few cpu stores, but it
could in theory give a client a malformed root that could be
misinterpreted as corruption.

Add a seqcount around the store of the stable manifest root during
commit and its load during request processing.  This ensures that
clients always get a consistent manifest root.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
8061a5cd28 scoutfs: add server bind warning
Emit an error message if the server fails to bind.  It can mean that
there is a bad configured address.  But we might want to be able to bind
if the address becomes available, so we don't hard error.  We only emit
the message once for a series of failures.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
9148f24aa2 scoutfs: use single small key struct
Variable length keys lead to having a key struct point to the buffer
that contains the key.  With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.

We no longer have a seperate generic key buf struct that points to
specific per-type key storage.  All items use the key struct and fill
out the appropriate fields.  All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.

Each key user now has an init function fills out its fields.  It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.

A bunch of code now takes the address of static key storage instead of
managing allocated keys.  Conversely, swapping now uses the full keys
instead of pointers to the keys.

We don't need all the functions that worked on the generic key buf
struct because they had different lengths.  Copy, clone, length init,
memcpy, all of that goes away.

The item API had some functions that tested the length of keys and
values.  The key length tests vanish, and that gets rid of the _same()
call.  The _same_min() call only had one user who didn't also test for
the value length being too large.  Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.

We no longer have to track the number of key bytes when calculating if
an item population will fit in segments.  This removes the key length
from reservations, transactions, and segment writing.

The item cache key querying ioctls no longer have to deal with variable
length keys.  The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.

The segment no longer has to store the key length.  It stores the key
struct in the item header.

The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct.  The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.

Manifest entries are now a fixed size.  We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq.  They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap.  This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
c76c6582f0 scoutfs: release server conn under mutex
I was rarely seeing null derefs during unmount.  The per-mount listening
scoutfs_server_func() was seeing null sock->ops as it called
kernel_sock_shutdown() to shutdown the connected client sockets.
sock_release() sets the ops to null.  We're not supposed to use a socket
after we call it.

The per-connection scoutfs_server_recv_func() calls sock_release() as it
tears down its connection.  But it does this before it removes the
connection from the listener's list.  There's a brief window where the
connection's socket has been released but is still visible on the list.
If the listener tries to shutdown during this time it will crash.

Hitting this window depends on scheduling races during unmount.  The
unmount path has the client close its connection to the server then the
server closes all its connected clients.  If the local mount is the
server then it will have recv work see an error as the client
disconnects and it will be racing to shut down the connection with the
listening thread during unmount.

I think I only saw this in my guests because they're running slower
debug kernels on my slower laptop.  The window of vulnerability while
the released socket is on the list is longer.

The fix is to release the socket while we hold the mutex and are
removing the connection from the list.  A released socket is never
visible on the list.

While we're at it don't use list_for_each_entry_safe() to iterate over
the connection list.  We're not modifying it.  This is an lingering
artifact from previous versions of the server code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-22 14:27:01 -08:00
Zach Brown
f52dc28322 scoutfs: simplify lock use of kernel dlm
We had an excessive number of layers between scoutfs and the dlm code in
the kernel.  We had dlmglue, the scoutfs locks, and task refs.  Each
layer had structs that track the lifetime of the layer below it.  We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.

This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.

The users of the lock interface are mainly unchanged.  We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use.  Lock fields change so a few
external users of those fields change.

This not only removes a lot of code it also contains functional
improvements.  For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.

It introduces the concept of an unlock grace period.  Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.

There are significant changes to trace points, counters, and debug files
that follow the implementation changes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-14 15:00:17 -08:00
Zach Brown
4ff1e3020f scoutfs: allocate inode numbers per directory
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved.  This means that
concurrent file creation in different directories will create
overlapping inode numbers.  This leads to lock contention as reasonable
work loads will tend to distribute work by directories.

The easy fix is to have per-directory inode number allocation pools.  We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-09 17:58:19 -08:00
Zach Brown
ec91a4375f scoutfs: unlock the server listen lock
Turns out the server wasn't explicitly unlocking the listen lock!  This
ended up working because we only shut down an active server on unmount
and unmount will tear down the lock space which will drop the still held
listen lock.

That's just dumb.

But it also forced using an awkward lock flag to avoid setting up a task
ref for the lock hold which wouldn't have been torn down otherwise.  By
adding the lock we restore balance to the force and can get rid of that
flag.

Cool, cool, cool.

Signed-off-by: Zach Brown <zab@versity.com>
2017-12-08 17:00:44 -06:00
Mark Fasheh
8064a161f0 scoutfs: better tracking of recursive lock holders
This replaces the fragile recursive locking logic in dlmglue. In particular
that code fails when we have a pending downconvert and a process comes in
for a level that's compatible with the existing level. The downconvert will
still happen which causes us to now believe we are holding a lock that we
are not! We could go back to checking for holders that raced our downconvert
worker but that had problems of its own (see commit e8f7ef0).

Instead of trying to infer from lock state what we are allowed to do, let's
be explicit. Each lock now has a tree of task refs. If you come in to
acquire a lock, we look for our task in that tree. If it's not there, we
know this is the first time this task wanted that lock, so we can continue.
Otherwise we incremement a count on the task ref and return the already
locked lock. Unlock does the opposite - it finds the task ref and decreases
the count. On zero it will proceed with the actual unlock.

The owning task is the only process allowed to manipulate a task ref, so we
only have to lock manipulation of the tree. We make an exception for
global locks which might be unlocked from another process context (in this
case that means the node id lock).

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-12-08 10:25:30 -08:00
Zach Brown
cb879d9f37 scoutfs: add network greeting message
Add a network greeting message that's exchanged between the client and
server on every connection to make sure that we have the correct file
system and format hash.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-12 13:57:31 -07:00
Zach Brown
1da18d17cf scoutfs: use trylock for global server lock
Shared unmount hasn't worked for a long time because we didn't have the
server work woken out of blocking trying to acquire the lock.  In the
old lock code the wait conditions didn't test ->shutdown.

dlmglue doesn't give us a reasonable way to break a caller out of a
blocked lock.  We could add some code to do it with a global context
that'd have to wake all locks or add a call with a lock resource name,
not a held lock, that'd wake that specific lock.  Neither sound great.

So instead we'll use trylock to get the server lock.  It's guaranteed to
make reasonble forward progress.  The server work is already requeued
with a delay to retry.

While we're at it we add a global server lock instead of using the weird
magical inode lock in the fs space.  The server lock doesn't need keys
or to participate in item cache consistency, etc.

With this unmount works.  All mounts will now generate regular
background trylock requests.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Zach Brown
7854471475 scoutfs: fix server wq destory warning
We were seeing warnings in destroy_workqueue() which meant that work was
queued on the server workqueue after it was drained and before it was
finally destroyed.

The only work that wasn't properly waited for was the commit work.  It
looks like it'd be idle because the server receive threads all wait for
their request processing work to finish.  But the way the commit work is
batched means that a request can have its commit processed by executing
commit work while leaving the work queued for another run.

Fix this by specifically waiting for the commit work to finish after the
server work has waited for all the recv and compaction work to finish.

I wasn't able to reliably trigger the assertion in repeated xfstests
runs.  This survived many runs also, let's see if it stops the
destroy_workqueue() assertion from triggering in the future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 15:22:03 -07:00
Zach Brown
51e03dcb7a scoutfs: refactor inode locking function
This is based on Mark Fasheh <mfasheh@versity.com>'s series that
introduced inode refreshing after locking and a trylock for readpage.

Rework the inode locking function so that it's more clearly named and
takes flags and the inode struct.

We have callers that want to lock the logical inode but aren't doing
anything with the vfs inode so we provide that specific entry point.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:37:59 -07:00
Zach Brown
87ab27beb1 scoutfs: add statfs network message
The ->statfs method was still using the super_block in the super_info
that was read during mount.  This will get progressively more out
of date.

We add a network message to ask the server for the current fields that
impact statfs.  This is always racy and the fields are mostly nonsense,
but we try our best.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:43:35 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00