Commit Graph

27 Commits

Author SHA1 Message Date
Zach Brown
ab7bde9e2c scoutfs: replace node_id with rid in networking
Use the client's rid in networking instead of the node_id.

The node_id no longer has to be allocated by the server and sent in the
greeting.  Instead the client sends it to the server in its greeting.

The server then uses the client's announced rid just like it used to use
the its node_id.  It's used to record clients in the btree and to
identify clients in sending and receive processing.

The use of the rid in networking calls makes its way to locking and
compaction which now use the rid to identify clients intead of the
node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
5b258cee3b scoutfs: refine quorum voting
The current quorum voting implementatoin had some rough edges that
increased the complexity of the system and introduced undesirable
failure modes.  We can keep the same basic pattern but move
functionality around a few places, and rethink the quorum voting, to end
up with a meaningfully simpler system.

The motivation for this work was to remove the need to provide a
uniq_name option for every mount instance.

The first big change is to remove the idea of static configuration slots
for mounts.  This removes the use of uniq_name.  Mounts now simply have
a server_addr mount option instead of using their uniq_name to find
their address in the configuration.

The server can't check the configuration to see if a given connected
client's name is found in the quorum config.  Clients can set a flag in
their sent greeting which indicates that they're a voter.  This removes
the uniq_name from the greeting and mounted client records.

Without a static configuration mounts no longer have dedicated block
locations to write to.  We increase the size of the region of quorum
blocks and have voters simply write to a random block.  Overwriting vote
blocks is OK because we move from heartbeating design patterns to a
protocol strongly based on raft's election.  We're using quorum blocks
to communicate votes instead of network messages and overwriting blocks
is analagous to lossy networks droping vote messages in the raft
election protocol.

We were using the dedicated per-mount quorum blocks to track mounts that
had been elected and needed to be fenced.  We no longer have that
storage so instead we add the idea of an election log that is stored in
every voting block.  Readers merge the logs from all the blocks they
read and write the resulting merged log in their block.

With no static quorum configuration we no longer have to worry about the
complexity of changing the slot configurations while they're in use.
The only persistent configuration is the number of votes a candidate
needs to be elected by a quorum.

It was a mistake to use quorum voting blocks to communicate state
between the server and the quorum voters.  We can easily move the
unmount_barrier, server address, and fencing state from the quorum
blocks into the super block.  The server no longer needs the quorum
election info struct to be able to later write its quorum block.  It
instead writes a few fields in the super.  There's only one place where
clients need to look to find out who they should connect to or if they
can finish unmount.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
36b0df336b scoutfs: add unmount barrier
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount.  We can't
let unmounting clients leave the remaining mounted clients without
quorum.

The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests.  It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.

We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.

The commit introduces and maintains the unmount_barrier field in the
quorum blocks.  It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.

The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.

Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established.  This also makes it easier to
make global decisions based on the count of pending farewell requests.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
a546bd0aab scoutfs: check for newlines in msg.h wrappers
The message formatter adds a newline so callers don't have to.  But
sometimes they do and we get double newlines.  Add a build check that
the format string doesn't end in a newline so that we stop adding these.
And fix up all the current offenders.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
ec0fb5380a scoutfs: implement lock recovery
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO.  As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.

This implements lock recovery by having the lock service recover locks
from clients as it starts up.

First the lock service stores records of connected clients in a btree
off the super block.  Records are added as the server receives their
greeting and are removed as the server receives their farewell.

Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.

We add lock recover request and response messages that are used to
communicate locks from the clients to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
74366f0df1 scoutfs: make networking more reliable
The current networking code has loose reliability guarantees.  If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection.  The client resends
requests but no responses are resent.  A client's requests could be
processed twice on the same server.  The server throws away disconnected
client state.

This was fine, sort of, for the simple requests we had implemented so
far.  It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.

This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.

The server keeps track of disconnected clients and restores state if the
same client reconnects.  This required some work around the greetings so
that clients and servers can recognize each other.  Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.

Now that connections between the client and server are preserved we can
resend responses across reconnection.  We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.

When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.

This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
675275fbf1 scoutfs: use hdr.fsid in greeting instead of id
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with.  This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
288d781645 scoutfs: start and stop server with quorum
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem.  That isn't going to
work if we're moving to locking provided by the server.

This uses quorum election to determine who should run the server.  We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
08a140c8b0 scoutfs: use our locking service
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.

The client code gets some shims to send and receive lock messages to and
from the server.  Callers use our lock mode constants instead of the
DLM's.

Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.

The biggest change is in the client lock state machine.  Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing.  We don't have everything
come through a per-lock work queue.  Instead we send requests either
from the blocking lock caller or from a shrink work queue.  Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.

The different processing contexts leads to a slightly different lock
life cycle.  We refactor and seperate allocation and freeing from
tracking and removing locks in data structures.  We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.

Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time.  We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.

As of this change the lock setup and destruction paths are a little
wobbly.  They'll be shored up as we add lock recovery between the client
and server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
2cc990406a scoutfs: compact using net requests
Currently compaction is only performed by one thread running in the
server.  Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.

This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server.  This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.

The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight.  It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.

A server thread still coordinates which segments are compacted.  The
search for a candidate compaction operation is largely unchanged.  It
now has to deal with being unable to process a compaction because its
segments are busy.  We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests.  If there are none at the level we move up to the next level.

The server will only issue a given number of compaction requests to a
client at a time.  When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.

If a client disconnects the server forgets the compactions it had sent
to that client.  If those compactions still need to be processed they'll
be sent to the next client.

The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes.  This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.

The server needs to block as it does work for compaction in the
notify_up and response callbacks.  We move them out from under spin
locks.

The server needs to clean up allocated segnos for a compaction request
that fails.  We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
07eec357ee scoutfs: simplify reliable request delivery
It was a bit of an overreach to try and limit duplicate request
processing in the network layer.  It introduced acks and the necessity
to resync last_processed_id on reconnect.

In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server.  The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server.  To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.

In thinking about this, though, there's a bigger problem.  Duplicate
request processing protection only works up in memory in the networking
connections.  If the server makes persistent changes, then crashes, the
client will resend the request to the new server.  It will need to
discover that the persistent changes have already been made.

So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server.  Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already.  There's no need to implement the
complexity of protecting duplicate delivery between running nodes.

This removes the last_processed_id on the server.  It removes resending
of responses and acks.  Now that ids can be processed out of order we
remove the special known ID of greeting commands.  They can be processed
as usual.  When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
0adbd7e439 scoutfs: have server track connected clients
This extends the notify up and down calls to let the server keep track
of connected clients.

It adds the notion of per-connection info that is allocated for each
connection.  It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.

It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
8b3193ea72 scoutfs: server allocates node_id
Today node_ids are randomly assigned.  This adds the risk of failure
from random number generation and still allows for the risk of
collisions.

Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange.  This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.

To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing.  This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.

Now that net_connect is sync in the client we don't need the notify_up
callback anymore.  The client can perform those duties when the connect
returns.

The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
d708421cfb scoutfs: remove unused client and server code
The previous commit added shared networking code and disabled the old
unused code.  This removes all that unused client and server code that
was refactored to become the shared networking code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
17dec65a52 scoutfs: add bidirectional network messages
The client and server networking code was a bit too rudimentary.

The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to.  We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.

This refactors sending and receiving in both the client and server code
into shared networking code.  It's built around a connection struct that
then holds the message state.  Both peers on the connection can send
requests and send responses.

The existing code only retransmitted requests down newly established
connections.  Requests could be processed twice.

This adds robust reliability guarantees.  Requests are resend until
their response is received.  Requests are only processed once by a given
peer, regardless of the connection's transport socket.  Responses are
reiably resent until acknowledged.

This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal.  A following commit will remove all
the unused code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
295bf6b73b scoutfs: return free extents to server
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node.  Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents.  With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.

This adds a simple high water mark after which nodes start returning
free extents to the server.  From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-05 16:19:31 -07:00
Zach Brown
e19716a0f2 scoutfs: clean up super block use
The code that works with the super block had drifted a bit.  We still
had two from an old design and we weren't doing anything with its crc.

Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 15:56:42 -07:00
Zach Brown
2efba47b77 scoutfs: satisfy large allocs with smaller extents
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size.  This prematurely
returns -ENOSPC if a very large allocation is attempted.  Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.

This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server.  It looks for previous extents in the index of
extents by length.  This builds on the previously added item and extent
_prev operations.

Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for.  The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
c01a715852 scoutfs: use extents in the server allocator
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.

We add a client request to allocate an extent of a given length.  The
existing segment alloc and free now work with a segment's worth of
blocks.

The server maintains counters in the super block of free blocks instead
of free segments.  We maintain an allocation cursor so that allocation
results tend to cycle through the device.  It's stored in the super so
that it is maintained across server instances.

This doesn't remove unused dead code to keep the commit from getting too
noisy.  It'll be removed in a future commit.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
ac259c82a0 scoutfs: allow interrupting client sends
Waiting for replies to sent requests wasn't interruptible.  This was
preventing ctl-c from breaking out of mount when a server wasn't yet
around to accept connections.

The only complication was that the receive thread was accessing the
sender's struct outside of the lock.  An interrupted sender could remove
their struct while receive was processing it.  We rework recv processing
so that it only uses the sender struct under the lock.  This introduces
a cpu copy of the payload but they're small and relatively infrequent
control messages.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
9148f24aa2 scoutfs: use single small key struct
Variable length keys lead to having a key struct point to the buffer
that contains the key.  With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.

We no longer have a seperate generic key buf struct that points to
specific per-type key storage.  All items use the key struct and fill
out the appropriate fields.  All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.

Each key user now has an init function fills out its fields.  It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.

A bunch of code now takes the address of static key storage instead of
managing allocated keys.  Conversely, swapping now uses the full keys
instead of pointers to the keys.

We don't need all the functions that worked on the generic key buf
struct because they had different lengths.  Copy, clone, length init,
memcpy, all of that goes away.

The item API had some functions that tested the length of keys and
values.  The key length tests vanish, and that gets rid of the _same()
call.  The _same_min() call only had one user who didn't also test for
the value length being too large.  Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.

We no longer have to track the number of key bytes when calculating if
an item population will fit in segments.  This removes the key length
from reservations, transactions, and segment writing.

The item cache key querying ioctls no longer have to deal with variable
length keys.  The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.

The segment no longer has to store the key length.  It stores the key
struct in the item header.

The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct.  The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.

Manifest entries are now a fixed size.  We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq.  They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap.  This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
4ff1e3020f scoutfs: allocate inode numbers per directory
Having an inode number allocation pool in the super block meant that all
allocations across the mount are interleaved.  This means that
concurrent file creation in different directories will create
overlapping inode numbers.  This leads to lock contention as reasonable
work loads will tend to distribute work by directories.

The easy fix is to have per-directory inode number allocation pools.  We
take the opportunity to clean up the network request so that the caller
gets the allocation instead of having it be fed back in via a weird
callback.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-09 17:58:19 -08:00
Zach Brown
cb879d9f37 scoutfs: add network greeting message
Add a network greeting message that's exchanged between the client and
server on every connection to make sure that we have the correct file
system and format hash.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-12 13:57:31 -07:00
Zach Brown
ca78757ca5 scoutfs: more careful client connect timeouts
The client connection loop was a bit of a mess.  It only slept between
retries in one particular case.  Other failures to connect would spin
and livelock.  It would spin forever.

This fixed loop now has a much more orderly reconnect procedure.  Each
connecting sender always tries once.  Then retry attempts backoff
exponentially, settling at a nice long timeout.  After long enough it'll
return errors.

This fixes livelocks in the xfstests that mount and unmount around
dm-flakey config.  generic/{034,039,040} would easily livelock before
this fix.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Zach Brown
87ab27beb1 scoutfs: add statfs network message
The ->statfs method was still using the super_block in the super_info
that was read during mount.  This will get progressively more out
of date.

We add a network message to ask the server for the current fields that
impact statfs.  This is always racy and the fields are mostly nonsense,
but we try our best.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:43:35 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00