Commit Graph

688 Commits

Author SHA1 Message Date
Zach Brown
74366f0df1 scoutfs: make networking more reliable
The current networking code has loose reliability guarantees.  If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection.  The client resends
requests but no responses are resent.  A client's requests could be
processed twice on the same server.  The server throws away disconnected
client state.

This was fine, sort of, for the simple requests we had implemented so
far.  It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.

This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.

The server keeps track of disconnected clients and restores state if the
same client reconnects.  This required some work around the greetings so
that clients and servers can recognize each other.  Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.

Now that connections between the client and server are preserved we can
resend responses across reconnection.  We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.

When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.

This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
20f4e1c338 scoutfs: put magic value in block header
The super block had a magic value that was used to identify that the
block should contain our data structure.  But it was called an 'id'
which was confused with the header fsid in the past.  Also, the btree
blocks aren't using a similar magic value at all.

This moves the magic value in to the header and creates values for the
super block and btree blocks.  Both are written but the btree block
reads don't check the value.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
675275fbf1 scoutfs: use hdr.fsid in greeting instead of id
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with.  This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
288d781645 scoutfs: start and stop server with quorum
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem.  That isn't going to
work if we're moving to locking provided by the server.

This uses quorum election to determine who should run the server.  We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
08a140c8b0 scoutfs: use our locking service
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.

The client code gets some shims to send and receive lock messages to and
from the server.  Callers use our lock mode constants instead of the
DLM's.

Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.

The biggest change is in the client lock state machine.  Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing.  We don't have everything
come through a per-lock work queue.  Instead we send requests either
from the blocking lock caller or from a shrink work queue.  Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.

The different processing contexts leads to a slightly different lock
life cycle.  We refactor and seperate allocation and freeing from
tracking and removing locks in data structures.  We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.

Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time.  We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.

As of this change the lock setup and destruction paths are a little
wobbly.  They'll be shored up as we add lock recovery between the client
and server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
7c8383eddd scoutfs: add scoutfs_lock_rename()
Add a specific lock method for locking the global rename lock instead of
having the caller specify it as a global lock.  We're getting rid of the
notion of lock scopes and requiring all locks to be related to keys.
The rename lock will use magic keys at the end of the volume.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
34b8950bca scoutfs: initial lock server core
Add the core lock server code for providing a lock service from our
server.  The lock messages are wired up but nothing calls them.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f472c0bc87 scoutfs: add scoutfs_net_response_node()
Today all responses can only be sent down the connection that sent the
response while the request is being processed.  We'll be adding
subsystems that need to send responses asynchronously after initial
request processing.  Give them a call to send a response to a node id
instead of to a node's connection.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
c34dd452a7 scoutfs: add quorum voting
Add a quorum election implementation.  The mounts that can participate
in the election are specified in a quorum config array in the super
block.  Each configured participant is assigned a preallocated block
that it can write to.

All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server.  The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.

Nothing calls this code yet, this adds the initial implementation and
format.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
d57b8232ee scoutfs: move base types in format.h
We had scattered some base types throughout the format file which made
them annoying to reference in higher level structs.  Let's put them at
the top so we can use them without declarations or moving things around
in unrelated commits.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f75e1e1322 scoutfs: reformat Makefile to one object per line
Reformat the scoutfs-y object list so that there's one object per line.
Diffs now clearly demonstrate what is changing instead of having word
wrapping constantly obscuring changes in the built objects.

(Did everyone spot the scoutfs_trace sorting mistake?  Another reason
not to mash everything into wrapped lines :)).

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
6caa87458b scoutfs: add scoutfs_net_client_node_id()
Some upcoming network request processing paths need access to the
connected client's node_id.  We could add it to the arguments but that'd
be a lot of churn so we'll add an accessor function for now.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
e9f6e79d67 scoutfs: add uniq_name mount option
Each mount is getting a specified unique name.  This can be used to
identify a reconnecting mount that indicates that an old instance of the
same unique name can no longer exist and doesn't need to be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
8fedfef1cc scoutfs: remove stale net response data comment
There was a time when responding with an error wouldn't include the
caller's data payload.  That hasn't been the case since we added
compaction network requests which include a reference to the compaction
operation with the error response.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
91d190622d scoutfs: remove scoutfs.md file
The current plan is to maintain a nice paper describing the system in
the scoutfs-utils repository.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-25 13:02:11 -07:00
Brandon Philips
9bb0c60c63 README: add whitepaper link
The white paper is helpful and not linked from the Github README which will be a primary landing spot for folks discovering the project.
2018-09-19 11:03:11 -07:00
Zach Brown
f8d1489415 scoutfs: add README.md
Add a README.md for github.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-14 15:18:27 -07:00
Zach Brown
5616175041 scoutfs: update rpm building infrastructure
Update the makefile and spec to our current method of building rpms.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-14 15:07:10 -07:00
Zach Brown
7e9d40d65a scoutfs: init ret when freeing zero extents
The server forgot to initialize ret to 0 and might return
undefined errnos if a client asked it to free zero extents.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-12 15:37:45 -07:00
Zach Brown
2cc990406a scoutfs: compact using net requests
Currently compaction is only performed by one thread running in the
server.  Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.

This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server.  This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.

The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight.  It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.

A server thread still coordinates which segments are compacted.  The
search for a candidate compaction operation is largely unchanged.  It
now has to deal with being unable to process a compaction because its
segments are busy.  We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests.  If there are none at the level we move up to the next level.

The server will only issue a given number of compaction requests to a
client at a time.  When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.

If a client disconnects the server forgets the compactions it had sent
to that client.  If those compactions still need to be processed they'll
be sent to the next client.

The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes.  This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.

The server needs to block as it does work for compaction in the
notify_up and response callbacks.  We move them out from under spin
locks.

The server needs to clean up allocated segnos for a compaction request
that fails.  We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
07eec357ee scoutfs: simplify reliable request delivery
It was a bit of an overreach to try and limit duplicate request
processing in the network layer.  It introduced acks and the necessity
to resync last_processed_id on reconnect.

In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server.  The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server.  To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.

In thinking about this, though, there's a bigger problem.  Duplicate
request processing protection only works up in memory in the networking
connections.  If the server makes persistent changes, then crashes, the
client will resend the request to the new server.  It will need to
discover that the persistent changes have already been made.

So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server.  Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already.  There's no need to implement the
complexity of protecting duplicate delivery between running nodes.

This removes the last_processed_id on the server.  It removes resending
of responses and acks.  Now that ids can be processed out of order we
remove the special known ID of greeting commands.  They can be processed
as usual.  When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
62d6c11e3c scoutfs: clean up workqueue flags
We had gotten a bit sloppy with the workqueue flags.  We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish.  We add NON_REENTRANT out of an abundance of caution.  It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
30d5471e4a scoutfs: call net response func outside lock
Today response processing calls a requests's response callback from
inside the net spinlock.  This happened to work for the synchronous
blocking request handler who only had to record the result and wake
their waiter.

It doesn't work for server compact response processing which needs to
use IO to commit the result of the compaction.

This lifts the call to the response function out of complete_send() and
into the response processing work function.  Other complete_send()
callers now won't trigger the response function call and can't see
errors, which they all ignored anyway.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
00adbd31be scoutfs: add sparse bitmap library
Add a quick library for maintaining a very large bitmap with sparse
allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
1ed0c6017f scoutfs: remove unused keys manifest field
Keys used to be variable length so the manifest struct on the wire ended
in key payloads.  The keys are now fixed size so that field is no longer
necessary or used.  It's an artifact that should have been removed when
the keys were made fixed length.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
0adbd7e439 scoutfs: have server track connected clients
This extends the notify up and down calls to let the server keep track
of connected clients.

It adds the notion of per-connection info that is allocated for each
connection.  It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.

It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
746293987c scoutfs: let server send msg to specific node_id
The current sending interfaces only send a message to the peer of a
given connection.  For the server to send to a specific connected client
it'd have to track connections itself and send to them.

This adds a sending interface that uses the node_id to send to a
specific connected client.  The conn argument is the listening socket
and its accepted sockets are searched for the destination node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
8b3193ea72 scoutfs: server allocates node_id
Today node_ids are randomly assigned.  This adds the risk of failure
from random number generation and still allows for the risk of
collisions.

Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange.  This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.

To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing.  This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.

Now that net_connect is sync in the client we don't need the notify_up
callback anymore.  The client can perform those duties when the connect
returns.

The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
f06b39cd7e scoutfs: destroy items after locks
We were destroying the item subsystem before shutting down locking.
This is wrong because locking shutdown invalidates items covered by the
locks.  It can walk into freed memory and crash or corrupt other memory.

The fix is to tear down the item subsystem after tearing down locks.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-24 15:40:53 -07:00
Zach Brown
ed9f4b6a22 scoutfs: calculate and enforce segment csum
We had fields in the segment header for the crc but weren't using it.
This calculates the crc on write and verifies it on read.  The crc
covers the used bytes in the segment as indicated by the total_bytes
field.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-21 13:28:36 -07:00
Zach Brown
a25b6324d2 scoutfs: maintain free_blocks in one place
The free_blocks counter in the super is meant to track the number of
total blocks in the primary free extent index.  Callers of extent
manipulation were trying to keep it in sync with the extents.

Segment allocation was allocating extents manually using a cursor.  It
forgot to update free_blocks.  Segment freeing then freed the segment as
an extent which did update free_blocks.  This created ever accumulating
free blocks over time which eventually pushed it greater than total
blocks and caused df to report negative usage.

This updates the free_blocks count in server extent io which is the only
place we update the extent items themselves.  This ensures that we'll
keep the count in sync with the extent items.  Callers don't have to
worry about it.

Signed-off-by: Zach Brown <zab@versity.com>

T# with '#' will be ignored, and an empty message aborts the commit.
2018-08-21 13:25:05 -07:00
Zach Brown
a72b7a9001 scoutfs: convert locks seq to trivial seq
Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
07df8816e3 scoutfs: add trivial seq file for net messages
Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
bafa4a6720 scoutfs: add net header printk args
We have macros for creating and printing trace arguments for our network
header struct.  Add a macro for making simple printk call args for
normal formatted output callers.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
8ff3ef3131 scoutfs: add trivial seq file for net connections
Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
c4cb5c0651 scoutfs: add trivial seq file wrapper
Add a seq file wrapper which lets callers track objects easily.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
d708421cfb scoutfs: remove unused client and server code
The previous commit added shared networking code and disabled the old
unused code.  This removes all that unused client and server code that
was refactored to become the shared networking code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
17dec65a52 scoutfs: add bidirectional network messages
The client and server networking code was a bit too rudimentary.

The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to.  We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.

This refactors sending and receiving in both the client and server code
into shared networking code.  It's built around a connection struct that
then holds the message state.  Both peers on the connection can send
requests and send responses.

The existing code only retransmitted requests down newly established
connections.  Requests could be processed twice.

This adds robust reliability guarantees.  Requests are resend until
their response is received.  Requests are only processed once by a given
peer, regardless of the connection's transport socket.  Responses are
reiably resent until acknowledged.

This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal.  A following commit will remove all
the unused code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
295bf6b73b scoutfs: return free extents to server
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node.  Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents.  With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.

This adds a simple high water mark after which nodes start returning
free extents to the server.  From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-05 16:19:31 -07:00
Zach Brown
784cda9bee scoutfs: more carefully set lock bast mode
Locks get a bast call from the dlm when a remote node is blocked waiting
for the mode of a lock to change.  We'd set the mode that we need to
convert to and kick off lock work to make forward progress.

The bast calls can happen at any old time.  If a call came in as we were
unlocking a lock we'd set its bast mode even though it was being
unlocked and would not need to be down converted.

Usually this bad mode would be fine because the lock was idle and would
just be freed after being locked.

But if someone was actively waiting for the lock it would get stuck in
an unlocked state.  The bad bast mode would prevent it from being
upconverted, but the waiters would stop it from being freed.

We fix this by only setting the mode from the bast call if there is
really work to do.  This avoids setting the bast for unlocked locks
which will let the lock state machine re-acquire them and make forward
progress on behalf of the waiters.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-02 14:16:50 -07:00
Zach Brown
e19716a0f2 scoutfs: clean up super block use
The code that works with the super block had drifted a bit.  We still
had two from an old design and we weren't doing anything with its crc.

Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 15:56:42 -07:00
Zach Brown
5d9ad0923a scoutfs: trace net structs
The userspace trace event printing code has trouble with arguments that
refer to fields in entries.  Add macros to make entries for all the
fields and use them as the formatted arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
53e8ab0f7b scoutfs: trace extent struct
The userspace trace event printing code has trouble with arguments that
refer to fields in entries.  Add macros to make entries for all the
fields and use them as the formatted arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
dfac36a9aa scoutfs: trace key struct
The userspace trace event printing code has trouble with arguments that
refer to fields in entries.  Add macros to make entries for all the
fields and use them as the formatted arguments.

We also remove the mapping of zone and type to strings.  It's smaller to
print the values directly and gets rid of some silly code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
5935a3f43e scoutfs: remove unused trace events
These trace events were all orphaned long ago by commits which removed
their callers but forgot to remove their definitions.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
fddc3a7a75 scoutfs: minimize commit writeback latencies
Our simple transaction machinery causes high commit latencies if we let
too much dirty file data accumulate.

Small files have a natural limit on the amount of dirty data because
they have more dirty items per dirty page.  They fill up the single
segment sooner and kick off a commit which finds a relatively small
amount of dirty file data.

But large files can reference quite a lot of dirty data with a small
amount of extent items which don't fill up the transaction's segment.
During large streaming writes we can fill up memory with dirty file data
before filling a segment with mapping extent metadata.  This can lead to
high commit latencies when memory is full of dirty file pages.

Regularly kicking off background writeback behind streaming write
positions reduces the amount of dirty data that commits will find and
have to write out.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
59170f41b1 scoutfs: revive item deletion path
The inode deletion path had bit rotted.  Delete the ifdefs that were
stopping it from deleting all the items associated with an inode.  There
can be a lot of xattr and data mapping items so we have them manage
their own transactions (data already did).  The xattr deletion code was
trying to get a lock while the caller already held it so delete that.
Then we accurately account for the small number of remaining items that
finally delete the inode.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
0c7ea66f57 scoutfs: add SIC_EXACT
Add an item count call that lets the caller give the exact item count
instead of basing it on the operation they're performing.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
002daf3c1c scoutfs: return -ENOSPC to client alloc segno
The server send_reply interface is confusing.  It uses errors to shut
down the connection.  Clients getting enospc needs to happen in the
message reply payload.

The segno allocation server processing needs to set the segno to 0 so
that the client gets it and translates that into -ENOSPC.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
876414065b scoutfs: warn if we try IO outside the device
We've had bugs in allocators that return success and crazy block
numbers.   The bad block numbers eventually make their way down to the
context-free kernel warning that IO was attempted outside the device.
This at least gives us a stack trace to help find where it's coming
from.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00