Add some sysfs files which show quorum state. We store the state in
quorum_info off the super which is updates as we participate in
elections.
Signed-off-by: Zach Brown <zab@versity.com>
Add some helpers to manage the lifetime of groups of attributes in
sysfs. We can wait until the sysfs files are no longer in use
before tearing down the data that they rely on.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
Somewhere in the mists of time (around when we removed path tracking
which held refs to blocks?) walking blocks to migrate started leaking
btree block references. It was providing a pointer so the walk gave it
the block it found but the caller was never dropping that ref.
It wasn't doing anything with the result of the walk so we just don't
provide a block pointer and the walk will drop the ref for us. This
will stop leaking refs, effectively pinning the ring in memory.
Signed-off-by: Zach Brown <zab@versity.com>
To avoid overwriting live btree blocks we have to migrate them between
halves of the ring. Each time we cross into a new half of the ring we
start migration all over again.
The intent was to slowly migrate the blocks over time. We'd track dirty
blocks that came from the old and current halves and keep them in
balance. This would keep the overhead of the migration low and spread
out through all at the start of the half that include migration.
But the calculation of current blocks was completely wrong. It checked
the newly allocated block which is always in the current half. It never
thought it was dirtying old blocks so it'd constantly migrate trying to
find them. We'd effectively migrate every btree block during the first
transaction in each half.
This calculates if we're dirtying old or new blocks by the source of the
cow operation. We now recognize when we dirty old blocks and will stop
migrating once we've migrated at least as many old blocks as we've
written new blocks.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing ring btree corruption that manifest as the server seeing
stale btree blocks as it tried to read all the btrees to migrate blocks
during a write. A block it tried to read didn't match its reference.
It turned out that block wasn't being migrated. It would get stuck
at a position in the ring. Eventually new block writes would overwrite
it and then the next read would see corruption.
It wasn't being migrated because the block reading function didn't
realize that it had to migrate a dirty block. The block was written in
a transaction at the end of the ring. The ring wrapped during
the transaction and then migration tried to migrate the dirty block.
It wouldn't be dirtied, and thus be migrated, because it was already
dirty in the transaction.
The fix is to add more cases to the dirtying decision which takes
migration specifically into account. We'll no longer short circuit
dirtying blocks for migration when they're in the old half of the ring
even though they're dirty.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
This adds some minor functionality to the per_task API for use by the
upcoming offline waiting work.
Add scoutfs_per_task_add_excl() so that a caller can tell if their task
was already put on a per-task list by their caller.
Make scoutfs_per_task_del() return a bool to indicate if the entry was
found on a list and was in fact deleted, or not.
Add scoutfs_per_task_init_entry() for initializing entries that aren't
declared on the stack.
Signed-off-by: Zach Brown <zab@versity.com>
Since fill_super was originally written we've added use of buffer_head
IO by the btree and quorum voting. We forgot to set the block size so
devices that didn't have the common 4k default, matching our block size,
would see errors. Explicitly set it.
Signed-off-by: Zach Brown <zab@versity.com>
It was a mistake to use a non-zero elected_nr as the indication that a
slot is considered actively elected. Zeroing it as the server shuts
down wipes the elected_nr and means that it doesn't advance as each
server is elected. This then causes a client connecting to a new server
to be confused for a client reconnecting to a server after the server
has timed it out and destroyed its state. This caused reconnection
after shutting down a server to fail and clients to loop reconnecting
indefinitely.
This instead adds flags to the quorum block and assigns a flag to
indicate that the slot should be considered active. It's cleared by
fencing and by the client as the server shuts down.
Signed-off-by: Zach Brown <zab@versity.com>
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount. We can't
let unmounting clients leave the remaining mounted clients without
quorum.
The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests. It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.
We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.
The commit introduces and maintains the unmount_barrier field in the
quorum blocks. It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.
The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.
Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established. This also makes it easier to
make global decisions based on the count of pending farewell requests.
Signed-off-by: Zach Brown <zab@versity.com>
We were relying on a cute (and probably broken) trick of defining
pointers to unaligned base types with __packed. Modern versions of gcc
warn about this.
Instead we either directly access unaligned types with get_ and
put_unaligned, or we copy unaligned data into aligned copies before
working with it.
Signed-off-by: Zach Brown <zab@versity.com>
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory. It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.
This stores open transaction sequence numbers in a shared persistent
btree instead of in memory. It removes tracking for clients as they
send their farewell during unmount. A new server that starts up will
see existing entries for clients that were created by old servers.
This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.
Signed-off-by: Zach Brown <zab@versity.com>
The macro for producing trace args for an ipv4 address had a typo when
shifting the third octet down before masking.
Signed-off-by: Zach Brown <zab@versity.com>
The server's listening address is fixed by the raft config in the super
block. If it shuts down and rapidly starts back up it needs to bind to
the currently lingering address.
Signed-off-by: Zach Brown <zab@versity.com>
Generate unique trace events on the send and recv side of each message
sent between nodes. This can be used to reasonbly efficiently
synchronize the monotonic clock in trace events between nodes given only
their captured trace events.
Signed-off-by: Zach Brown <zab@versity.com>
The message formatter adds a newline so callers don't have to. But
sometimes they do and we get double newlines. Add a build check that
the format string doesn't end in a newline so that we stop adding these.
And fix up all the current offenders.
Signed-off-by: Zach Brown <zab@versity.com>
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO. As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.
This implements lock recovery by having the lock service recover locks
from clients as it starts up.
First the lock service stores records of connected clients in a btree
off the super block. Records are added as the server receives their
greeting and are removed as the server receives their farewell.
Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.
We add lock recover request and response messages that are used to
communicate locks from the clients to the server.
Signed-off-by: Zach Brown <zab@versity.com>
The current networking code has loose reliability guarantees. If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection. The client resends
requests but no responses are resent. A client's requests could be
processed twice on the same server. The server throws away disconnected
client state.
This was fine, sort of, for the simple requests we had implemented so
far. It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.
This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.
The server keeps track of disconnected clients and restores state if the
same client reconnects. This required some work around the greetings so
that clients and servers can recognize each other. Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.
Now that connections between the client and server are preserved we can
resend responses across reconnection. We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.
When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.
This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.
Signed-off-by: Zach Brown <zab@versity.com>
The super block had a magic value that was used to identify that the
block should contain our data structure. But it was called an 'id'
which was confused with the header fsid in the past. Also, the btree
blocks aren't using a similar magic value at all.
This moves the magic value in to the header and creates values for the
super block and btree blocks. Both are written but the btree block
reads don't check the value.
Signed-off-by: Zach Brown <zab@versity.com>
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with. This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.
Signed-off-by: Zach Brown <zab@versity.com>
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem. That isn't going to
work if we're moving to locking provided by the server.
This uses quorum election to determine who should run the server. We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.
Signed-off-by: Zach Brown <zab@versity.com>
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add a specific lock method for locking the global rename lock instead of
having the caller specify it as a global lock. We're getting rid of the
notion of lock scopes and requiring all locks to be related to keys.
The rename lock will use magic keys at the end of the volume.
Signed-off-by: Zach Brown <zab@versity.com>
Add the core lock server code for providing a lock service from our
server. The lock messages are wired up but nothing calls them.
Signed-off-by: Zach Brown <zab@versity.com>
Today all responses can only be sent down the connection that sent the
response while the request is being processed. We'll be adding
subsystems that need to send responses asynchronously after initial
request processing. Give them a call to send a response to a node id
instead of to a node's connection.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quorum election implementation. The mounts that can participate
in the election are specified in a quorum config array in the super
block. Each configured participant is assigned a preallocated block
that it can write to.
All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server. The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.
Nothing calls this code yet, this adds the initial implementation and
format.
Signed-off-by: Zach Brown <zab@versity.com>
We had scattered some base types throughout the format file which made
them annoying to reference in higher level structs. Let's put them at
the top so we can use them without declarations or moving things around
in unrelated commits.
Signed-off-by: Zach Brown <zab@versity.com>
Reformat the scoutfs-y object list so that there's one object per line.
Diffs now clearly demonstrate what is changing instead of having word
wrapping constantly obscuring changes in the built objects.
(Did everyone spot the scoutfs_trace sorting mistake? Another reason
not to mash everything into wrapped lines :)).
Signed-off-by: Zach Brown <zab@versity.com>
Some upcoming network request processing paths need access to the
connected client's node_id. We could add it to the arguments but that'd
be a lot of churn so we'll add an accessor function for now.
Signed-off-by: Zach Brown <zab@versity.com>
Each mount is getting a specified unique name. This can be used to
identify a reconnecting mount that indicates that an old instance of the
same unique name can no longer exist and doesn't need to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
There was a time when responding with an error wouldn't include the
caller's data payload. That hasn't been the case since we added
compaction network requests which include a reference to the compaction
operation with the error response.
Signed-off-by: Zach Brown <zab@versity.com>
The server forgot to initialize ret to 0 and might return
undefined errnos if a client asked it to free zero extents.
Signed-off-by: Zach Brown <zab@versity.com>
Currently compaction is only performed by one thread running in the
server. Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.
This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server. This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.
The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight. It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.
A server thread still coordinates which segments are compacted. The
search for a candidate compaction operation is largely unchanged. It
now has to deal with being unable to process a compaction because its
segments are busy. We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests. If there are none at the level we move up to the next level.
The server will only issue a given number of compaction requests to a
client at a time. When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.
If a client disconnects the server forgets the compactions it had sent
to that client. If those compactions still need to be processed they'll
be sent to the next client.
The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes. This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.
The server needs to block as it does work for compaction in the
notify_up and response callbacks. We move them out from under spin
locks.
The server needs to clean up allocated segnos for a compaction request
that fails. We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.
Signed-off-by: Zach Brown <zab@versity.com>
It was a bit of an overreach to try and limit duplicate request
processing in the network layer. It introduced acks and the necessity
to resync last_processed_id on reconnect.
In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server. The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server. To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.
In thinking about this, though, there's a bigger problem. Duplicate
request processing protection only works up in memory in the networking
connections. If the server makes persistent changes, then crashes, the
client will resend the request to the new server. It will need to
discover that the persistent changes have already been made.
So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server. Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already. There's no need to implement the
complexity of protecting duplicate delivery between running nodes.
This removes the last_processed_id on the server. It removes resending
of responses and acks. Now that ids can be processed out of order we
remove the special known ID of greeting commands. They can be processed
as usual. When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.
Signed-off-by: Zach Brown <zab@versity.com>
We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish. We add NON_REENTRANT out of an abundance of caution. It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.
Signed-off-by: Zach Brown <zab@versity.com>
Today response processing calls a requests's response callback from
inside the net spinlock. This happened to work for the synchronous
blocking request handler who only had to record the result and wake
their waiter.
It doesn't work for server compact response processing which needs to
use IO to commit the result of the compaction.
This lifts the call to the response function out of complete_send() and
into the response processing work function. Other complete_send()
callers now won't trigger the response function call and can't see
errors, which they all ignored anyway.
Signed-off-by: Zach Brown <zab@versity.com>
Keys used to be variable length so the manifest struct on the wire ended
in key payloads. The keys are now fixed size so that field is no longer
necessary or used. It's an artifact that should have been removed when
the keys were made fixed length.
Signed-off-by: Zach Brown <zab@versity.com>
This extends the notify up and down calls to let the server keep track
of connected clients.
It adds the notion of per-connection info that is allocated for each
connection. It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.
It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.
Signed-off-by: Zach Brown <zab@versity.com>
The current sending interfaces only send a message to the peer of a
given connection. For the server to send to a specific connected client
it'd have to track connections itself and send to them.
This adds a sending interface that uses the node_id to send to a
specific connected client. The conn argument is the listening socket
and its accepted sockets are searched for the destination node_id.
Signed-off-by: Zach Brown <zab@versity.com>