Use the client's rid in networking instead of the node_id.
The node_id no longer has to be allocated by the server and sent in the
greeting. Instead the client sends it to the server in its greeting.
The server then uses the client's announced rid just like it used to use
the its node_id. It's used to record clients in the btree and to
identify clients in sending and receive processing.
The use of the rid in networking calls makes its way to locking and
compaction which now use the rid to identify clients intead of the
node_id.
Signed-off-by: Zach Brown <zab@versity.com>
The current quorum voting implementatoin had some rough edges that
increased the complexity of the system and introduced undesirable
failure modes. We can keep the same basic pattern but move
functionality around a few places, and rethink the quorum voting, to end
up with a meaningfully simpler system.
The motivation for this work was to remove the need to provide a
uniq_name option for every mount instance.
The first big change is to remove the idea of static configuration slots
for mounts. This removes the use of uniq_name. Mounts now simply have
a server_addr mount option instead of using their uniq_name to find
their address in the configuration.
The server can't check the configuration to see if a given connected
client's name is found in the quorum config. Clients can set a flag in
their sent greeting which indicates that they're a voter. This removes
the uniq_name from the greeting and mounted client records.
Without a static configuration mounts no longer have dedicated block
locations to write to. We increase the size of the region of quorum
blocks and have voters simply write to a random block. Overwriting vote
blocks is OK because we move from heartbeating design patterns to a
protocol strongly based on raft's election. We're using quorum blocks
to communicate votes instead of network messages and overwriting blocks
is analagous to lossy networks droping vote messages in the raft
election protocol.
We were using the dedicated per-mount quorum blocks to track mounts that
had been elected and needed to be fenced. We no longer have that
storage so instead we add the idea of an election log that is stored in
every voting block. Readers merge the logs from all the blocks they
read and write the resulting merged log in their block.
With no static quorum configuration we no longer have to worry about the
complexity of changing the slot configurations while they're in use.
The only persistent configuration is the number of votes a candidate
needs to be elected by a quorum.
It was a mistake to use quorum voting blocks to communicate state
between the server and the quorum voters. We can easily move the
unmount_barrier, server address, and fencing state from the quorum
blocks into the super block. The server no longer needs the quorum
election info struct to be able to later write its quorum block. It
instead writes a few fields in the super. There's only one place where
clients need to look to find out who they should connect to or if they
can finish unmount.
Signed-off-by: Zach Brown <zab@versity.com>
Add a server_addr mount option that takes an ipv4 address. This will be
used by the upcoming changes to quorum voting to indicate that a mount
should participate in voting and to specify the address that its server
should listen on.
Signed-off-by: Zach Brown <zab@versity.com>
The pattern of advancing and writing a "dirty super" comes from the time
when the format had two persistent super blocks. One was kept in memory
and modified as changes were made. Advancing it changed which of the
two supers would be eventually written.
This no longer makes sense now that we only have one super block.
Remove the idea of advancing and writing an implicit dirty super block
that's stored in the super block info. Instead use a single
scoutfs_write_super() which takes the super block struct to write.
We still store and increment the hdr.gen in the super block. It used to
be used to tell which of the two super blocks are more recent, now it is
just some information that can tell us something about the life of the
super block.
Signed-off-by: Zach Brown <zab@versity.com>
Add some quick functions that let us convert between our persistent
packed inet addr struct and native sockaddr_in structs.
Signed-off-by: Zach Brown <zab@versity.com>
Use the mount's generated random id in persistent items and the lock
that protects them instead of the assigned node_id.
Signed-off-by: Zach Brown <zab@versity.com>
Add the mount rid to traces which included the fsid by converting them
to use the super block message format and args.
Signed-off-by: Zach Brown <zab@versity.com>
Change the console message output to show the fsid:rid mount identity
instead of the block device name and device major and minor numbers.
Signed-off-by: Zach Brown <zab@versity.com>
Add macros which provide printk format and args for a little string
which identifies a specific mount. This will be used in kernel logs and
trace messages.
Signed-off-by: Zach Brown <zab@versity.com>
Calculate a random id which identifies the life of a particular mount.
This will be visible in messages and tracing and will replace the
server-assigned node_id in persistent structures and protocols.
Signed-off-by: Zach Brown <zab@versity.com>
Our hidden attributes are hidden so that they don't leak out of
the system when archiving tools transfer xattrs from listxattr along
with the file. They're not intended to be secret, in fact users want to
see their contents like they want to see other fs metadata that they
can't update which describes the system.
Make our listxattr ioctl only return hidden xattrs and allow anyone to
see the results if they can read the file. Rename it to more
accurately describe its intended use.
Signed-off-by: Zach Brown <zab@versity.com>
Order the ioctl struct field definitions and add padding so that
runtimes with different word dizes don't add different padding.
Userspace is spared having to deal with packing and we don't
have to worry about compat translation in the kernel.
We had two persistent structures that crossed the ioctl, a key and a
timespec, so we explicitly translate to and from their persistent types
in the ioctl.
Signed-off-by: Zach Brown <zab@versity.com>
Add a .indx. xattr tag which adds the inode to an index of inodes keyed
by the hash of xattr names. An ioctl is added which then returns all
the inodes which may contain an xattr of the given name. Dropping all
xattrs now has to parse the name to find out if it also has to delete an
index item.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick header which calculates 64bit hashes by calculating the crc
of the two halves of a byte region.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl which can be used to iterate over the keys for all the
xattrs on an inode. It is privileged, can see hidden inodes, and has an
iteration cursor so that it can make its way through very large numbers
of xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
Add a scoutfs. xattr prefix which then defines a series of following
tags which can change the behaviour of the xattr. We start with .hide.
which stops the xattr from showing up in listxattr.
Signed-off-by: Zach Brown <zab@versity.com>
An elected leader writes a quorum block showing that it's elected before
it assumes exclusive access to the device and starts bringing up the
server. This lets another later elected leader find and fence it if
something happens.
Other mounts were trying to connect to the server once this elected
quorum block was written and before the server was listening. They'd
get conection refused, decide to elect a new leader, and try to fence
the server that's still running.
Now, they should have tried much harder to connect to the elected leader
instead of taking a single failed attempt as fatal. But that's a
problem for another day that involves more work in balancing timeouts
and retries.
But mounts should not have tried try to connect to the server until its
listening. That's easy to signal by adding a simple listening flag to
the quorum block. Now mounts will only try to connect once they see the
listening flag and don't see these racey refused connections.
Signed-off-by: Zach Brown <zab@versity.com>
We have trace points as blocks are read, but the reads are cached as
buffer heads. The iteration helpers are used to referenced cached
blocks a few times in each voting cycle and we end up tracing cached
read blocks multiple times. This uses a bit on the buffer_head to only
trace a cached block the first time it's read.
Signed-off-by: Zach Brown <zab@versity.com>
Add some sysfs files which show quorum state. We store the state in
quorum_info off the super which is updates as we participate in
elections.
Signed-off-by: Zach Brown <zab@versity.com>
Add some helpers to manage the lifetime of groups of attributes in
sysfs. We can wait until the sysfs files are no longer in use
before tearing down the data that they rely on.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
Somewhere in the mists of time (around when we removed path tracking
which held refs to blocks?) walking blocks to migrate started leaking
btree block references. It was providing a pointer so the walk gave it
the block it found but the caller was never dropping that ref.
It wasn't doing anything with the result of the walk so we just don't
provide a block pointer and the walk will drop the ref for us. This
will stop leaking refs, effectively pinning the ring in memory.
Signed-off-by: Zach Brown <zab@versity.com>
To avoid overwriting live btree blocks we have to migrate them between
halves of the ring. Each time we cross into a new half of the ring we
start migration all over again.
The intent was to slowly migrate the blocks over time. We'd track dirty
blocks that came from the old and current halves and keep them in
balance. This would keep the overhead of the migration low and spread
out through all at the start of the half that include migration.
But the calculation of current blocks was completely wrong. It checked
the newly allocated block which is always in the current half. It never
thought it was dirtying old blocks so it'd constantly migrate trying to
find them. We'd effectively migrate every btree block during the first
transaction in each half.
This calculates if we're dirtying old or new blocks by the source of the
cow operation. We now recognize when we dirty old blocks and will stop
migrating once we've migrated at least as many old blocks as we've
written new blocks.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing ring btree corruption that manifest as the server seeing
stale btree blocks as it tried to read all the btrees to migrate blocks
during a write. A block it tried to read didn't match its reference.
It turned out that block wasn't being migrated. It would get stuck
at a position in the ring. Eventually new block writes would overwrite
it and then the next read would see corruption.
It wasn't being migrated because the block reading function didn't
realize that it had to migrate a dirty block. The block was written in
a transaction at the end of the ring. The ring wrapped during
the transaction and then migration tried to migrate the dirty block.
It wouldn't be dirtied, and thus be migrated, because it was already
dirty in the transaction.
The fix is to add more cases to the dirtying decision which takes
migration specifically into account. We'll no longer short circuit
dirtying blocks for migration when they're in the old half of the ring
even though they're dirty.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
This adds some minor functionality to the per_task API for use by the
upcoming offline waiting work.
Add scoutfs_per_task_add_excl() so that a caller can tell if their task
was already put on a per-task list by their caller.
Make scoutfs_per_task_del() return a bool to indicate if the entry was
found on a list and was in fact deleted, or not.
Add scoutfs_per_task_init_entry() for initializing entries that aren't
declared on the stack.
Signed-off-by: Zach Brown <zab@versity.com>
Since fill_super was originally written we've added use of buffer_head
IO by the btree and quorum voting. We forgot to set the block size so
devices that didn't have the common 4k default, matching our block size,
would see errors. Explicitly set it.
Signed-off-by: Zach Brown <zab@versity.com>
It was a mistake to use a non-zero elected_nr as the indication that a
slot is considered actively elected. Zeroing it as the server shuts
down wipes the elected_nr and means that it doesn't advance as each
server is elected. This then causes a client connecting to a new server
to be confused for a client reconnecting to a server after the server
has timed it out and destroyed its state. This caused reconnection
after shutting down a server to fail and clients to loop reconnecting
indefinitely.
This instead adds flags to the quorum block and assigns a flag to
indicate that the slot should be considered active. It's cleared by
fencing and by the client as the server shuts down.
Signed-off-by: Zach Brown <zab@versity.com>
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount. We can't
let unmounting clients leave the remaining mounted clients without
quorum.
The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests. It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.
We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.
The commit introduces and maintains the unmount_barrier field in the
quorum blocks. It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.
The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.
Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established. This also makes it easier to
make global decisions based on the count of pending farewell requests.
Signed-off-by: Zach Brown <zab@versity.com>
We were relying on a cute (and probably broken) trick of defining
pointers to unaligned base types with __packed. Modern versions of gcc
warn about this.
Instead we either directly access unaligned types with get_ and
put_unaligned, or we copy unaligned data into aligned copies before
working with it.
Signed-off-by: Zach Brown <zab@versity.com>
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory. It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.
This stores open transaction sequence numbers in a shared persistent
btree instead of in memory. It removes tracking for clients as they
send their farewell during unmount. A new server that starts up will
see existing entries for clients that were created by old servers.
This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.
Signed-off-by: Zach Brown <zab@versity.com>
The macro for producing trace args for an ipv4 address had a typo when
shifting the third octet down before masking.
Signed-off-by: Zach Brown <zab@versity.com>
The server's listening address is fixed by the raft config in the super
block. If it shuts down and rapidly starts back up it needs to bind to
the currently lingering address.
Signed-off-by: Zach Brown <zab@versity.com>
Generate unique trace events on the send and recv side of each message
sent between nodes. This can be used to reasonbly efficiently
synchronize the monotonic clock in trace events between nodes given only
their captured trace events.
Signed-off-by: Zach Brown <zab@versity.com>
The message formatter adds a newline so callers don't have to. But
sometimes they do and we get double newlines. Add a build check that
the format string doesn't end in a newline so that we stop adding these.
And fix up all the current offenders.
Signed-off-by: Zach Brown <zab@versity.com>
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO. As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.
This implements lock recovery by having the lock service recover locks
from clients as it starts up.
First the lock service stores records of connected clients in a btree
off the super block. Records are added as the server receives their
greeting and are removed as the server receives their farewell.
Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.
We add lock recover request and response messages that are used to
communicate locks from the clients to the server.
Signed-off-by: Zach Brown <zab@versity.com>
The current networking code has loose reliability guarantees. If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection. The client resends
requests but no responses are resent. A client's requests could be
processed twice on the same server. The server throws away disconnected
client state.
This was fine, sort of, for the simple requests we had implemented so
far. It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.
This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.
The server keeps track of disconnected clients and restores state if the
same client reconnects. This required some work around the greetings so
that clients and servers can recognize each other. Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.
Now that connections between the client and server are preserved we can
resend responses across reconnection. We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.
When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.
This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.
Signed-off-by: Zach Brown <zab@versity.com>
The super block had a magic value that was used to identify that the
block should contain our data structure. But it was called an 'id'
which was confused with the header fsid in the past. Also, the btree
blocks aren't using a similar magic value at all.
This moves the magic value in to the header and creates values for the
super block and btree blocks. Both are written but the btree block
reads don't check the value.
Signed-off-by: Zach Brown <zab@versity.com>
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with. This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.
Signed-off-by: Zach Brown <zab@versity.com>