Commit Graph

722 Commits

Author SHA1 Message Date
Zach Brown
d8bc962fc5 scoutfs: unpriv listxattr_hidden only shows .hide.
Our hidden attributes are hidden so that they don't leak out of
the system when archiving tools transfer xattrs from listxattr along
with the file.  They're not intended to be secret, in fact users want to
see their contents like they want to see other fs metadata that they
can't update which describes the system.

Make our listxattr ioctl only return hidden xattrs and allow anyone to
see the results if they can read the file.   Rename it to more
accurately describe its intended use.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-28 10:23:55 -07:00
Zach Brown
663ce53109 scoutfs: clean up _IO ioctl macro usage
Accurately set the direction bits, pack down the used numbers, and
remove stale old ioctl definitions.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-28 10:23:55 -07:00
Zach Brown
4a29cb5888 scoutfs: naturally align ioctl structs
Order the ioctl struct field definitions and add padding so that
runtimes with different word dizes don't add different padding.
Userspace is spared having to deal with packing and we don't
have to worry about compat translation in the kernel.

We had two persistent structures that crossed the ioctl, a key and a
timespec, so we explicitly translate to and from their persistent types
in the ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-27 11:39:11 -07:00
Zach Brown
7dfbd3950f scoutfs: add index of inodes by xattr names
Add a .indx. xattr tag which adds the inode to an index of inodes keyed
by the hash of xattr names.  An ioctl is added which then returns all
the inodes which may contain an xattr of the given name.  Dropping all
xattrs now has to parse the name to find out if it also has to delete an
index item.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-24 09:58:22 -07:00
Zach Brown
aee017903b scoutfs: add hash helper
Add a quick header which calculates 64bit hashes by calculating the crc
of the two halves of a byte region.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-24 09:58:22 -07:00
Zach Brown
a7fef3d7dd scoutfs: add listxattr_raw ioctl
Add an ioctl which can be used to iterate over the keys for all the
xattrs on an inode.  It is privileged, can see hidden inodes, and has an
iteration cursor so that it can make its way through very large numbers
of xattrs.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-24 09:58:22 -07:00
Zach Brown
019b5f6d6b scoutfs: add scoutfs xattr prefix and name tags
Add a scoutfs. xattr prefix which then defines a series of following
tags which can change the behaviour of the xattr.  We start with .hide.
which stops the xattr from showing up in listxattr.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-24 09:58:22 -07:00
Zach Brown
a239f6093d scoutfs: add mount_options/ sysfs dir
Add a directory per mount that shows the values of all the mount
options.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-05 14:30:11 -07:00
Zach Brown
7d56d8f34f scoutfs: add .show_options
Add the vfs callback that prints mount options in /proc files.

Signed-off-by: Zach Brown <zab@versity.com>
2019-06-05 14:30:11 -07:00
Zach Brown
c061ada671 scoutfs: mounts connect once server is listening
An elected leader writes a quorum block showing that it's elected before
it assumes exclusive access to the device and starts bringing up the
server.  This lets another later elected leader find and fence it if
something happens.

Other mounts were trying to connect to the server once this elected
quorum block was written and before the server was listening.  They'd
get conection refused, decide to elect a new leader, and try to fence
the server that's still running.

Now, they should have tried much harder to connect to the elected leader
instead of taking a single failed attempt as fatal.  But that's a
problem for another day that involves more work in balancing timeouts
and retries.

But mounts should not have tried try to connect to the server until its
listening.  That's easy to signal by adding a simple listening flag to
the quorum block.  Now mounts will only try to connect once they see the
listening flag and don't see these racey refused connections.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 15:01:00 -07:00
Zach Brown
abd7ffc247 scoutfs: only trace read qourum blocks after io
We have trace points as blocks are read, but the reads are cached as
buffer heads.  The iteration helpers are used to referenced cached
blocks a few times in each voting cycle and we end up tracing cached
read blocks multiple times.  This uses a bit on the buffer_head to only
trace a cached block the first time it's read.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 15:01:00 -07:00
Zach Brown
4df35efbc0 scoutfs: show quorum state in sysfs
Add some sysfs files which show quorum state.  We store the state in
quorum_info off the super which is updates as we participate in
elections.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 13:51:02 -07:00
Zach Brown
2cc4f89ad5 scoutfs: add sysfs attrs wrappers
Add some helpers to manage the lifetime of groups of attributes in
sysfs.  We can wait until the sysfs files are no longer in use
before tearing down the data that they rely on.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 13:51:02 -07:00
Zach Brown
c010afa8ff scoutfs: add setattr_more ioctl
Add an ioctl that can be used by userspace to restore a file to its
offline state.  To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-30 13:45:52 -07:00
Zach Brown
0b6bc8789c scoutfs: don't leak btree block refs
Somewhere in the mists of time (around when we removed path tracking
which held refs to blocks?) walking blocks to migrate started leaking
btree block references.  It was providing a pointer so the walk gave it
the block it found but the caller was never dropping that ref.

It wasn't doing anything with the result of the walk so we just don't
provide a block pointer and the walk will drop the ref for us.  This
will stop leaking refs, effectively pinning the ring in memory.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:41:07 -07:00
Zach Brown
0988cbe1e9 scoutfs: track old and cur dirty btree blocks
To avoid overwriting live btree blocks we have to migrate them between
halves of the ring.  Each time we cross into a new half of the ring we
start migration all over again.

The intent was to slowly migrate the blocks over time.  We'd track dirty
blocks that came from the old and current halves and keep them in
balance.  This would keep the overhead of the migration low and spread
out through all at the start of the half that include migration.

But the calculation of current blocks was completely wrong.  It checked
the newly allocated block which is always in the current half.  It never
thought it was dirtying old blocks so it'd constantly migrate trying to
find them.  We'd effectively migrate every btree block during the first
transaction in each half.

This calculates if we're dirtying old or new blocks by the source of the
cow operation.  We now recognize when we dirty old blocks and will stop
migrating once we've migrated at least as many old blocks as we've
written new blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:41:07 -07:00
Zach Brown
e10033b34d scoutfs: migrate dirty btree blocks during wrap
We were seeing ring btree corruption that manifest as the server seeing
stale btree blocks as it tried to read all the btrees to migrate blocks
during a write.  A block it tried to read didn't match its reference.

It turned out that block wasn't being migrated.  It would get stuck
at a position in the ring.  Eventually new block writes would overwrite
it and then the next read would see corruption.

It wasn't being migrated because the block reading function didn't
realize that it had to migrate a dirty block.  The block was written in
a transaction at the end of the ring.   The ring wrapped during
the transaction and then migration tried to migrate the dirty block.
It wouldn't be dirtied, and thus be migrated, because it was already
dirty in the transaction.

The fix is to add more cases to the dirtying decision which takes
migration specifically into account.  We'll no longer short circuit
dirtying blocks for migration when they're in the old half of the ring
even though they're dirty.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:41:07 -07:00
Zach Brown
e150ebc8d2 scoutfs: trace btree dirty blocks
Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:41:07 -07:00
Zach Brown
806ac0d8e6 scoutfs: fix mkfs option in README
Fix a quick option typo in the mkfs invocations in the readme.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
a6782fc03f scoutfs: add data waiting
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier.  For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.

This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.

We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline.  We add these checks and waiting to data io
operations that could encounter offline extents.

This has to be done carefully so that we don't wait while holding locks
that would prevent staging.  We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.

And while we're waiting our operation is tracked and reported to
userspace through an ioctl.  This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.

Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online.  This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again.  It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes.  It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
cfa563a4a4 scoutfs: expand the per_task API
This adds some minor functionality to the per_task API for use by the
upcoming offline waiting work.

Add scoutfs_per_task_add_excl() so that a caller can tell if their task
was already put on a per-task list by their caller.

Make scoutfs_per_task_del() return a bool to indicate if the entry was
found on a list and was in fact deleted, or not.

Add scoutfs_per_task_init_entry() for initializing entries that aren't
declared on the stack.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
3a6392aee6 scoutfs: remove scoutfs_unlock_flags() prototype
There was an old prototype for an unlock variant that hasn't been around
for a while.  Remove it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
7097d545cf scoutfs: make sure to set the sb blocksize
Since fill_super was originally written we've added use of buffer_head
IO by the btree and quorum voting.  We forgot to set the block size so
devices that didn't have the common 4k default, matching our block size,
would see errors.  Explicitly set it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-05-21 11:33:26 -07:00
Zach Brown
6342bd5679 scoutfs: update README.md for quorum
Update the github README to describe the recent addition of integrated
quorum voting and locking.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
b5133bfc98 scoutfs: add elected flag to quorum block
It was a mistake to use a non-zero elected_nr as the indication that a
slot is considered actively elected.  Zeroing it as the server shuts
down wipes the elected_nr and means that it doesn't advance as each
server is elected.  This then causes a client connecting to a new server
to be confused for a client reconnecting to a server after the server
has timed it out and destroyed its state.  This caused reconnection
after shutting down a server to fail and clients to loop reconnecting
indefinitely.

This instead adds flags to the quorum block and assigns a flag to
indicate that the slot should be considered active.  It's cleared by
fencing and by the client as the server shuts down.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
36b0df336b scoutfs: add unmount barrier
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount.  We can't
let unmounting clients leave the remaining mounted clients without
quorum.

The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests.  It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.

We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.

The commit introduces and maintains the unmount_barrier field in the
quorum blocks.  It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.

The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.

Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established.  This also makes it easier to
make global decisions based on the count of pending farewell requests.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
fe63b566c9 scoutfs: use _unaligned instead of __packed
We were relying on a cute (and probably broken) trick of defining
pointers to unaligned base types with __packed.  Modern versions of gcc
warn about this.

Instead we either directly access unaligned types with get_ and
put_unaligned, or we copy unaligned data into aligned copies before
working with it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
e88b5732ad scoutfs: track trans seq in btree
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory.  It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.

This stores open transaction sequence numbers in a shared persistent
btree instead of in memory.  It removes tracking for clients as they
send their farewell during unmount.  A new server that starts up will
see existing entries for clients that were created by old servers.

This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
3d82dd3a46 scoutfs: fix bad octet in tracing ipv4 address
The macro for producing trace args for an ipv4 address had a typo when
shifting the third octet down before masking.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
fa3e0a31c7 scoutfs: use SO_REUSEADDR for server socket
The server's listening address is fixed by the raft config in the super
block.  If it shuts down and rapidly starts back up it needs to bind to
the currently lingering address.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
0bc0ff9300 scoutfs: add clock sync trace events
Generate unique trace events on the send and recv side of each message
sent between nodes.  This can be used to reasonbly efficiently
synchronize the monotonic clock in trace events between nodes given only
their captured trace events.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
a546bd0aab scoutfs: check for newlines in msg.h wrappers
The message formatter adds a newline so callers don't have to.  But
sometimes they do and we get double newlines.  Add a build check that
the format string doesn't end in a newline so that we stop adding these.
And fix up all the current offenders.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
ec0fb5380a scoutfs: implement lock recovery
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO.  As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.

This implements lock recovery by having the lock service recover locks
from clients as it starts up.

First the lock service stores records of connected clients in a btree
off the super block.  Records are added as the server receives their
greeting and are removed as the server receives their farewell.

Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.

We add lock recover request and response messages that are used to
communicate locks from the clients to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
801f6ad9be scoutfs: add scoutfs_spbm_empty()
Add a quick function that determines if a sparse bitmap has no bits set.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
74366f0df1 scoutfs: make networking more reliable
The current networking code has loose reliability guarantees.  If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection.  The client resends
requests but no responses are resent.  A client's requests could be
processed twice on the same server.  The server throws away disconnected
client state.

This was fine, sort of, for the simple requests we had implemented so
far.  It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.

This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.

The server keeps track of disconnected clients and restores state if the
same client reconnects.  This required some work around the greetings so
that clients and servers can recognize each other.  Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.

Now that connections between the client and server are preserved we can
resend responses across reconnection.  We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.

When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.

This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
20f4e1c338 scoutfs: put magic value in block header
The super block had a magic value that was used to identify that the
block should contain our data structure.  But it was called an 'id'
which was confused with the header fsid in the past.  Also, the btree
blocks aren't using a similar magic value at all.

This moves the magic value in to the header and creates values for the
super block and btree blocks.  Both are written but the btree block
reads don't check the value.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
675275fbf1 scoutfs: use hdr.fsid in greeting instead of id
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with.  This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
288d781645 scoutfs: start and stop server with quorum
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem.  That isn't going to
work if we're moving to locking provided by the server.

This uses quorum election to determine who should run the server.  We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
08a140c8b0 scoutfs: use our locking service
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.

The client code gets some shims to send and receive lock messages to and
from the server.  Callers use our lock mode constants instead of the
DLM's.

Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.

The biggest change is in the client lock state machine.  Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing.  We don't have everything
come through a per-lock work queue.  Instead we send requests either
from the blocking lock caller or from a shrink work queue.  Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.

The different processing contexts leads to a slightly different lock
life cycle.  We refactor and seperate allocation and freeing from
tracking and removing locks in data structures.  We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.

Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time.  We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.

As of this change the lock setup and destruction paths are a little
wobbly.  They'll be shored up as we add lock recovery between the client
and server.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
7c8383eddd scoutfs: add scoutfs_lock_rename()
Add a specific lock method for locking the global rename lock instead of
having the caller specify it as a global lock.  We're getting rid of the
notion of lock scopes and requiring all locks to be related to keys.
The rename lock will use magic keys at the end of the volume.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
34b8950bca scoutfs: initial lock server core
Add the core lock server code for providing a lock service from our
server.  The lock messages are wired up but nothing calls them.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f472c0bc87 scoutfs: add scoutfs_net_response_node()
Today all responses can only be sent down the connection that sent the
response while the request is being processed.  We'll be adding
subsystems that need to send responses asynchronously after initial
request processing.  Give them a call to send a response to a node id
instead of to a node's connection.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
c34dd452a7 scoutfs: add quorum voting
Add a quorum election implementation.  The mounts that can participate
in the election are specified in a quorum config array in the super
block.  Each configured participant is assigned a preallocated block
that it can write to.

All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server.  The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.

Nothing calls this code yet, this adds the initial implementation and
format.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
d57b8232ee scoutfs: move base types in format.h
We had scattered some base types throughout the format file which made
them annoying to reference in higher level structs.  Let's put them at
the top so we can use them without declarations or moving things around
in unrelated commits.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f75e1e1322 scoutfs: reformat Makefile to one object per line
Reformat the scoutfs-y object list so that there's one object per line.
Diffs now clearly demonstrate what is changing instead of having word
wrapping constantly obscuring changes in the built objects.

(Did everyone spot the scoutfs_trace sorting mistake?  Another reason
not to mash everything into wrapped lines :)).

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
6caa87458b scoutfs: add scoutfs_net_client_node_id()
Some upcoming network request processing paths need access to the
connected client's node_id.  We could add it to the arguments but that'd
be a lot of churn so we'll add an accessor function for now.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
e9f6e79d67 scoutfs: add uniq_name mount option
Each mount is getting a specified unique name.  This can be used to
identify a reconnecting mount that indicates that an old instance of the
same unique name can no longer exist and doesn't need to be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
8fedfef1cc scoutfs: remove stale net response data comment
There was a time when responding with an error wouldn't include the
caller's data payload.  That hasn't been the case since we added
compaction network requests which include a reference to the compaction
operation with the error response.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
91d190622d scoutfs: remove scoutfs.md file
The current plan is to maintain a nice paper describing the system in
the scoutfs-utils repository.

Signed-off-by: Zach Brown <zab@versity.com>
2018-09-25 13:02:11 -07:00
Brandon Philips
9bb0c60c63 README: add whitepaper link
The white paper is helpful and not linked from the Github README which will be a primary landing spot for folks discovering the project.
2018-09-19 11:03:11 -07:00