Use the mount's generated random id in persistent items and the lock
that protects them instead of the assigned node_id.
Signed-off-by: Zach Brown <zab@versity.com>
Add a .indx. xattr tag which adds the inode to an index of inodes keyed
by the hash of xattr names. An ioctl is added which then returns all
the inodes which may contain an xattr of the given name. Dropping all
xattrs now has to parse the name to find out if it also has to delete an
index item.
Signed-off-by: Zach Brown <zab@versity.com>
An elected leader writes a quorum block showing that it's elected before
it assumes exclusive access to the device and starts bringing up the
server. This lets another later elected leader find and fence it if
something happens.
Other mounts were trying to connect to the server once this elected
quorum block was written and before the server was listening. They'd
get conection refused, decide to elect a new leader, and try to fence
the server that's still running.
Now, they should have tried much harder to connect to the elected leader
instead of taking a single failed attempt as fatal. But that's a
problem for another day that involves more work in balancing timeouts
and retries.
But mounts should not have tried try to connect to the server until its
listening. That's easy to signal by adding a simple listening flag to
the quorum block. Now mounts will only try to connect once they see the
listening flag and don't see these racey refused connections.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
It was a mistake to use a non-zero elected_nr as the indication that a
slot is considered actively elected. Zeroing it as the server shuts
down wipes the elected_nr and means that it doesn't advance as each
server is elected. This then causes a client connecting to a new server
to be confused for a client reconnecting to a server after the server
has timed it out and destroyed its state. This caused reconnection
after shutting down a server to fail and clients to loop reconnecting
indefinitely.
This instead adds flags to the quorum block and assigns a flag to
indicate that the slot should be considered active. It's cleared by
fencing and by the client as the server shuts down.
Signed-off-by: Zach Brown <zab@versity.com>
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount. We can't
let unmounting clients leave the remaining mounted clients without
quorum.
The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests. It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.
We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.
The commit introduces and maintains the unmount_barrier field in the
quorum blocks. It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.
The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.
Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established. This also makes it easier to
make global decisions based on the count of pending farewell requests.
Signed-off-by: Zach Brown <zab@versity.com>
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory. It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.
This stores open transaction sequence numbers in a shared persistent
btree instead of in memory. It removes tracking for clients as they
send their farewell during unmount. A new server that starts up will
see existing entries for clients that were created by old servers.
This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.
Signed-off-by: Zach Brown <zab@versity.com>
Generate unique trace events on the send and recv side of each message
sent between nodes. This can be used to reasonbly efficiently
synchronize the monotonic clock in trace events between nodes given only
their captured trace events.
Signed-off-by: Zach Brown <zab@versity.com>
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO. As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.
This implements lock recovery by having the lock service recover locks
from clients as it starts up.
First the lock service stores records of connected clients in a btree
off the super block. Records are added as the server receives their
greeting and are removed as the server receives their farewell.
Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.
We add lock recover request and response messages that are used to
communicate locks from the clients to the server.
Signed-off-by: Zach Brown <zab@versity.com>
The current networking code has loose reliability guarantees. If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection. The client resends
requests but no responses are resent. A client's requests could be
processed twice on the same server. The server throws away disconnected
client state.
This was fine, sort of, for the simple requests we had implemented so
far. It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.
This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.
The server keeps track of disconnected clients and restores state if the
same client reconnects. This required some work around the greetings so
that clients and servers can recognize each other. Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.
Now that connections between the client and server are preserved we can
resend responses across reconnection. We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.
When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.
This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.
Signed-off-by: Zach Brown <zab@versity.com>
The super block had a magic value that was used to identify that the
block should contain our data structure. But it was called an 'id'
which was confused with the header fsid in the past. Also, the btree
blocks aren't using a similar magic value at all.
This moves the magic value in to the header and creates values for the
super block and btree blocks. Both are written but the btree block
reads don't check the value.
Signed-off-by: Zach Brown <zab@versity.com>
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem. That isn't going to
work if we're moving to locking provided by the server.
This uses quorum election to determine who should run the server. We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.
Signed-off-by: Zach Brown <zab@versity.com>
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add the core lock server code for providing a lock service from our
server. The lock messages are wired up but nothing calls them.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quorum election implementation. The mounts that can participate
in the election are specified in a quorum config array in the super
block. Each configured participant is assigned a preallocated block
that it can write to.
All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server. The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.
Nothing calls this code yet, this adds the initial implementation and
format.
Signed-off-by: Zach Brown <zab@versity.com>
We had scattered some base types throughout the format file which made
them annoying to reference in higher level structs. Let's put them at
the top so we can use them without declarations or moving things around
in unrelated commits.
Signed-off-by: Zach Brown <zab@versity.com>
Each mount is getting a specified unique name. This can be used to
identify a reconnecting mount that indicates that an old instance of the
same unique name can no longer exist and doesn't need to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
Currently compaction is only performed by one thread running in the
server. Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.
This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server. This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.
The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight. It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.
A server thread still coordinates which segments are compacted. The
search for a candidate compaction operation is largely unchanged. It
now has to deal with being unable to process a compaction because its
segments are busy. We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests. If there are none at the level we move up to the next level.
The server will only issue a given number of compaction requests to a
client at a time. When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.
If a client disconnects the server forgets the compactions it had sent
to that client. If those compactions still need to be processed they'll
be sent to the next client.
The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes. This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.
The server needs to block as it does work for compaction in the
notify_up and response callbacks. We move them out from under spin
locks.
The server needs to clean up allocated segnos for a compaction request
that fails. We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.
Signed-off-by: Zach Brown <zab@versity.com>
It was a bit of an overreach to try and limit duplicate request
processing in the network layer. It introduced acks and the necessity
to resync last_processed_id on reconnect.
In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server. The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server. To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.
In thinking about this, though, there's a bigger problem. Duplicate
request processing protection only works up in memory in the networking
connections. If the server makes persistent changes, then crashes, the
client will resend the request to the new server. It will need to
discover that the persistent changes have already been made.
So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server. Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already. There's no need to implement the
complexity of protecting duplicate delivery between running nodes.
This removes the last_processed_id on the server. It removes resending
of responses and acks. Now that ids can be processed out of order we
remove the special known ID of greeting commands. They can be processed
as usual. When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.
Signed-off-by: Zach Brown <zab@versity.com>
Keys used to be variable length so the manifest struct on the wire ended
in key payloads. The keys are now fixed size so that field is no longer
necessary or used. It's an artifact that should have been removed when
the keys were made fixed length.
Signed-off-by: Zach Brown <zab@versity.com>
Today node_ids are randomly assigned. This adds the risk of failure
from random number generation and still allows for the risk of
collisions.
Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange. This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.
To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing. This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.
Now that net_connect is sync in the client we don't need the notify_up
callback anymore. The client can perform those duties when the connect
returns.
The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.
Signed-off-by: Zach Brown <zab@versity.com>
We had fields in the segment header for the crc but weren't using it.
This calculates the crc on write and verifies it on read. The crc
covers the used bytes in the segment as indicated by the total_bytes
field.
Signed-off-by: Zach Brown <zab@versity.com>
The client and server networking code was a bit too rudimentary.
The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to. We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.
This refactors sending and receiving in both the client and server code
into shared networking code. It's built around a connection struct that
then holds the message state. Both peers on the connection can send
requests and send responses.
The existing code only retransmitted requests down newly established
connections. Requests could be processed twice.
This adds robust reliability guarantees. Requests are resend until
their response is received. Requests are only processed once by a given
peer, regardless of the connection's transport socket. Responses are
reiably resent until acknowledged.
This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal. A following commit will remove all
the unused code.
Signed-off-by: Zach Brown <zab@versity.com>
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node. Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents. With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.
This adds a simple high water mark after which nodes start returning
free extents to the server. From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.
Signed-off-by: Zach Brown <zab@versity.com>
The code that works with the super block had drifted a bit. We still
had two from an old design and we weren't doing anything with its crc.
Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.
Signed-off-by: Zach Brown <zab@versity.com>
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size. This prematurely
returns -ENOSPC if a very large allocation is attempted. Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.
This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server. It looks for previous extents in the index of
extents by length. This builds on the previously added item and extent
_prev operations.
Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for. The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.
Signed-off-by: Zach Brown <zab@versity.com>
Add an fallocate operation.
This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size. This will confuse the rest of
the code in a few places and that will be fixed up next.
Signed-off-by: Zach Brown <zab@versity.com>
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata. The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption. In this case we
continue returning the error that caused us to try and clean up.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have extents we can address the fragmentation of concurrent
writes with large preallocated unwritten extents instead of trying to
allocate from disjoint free space with cursors.
First we add support for unwritten extents. Truncate needs to make sure
it doesn't treat truncated unwritten blocks as online just because
they're not offline. If we try to write into them we convert them to
written extents. And fiemap needs to flag them as unwritten and be sure
to check for extents past i_size.
Then we allocate unwritten extents only if we're extending a contiguous
file. We try to preallocate the size of the file and cap it to a meg.
This ends up with a power of two progression of preallocation sizes,
which nicely balances extent sizes and wasted allocation as file sizes
increase.
We need to be careful to truncate the preallocated regions if the entire
file is released. We take that as an indication that the user doesn't
want the file consuming any more space.
This removes most of the use of the cursor code. It will be completely
removed in a further patch.
Signed-off-by: Zach Brown <zab@versity.com>
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.
We add a client request to allocate an extent of a given length. The
existing segment alloc and free now work with a segment's worth of
blocks.
The server maintains counters in the super block of free blocks instead
of free segments. We maintain an allocation cursor so that allocation
results tend to cycle through the device. It's stored in the super so
that it is maintained across server instances.
This doesn't remove unused dead code to keep the commit from getting too
noisy. It'll be removed in a future commit.
Signed-off-by: Zach Brown <zab@versity.com>
Store file data mappings and free block ranges in extents instead of in
block mapping items and bitmaps.
This adds the new functionality and refactors the functions that use it.
The old functions are no longer called and we stop at ifdeffing them out
to keep the change small. We'll remove all the dead code in a future
change.
Signed-off-by: Zach Brown <zab@versity.com>
Add a file of extent functions that callers will use to manipulate and
store extents in different persistent formats.
Signed-off-by: Zach Brown <zab@versity.com>
Add functions that atomically change and query the online and offline
block counts as a pair. They're semantically linked and we shouldn't
present counts that don't match if they're in the process of being
updated.
Signed-off-by: Zach Brown <zab@versity.com>
Add the max possible logical block / physical blkno number given u64
bytes recorded at block size granularity.
Signed-off-by: Zach Brown <zab@versity.com>
Add a tunable option to force using tiny btree blocks on an active
mount. This lets us quickly exercise large btrees.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we're using small file system keys we can dramatically shrink
the maximum allowed btree keys and values. This more accurately matches
the current users and less us fit more possible items in each block.
Which lets us turn the block size way down and still have multiple worst
case largest items per block.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Directory entries were the last items that had large variable length
keys because they stored the entry name in the key. We'd like to have
small fixed size keys so let's store dirents with small keys.
Entries for lookup are stored at the hash of the name instead of the
full name. The key also contains the unique readdir pos so that we
don't have to deal with collision on creation. The lookup procedure now
does need to iterate over all the readdir positions for the hash value
and compare the names.
Entries for link backref walking are stored with the entry's position in
the parent dir instead of the entry's name. The name is then stored in
the value. Inode to path conversion can still walk the backref items
without having to lookup dirent items.
These changes mean that all directory entry items are now stored at a
small key with some u64s (hash, pos, parent dir, etc) and have a value
with the dirent struct and full entry name. This lets us use the same
key and value format for the three entry key types. We no longer have
to allocate keys, we can store them on the stack.
We store the entry's hash and pos in the dirent struct in the item value
so that any item has all the fields to reference all the other item
keys. We store the same values in the dentry_info so that deletion
(unlink and rename) can find all the entries.
The ino_path ioctl can now much more clearly iterate over parent
directories and entry positions instead of oh so cleverly iterating over
null terminated names in the parent directories. The ioctl interface
structs and implementation become simpler.
Signed-off-by: Zach Brown <zab@versity.com>
Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting
bug in getxattr(). We were unconditinally returning the max xattr value
size when someone tried to probe an existing xattrs' value size by
calling getxattr with size == 0. Some kernel paths did this to probe
the existance of xattrs. They expected to get an error if the xattr
didn't exist, but we were giving them the max possible size. This
kernel path then tried to remove the xattrs with XATTR_REMOVE and that
now failed and caused a bunch of errors in xfstests.
The fix is to return the real xattr value size when getxattr is called
with size == 0. To do that with the old format we'd have to iterate
over all the items which happened to be pretty awkward in the current
code paths.
So we're taking this opportunity to land a change that had been brewing
for a while. We now form the xattr keys from the hash of the name and
the item values now store a logical contiquous header, the name, and the
value. This makes it very easy for us to have the full xattr value
length in the header and return it from getxattr when size == 0.
Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE
flags.
And the code is a whole lot easier to follow. And we've removed another
barrier for moving to small fixed size keys.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't doing anything with the inode blocks field. We weren't even
initializing it which explains why we'd sometimes see garbage i_blocks
values in scoutfs inodes in segments.
The logical blocks field reflects the contents of the file regardless of
whether its online or not. It's the sum of our online and offline block
tracking.
So we can initialize it to our persistent online and offline counts and
then keep it in sync as blocks are allocated and freed.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
This is implemented by filling in our export ops functions.
When we get those right, the VFS handles most of the details for us.
Internally, scoutfs handles are two u64's (ino and parent ino) and a
type which indicates whether the handle contains the parent ino or not.
Surpisingly enough, no existing type matches this pattern so we use our
own types to identify the handle.
Most of the export ops are self explanatory scoutfs_encode_fh() takes
an inode and an optional parent and encodes those into the smallest
handle that would fit. scoutfs_fh_to_[dentry|parent] turn an existing
file handle into a dentry.
scoutfs_get_parent() is a bit different and would be called on
directory inodes to connect a disconnected dentry path. For
scoutfs_get_parent(), we can export add_next_linkref() and use the backref
mechanism to quickly find a parent directory.
scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're
linking an inode to a name which exists in the parent directory. We can also
use add_next_linkref, and simply copy the name from the backref.
As a result of this patch we can also now export scoutfs file systems
via NFS, however testing NFS thoroughly is outside the scope of this
work so export support should be considered experimental at best.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab edited <= NAME_MAX]
The augmenting of the btree to track items with bits set was too fiddly
for its own good. We were able to migrate old btree blocks with a
simple stored key while also fixing livelocks as the parent and item
bits got out of sync. This is now unused buggy code that can be
removed.
Signed-off-by: Zach Brown <zab@versity.com>