Add a checker that can walk blocks from the super block to make sure
that all metadata block numbers are accounted for. This initial version
isn't suitable for use without further refinement, but we keep it
compiling to keep up with structures.
Signed-off-by: Zach Brown <zab@versity.com>
The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl.
total and value while reading fs items. We're going to want to do this
in another reader so let's put these in their own funcions that clearly
isolate the logic of merging the fs items into a coherent result.
We can get rid of some of the totl_read_ counters that tracked which
items we were merging. They weren't adding much value and conflated the
reading ioctl interface with the merging logic.
Signed-off-by: Zach Brown <zab@versity.com>
Add the weak item cache that is used for reads that can handle results
being a little behind. This gives us a lot more freedom to implement
the cache that biases concurrent reads.
Signed-off-by: Zach Brown <zab@versity.com>
The existing stat_more and setattr_more interfaces aren't extensible.
This solves that problem by adding attribute interfaces which specify
the specific fields to work with.
We're about to add a few more inode fields and it makes sense to add
them to this extensible structure rather than adding more ioctls or
relatively clumsy xattrs. This is modeled loosely on the upstream
kernel's statx support.
The ioctl entry points call core functions so that we can also implement
the existing stat_more and setattr_more interfaces in terms of these new
attr_x functions.
Signed-off-by: Zach Brown <zab@versity.com>
Move to the more recent interfaces for counting and scanning cached
objects to shrink.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add support for the POSIX ACLs as described in acl(5). Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce global volume options. They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.
Signed-off-by: Zach Brown <zab@versity.com>
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount. This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.
We fix this by adding cached inode tracking. Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.
This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group. Removing many files in a group will only lock and get
the open map once per group.
Signed-off-by: Zach Brown <zab@versity.com>
Add a little set of functions to help the server track which clients are
waiting to recover which state. The open map messages need to wait for
recovery so we're moving recovery out of being only in the lock server.
Signed-off-by: Zach Brown <zab@versity.com>
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.
Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.
Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.
Update README. Fix comments.
Add interop version to notes and modinfo.
Signed-off-by: Andy Grover <agrover@versity.com>
Instead, explicitly add padding field, and adjust member ordering to
eliminate compiler-added padding between members, and at the end of the
struct (if possible: some structs end in a u8[0] array.)
This should prevent unaligned accesses. Not a big deal on x86_64, but
other archs like aarch64 really want this.
Signed-off-by: Andy Grover <agrover@versity.com>
Add an allocator which uses btree items to store extents. Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.
Signed-off-by: Zach Brown <zab@versity.com>
Add infrastructure for working with extents. Callers provide callbacks
which operate on their extent storage while this code performs the
fiddly splitting and merging of extents. This layer doesn't have any
persitent structures itself, it only operates on native structs in
memory.
Signed-off-by: Zach Brown <zab@versity.com>
Add an item cache between fs callers and the forest of btrees. Calling
out to the btrees for every item operation was far too expensive. This
gives us a flexible in-memory structure for working with items that
isn't bound by the constrants of persistent block IO. We can rarely
stream large groups of items to and from the btrees and then use
efficient kernel memory structures for more frequent item operations.
This adds the infrastructure, nothing is calling it yet.
Signed-off-by: Zach Brown <zab@versity.com>
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr. This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.
This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model. Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.
The core data type is the srch entry which maps a hashed name to an
inode number. Mounts can append entries to the end of unsorted log
files during their transaction. The server tracks these files and
rotates them into a list of files as they get large enough. Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file. The server only
initiates compactions when it sees a number of files of roughly the same
size. Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.
Signed-off-by: Zach Brown <zab@versity.com>
Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
Add the allocator that uses bits stored in the leaves of a cow radix.
It'll replace two metadata and data allocators that were previously
storing allocation bitmap fragments in btree items.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
The forest code presents a consistent item interface that's implemented
on top of a forest of persistent btrees.
Signed-off-by: Zach Brown <zab@versity.com>
Add the core lock server code for providing a lock service from our
server. The lock messages are wired up but nothing calls them.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quorum election implementation. The mounts that can participate
in the election are specified in a quorum config array in the super
block. Each configured participant is assigned a preallocated block
that it can write to.
All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server. The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.
Nothing calls this code yet, this adds the initial implementation and
format.
Signed-off-by: Zach Brown <zab@versity.com>
Reformat the scoutfs-y object list so that there's one object per line.
Diffs now clearly demonstrate what is changing instead of having word
wrapping constantly obscuring changes in the built objects.
(Did everyone spot the scoutfs_trace sorting mistake? Another reason
not to mash everything into wrapped lines :)).
Signed-off-by: Zach Brown <zab@versity.com>
The previous commit added shared networking code and disabled the old
unused code. This removes all that unused client and server code that
was refactored to become the shared networking code.
Signed-off-by: Zach Brown <zab@versity.com>
The client and server networking code was a bit too rudimentary.
The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to. We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.
This refactors sending and receiving in both the client and server code
into shared networking code. It's built around a connection struct that
then holds the message state. Both peers on the connection can send
requests and send responses.
The existing code only retransmitted requests down newly established
connections. Requests could be processed twice.
This adds robust reliability guarantees. Requests are resend until
their response is received. Requests are only processed once by a given
peer, regardless of the connection's transport socket. Responses are
reiably resent until acknowledged.
This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal. A following commit will remove all
the unused code.
Signed-off-by: Zach Brown <zab@versity.com>
The code that works with the super block had drifted a bit. We still
had two from an old design and we weren't doing anything with its crc.
Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.
Signed-off-by: Zach Brown <zab@versity.com>
The userspace trace event printing code has trouble with arguments that
refer to fields in entries. Add macros to make entries for all the
fields and use them as the formatted arguments.
We also remove the mapping of zone and type to strings. It's smaller to
print the values directly and gets rid of some silly code.
Signed-off-by: Zach Brown <zab@versity.com>
Add a file of extent functions that callers will use to manipulate and
store extents in different persistent formats.
Signed-off-by: Zach Brown <zab@versity.com>
Originally the item interfaces were written with full support for
vectored keys and values. Callers constructed keys and values made up
of header structs and data buffers. Segments supported much larger
values which could span pages when stored in memory.
But over time we've pulled that support back. Keys are described by a
key struct instead of a multi-element kvec. Values are now much smaller
and don't span pages. The item interfaces still use the kvec arrays but
everyone only uses a single element.
So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec. A
bunch of code disappears and the result is much easier to understand.
Signed-off-by: Zach Brown <zab@versity.com>
We had an excessive number of layers between scoutfs and the dlm code in
the kernel. We had dlmglue, the scoutfs locks, and task refs. Each
layer had structs that track the lifetime of the layer below it. We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.
This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.
The users of the lock interface are mainly unchanged. We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use. Lock fields change so a few
external users of those fields change.
This not only removes a lot of code it also contains functional
improvements. For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.
It introduces the concept of an unlock grace period. Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.
There are significant changes to trace points, counters, and debug files
that follow the implementation changes.
Signed-off-by: Zach Brown <zab@versity.com>
This is implemented by filling in our export ops functions.
When we get those right, the VFS handles most of the details for us.
Internally, scoutfs handles are two u64's (ino and parent ino) and a
type which indicates whether the handle contains the parent ino or not.
Surpisingly enough, no existing type matches this pattern so we use our
own types to identify the handle.
Most of the export ops are self explanatory scoutfs_encode_fh() takes
an inode and an optional parent and encodes those into the smallest
handle that would fit. scoutfs_fh_to_[dentry|parent] turn an existing
file handle into a dentry.
scoutfs_get_parent() is a bit different and would be called on
directory inodes to connect a disconnected dentry path. For
scoutfs_get_parent(), we can export add_next_linkref() and use the backref
mechanism to quickly find a parent directory.
scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're
linking an inode to a name which exists in the parent directory. We can also
use add_next_linkref, and simply copy the name from the backref.
As a result of this patch we can also now export scoutfs file systems
via NFS, however testing NFS thoroughly is outside the scope of this
work so export support should be considered experimental at best.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab edited <= NAME_MAX]
I wanted to add a sysfs file that exports the fsid for the mount of a
given device. But our use of sysfs was confusing and spread through
super.c and counters.c.
This moves the core of our sysfs use to sysfs.c. Instead of defining
the per-mount dir as a kset we define it as an object with attributes
which gives us a place to add an fsid attribute.
counters still have their own whack of sysfs implementation. We'll let
it keep it for now but we could move it into sysfs.c. It's just counter
interation around the insane sysfs obj/attr/type nonsense. For now it
just needs to know to add its counters dir as a child of the per-mount
dir instead of adding it to the kset.
Signed-off-by: Zach Brown <zab@versity.com>
Calculate the hash of format.h and ioctl.h and make sure the hash stored
in the super during mkfs matches our calculated hash on mount.
Signed-off-by: Zach Brown <zab@versity.com>
Add some functions for storing and using per-task storage in a list.
Callers can use this to pass pointers to children in a given scope when
interfaces don't allow for passing individual arguments amongst
concurrent callers in the scope.
Signed-off-by: Zach Brown <zab@versity.com>
With trylock implemented we can add locking in readpage. After that it's
pretty easy to implement our own read/write functions which at this
point are more or less wrapping the kernel helpers in the correct
cluster locking.
Data invalidation is a bit interesting. If the lock we are invalidating
is an inode group lock, we use the lock boundaries to incrementally
search our inode cache. When an inode struct is found, we sync and
(optionally) truncate pages.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call, fixed some error handling]
Signed-off-by: Zach Brown <zab@versity.com>
Dlmglue is built on top of this. Bring in the portions we need which
includes the stackglue API as well as most of the fs/dlm implementation.
I left off the Ocfs2 specific version and connection handling. Also
left out is the old Ocfs2 dlm support which we'll never want.
Like dlmglue, we keep as much of the generic stackglue code in tact
here. This will make translating to/from upstream patches much easier.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The networking code was really suffering by trying to combine the client
and server processing paths into one file. The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.
The client maintains a single connection. Blocked senders work on the
socket under a sending mutex. The recv path runs in work that can be
canceled after first shutting down the socket.
A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets. Each accepted socket has
a single recv work blocked waiting for requests. That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.
All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server. This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.
Signed-off-by: Zach Brown <zab@versity.com>