Commit Graph

87 Commits

Author SHA1 Message Date
Zach Brown
70d6f3b042 Add initial metadata ref checking
Add a checker that can walk blocks from the super block to make sure
that all metadata block numbers are accounted for.  This initial version
isn't suitable for use without further refinement, but we keep it
compiling to keep up with structures.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-03 14:16:52 -06:00
Zach Brown
38e6f11ee4 Add quota support
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
ee9e8c3e1a Extract .totl. item merging into own functions
The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl.
total and value while reading fs items.  We're going to want to do this
in another reader so let's put these in their own funcions that clearly
isolate the logic of merging the fs items into a coherent result.

We can get rid of some of the totl_read_ counters that tracked which
items we were merging.  They weren't adding much value and conflated the
reading ioctl interface with the merging logic.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
3a51ca369b Add the weak item cache
Add the weak item cache that is used for reads that can handle results
being a little behind.  This gives us a lot more freedom to implement
the cache that biases concurrent reads.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
6a99ca9ede Add attr_x core and ioctls
The existing stat_more and setattr_more interfaces aren't extensible.
This solves that problem by adding attribute interfaces which specify
the specific fields to work with.

We're about to add a few more inode fields and it makes sense to add
them to this extensible structure rather than adding more ioctls or
relatively clumsy xattrs.  This is modeled loosely on the upstream
kernel's statx support.

The ioctl entry points call core functions so that we can also implement
the existing stat_more and setattr_more interfaces in terms of these new
attr_x functions.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
cca4fcb788 Use count/scan objects shrinking interface
Move to the more recent interfaces for counting and scanning cached
objects to shrink.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Zach Brown
29538a9f45 Add POSIX ACL support
Add support for the POSIX ACLs as described in acl(5).  Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
b060eb4f5d Add fencing subsystem
Add the subsystem which tracks pending fence requests and exposes them
to userspace for processing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:25 -07:00
Zach Brown
0aa6005c99 Add volume options super, server, and sysfs
Introduce global volume options.  They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-19 14:15:06 -07:00
Zach Brown
22371fe5bd Fully destroy inodes after all mounts evict
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount.  This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.

We fix this by adding cached inode tracking.  Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.

This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group.  Removing many files in a group will only lock and get
the open map once per group.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
a65775588f Add server recovery helpers
Add a little set of functions to help the server track which clients are
waiting to recover which state.  The open map messages need to wait for
recovery so we're moving recovery out of being only in the lock server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Andy Grover
d731c1577e Filesystem version instead of format hash check
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.

Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.

Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.

Update README. Fix comments.

Add interop version to notes and modinfo.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-15 10:53:00 -08:00
Andy Grover
d9d9b65f14 scoutfs: remove __packed from all struct definitions
Instead, explicitly add padding field, and adjust member ordering to
eliminate compiler-added padding between members, and at the end of the
struct (if possible: some structs end in a u8[0] array.)

This should prevent unaligned accesses. Not a big deal on x86_64, but
other archs like aarch64 really want this.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
dbea353b92 scoutfs: bring back sort_priv
Bring back sort_priv, we have need for sorting with a caller argument.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
c61175e796 scoutfs: remove unused radix code
Remove the radix allocator that was added as we expermented with packed
extent items.  It didn't work out.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
8f946aa478 scoutfs: add btree item extent allocator
Add an allocator which uses btree items to store extents.  Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
b605407c29 scoutfs: add extent layer
Add infrastructure for working with extents.  Callers provide callbacks
which operate on their extent storage while this code performs the
fiddly splitting and merging of extents.  This layer doesn't have any
persitent structures itself, it only operates on native structs in
memory.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
45e594396f scoutfs: add an item cache above the btrees
Add an item cache between fs callers and the forest of btrees.  Calling
out to the btrees for every item operation was far too expensive.  This
gives us a flexible in-memory structure for working with items that
isn't bound by the constrants of persistent block IO.  We can rarely
stream large groups of items to and from the btrees and then use
efficient kernel memory structures for more frequent item operations.

This adds the infrastructure, nothing is calling it yet.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
f8e1812288 scoutfs: add srch infrastructure
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr.  This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.

This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model.  Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.

The core data type is the srch entry which maps a hashed name to an
inode number.  Mounts can append entries to the end of unsorted log
files during their transaction.  The server tracks these files and
rotates them into a list of files as they get large enough.  Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file.  The server only
initiates compactions when it sees a number of files of roughly the same
size.  Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
f59336085d scoutfs: add avl
Add the little avl implementation that we're going to use for indexing
items within the btree blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
85142dcadf scoutfs: use radix allocator
Convert metadata block and file data extent allocations to use the radix
allocator.

Most of this is simple transitions between types and calls.  The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix.  We remove the code and fields that were responsible for adding
uninitialized data and metadata.

The rest of the unused block allocator code is only ifdefed out.  It'll
be removed in a separate patch to reduce noise here.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
455a547e8e scoutfs: add radix allocator
Add the allocator that uses bits stored in the leaves of a cow radix.
It'll replace two metadata and data allocators that were previously
storing allocation bitmap fragments in btree items.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
dee9fbcf66 scoutfs: use packed extents and bitmaps
The btree forest item storage doesn't have as much item granular state
as the item cache did.  The item cache could tell if a cached item was
populated from persistent storage or was created in memory.  It could
simply remove created items rather than leaving behind a deletion item.

The cached btree blocks in the btree forest item storage mechanism can't
do this.  It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.

This created a problem with the extent storage we were using.  The
individual extent items were stored with a key set to the last logical
block of their extent.  As extents grew or shrank they often were
deleted and created at different key values during a transaction.  In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent.  Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.

Streaming writes would operate on O(n) for every extent operation.  It
got to be out of hand.  This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.

For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.

Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items.  The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly.  It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.

Previously the client and server would exchange extents with network
messages.  Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction.  The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.

The server no longer has to manage free extents.  It transfers block
bitmap items between trees around commits.   All of its extent
manipulation can be removed.

The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
edd8fe075c scoutfs: remove lsm code
Remove all the now unused code that deals with lsm: segment IO, the item
cache, and the manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
858dad1d51 scoutfs: add forest subsystem
The forest code presents a consistent item interface that's implemented
on top of a forest of persistent btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
bdafa6ede6 scoutfs: add block allocator
Add our block allocator core.  It'll be used shortly.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
e444c2b8c2 scoutfs: remove sort_priv
The only user was item compaction in the btree and it has been removed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
2a6d209854 scoutfs: add kernelcompat files
Add files that we'll use to detect and work around incompatibilities
between kernel versions.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-15 14:57:57 -08:00
Zach Brown
34b8950bca scoutfs: initial lock server core
Add the core lock server code for providing a lock service from our
server.  The lock messages are wired up but nothing calls them.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
c34dd452a7 scoutfs: add quorum voting
Add a quorum election implementation.  The mounts that can participate
in the election are specified in a quorum config array in the super
block.  Each configured participant is assigned a preallocated block
that it can write to.

All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server.  The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.

Nothing calls this code yet, this adds the initial implementation and
format.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f75e1e1322 scoutfs: reformat Makefile to one object per line
Reformat the scoutfs-y object list so that there's one object per line.
Diffs now clearly demonstrate what is changing instead of having word
wrapping constantly obscuring changes in the built objects.

(Did everyone spot the scoutfs_trace sorting mistake?  Another reason
not to mash everything into wrapped lines :)).

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
00adbd31be scoutfs: add sparse bitmap library
Add a quick library for maintaining a very large bitmap with sparse
allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
c4cb5c0651 scoutfs: add trivial seq file wrapper
Add a seq file wrapper which lets callers track objects easily.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
d708421cfb scoutfs: remove unused client and server code
The previous commit added shared networking code and disabled the old
unused code.  This removes all that unused client and server code that
was refactored to become the shared networking code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
17dec65a52 scoutfs: add bidirectional network messages
The client and server networking code was a bit too rudimentary.

The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to.  We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.

This refactors sending and receiving in both the client and server code
into shared networking code.  It's built around a connection struct that
then holds the message state.  Both peers on the connection can send
requests and send responses.

The existing code only retransmitted requests down newly established
connections.  Requests could be processed twice.

This adds robust reliability guarantees.  Requests are resend until
their response is received.  Requests are only processed once by a given
peer, regardless of the connection's transport socket.  Responses are
reiably resent until acknowledged.

This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal.  A following commit will remove all
the unused code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
e19716a0f2 scoutfs: clean up super block use
The code that works with the super block had drifted a bit.  We still
had two from an old design and we weren't doing anything with its crc.

Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 15:56:42 -07:00
Zach Brown
dfac36a9aa scoutfs: trace key struct
The userspace trace event printing code has trouble with arguments that
refer to fields in entries.  Add macros to make entries for all the
fields and use them as the formatted arguments.

We also remove the mapping of zone and type to strings.  It's smaller to
print the values directly and gets rid of some silly code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
869d11fd0f scoutfs: add core extent functions
Add a file of extent functions that callers will use to manipulate and
store extents in different persistent formats.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
b0bd273acc scoutfs: remove support for multi-element kvecs
Originally the item interfaces were written with full support for
vectored keys and values.  Callers constructed keys and values made up
of header structs and data buffers.  Segments supported much larger
values which could span pages when stored in memory.

But over time we've pulled that support back.  Keys are described by a
key struct instead of a multi-element kvec.  Values are now much smaller
and don't span pages.  The item interfaces still use the kvec arrays but
everyone only uses a single element.

So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec.  A
bunch of code disappears and the result is much easier to understand.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-04 09:15:27 -05:00
Zach Brown
f52dc28322 scoutfs: simplify lock use of kernel dlm
We had an excessive number of layers between scoutfs and the dlm code in
the kernel.  We had dlmglue, the scoutfs locks, and task refs.  Each
layer had structs that track the lifetime of the layer below it.  We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.

This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.

The users of the lock interface are mainly unchanged.  We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use.  Lock fields change so a few
external users of those fields change.

This not only removes a lot of code it also contains functional
improvements.  For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.

It introduces the concept of an unlock grace period.  Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.

There are significant changes to trace points, counters, and debug files
that follow the implementation changes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-02-14 15:00:17 -08:00
Mark Fasheh
ac09f03327 scoutfs: open by handle
This is implemented by filling in our export ops functions.
When we get those right, the VFS handles most of the details for us.

Internally, scoutfs handles are two u64's (ino and parent ino) and a
type which indicates whether the handle contains the parent ino or not.
Surpisingly enough, no existing type matches this pattern so we use our
own types to identify the handle.

Most of the export ops are self explanatory scoutfs_encode_fh() takes
an inode and an optional parent and encodes those into the smallest
handle that would fit. scoutfs_fh_to_[dentry|parent] turn an existing
file handle into a dentry.

scoutfs_get_parent() is a bit different and would be called on
directory inodes to connect a disconnected dentry path. For
scoutfs_get_parent(), we can export add_next_linkref() and use the backref
mechanism to quickly find a parent directory.

scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're
linking an inode to a name which exists in the parent directory. We can also
use add_next_linkref, and simply copy the name from the backref.

As a result of this patch we can also now export scoutfs file systems
via NFS, however testing NFS thoroughly is outside the scope of this
work so export support should be considered experimental at best.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab edited <= NAME_MAX]
2018-01-26 11:59:47 -08:00
Zach Brown
e354fd18b1 scoutfs: add sysfs.c, fsid file
I wanted to add a sysfs file that exports the fsid for the mount of a
given device.  But our use of sysfs was confusing and spread through
super.c and counters.c.

This moves the core of our sysfs use to sysfs.c.  Instead of defining
the per-mount dir as a kset we define it as an object with attributes
which gives us a place to add an fsid attribute.

counters still have their own whack of sysfs implementation.  We'll let
it keep it for now but we could move it into sysfs.c.  It's just counter
interation around the insane sysfs obj/attr/type nonsense.  For now it
just needs to know to add its counters dir as a child of the per-mount
dir instead of adding it to the kset.

Signed-off-by: Zach Brown <zab@versity.com>
2017-12-20 12:21:13 -08:00
Zach Brown
9ed34f8892 scoutfs: add triggers
Signed-off-by: Zach Brown <zab@versity.com>
2017-12-20 12:21:13 -08:00
Zach Brown
ce4daa817a scoutfs: add support for format_hash
Calculate the hash of format.h and ioctl.h and make sure the hash stored
in the super during mkfs matches our calculated hash on mount.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-12 13:57:31 -07:00
Zach Brown
c3e690a1ac scoutfs: add per_task storage helper
Add some functions for storing and using per-task storage in a list.
Callers can use this to pass pointers to children in a given scope when
interfaces don't allow for passing individual arguments amongst
concurrent callers in the scope.

Signed-off-by: Zach Brown <zab@versity.com>
2017-10-09 15:31:29 -07:00
Mark Fasheh
c0d3f99a6e scoutfs: Cluster coherent read/write
With trylock implemented we can add locking in readpage. After that it's
pretty easy to implement our own read/write functions which at this
point are more or less wrapping the kernel helpers in the correct
cluster locking.

Data invalidation is a bit interesting. If the lock we are invalidating
is an inode group lock, we use the lock boundaries to incrementally
search our inode cache. When an inode struct is found, we sync and
(optionally) truncate pages.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call, fixed some error handling]
Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Mark Fasheh
72a8e9e171 scoutfs: pull in some of ocfs2 stackglue
Dlmglue is built on top of this. Bring in the portions we need which
includes the stackglue API as well as most of the fs/dlm implementation.
I left off the Ocfs2 specific version and connection handling. Also
left out is the old Ocfs2 dlm support which we'll never want.

Like dlmglue, we keep as much of the generic stackglue code in tact
here. This will make translating to/from upstream patches much easier.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-23 21:40:20 -05:00
Mark Fasheh
fc21a0253c scoutfs: Hook dlmglue into our build system
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-23 15:54:08 -05:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00