Commit Graph

257 Commits

Author SHA1 Message Date
Zach Brown
2591e54fdc Make it easier to build scoutfs.ko
We were duplicating the make args a few times so make a little ARGS
variable.

Default to the /lib/modules/$(uname -r) installed kernel source if
SK_KSRC isn't set.

And only try a sparse build that can fail if we can execute the sparse
command.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2017-04-18 14:03:24 -07:00
Nic Henke
9fc47dedf8 Add unlocked ioctls for directories.
The use of the Scout ioctls for inode-since and data-since on the root
directory is a rather helpful boost. This allows user code to start on
blank filesystems and monitor activity without needing to create files.

The existing ioctl code was already present, so wiring into the
directory file operations was all that needed to happen.

Signed-off-by: Nic Henke <nic.henke@versity.com>
Reviewed-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2017-04-18 14:03:24 -07:00
Zach Brown
e61697a54e Add generic file and dir seek methods
Two more xfstests pass when we can seek in files and dirs.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2017-04-18 14:03:22 -07:00
Zach Brown
efd95688d3 Add printf format checking to scoutfs msg funcs
scoutfs_msg() was missing the attribute to check printf formats and
arguments.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2017-04-18 13:59:54 -07:00
Zach Brown
cec3f9468a Further isolate rings and compaction
Each mount was still loading the manifest and allocator rings and
starting compaction, even if they were coordinating segment reads
and writes with the server.

This moves ring and compaction setup and teardown from on mount and
unmount to as the server starts up and shuts down.  Now only the server
has the rings resident and is running compaction.

We had to null some of the super info fields so that we can repeatedly
load and destroy the ring indices over the lifetime of a mount.

We also have to be careful not to call between item transactions and
compaction.   We'll restore this functionality with the server in the
future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
5eefaf34f8 Server updates ring for level0 segment writes
Transaction commits currently directly modify the ring and super block
as segments are written.  As we introduce shared mounts only the server
can modify the ring and super blocks.

This adds network messages to let mounts write items in a level 0
segment while the server modifies the allocator and manifest.

The item transaction commit now sends a message to the server to get an
allocated segno for its new level0 segment and sends a manifest entry to
the server once the segment is written.  The request and reply handlers
for the functions are straight forward.  The processing paths are simple
wrappers around the allocation and update functions that transaction
writing used to call directly.

Now that the item transactions aren't updating the super sync can't
work with the super sequence numbers.

The server needs to make both allocations and manifest updates
persistent before it sends replies to the client.  We add the ability
for the server processing paths to queue and wait for commits of the
rings and super block.  We can hopefull get reasonable batching by using
a work struct for the commit.  We update the other processing path
callers that modify the rings to use the new commit mechanism.

We add a few segment and manifest functions to work with manifest
entries that describe segments.  This creats a bit of similar looking
code thorughout the segment and manifest code but we'll come back and
clean this up once we see what the final shared support looks like.

scoutfs_seg_alloc() now takes the segno from the caller for the segment
it's allocating and inserting into the cache.  Transaction commit uses
the segno it got from the server while compaction still allocates
locally.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
5487aee6a7 Read items with manifest entries from server
Item reading tries to directly walk the manifest to find segments to
read.  That doesn't work when only the server has read the ring and
loaded the manifest.

This adds a network message to ask the server for the manifest entries
that describe the segments that will be needed to read items.

Previously item reading would walk the manifest and build up native
manifest references in a list that it'd use to read.   To implement the
network message we add request sending, processing, and reply parsing
around those original functions.  Item reading now packs its key range
and sends it to the server.  The server walks the manifest and sends the
entries that intersect with the key range.  Then the reply function
builds up the native manifest references that item reading will use.

The net reply functions needed an argument so that the manifest reading
request could pass in the caller's list that the native manifest
references should be added to.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
b50de90196 Alloc inodes from pool from server
Inode allocation was always modifying the in-memory super block.  This
doesn't work when the server is solely responsible for modifying the
super blocks.  We add network messages to have mounts send a message to
the server to request inodes that they can use to satisfy allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
453715a78d Only shutdown locks that were setup
Lock shutdown was crashing trying to deref a null linf on cleanup from
mont errors that happened before locks were setup.  Make sure lock
shutdown only tries to do work if the locks have been setup.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
45882f5a77 Add some ring tracing
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
5e0e9ac12e Move to much simpler manifest/alloc storage
Using the treap to be able to incrementally read and write the manifest
and allocation storage from all nodes wasn't quite ready for prime time.
The biggest problem is that invalidating cached nodes which are the
target of native pointers, either for consistency or memory pressure, is
problematic.  This was getting in the way of adding shared support as
readers and writers try to use as much of their treap caches as they
can.  There were other serious problems that we'd run into eventually:
memory pressure from duplicate caching in native nodes and the page
cache, small IOs from reading a page at a time, the risk of
pathologically imbalanced treaps, and the ring being corrupted if the
migration balancing doesn't work (the model assumed you could always
dirty an individual node in a transaction, you have to dirty all the
parents in each new transaction).

Let's back off to a much simpler mechanism while we build the rest of
the system around it.  We can revisit aggressively optimizing this when
it's our worst problem.

We'll store the indexes that the manifest server needs in simple
preallocated rings with log entries.   The server has to read the index
in its entirety into a native rbtree before it can work on it.  We won't
access the physical ring from mounts anymore, they'll send messages to
the server.

The ring callers are now working with a pinned tree in memory so the
interface can be a bit simpler.  By storing the indexes in their own
rings the code and write path become a lot simper: we have an IO
submission path for each index instead of "dirtying" calls per index and
then a writing call.

All this is much more robust and much less likely to get in our way as
we stand up the rest of the system around it.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
86d3090982 Tighten lock range error handling
If lock_range returns an error then the caller won't unlock the range.
Make sure to unlock the range if we have it locked when we get errors
that we're going to return to the caller.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
104bbb06a9 Remove cached range when invalidating items
When invalidating items we need to remove the cached
range that covers the range of keys that we're removing so that
the removed items aren't then considered negative cached items.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
2ea5f1d734 invalidate_others could return uninit ret
Make sure to initialize ret in case there aren't other mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
8c59902b70 scoutfs: cleanup socket callbacks
The first attempt at wiring up the socket callbacks was a bit too
precious.  We can simplify and do what other modern socket callback
users do: don't bother with the callback locks and call shutdown before
release.

We also protect against spurious callbacks by only doing work in the
callbacks when the sk user_data points to a sock_info which points back
to the socket.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
27e55eb43c Flesh out some pieces of the scoutfs.md doc
Trying to keep adding coverage across the design.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
39ae89d85f Add network messaging between mounts
We're going to need communication between mounts to update and
distribute the manifest and allocators in the treap ring.

This adds a netwoking core where one mount becomes the server and other
mounts send requests to it.  The messaging semantics are pretty simple
in that clients reliably send requests and the server passively reply to
requests.  Complexity beyond that is up to the callers implementing the
requests.

It relies on locking to establish the server role and to broadcast the
address of the server socket.  We add a trivial lvb back to our local
test locking implementation to store the address.  We also add the
ability to shut down locking so that the locking networking work stops
blocking.

A little demonstration request is included which just gives visibility
into client and server clocks in the trace logs.  Next up we'll add the
requests that do real work.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
392ed81c43 Add some simple lock/invalidation tracing
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
955d940c64 Restore key tracing
Now that the keys are a contiguous buffer we can format them for the
trace buffers with a much more straight forward type check around
per-key snprintfs.  We can get rid of all the weird kvec code that tried
to deal with keys that straddled vectors.

With that fixed we can uncomment out the tracing statements that were
waiting the key formatting.

I was testing with xattr keys so they're added as the code is updated.
The rest of the key types will be added seperately as they're used.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
607eff9b7c Add range locking to xattr ops
We can use easy xattrs to test range locking and item consistency
between mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:49:16 -07:00
Zach Brown
b3b2693939 Add simple debugging range locking layer
We can work on shared mechanics without requiring a full locking server.
We can stand up a simple layer which uses shared data structures in a
kernel image to lock between mounts in the same kernel.

On mount we add supers to a list.  Held locks are tracked in a rbtree.
A lock attempt blocks until it doesn't conflict with anything in the
rbtree.

As locks are acquired we walk all the other supers and write/invaludate
any items they have which intersect with the acquired range.  This is
easier to implement and less efficient than caching locks after they're
unlocked and implementing downconvert/blocking/revoke.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
f373f05fb7 Add engineering markdown document
Let's put the engineering doc in the source tree so that eventually
it'll be easily found upstream.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
97cb75bd88 Remove dead btree, block, and buddy code
Remove all the unused dead code from the previous btree block design.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
6bcdca3cf9 Update dirent last pos and update first comment
The last valid pos for us is now a full u64 because we're storing
entries at an increasing counter instead of at a hahs of the entry name.

And might as well add a clarifying comment to the first pos while we're
here.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
00fed84c68 Build statfs f_blocks from total_segs
Use the current total_segs field to calculate the total number of blocks
in the system instead of the old and redundant total_segs field which is
going away.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
02af35a98e Convert inode since ioctl to the item API
The inode since ioctl was the last user of the btree.  It doesn't yet
work because the item cache doesn't know how to search for items by
sequence yet.

It's not yet clear exactly how we'll build the data since ioctls.  It'll
be easy enough to refactor the inode since item walk if they follow a
similar pattern again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
429e1b6eb4 Truncate data items
scoutfs_data_truncate_items() was still using the btree.  This updates
it to use the item cache but doesn't yet support regions being offline.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
92b10e8270 Write super with bio functions
Write our super block from an allocated page with our bio functions
instead of relying on the old block cache layer which is going away.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
75b018a0e7 Add symlinks back
Convert symlinks to use the new item cache API.  This is so much easier
because our max item size matches the symlink size.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
54e07470f1 Update xattrs to use the item cache
Update the xattrs to use the item cache.  Because we now have large keys
we can store the xattr at its full name instead of having to deal with
hashing the name and addressing collisions.

Now that we don't have the find xattr ioctls we don't need to maintain
backrefs.

We also add support for large xattrs that span multiple items.  The key
footer and value header give us the metadata we need to iterate over the
items that make up an xattr.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
64bc145e3c Add scoutfs_item_set_batch()
We're about to update xattrs to use the item cache API and xattrs want
to be pretty big.  scoutfs_item_set_batch() let's the xattr code
atomically update xattrs made up of multiple items.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
a310027380 Remove the find xattr ioctls
The current plan for finding populations of inodes to search no longer
involves xattr backrefs.  We're about to change the xattr storage format
so let's remove these interfaces so we don't have to update them.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
fff6fb4740 Restore link backref items
Convert the link backref code from btree items to the item cache.

Now that the backref items have the full entry name we can traverse a
link with one item lookup.  We don't need to lock the inode and verify
that the entry at the backref offset really points to our inode.  The
link backref walk gets a lot simpler.

But we have to widen the ioctl cursor to store a full dir ino and path
name isntead of just the dir's backref counter.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
8def9141bc Add scoutfs_key_init_buf_len()
As of yet the static key users have key and buffer lengths that match.
We're about to add a link backref caller who searches with a small key
but gets a result copied into a larger buffer.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
6516ce7d57 Report free blocks in statfs
Our statfs callback was still using the old buddy allocator.

We add a free segments field to the super and have it track the number
of free segments in the allocator.  We then use that to calculate the
number of free blocks for statfs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9f5e42f7dd Add simple data items
Add basic file data support by managing file data items from the page
cache address space callbacks.

Data is read by copying from cached items into page contents in
readpage.

Writes create new ephemeral items which reference dirty pages.  The
items are deleted once they're written in a transaction or if
invalidatepage removes the dirty page they reference.

There's a lot more to do to remove data copies, avoid compaction bw
overhead, and add support for truncate, o_direct, and mmap.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
1ad479a1af Add ephemeral items
Ephemeral items exist to reference external values.  They're going to be
used by the page cache to reference dirty pages for writeback.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
c3307e941b Add scoutfs_item_forget()
Add a forget call which forcefully removes an item, no matter it's
state.   The page cache will use this in invalidate page to drop
ephemeral items that reference a dirty page that's being truncated.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9f885b4c12 Fix item erase augmentation
The item cache was getting inconsistent as items were removed.  This
would manifest in failing to find dirty items that it had counted as it
was writing items into the segment and removing deletion items.

For a start it wasn't using the augmented rb_erase().  We make a
function that everyone uses.  There's no augmented rb_replace() so We
just augment erase, restart, and insert.  (We could probably augment on
descent and replace/propagate but that can come later.)

Then the augmentation callbacks got the semantics slightly wrong.  The
rotation callback is named after a caller that happens to use it, not on
any implied relationship between the nodes.  It actually just
recalculates the augmentation value for the two subtrees.  Mischief
managed.

(We'll probably rework the augmentation so the value is for the node and
its children and we can get rid of the extra code we have today to
support our augmentation value that is sensitive to the difference
between the left and write subtrees.)

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
568cefa4db Add some item debugging tracing to seg writing
Trace the items that we count and then write to the segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
7045e3a6e8 More efficiently destroy item rbtrees
I was auditing rb_erase() use and noticed that we we don't need to fully
tear down the item trees.  We can just blow them away with postorder
traversal and raw frees of the nodes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
0298cbb562 Fix compact cleanup on mount failure
scoutfs_compact_destroy() was testing the wrong pointer to see if
_setup() had built up resources that needed to be torn down.  It'd crash
on mount failure.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
67aec72c77 Add readdir items
Restore readdir functionality by adding readdir items.

The readdir items are keyed by an increasing position in the parent
dir's inode.  We track it in our inode info.  To delete the readdir
items we restore the dentry_info and put the pos in the dentry so unlink
can build the readdir item key.  And finally we put the pos in the
lookup dirent so that it can populate the dentry info on lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9a293bfa75 Add item delete dirty and many interfaces
Add item functions for deleting items that we know to be dirty and add a
user in another function that deletes many items without leaving parial
deletions behind in the case of errors.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
f139cf4a5e Convert unlink and orphan processing
Restore unlink functionality by converting unlink and orphan item
processing from the old btree interface to the new item cache interface.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9d6d70bd89 Add an item next for key len ignoring val
Add scoutfs_item_next_same() which requires that the key lengths be
identical but which allows any values, including no value by way of a
null kvec.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9d68e272cc Allow creation of items with no value
Item creation always tried to allocate a value.  We have some item types
which don't have values.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
8f63196318 Add key inc/dec variants for partial keys
Some callers know that it's safe to increment their partial keys.  Let
them do so without trying to expand the keys to full precision and
triggering warnings that their buffers aren't large enough.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
2ac239a4cb Add deletion items
So far we were only able to add items to the segments.  To support
deletion we have to insert deletion items and then remove them and the
item they reference when their segments are compacted.

As callers attempt to delete items from the item cache we replace the
existing item with a deletion marker with the key but no value.  Now
that there are deletion items in the cache we have to teach the other
item cache operations to skip them.  There's some noise in the patch
from moving functions around so that item insertion can free a deletion
item it finds.

The deletion items are written out to the segment as usual except now
the in-segment item struct has a flag to mark a deletion item and the
deletion item is removed from the cache once its written to the segment.

Item reading knows to skip deletion items and not add them back into
the cache.

Compaction proceeds as usual for most of the levels with the deletion
item clobbering any older higher level items with the same key.
Eventually the deletion item itself is removed by skipping over it when
compacting to the largest final level.  We support this by adding a
little call that describes the max level of the tree at the time the
compaction starts so that compaction can tell when it should skip
copying the deletion item to the final lower level.

All of this is for deletion of items with a precise key.  In the future
we'll expand the deletion items so that they can reference a contiguous
range of keys.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
cfc6d72263 Remove item off and len packing
The key and value offsets and lengths were aggressively packed into the
item structs in the segments.   This saved a few bytes per item but
didn't leave any room left for expansion without growing the item.  We
want to add a deletion item flag so let's just grow the item struct.  It
now has room for full precision offsets and lengths that we can access
natively so we can get rid fo the packing and unpacking functions.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00