Commit Graph

234 Commits

Author SHA1 Message Date
Zach Brown
6bcdca3cf9 Update dirent last pos and update first comment
The last valid pos for us is now a full u64 because we're storing
entries at an increasing counter instead of at a hahs of the entry name.

And might as well add a clarifying comment to the first pos while we're
here.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
00fed84c68 Build statfs f_blocks from total_segs
Use the current total_segs field to calculate the total number of blocks
in the system instead of the old and redundant total_segs field which is
going away.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
02af35a98e Convert inode since ioctl to the item API
The inode since ioctl was the last user of the btree.  It doesn't yet
work because the item cache doesn't know how to search for items by
sequence yet.

It's not yet clear exactly how we'll build the data since ioctls.  It'll
be easy enough to refactor the inode since item walk if they follow a
similar pattern again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
429e1b6eb4 Truncate data items
scoutfs_data_truncate_items() was still using the btree.  This updates
it to use the item cache but doesn't yet support regions being offline.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
92b10e8270 Write super with bio functions
Write our super block from an allocated page with our bio functions
instead of relying on the old block cache layer which is going away.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
75b018a0e7 Add symlinks back
Convert symlinks to use the new item cache API.  This is so much easier
because our max item size matches the symlink size.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
54e07470f1 Update xattrs to use the item cache
Update the xattrs to use the item cache.  Because we now have large keys
we can store the xattr at its full name instead of having to deal with
hashing the name and addressing collisions.

Now that we don't have the find xattr ioctls we don't need to maintain
backrefs.

We also add support for large xattrs that span multiple items.  The key
footer and value header give us the metadata we need to iterate over the
items that make up an xattr.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
64bc145e3c Add scoutfs_item_set_batch()
We're about to update xattrs to use the item cache API and xattrs want
to be pretty big.  scoutfs_item_set_batch() let's the xattr code
atomically update xattrs made up of multiple items.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
a310027380 Remove the find xattr ioctls
The current plan for finding populations of inodes to search no longer
involves xattr backrefs.  We're about to change the xattr storage format
so let's remove these interfaces so we don't have to update them.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
fff6fb4740 Restore link backref items
Convert the link backref code from btree items to the item cache.

Now that the backref items have the full entry name we can traverse a
link with one item lookup.  We don't need to lock the inode and verify
that the entry at the backref offset really points to our inode.  The
link backref walk gets a lot simpler.

But we have to widen the ioctl cursor to store a full dir ino and path
name isntead of just the dir's backref counter.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
8def9141bc Add scoutfs_key_init_buf_len()
As of yet the static key users have key and buffer lengths that match.
We're about to add a link backref caller who searches with a small key
but gets a result copied into a larger buffer.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
6516ce7d57 Report free blocks in statfs
Our statfs callback was still using the old buddy allocator.

We add a free segments field to the super and have it track the number
of free segments in the allocator.  We then use that to calculate the
number of free blocks for statfs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9f5e42f7dd Add simple data items
Add basic file data support by managing file data items from the page
cache address space callbacks.

Data is read by copying from cached items into page contents in
readpage.

Writes create new ephemeral items which reference dirty pages.  The
items are deleted once they're written in a transaction or if
invalidatepage removes the dirty page they reference.

There's a lot more to do to remove data copies, avoid compaction bw
overhead, and add support for truncate, o_direct, and mmap.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
1ad479a1af Add ephemeral items
Ephemeral items exist to reference external values.  They're going to be
used by the page cache to reference dirty pages for writeback.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
c3307e941b Add scoutfs_item_forget()
Add a forget call which forcefully removes an item, no matter it's
state.   The page cache will use this in invalidate page to drop
ephemeral items that reference a dirty page that's being truncated.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9f885b4c12 Fix item erase augmentation
The item cache was getting inconsistent as items were removed.  This
would manifest in failing to find dirty items that it had counted as it
was writing items into the segment and removing deletion items.

For a start it wasn't using the augmented rb_erase().  We make a
function that everyone uses.  There's no augmented rb_replace() so We
just augment erase, restart, and insert.  (We could probably augment on
descent and replace/propagate but that can come later.)

Then the augmentation callbacks got the semantics slightly wrong.  The
rotation callback is named after a caller that happens to use it, not on
any implied relationship between the nodes.  It actually just
recalculates the augmentation value for the two subtrees.  Mischief
managed.

(We'll probably rework the augmentation so the value is for the node and
its children and we can get rid of the extra code we have today to
support our augmentation value that is sensitive to the difference
between the left and write subtrees.)

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
568cefa4db Add some item debugging tracing to seg writing
Trace the items that we count and then write to the segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
7045e3a6e8 More efficiently destroy item rbtrees
I was auditing rb_erase() use and noticed that we we don't need to fully
tear down the item trees.  We can just blow them away with postorder
traversal and raw frees of the nodes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
0298cbb562 Fix compact cleanup on mount failure
scoutfs_compact_destroy() was testing the wrong pointer to see if
_setup() had built up resources that needed to be torn down.  It'd crash
on mount failure.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
67aec72c77 Add readdir items
Restore readdir functionality by adding readdir items.

The readdir items are keyed by an increasing position in the parent
dir's inode.  We track it in our inode info.  To delete the readdir
items we restore the dentry_info and put the pos in the dentry so unlink
can build the readdir item key.  And finally we put the pos in the
lookup dirent so that it can populate the dentry info on lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9a293bfa75 Add item delete dirty and many interfaces
Add item functions for deleting items that we know to be dirty and add a
user in another function that deletes many items without leaving parial
deletions behind in the case of errors.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
f139cf4a5e Convert unlink and orphan processing
Restore unlink functionality by converting unlink and orphan item
processing from the old btree interface to the new item cache interface.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9d6d70bd89 Add an item next for key len ignoring val
Add scoutfs_item_next_same() which requires that the key lengths be
identical but which allows any values, including no value by way of a
null kvec.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9d68e272cc Allow creation of items with no value
Item creation always tried to allocate a value.  We have some item types
which don't have values.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
8f63196318 Add key inc/dec variants for partial keys
Some callers know that it's safe to increment their partial keys.  Let
them do so without trying to expand the keys to full precision and
triggering warnings that their buffers aren't large enough.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
2ac239a4cb Add deletion items
So far we were only able to add items to the segments.  To support
deletion we have to insert deletion items and then remove them and the
item they reference when their segments are compacted.

As callers attempt to delete items from the item cache we replace the
existing item with a deletion marker with the key but no value.  Now
that there are deletion items in the cache we have to teach the other
item cache operations to skip them.  There's some noise in the patch
from moving functions around so that item insertion can free a deletion
item it finds.

The deletion items are written out to the segment as usual except now
the in-segment item struct has a flag to mark a deletion item and the
deletion item is removed from the cache once its written to the segment.

Item reading knows to skip deletion items and not add them back into
the cache.

Compaction proceeds as usual for most of the levels with the deletion
item clobbering any older higher level items with the same key.
Eventually the deletion item itself is removed by skipping over it when
compacting to the largest final level.  We support this by adding a
little call that describes the max level of the tree at the time the
compaction starts so that compaction can tell when it should skip
copying the deletion item to the final lower level.

All of this is for deletion of items with a precise key.  In the future
we'll expand the deletion items so that they can reference a contiguous
range of keys.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
cfc6d72263 Remove item off and len packing
The key and value offsets and lengths were aggressively packed into the
item structs in the segments.   This saved a few bytes per item but
didn't leave any room left for expansion without growing the item.  We
want to add a deletion item flag so let's just grow the item struct.  It
now has room for full precision offsets and lengths that we can access
natively so we can get rid fo the packing and unpacking functions.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
685eb1f2dc Fix segment block item alignemnt build bug
The BUILD_BUG_ON() to test that the start of the items in the segment
header is naturally aligned had a typo that masked the length instead of
checking the remainder of division by the length.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
736d5765fc Add a shrinker for the segment cache
After segments have finished IO and while they're in the rbtree we track
them with an LRU.  Under memory pressure we can remove the oldest
segments from the rbtree and free them.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
2bc1617280 Use contiguous key struct instead of kvecs
Using kvecs for keys seemed like a good idea because there were a few
uses that had keys in fragmented memory: dirent keys made up of an
on-stack struct and the file name in the dentry, and keys straddling the
pages that make up a cached segment.

But it hasn't worked out very well.  The code to perform ops on keys
by iterating over vectors is pretty fiddly.  And the raw kvecs only
describe the actively referenced key, they know nothing about the total
size of the buffer that the key resides in.  Some ops can't check that
they're not clobbering things, they're relying on callers not to mess
up.

And critically, the kvec iteration's become a bottleneck.  It turns out
that comparing keys is a very hot path in the item cache.  All the code
to initialize and iterate over two key vectors adds up when each high
level fs operation is a few tree descents and each tree descent is a
bunch of compares.

So let's back off and have a specific struct for tracking keys that are
stored in contiguous memory regions.  Users ensure that keys are
contiguous.  The code ends up being a lot clearer, code now can see how
big the full key buffer is, and the rbtree node comparison fast path is
now just a memcmp.

Almost all of the changes in the patch are mechanical semantic changes
involving types, function names, args, and occasionaly slightly
different return conventions.

A slightly more involved change is that now dirent key users have to
manage an allocated contiguous key with a copy of the path from the
dentry.

Item reading is now a little more clever about calculating the greatest
range it can cache by initially walking all the segments instead of
trying to do it as it runs out of items in each segment.

The largest meaningful change is that now keys can't straddle page
boundaries in memory which means they can't cross block boundaries in
the segment.  We align key offsets to the next block as we write keys to
segments that would have straddled a block.

We then also have to account for that padding when building segments.
We add a helper that calculates if a given number of items will fit in a
segment which is used by item dirtying, segment writing, and compaction.

I left the tracepoint formatting for another patch.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
8a302609f2 Add some item cache/range counters
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
822ce205c5 Let compaction skip segments as needed
Previously the only clever compaction avoidance we'd try was in the
manifest walk.  If we found that there were no overlapping segments in
the next level we'd just move the entry down a level and skip compaction
entirely.

But that's just one specific instance of the general case: either of the
lower or upper segments don't overlap with each other.  There can be
many lower level segments that intersect with the full range of keys in
the upper level segment but which don't actually intersect with any
items in the upper segment.

So we refactor the compaction to notice this case.  We get the first and
last keys and use them to skip each segment as we first start to iterate
through it.

We don't want to read segments that we never actually have to copy items
from so we read each segment on demand instead of concurrently as the
compaction starts.  This means that item iteration can now have to read
a segment and can now return errors.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
519b9c35c4 Correcly wrap when finding compaction entries
Compaction looks for the next entry at a given level to compact.  It
only tested for not finding a next entry when it needs to wrap the key
and start over in the level, it missed the case where the next entry is
at a greater level.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
930f541c7b Add a scoutfs_seg_get
Compaction is going to want to get additional references on a segment.
It could just "read" it again while holding a reference but this is more
clear.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
3407576ced Don't use bio size in end_io
Some drives don't set bi_size so just track the number of IOs.  (And the
size argument to end_io has been removed in recent kernels.)

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
963b04701f Add some bio tracing
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
0a5fb7fd83 Add some counters
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
ded184b481 Add a pile of tracing printks
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
3f812fa9a7 More thoroughly integrate compaction
The first pass at compaction just kicked a thread any time we added a
segment that brought its level's count over the limit.  Tasks could
create dirty items and write level0 segments regardless of the progress
of compaction.

This ties the writing rate to compaction.  Writers have to wait to hold
a transaction until the dirty item count is under a segment and there's
no level0 segments.  Usualy more level0 segments are allowed but we're
aggressively pushing compaction, we'll relax this later.

This also more forcefully ensures that compaction makes forward
progress.  We kick the compaction thread if we exceed the level count,
wait for level0 to drain, or successfully complete a compaction.  We
tweak scoutfs_manifest_next_compact() to return 0 if there's no
compaction work to do so the the compaction thread can exit without
triggering another.

For clarity we also kick off a sync after compaction so that we don't
sit around with a dirty manifest until the next sync.  This may not be
wise.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
aad5a34290 Don't prematurely write dirty super
A previous refactoring messed up and had scoutfs_trans_write_func()
always write the dirty super even when nothing was dirty and there was
nothing for the sync attempt to do.  This was very confusing and made it
look like the segment and treap writes were being lost when in fact it
was the super write that shouldn't have happened.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
a333c507fb Fix how dirty treap is tracked
The transaction writing thread tests if the manifest and alloc treaps
are dirty.  It did this by testing if there were any dirty nodes in the
treap.

But this misses the case where the treap has been modified and all nodes
have been removed.  In that case the root references no dirty nodes but
needs to be written.

Instead let's specifically mark the treap dirty when it's modified.
From then on sync will always try to write it out.  We also integrate
updating the persistent root as part of writing the dirty nodes to the
persistent ring.  It's required and every caller did it so it was silly
to make it a separate step.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
30b088377f Fix setting trans_task
Some recent refactoring accidentally set the trans task to null instead
of the current task.  It's not used but until it's removed it should be
correct.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
c21dc4ec20 Refactor level_count and protect with seqcount
We were manually manipulating the level counts in the super in a bunch
of places under the manifest rwsem.  This refactors them into simple get
and add functions.  We protect them with a seqcount so that we'll be
able read them without blocking (from trans hold attempts).  We also add
a helper for testing that a level is full because we already used
different comparisons in two call sites.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
2083793ae0 Add first pass at segment compaction
This is the first draft of compaction which has the core mechanics.

Add segment functions to free a segment's segno and to delete the entry
that refers to the given segment.

Add manifest functions that lock the manifest and dirty and delete
manifest entries.  These are used by the compaction thread to atomically
modify the manfiest with the result of a compaction.

Sort the level 0 entries in the manifest by their sequence.  This lets
compaction use the first oldest entry and reading can walk them
backwards to get them in order and not have to sort.  We also more
carefully use the sequence field in the manifest search key to
differentiate between finding high level entries that overlap and
finding specific entries identified by their seq.

Add some fields to the per-super compact_info struct which support
compaction.  We need to know the limit on the number of segments per
level and we record keys per level which tell us which segment to use
next time that level is compacted.

We kick a compaction thread when we add a manifest entry and that brings
the level count over the limit.

scoutfs_manifest_next_compact() is the first meaty function.  The
compaction thread uses this to get all the segments involved in a
compaction.  It does a quick manifest update if the next manifest
candidate doesn't overlap with any sgements in the next level.

The compaction operation itself is a pretty straight forward
read-modify-write operation.  It asks the manifest to give it references
to the segments it'll need, reads them in, iterates over them to count
and copies items in order to output segments, and atomically updates the
manifest.

Now that the manifest can be dirty without any dirty segments we need to
fix the transaction writing function's assumption that everything flows
from dirty segments.  It also has to now lock and unlock the manifest as
it adds the entry for its level 0 segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
a45661e5b6 Add _prev version of treap lookup and iteration
_lookup() and _lookup_next() each had nearly identical loops that took a
dirty boolean.  We combine them into one walker with flags for dirty and
next and add a prev prev as well, giving us all the exported functions
with combinations of the flags.

We also add _last() to match _first() and _prev() to match _next().

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
2522509ec8 Fix scoutfs_treap_next() parent walk comparision
While walking up parents looking for the next node we were comparing the
child with the wrong parent pointer.  This is easily verified by
glancing at rb_next() :).

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
80da7fefa7 fix treap deletion
Treap deletion was pretty messed up.  It forgot to reset parent and ref
for the swapped node before using them to finally delete.  And it didn't
get all the weird cases right where the child node to swap is the direct
child of the node.  In that case we can't just swap the parent pointers
and node pointers, they need to be special cased.

So nuts to all that.  We'll just rotate the node down until it doesn't
have both children.  They result in pretty similar patterns and the
rotation mechanism is much simpler to understand.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
e5e7a25ecd Don't use null node when repairing aug
We were derefing the null parent when deleting a single node in a tree.
There's no need to use parent_ref() here, we know that there's no node
and we can just clear the root's aug bits.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
cd6cd000ce Add ifdefed out quick treap printer
This was pretty handy for debugging weird failure cases.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
eb94092f2f Add kvec big endian inc and dec
Add helpers that increment or decrement kvec vectors as theough they're
big endian values.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00