Commit Graph

323 Commits

Author SHA1 Message Date
Zach Brown
fc50072cf9 scoutfs: store manifest entries in the btree
Convert the manifest to store entries in persistent btree keys and
values instead of using the rbtree in memory from the ring.

The btree doesn't have a sort function.  It just compares variable
length keys.  The most complicated part of this transformation is
dealing with the fallout of this.  The compare function can't compare
different search keys and item keys so searches need to construct full
synthetic btree keys to search.  It also can't return different
comparisons, like overlaping, so the caller needs to do a bit more work
to use key comparisons to find overlapping segments.  And it can't
compare differently depending on the level of the manifest so we store
the manifest in keys differently depending on whether its in level 0 or
not.

All mount clients can now see the manifest blocks.  They can query the
manifest directly when trying to find segments to read.  We can get rid
of all the networking calls that were finding the segments for readers.

We change the manifest functions that relied on the ring that the to
make changes in the manifest persistent.  We don't touch the allocator
or the rest of the manifest server, though, so this commit breaks the
world.  It'll be restored in future patches as we update the segment
allocator and server to work with the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
3eaabe81de scoutfs: add btree stored in persistent ring
Add a cow btree whose blocks are stored in a persistently allocated
ring.   This will let us incrementally index very large data sets
efficiently.

This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Mark Fasheh
eb439ccc01 scoutfs: s/lck/lock/ lock.[ch]
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:18:58 -05:00
Mark Fasheh
136cbbed29 scoutfs: only release lockspace/workqueues in lock_destroy if they exist
Mount failure means these might be NULL.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:17:22 -05:00
Mark Fasheh
19f6f40fee scoutfs: get rid of held_locks construct
Now that we have a dlm, this is a needless redirection. Merge all fields
back into the lock_info struct.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:17:22 -05:00
Mark Fasheh
250e9d2701 scoutfs: remove unused function, can_complete_shutdown
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:17:22 -05:00
Mark Fasheh
e6f3b3ca8f scoutfs: add lock caching
We refcount our locks and hold them across system calls. If another node
wants access to a given lock we'll mark it as blocking in the bast and queue
a work item so that the lock can later be released. Otherwise locks are
free'd under memory pressure, unmount or after a timer fires.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:15:11 -05:00
Zach Brown
76a73baefd scoutfs: don't lose items between segments
The refactoring of compaction to use the skip lists changed the nature
of item insertion.  Previously it would precisely count the number of
items to insert.  Now it discovers that the current output segment is
full by having _append_item() return false.

In this case the cursors currently point to the item that would have
been inserted but failed.  compact_items() caller loops around to
allocate the next segment.  Then it calls compact_items() again and it
mistakenly advances *past* the current item that still needed to be
inserted.

Hiding next_item() away from the segment loop made it hard to see this
mechanism.  Let's drop the compact_items() function and bring item
iteration and item appending into the main loop so we can more carefully
advance or not as we write and allocate new segments.

This stops losing items at segment boundaries.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
e7655f00ee scoutfs: read items from next segment in level
If the starting key for a segment read doesn't fall in a segment then we
have to include the next segment from that level in the read.  If we
don't then the read can think that there are no more items at that level
and assume that all the items in the upper level are all that exist.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
f7701177d2 scoutfs: throttle addition of level 0 segments
Writers can add level 0 segments much faster (~20x) than compaction can
compact them down into the lower levels.  Without a limit on the number
of level 0 segments item readind can try to read an extraordinary number
of level 0 segments and wedge the box nonreclaimable page allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
9f545782fb scoutfs: add missing segment put
Back when we changed the transaction commit to ask the server to update
the commit we accidentally lost the put of the level0 segment that was
just written.  This leaked refcount would pin segments over time and
eventually drag the box into crippling oom.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
823a5bed34 scoutfs: add some segment cache life cycle tracing
Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
70c7178e6a scoutfs: index segment items with skip list
We want to be able to read a region of items from a segment by searching
for the key that starts the item.  In the first version of the segment
format we find a key by performing a binary search across an array of
offsets that point to the items.

Unfortunately the current format requires that we know the number of
items before we start writing.  With thousands of items per segment it's
a little bonkers to ask compaction to walk through all the items twice.

Worse still, we didn't want the item offset array entries to span pages
so they're rounded up to a power of two after having seqs and offsets
and lengths.  This makes them surprisingly large and sometimes they can
consume up to 60% (!) of a segment.

We know that we're inserting in sort order so it's very easy to build an
index as we insert.  Skip lists give us a nice simple way to ensure
o(log n) lookups with only an average of two links per node.

CPU use is greatly reduced by removing a full redundant item walk and we
know use up almost all of the space in segments.  There's still little
gaps at the ends of blocks as item's still won't cross block boundaries.

Most of this change is safely mechanical.  The big difference is in how
the compaction loop is built.  It used to count the items before hand.
It would never try to append when out of segments and writing would stop
after the exact number of items.  Now it discovers its out of items by
allocating and trying to append and finding that there's no more work to
do.  It required rethinking the loop exit and segment allocation and
stopping conditions.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
463a696575 scoutfs: add value length limit
Add a relatively small universal value size limit.  This will be needed
by more dense item packing to predict the worst case padding to avoid
full items crossing block boundaries.

We refactor the existing symlink and xattr item value limit to use this
new limit.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
1724bab8ea scoutfs: store large symlinks in multiple items
We're shrinking the max item value size so we need to store symlinks
with large target paths in multiple items.  The arbitrary max value size
defined here will be replaced in the future with the new global maximum
value size.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
8d59e6d071 scoutfs: fix alloc eio for free region
It's possible for the next segno to fall at the end of an allocation
region that doesn't have any bits set.  The code shouldn't return -EIO
in that case, it should carry on to the next region.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
793f84b86b scoutfs: remove item reading limit
The item reading limit was intended to minimize latency when we were
directly reading cached manifests.  We're now asking the server to walk
the manifest for us and that's a lot more expensive than querying local
cached blocks.

Let's gulp in an entire segment's worth of items if we can.  We'll have
plenty of opportunity to tune this down later.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
3b56161ed3 scoutfs: fix item read seg walk limit
When read items into the cache we have the range of keys that were
missing from the cache.  The item walk was stopping when it hit the end
of the missing cache range, not when it hit the end of the keys that
were covered by all the segments.

This would manifest as huge regions of missing items.  The read would
walk off the relatively closed end of the highest level segment.  It
would keep reading while there were items in the upper levels but all
those keys that would have been found in additional lower level segments
are missing.  Eventually it'd hit the end of the higher level sgement
and mark that region as cached.

With it fixed it now stops the read appropriately and will come around
next time to read the range that coveres the next lowest level segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
71711c8b56 scoutfs: add manifest and item tracing
Add some tracing to get visibility into compaction and item reading.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
ef551776ae scoutfs: item cache reads shouldn't clobber dirty
We read into the item cache when we hit a region that isn't cached.
Inode index items are created without cached range coverage.   It's easy
to trigger read attempts that overlap with dirty inode index items.

insert_item() had a bug where it's notion of overwriting only applied to
logical presence.  It always let an insertion overwrite an existing item
if it was a deletion.

But that only makes sense for new item creation.  Item cache population
can't do this.  In this inode index case it can replace a correct dirty
inode index item with its old pre-deletion item from the read.  This
clobbers the deletions and leaks the old inode index item versions.

So teach the item insertion for caching to never, ever, replace an
existing item.   This fixes assertion failures from trying to
immediately walk meta seq items after creating a few thousand dirty
entries.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
bf7b3ac506 scoutfs: fix ring first_seq calculation
As we write ring blocks we need to update the first_seq to point at
the first live block in the ring.

The existing calculation gets it wrong and stores the seq of the first
block that we wrote in this commit, not the first ring block that
is still live and would need to be read.

Fix the calculation to so that we set first_seq to the first live block
in the ring.

This fixes the bug where a mount can spin printing the super it's using.
This is the server trying to constantly startup as each server start
fails as it can't read the ring.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
2eecbbe78a scoutfs: add item cache key ioctls
These ioctls let userspace see the items and ranges that are cached.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
5f5729b2a4 scoutfs: add sticky compaction
As we write segments we're not limiting the number of segments they
intersect at the next level.  Compactions are limited to a fanout's
worth of overlapping segments.  This means that we can get a compaction
where the upper level segment overlapps more than the segments that are
part of the compaction.  In this case we can't write the remaining upper
level items at the lower level because now we can have a level with
segments whose keys intersect.

Instead we detect this compaction case.  We call it sticky because after
merging with the lower level segments the remaining items in the upper
level need to stick to the upper level.  The next time compaction comes
around it'll compact the remaining items with the additional lower
overlaping segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
d52f09449d scoutfs: reclaim item cache
Add a LRU and shrinker to reclaim old cached items under memory
pressure.  This is pretty awful today because of the separate cached
range structs and rbtree.  We do our best to blow away enough of the
cache and range to try and make progress.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
94e78414f9 scoutfs: add key trace class
Some item tracing functions were really just tracing a key.  Refactor it
into a trace class with event users.  Later patches can then use the key
trace class.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Mark Fasheh
e711c15acf scoutfs: use dlm for locking
To actually use it, we first have to copy symbols over from the dlm build
into the scoutfs source directory. Make that happen automatically for us in
the Makefile.

The only users of locking at the moment are mount, unmount and xattr
read/write. Adding more locking calls should be a straight-forward endeavor.

The LVB based server ip communication didn't work out, and LVBS as they are
written don't make sense in a range locking world. So instead, we record the
server ip address in the superblock. This is protected by the listen lock,
which also arbitrates which node will be the manifest server.

We take and drop the dlm lock on each lock/unlock call. Lock caching will
come in a future patch.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-06-23 15:08:02 -05:00
Mark Fasheh
08bf1fea79 dlm: Give fs/dlm the notion of ranges
Using the new interval tree code we add a tree for each lock status list to
efficiently track ranged requests. Internally, most operations on a
resources lock status list (granted, waiting, converting) then are turned
into operations within a given range.

There is no API change other than a new call, dlm_lock_range() and a new
structure, 'struct dlm_key' to define our range endpoints. Keys can have
arbitrary lengths and are compared via memcmp. A ranged blocking ast type is
defined so that users of dlm_lock_range() can know which range they are
blocking.

A rudimentary test, dlmtest.ko is included.

TODO:
 - Update userspace entry points, need to add one for new lock call
 - Manage backwards compatibility with network protocol

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-06-23 15:07:10 -05:00
Mark Fasheh
0c1c2691e0 interval-tree: Allow user defined objects as endpoints
Users pass in a comparison function which is used when endpoints need to be
checked against each other. We also put each ITTYPE local definition on it's
own line to facilitate the use of pointers. An upcoming dlm patch will make
use of this to allow for keyed, ranged locking.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-06-08 18:10:40 -05:00
Mark Fasheh
dfc220ad6f Import fs/dlm/* from linux-3.10.0-327.36.1.el7
Also wire it into the build system. We have to figure out how to get scoutfs
pulling in the right headers but that can wait until we have something more
usable.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-06-08 18:10:40 -05:00
Zach Brown
85cbe7dc97 scoutfs: add a counter add macro to match inc
Just a quick wrapper around the related percpu_counter call.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-07 08:52:11 -07:00
Zach Brown
0280971fab scoutfs: add bug on for out of order seg items
We've seen some cases where compaction writes a new segment that
contains items that aren't sorted.  This eventually leads to read being
mislead in its binary search of the items in a segment and failing to
find the items it was looking for.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:45:15 -07:00
Zach Brown
a1dadd9763 scoutfs: lock around dirty item writing
Writing dirty items into a segment wasn't protected by locking.  It's
not racing with item dirtying, bit it's absolutely racing with reads
while modifying the rbtree.  And shrinking will be modifying the item
cache at any old time in the future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:35:05 -07:00
Zach Brown
1652512af7 scoutfs: remove ancient dirty item comment
This is just old and wrong.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:34:42 -07:00
Zach Brown
43e9d2caa2 scoutfs: trace compaction manifest entries
Trace the manifest entries compaction received from the server.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:31:47 -07:00
Zach Brown
b5ee282f6b scoutfs: minor manifest ring comparison tracing
It was nice to watch the ring compare nodes so leave behind the trace
and clean up the callers so that uninitialized keys are cleanly null.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:30:21 -07:00
Zach Brown
54d286d91c scoutfs: format strings for all key types
Add all the missing key types to scoutfs_key_str() so that we can get
traces and printks of all key types.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 15:01:37 -07:00
Zach Brown
1485b02554 scoutfs: add SK_ helpers for printing keys
Add some percpu string buffers so that we can pass formatted strings as
arguments when printing keys.  The percpu struct uses a different buffer
for each argument.  We wrap the whole print call in a wrapper that
disables and enables preemption.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 14:48:09 -07:00
Zach Brown
a050152254 scoutfs: fix ring next/prev walk comparison test
The scoutfs_ring_next() and _prev() functions had a really dumb bug
where they check the sign of comparisons by comparing with 1.  For
example, next would miss that the walk traversed a lesser item
and wouldn't return the next item.

This was causing compaction to miss underlying segments, creating
segments in levels that had overlapping keys, which then totally
confused reading and kept it from finding the items it was looking for.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 14:32:29 -07:00
Zach Brown
79de18443b scoutfs: don't extend key in dec_cur_len
A copy and paste bug had us extending the length of keys that were
decremented at their previous length.  The whole point of the _cur_len
functions is that they don't have to extend the key buf out to full
precision.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 14:30:43 -07:00
Zach Brown
2bd698b604 scoutfs: set NODELAY and REUSEADDR on net sockets
Add a helper that creates a socket and sets nodelay for all sockets and
set reuseaddr in listening sockets.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 14:29:05 -07:00
Zach Brown
c84250b8c6 scoutfs: add item_set_batch trace point
Restore the item_set_batch trace point by changing the current
insert_batch tracepoint to a class and defining insert and set as class
trace points.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-02 09:20:02 -07:00
Zach Brown
a2ef5ecb33 scoutfs: remove item_forget
It's pretty dangerous to forcefully remove items without writing
deletion items to lsm segments.  This was only used for magical
ephemeral items when we were having them store file data.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:59:24 -07:00
Zach Brown
1f933016f0 scoutfs: remove ephemeral items
Ephemeral items were only used by the page cache which tracked page
contents in items whose values pointed to the pages.  Remove their
special case.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:58:09 -07:00
Zach Brown
b7bbad1fba scoutfs: add precise transation item reservations
We had a simple mechanism for ensuring that transaction didn't create
more items than would fit in a single written segment.  We calculated
the most dirty items that a holder could generate and assumed that all
holders dirtied that much.

This had two big problems.

The first was that it wasn't accounting for nested holds.
write_begin/end calls the generic inode dirtying path whild holding a
transaction.  This ended up deadlocking as the dirty inode waited to be
able to write while its trans held back in write_begin prevented
writeout.

The second was that the worst case (full size xattr) item dirtying is
enormous and meaningfully restricts concurrent transaction holders.
With no currently dirty items you can have less than 16 full size xattr
writes.  This concurrency limit only gets worse as the transaction fills
up with dirty items.

This fixes those problems.  It adds precise accounting of the dirty
items that can be created while a transaction is held.  These
reservations are tracked in journal_info so that they can be used by
nested holds.  The precision allows much greater concurrency as
something like a create will try to reserve a few hundreds bytes instead
of 64k.  Normal sized xattr operations won't try to reserve the largest
possible space.

We add some feedback from the item cache to the transaction to issue
warnings if a holder dirties more items than it reserved.

Now that we have precise item/key/value counts (segment space
consumption is a function of all three :/) we can't have a single atomic
track transaction holders.  We add a long-overdue trans_info and put a
proper lock and fields there and much more clearly track transaction
serialization amongst the holders and writer.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:15:13 -07:00
Zach Brown
297b859577 scoutfs: deletion items maintain counts
When we turned existing items into deletion items we'd remove their
values.  But we didn't update the count of dirty values to reflect that
removal so the dirty value count would slowly grow without bound.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:12:30 -07:00
Zach Brown
5f11cdbfe5 scoutfs: add and index inode meta and data seqs
For each transaction we send a message to to the server asking for a
unique sequence number to associate with the transaction.  When we
change metadata or data of an inode we store the current transaction seq
in the inode and we index it with index items like the other inode
fields.

The server remembers the sequences it gives out.  When we go to walk the
inode sequence indexes we ask the server for the largest stable seq and
limit results to that seq.  This ensures that we never return seqs that
are past dirty items so never have inodes and seqs appear in the past.

Nodes use the sync timer to regularly cycle through seqs and ensure that
inode seq index walks don't get stuck on their otherwise idle seq.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:12:24 -07:00
Zach Brown
b291818448 scoutfs: add sync deadline timer
Make sure that data is regularly synced.  We switch to a delayed work
struct that is always queued with the sync deadline.  If we need an
immediate sync we mod it to now.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-19 11:19:56 -07:00
Zach Brown
373def02f0 scoutfs: remove trade_time message
This was mostly just a demonstration for how to add messages.  We're
about to add a message that we always send on mount so this becomes
completely redundant.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-18 10:52:04 -07:00
Zach Brown
8ea414ac68 scoutfs: clear seg rb node after replacing
When inserting a newly allocated segment we might find an existing
cached stale segment.  We replace it in the cache so that its user can
keep using its stale contents while we work on the new segment.

Replacing doesn't clear the rb_node, though, so we trip over a warning
when we finally free the segment and it looks like it's still present in
the rb tree.

Clear the node after we replace it so that freeing sees a clear node and
doesn't issue a warning.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 14:51:36 -07:00
Zach Brown
5307c56954 scoutfs: add a stat_more ioctl
We have inode fields that we want to return to userspace with very low
overhead.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 14:28:10 -07:00