2141 Commits

Author SHA1 Message Date
Zach Brown
b8d7e04262 Add negative caching of item ranges
The item cache only knew about present items in the rbtree.  Attempts to
read items that didn't exist would always trigger expensive manfifest
and segment searches.

This reworks the item cache and item reading code to support the notion
of cached ranges of keys.  When we read items we also communicate the
range of keys that we searched.  This lets the cache return negative lookups
for key values in the search that don't have items.

The item cache gets an rbtree of key ranges.  Each item lookup method
now uses it to determine if a missing item needs to trigger a read.

Item reading is now performed in batches instead of one at a time.  This
lets us specify the cache range along with the batch and apply them all
atomically under the lock.

The item range code is much more robust now that it has to track
the range of keys that it searches.  The read items call now takes a range.
It knows to look for all level0 segments that interesect that range, not
just the first key.  The manifest segment references now include the
min and max keys for the segment so we can use those to define the
item search range.

Since the refs now include keys we no longer have them as a dumb allocated
array but instead have a list of alloced ref structs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
cbb4282429 Add kvec key helpers
Add a suite of simple kvec functions to work with kvecs that point to
file system keys.

The ones worth mentioning format keys into strings.  They're used to
add formatted strings for the keys to tracepoints.  They're still a
little rough but this is a functional first step.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
57a6ff087f Add max key and max key size to format
We're going to use these to support tracking cached item ranges.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
1b4bab3217 Fix ring read nr/part confusion
Some parts of the ring reading were still using the old 'nr' for the
number of blocks to read, but it's now the total number of blocks
in the ring.  Use part instead.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
c8d61c2e01 Manifest item reading tracked wrong key
When iterating over items the manifest would always insert
whatever values it found at the caller's key, instead of the
key that it found in the segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
c74787a848 Harden, simplify, and shrink the kvecs
The initial kvec code was a bit wobbly.  It had raw loops, some weird
constructs, and had more elements than we need.

Add some iterator helpers that make it less likely that we'll screw up
iterating over different length vectors.  Get rid of reliance on a
tailing null pointer and always use the count of elements to stop
iterating.  With that in place we can shrink the number of elements to
just the greatest user.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
51a84447dd Fix calculation of number of dirty segments
The estimate for the number of dirty segment bytes was wildly over
calculating the number of segment headers by confusing the length of the
segment header with the length of segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
acee97ba2a Fix tail ring entry zeroing
The space calculation didn't include the terminating zero entry.  That
ensured that the space for the netry would never be consumed.  But the
remaining space was used to zero the end of the block so the final entry
wasn't being zeroed.

So have the space remaining include the terminating entry and factor
that into the space consumption of each entry being appended.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
967e90e5ef Fix kvec overlapping comparison
The comparisons were a bit wrong when comparing overlaping kvec
endpoints.  We want to compare the starts and ends with the ends and
starts, respectively.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
b251f91842 Fix updating parent dirty bits
We were trying to propagate dirty bits from a node itself when its dirty
bit is set.  But it's bits are consistent so it stops immediately.  We
need to propagate from the parent of the node that changed.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
b7b43de8c7 Queue trans work in our work queue
We went to the trouble of allocating a work queue with one work in
flight but then didn't use it.  We could have concurrent trans write
func execution.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
a5cac107a1 Set END_IO on allocated segs
A reader that hits an allocated segment would wait on IO forever.
Setting the end_io bit lets readers use written segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
f9ca1885f9 Specify ring blocks with index,nr
Specifying the ring blocks with a head and tail index lead to pretty
confusing code to figure out how many blocks to read and if we had
passed the tail.

Instead specify the ring with a starting index and number of blocks.
The code to read and write the ring blocks naturally falls out and is a
lot more clear.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
b598bf855d Fix ring appending and writing
The ring appending next entry header cursor assignment was pointing past
the caller's src header, not at the next header to write to in the
block.

The writing block index and blkno calculations were just bad.  Pretend
they never happened.

And finally we need to point the dirty super at the ring index for the
commit and we need to reset the append state for the next commit.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
b8ede1f6ee Fix ring block tail zeroing
The ring block tail zeroing memset treated the value as the offset to
zero from, not the number of bytes at the tail to zero.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
fd7b09b4e4 Fix written manifest entry length
Manifest entries were being written with the size of their in-memory
nodes, not the smaller persistent add_manifest structure size.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
e418629bea Remove items from trees before freeing
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
454767e992 Stop first dirty search looping
first_dirty() forgot to stop if the tree had no dirty items at all.
It'd just spin forever.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
a8a6d3697b Fix dirty item counter calculations
Unfortunately the generic augmented callbacks don't work for our
augmented node bits which specifically reflect the left and right nodes.
We need our own rotation callback and then we have boilerplate for the
other two copy and propagate callbacks.  Once we have to provide
.propagate we can call it instead of our own update_dirty_parents()
equivalent.

In addition some callers messed up marking and clearing dirty.  We only
want to mark dirty item insertions, not all inserted items.  And if we
update an item's keys and values we need to clear and mark it to keep
the counters consistent.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
fbd12b4dda Fix existing item insertion
Inserting an item over an existing key was super broken.  Now that we're
not replacing we can't stop descent if we find an existing item.  We
need to keep descending and then insert.  And the caller needs to, you
know, actually remove the existing item when it's found -- not the item
it just inserted :P.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
5eb388ae6e Fix seg item filling
The two functions that added to items had little bugs.  They initialized
the item vectors incorrectly and didn't actually store the keys and
values.  Appending was always overwriting the first segment.  Have it
call 'nr' 'pos' like the rest of the code to make it more clear.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
07ba01f6b0 Iniitialize segment header when writing item
Initialize the segment header as the items are written.  This isn't a
great place to do it.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
48f9be8455 Free key and value in the right order!
Always key then value!

*twitch*

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
9e02573e06 Rename scoutfs_seg_manfest_add
Rename scoutfs_seg_add_ment to _manifest_add as that makes it a lot more
clear that it's a wrapper around scoutfs_manifest_add() that gets its
arguments from the segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
471405f8cd Fix kvec iterators
A few of the kvec iterators that work with byte offsets forgot to reset
the offsets as they advanced to the next vec.

These should probably be refactored into a set of iterator helpers.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
be98c4dfd8 Fix up manifest key use
The manifest entries were changed to be a single contiguous allocation.

The calculation of the vec that points to the last key vec was adding
the key length in units of the add manifest struct.

Adding the manifest wasn't setting the key lengths nor copying the keys
into their position in the entry alloc.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
c4954eb6f4 Add initial LSM write implementation
Add all the core strutural components to be able to modify metadata.  We
modify items in fs write operations, track dirty items in the cache,
allocate free segment block reagions, stream dirty items into segments,
write out the segments, update the manifest to reference the written
segments, and write out a new ring that has the new manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
b45ec8824b Add a bunch of trace_printk()s
There's nothing particularly coherent about these, they're
what I added while debugging.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
641aae50ed Fix ring block replay
The ring block replay walk messed up the blkno it read
from and its exit condition.  It needed to test for having
just replayed the tail before moving on.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
21d313e0f6 Corretly wait on submitted segment reads
The segment waiting loop was rewritten to use n to
iterate up to i, but the body of the loop still had i.  Take that as a
signal to Always Iterate With 'i' and store the last i and then
iterate towards it with i again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
d1f36e2165 Correctly store last manifest key
A copy+paste error led us to overwrite the first
key in the manifest with the last, leaving the
last uninitialized.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
f3288f27c6 Declare full kvecs in manifest
The manifest had silly single kvecs instead of the macros that define
our maximal kvecs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
6957c73aba Have _lookup_exact return 0
scoutfs_item_lookup_exact() exists to only return one size.  Have it
just return 0 on success so callers don't have to remember that it
returns > 0.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
3f27de0b2c Fix hilarious BLOCK_SIZE typo
Turns out BLOCK_SIZE is a thing and confused scoutfs into thinking it
had many more blocks per segment then it did.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
a201cff5ad Read supers with bios instead of bl blocks
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
f7f840a342 Fix bio read completion init
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
43d0d44e48 Add initial LSM implementation
Add the initial core components of the LSM implementation to be able to
read the root inode:

 - bio.c: read big block regions
 - seg.c: cache logical segments
 - ring.c: read the manifest from storage
 - manifest.c: organize segments into an LSM
 - kvec.c: work with arbitrary memory vectors
 - item.c: cache fs metadata items read from segments

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:38:50 -07:00
Zach Brown
9d3fe27929 Add data_since command
Add a data_since command that operates just like inodes-since.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-18 10:52:25 -08:00
Zach Brown
5fcf70b53e Catch up to kernel's scoutfs_extent
Signed-off-by: Zach Brown <zab@versity.com>
2016-11-17 19:48:05 -08:00
Zach Brown
41e3ca0f41 Consistently use __u8 in format.h
Signed-off-by: Zach Brown <zab@versity.com>
2016-11-17 19:47:39 -08:00
Zach Brown
ec702b9bb3 Update the data_version ioctl to return the u64
We updated the code to use the new iteration of the data_version ioctl
but we forgot to update the ioctl definition so it didn't actually work.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-17 15:45:56 -08:00
Zach Brown
c3f122a5f1 Fix mkfs buddy initialization
mkfs was starting setting free blk bits from 0 instead of from
the blkno offset of the first free block.  This resulted in
the highest order above a used blkno being marked free.  Freeing
that blkno would set its lowest order blkno.  Now that blkno can be
allocated from two orders.  That, eventually, can lead to blocks
being doubly allocated and users trampling on each other.

While auditing the code to chase this bug down I also noticed that
write_buddy_blocks() was using a min() that makes no sense at all.  Here
'blk' is inclusive, the modulo math works on its own.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-17 15:43:23 -08:00
Zach Brown
c6b688c2bf Add staging ioctl
This adds the ioctl for writing archived file contents back into the
file if the data_version still matches.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
df561bbd19 Add offline extent flag and release ioctl
Add the _OFFLINE flag to indicate offline extents.  The release ioctl
frees extents within the release range and sets their _OFFLINE flag if
the data_version still matches.

We tweak the existing truncate item function just a bit to support
making extents offline.  We make it take an explicit range of blocks to
remove instead of just giving it the size and it learns to mark extents
offline and update them instead of always deleting them.

Reads from offline extents return zeros like reading from a sparse
region (later it will trigger demand staging) and writing to offline
extents clears the offline flag (later only staging can do that).

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
5d87418925 Add ioctl for sampling inode data version
Add an ioctl that samples the inode's data_version.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
f86fab1162 Add an inode data_version field
The data_version field is changed every time the contents of the file
could have changed.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Mark Fasheh
467801de73 scoutfs: use extents for file data
We're very basic here at this stage and simply put a single-block extent
item where we would have previously had a multi-block bmap item.
Multi-block extents will come in future patches.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
f1b29c8372 scoutfs_btree_prev() searches prev block, not next
Oops, scoutfs_btree_prev() asked btree_walk() for the key for the next
block, not the previous block to search when it's walk lands in the
space before all the items in the leaf block.

I saw it when truncate's check_size_eq constraint failed on items
outside the range which stopped the truncate and left inodes, extents,
and the orphan item around after rm -rf.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
af5955e95a Add found_seq argument to scoutfs_btree_prev
Add a *found_seq argument to _prev so that it can give the caller the
seq of the item that's returned.  The extent code is going to use this
to find seqs of extents.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
ae6cc83d01 Raise the nlink limit
A few xfstests tests were failing because they tried to create a decent
number of hard links to a file.

We had a small nlink limit because the inode-paths ioctl copied all the
paths for all the hard links to a userspace buffer which could be
enormous if there was a larger nlink limit.

The hard link backref disk format already has a natural counter that
could be used as a cursor to iterate over all the hard links that point
to a given inode.

This refactors the inode_paths ioctl into a ino_path ioctl that returns
a single path for the given counter and returns the counter for the next
path that links to the inode.  Happily this lets us get rid of all the
weird path component lists and allocations.  Now there's just the kernel
path buffer that gets null terminated path components and the userspace
buffer that those are copied to.

We don't fully relax the nlink limit.  stat(2) returns the link count as
a u32.  We go a step further and limit it to S32_MAX so that apps might
avoid sign bugs.  That still gives us a more generous limit than ext4
and btrfs which are around U16_MAX.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00