Commit Graph

159 Commits

Author SHA1 Message Date
Zach Brown
fbd12b4dda Fix existing item insertion
Inserting an item over an existing key was super broken.  Now that we're
not replacing we can't stop descent if we find an existing item.  We
need to keep descending and then insert.  And the caller needs to, you
know, actually remove the existing item when it's found -- not the item
it just inserted :P.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
5eb388ae6e Fix seg item filling
The two functions that added to items had little bugs.  They initialized
the item vectors incorrectly and didn't actually store the keys and
values.  Appending was always overwriting the first segment.  Have it
call 'nr' 'pos' like the rest of the code to make it more clear.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
07ba01f6b0 Iniitialize segment header when writing item
Initialize the segment header as the items are written.  This isn't a
great place to do it.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
48f9be8455 Free key and value in the right order!
Always key then value!

*twitch*

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
9e02573e06 Rename scoutfs_seg_manfest_add
Rename scoutfs_seg_add_ment to _manifest_add as that makes it a lot more
clear that it's a wrapper around scoutfs_manifest_add() that gets its
arguments from the segment.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
471405f8cd Fix kvec iterators
A few of the kvec iterators that work with byte offsets forgot to reset
the offsets as they advanced to the next vec.

These should probably be refactored into a set of iterator helpers.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
be98c4dfd8 Fix up manifest key use
The manifest entries were changed to be a single contiguous allocation.

The calculation of the vec that points to the last key vec was adding
the key length in units of the add manifest struct.

Adding the manifest wasn't setting the key lengths nor copying the keys
into their position in the entry alloc.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:52 -07:00
Zach Brown
c4954eb6f4 Add initial LSM write implementation
Add all the core strutural components to be able to modify metadata.  We
modify items in fs write operations, track dirty items in the cache,
allocate free segment block reagions, stream dirty items into segments,
write out the segments, update the manifest to reference the written
segments, and write out a new ring that has the new manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
b45ec8824b Add a bunch of trace_printk()s
There's nothing particularly coherent about these, they're
what I added while debugging.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
641aae50ed Fix ring block replay
The ring block replay walk messed up the blkno it read
from and its exit condition.  It needed to test for having
just replayed the tail before moving on.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
21d313e0f6 Corretly wait on submitted segment reads
The segment waiting loop was rewritten to use n to
iterate up to i, but the body of the loop still had i.  Take that as a
signal to Always Iterate With 'i' and store the last i and then
iterate towards it with i again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
d1f36e2165 Correctly store last manifest key
A copy+paste error led us to overwrite the first
key in the manifest with the last, leaving the
last uninitialized.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
f3288f27c6 Declare full kvecs in manifest
The manifest had silly single kvecs instead of the macros that define
our maximal kvecs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
6957c73aba Have _lookup_exact return 0
scoutfs_item_lookup_exact() exists to only return one size.  Have it
just return 0 on success so callers don't have to remember that it
returns > 0.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
3f27de0b2c Fix hilarious BLOCK_SIZE typo
Turns out BLOCK_SIZE is a thing and confused scoutfs into thinking it
had many more blocks per segment then it did.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
a201cff5ad Read supers with bios instead of bl blocks
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
f7f840a342 Fix bio read completion init
Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00
Zach Brown
43d0d44e48 Add initial LSM implementation
Add the initial core components of the LSM implementation to be able to
read the root inode:

 - bio.c: read big block regions
 - seg.c: cache logical segments
 - ring.c: read the manifest from storage
 - manifest.c: organize segments into an LSM
 - kvec.c: work with arbitrary memory vectors
 - item.c: cache fs metadata items read from segments

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:38:50 -07:00
Zach Brown
c6b688c2bf Add staging ioctl
This adds the ioctl for writing archived file contents back into the
file if the data_version still matches.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
df561bbd19 Add offline extent flag and release ioctl
Add the _OFFLINE flag to indicate offline extents.  The release ioctl
frees extents within the release range and sets their _OFFLINE flag if
the data_version still matches.

We tweak the existing truncate item function just a bit to support
making extents offline.  We make it take an explicit range of blocks to
remove instead of just giving it the size and it learns to mark extents
offline and update them instead of always deleting them.

Reads from offline extents return zeros like reading from a sparse
region (later it will trigger demand staging) and writing to offline
extents clears the offline flag (later only staging can do that).

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
5d87418925 Add ioctl for sampling inode data version
Add an ioctl that samples the inode's data_version.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
f86fab1162 Add an inode data_version field
The data_version field is changed every time the contents of the file
could have changed.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Mark Fasheh
467801de73 scoutfs: use extents for file data
We're very basic here at this stage and simply put a single-block extent
item where we would have previously had a multi-block bmap item.
Multi-block extents will come in future patches.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
f1b29c8372 scoutfs_btree_prev() searches prev block, not next
Oops, scoutfs_btree_prev() asked btree_walk() for the key for the next
block, not the previous block to search when it's walk lands in the
space before all the items in the leaf block.

I saw it when truncate's check_size_eq constraint failed on items
outside the range which stopped the truncate and left inodes, extents,
and the orphan item around after rm -rf.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
af5955e95a Add found_seq argument to scoutfs_btree_prev
Add a *found_seq argument to _prev so that it can give the caller the
seq of the item that's returned.  The extent code is going to use this
to find seqs of extents.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
ae6cc83d01 Raise the nlink limit
A few xfstests tests were failing because they tried to create a decent
number of hard links to a file.

We had a small nlink limit because the inode-paths ioctl copied all the
paths for all the hard links to a userspace buffer which could be
enormous if there was a larger nlink limit.

The hard link backref disk format already has a natural counter that
could be used as a cursor to iterate over all the hard links that point
to a given inode.

This refactors the inode_paths ioctl into a ino_path ioctl that returns
a single path for the given counter and returns the counter for the next
path that links to the inode.  Happily this lets us get rid of all the
weird path component lists and allocations.  Now there's just the kernel
path buffer that gets null terminated path components and the userspace
buffer that those are copied to.

We don't fully relax the nlink limit.  stat(2) returns the link count as
a u32.  We go a step further and limit it to S32_MAX so that apps might
avoid sign bugs.  That still gives us a more generous limit than ext4
and btrfs which are around U16_MAX.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
37bc86b558 Add check_size_lte
Add a _lte val boolean so that -EOVERFLOW is returned if the item is
greater than the value vector.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
243a36e405 Add fsync file operation method
Add a scoutfs_file_fsync() which synchronously commits the current
transaction and call it to fsync files and directories.  This fixes a
number of generic xfstests in the quick group which were failing because
fsync wasn't supported.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
1d0cd95b55 Let the commit task hold transactions
The commit work sets trans_holds so that all hold attempts block while
it's doing its work.  Now that it's calling in to generic vfs functions
to write out dirty file data it can end up in generic write functions
that try to hold the trans and can deadlock.

This adds tracking of the commit task so that holds know to let it
proceed without deadlocking.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
256166db32 Fix write trans/page lock inversions
scoutfs_write_begin() was riddled with lock ordering and cleanup bugs:

 - blocked holding the trans with the page lock held
 - dirtied the inode with the page lock held
 - didn't release the trans on error
 - tried to double unlock and release pages on readpage error

We fix all this up by reordering things so we hold the trans, dirty the
inode, then work pages all while more carefully cleaning up.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
f54f61f064 Always initialize btree lockdep class
We have to set the lock class to the btree level to keep
lockdep from building long depdencency chains.  We initialized
allocated blocks for tree growth but not for splitting.  We fix this by
moving the init up into allocation instead of in tree growth.  Now all
the places we get blocks from the block calls are set.

This silences a lockdep warning during merge during rm -rf which is the
first place where multiple blocks in a level are locked.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
6fd5396fbe Add block cache shrinker
Now that we have our own allocated block cache struct we need to add a
shrinker so that it's reclaimed under memory pressure.  We keep clean
blocks in a simple lru list that the shrinker walks to free the oldest
blocks.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
b612438abc Buddy forgot to put blocks in a few places
The buddy code missed putting the block in a few error cases.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
d71f7a24ec Don't check meta seq before locking
The block lock functions were trying to compare the block header seq and
the super seq to decide if the block is stable and if it should lock, or
not.  Readers trying to lock races with transaction commits.
Transaction commit can update the super after the reader locks and
before it unlocks.  The unlock will then fail the test and fail to
unlock.  fsstress triggered this in xfstests generic/013.

Instead we can always acquire the read lock on stable blocks.  We'll be
bouncing the rwsem cacheline around like the refcount cacheline.  If
this is a problem we can carefully maintain bits in the block to safely
indicate if it should be locked or unlocked but let's not go there if we
don't have to.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
0c67dd51ef Forget freed btree blocks
When the btree stops referencing a block and frees it we can also forget
it so that it isn't uselessly written to disk.

Callers who forget are careful to only unlock and release the block ref
after freeing it.  They won't be confused if something allocates the
block and starts using it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
d4571b6db3 Add scoutfs_block_forget()
Add scoutfs_block_forget() which ensures that a block won't satisfy
future lookups and will not be written out.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
f57c07381a Go back to having our own scoutfs_block cache
We used to have 16k blocks in our own radix_tree cache.  When we
introduced the simple file block mapping code it preferred to have block
size == page size.  That let us remove a bunch of code and reuse all the
kernel's buffer head code.

But it turns out that the buffer heads are just a bit too inflexible.

We'd like to have blocks larger than page size, obviously, but it turns
out there's real functional differences.

Resolving the problem of unlocked readers and allocating writers working
with the same blkno is the most powerful example of this.  It's trivial
to fix by always inserting new allocated cached blocks in the cache. But
solving it with buffer heads requires expensive and risky locking around
the buffer head cache which can only support a single physical instance
of a given blkno because there can be multiple blocks per page.

So this restores the simple block cache that was removed back in commit
'c8e76e2 scoutfs: use buffer heads'.  There's still work to do to get
this fully functional but it's worth it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
4042927519 Make btree nr_items le16
If we increase the block size the btree is going to need to be able to
store more than 255 items in a btree block.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
03787f23d3 Add scoutfs_block_data_from_contents()
The btree code needs to get a pointer to a whole block from just
pointers to elements that it's sorting.  It had some manual code that
assumed details of the blocks.  Let's give it a real block interface to
do what it wants and make it the block API's problem to figure out how
to do it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:44:54 -08:00
Zach Brown
3d66a4b3dd Block API offers scoutfs_block instead of bh
Our thin block wrappers exposed buffer heads to all the callers.  We're
about to revert back to the block interface that uses its own
scoutfs_block struct instead of buffer heads.  Let's reduce the churn of
that patch by first having the block API give callers an opaque struct
scoutfs_block.  Internally it's still buffer heads but the callers don't
know that.

scoutfs_write_dirty_super() is the exception who has magical knowledge
of buffer heads.  That's fixed once the new block API offers a function
for writing a single block.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:44:40 -08:00
Zach Brown
c65b70f2aa Use full radix for buddy and record first set
The first pass of the buddy allocator had a fixed indirect block so it
couldn't address large devices.  It didn't index set bits or slots for
each order so we spent a lot of cpu searching for free space.  And it
didn't precisely account for stable free space so it could spend a lot
of cpu time discovering that free space can't be used because it wasn't
stable.

This fixes these initial critical flaws in the buddy allocator.  Before
it could only address a few hundred megs and now it can address 2^64
blocks.  Before it limited bulk inode creation searching for slots and
leaf bits and now other components are much higher in the profiles with
greater create rates.

First we remove the special case single indirect block.  The root now
references a block that can be at any height.  The root records the
height and each block records its level.  We descend until we hit the
leaf.  We add a stack of the blocks traversed so that we can ascend and
fix up parent indexing after we modify a leaf.

Now that we can have quite a lot of parent indirect blocks we can no
longer have a static bitmap for allocating buddy blocks.  We instead
precisely preallocate two blocks for every buddy block that will be used
to address all the device blocks.  The blkno offset of these pairs of
buddy blocks can be calculated for a given position in the tree.
Allocating a blkno xors the low bit of the blkno and freeing is a nop.
This happily gets rid of the specific allocation of buddy blocks with
its regions and worrying about stable free blocks itself.

Then we index the first set index in a block for each order.  In parent
blocks this tells you the slot you can traverse to find a free region of
that order.  In leaf blocks it tells you the specific block offset of
the first free extent.  This is kept up to date as we set and clear
buddy bits in leaves and free_order bits in parent slots.  Allocation
now is a simple matter of block reads and array dereferencing.

And we now precisely account for frees that should not satisfy
allocation until after a transaction commit.  We record frees of stable
data in extent nodes in an rbtree after their buddy blocks have been
dirtied.  Because their blocks are dirtied we can free them as the
transaction commits without errors.  Similarly, we can also revert them
if the transaction commit fails so that they don't satisfy allocation.
This prevents us from having to hang or go read-only if a transaction
commit fails.

The two changes visible to callers are easy argument changes:
scoutfs_buddy_free() now takes a seq to specify when the allocation was
first allocated, and scoutfs_buddy_alloc_same() has its arguments match
that it only makes sense for single block allocations.

Unfortunately all these changes are interrelated so the resulting patch
amounts to a rewrite.  The core buddy bitmap helper functions and loops
are the same but the surrounding block container code changes
significnatly.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
17ec4a1480 Add seq field to block map item
The file block mapping code needs to know if an existing block mapping
is dirty in the current transaction or not.  It was doing that by
calling in to the allocator.

Instead of calling in to the allocator we can instead store the seq of
the block in the mapping item.  We also probably want to know the seq of
data blocks to make it possible to discover regions of files that have
changed since a previous seq.

This does increase the size of the block mapping item but they're not
long for this world.  We're going to replace them with proper extent
items in the near future.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
c8d1703196 Add blkno and level to bad btree printk
Add the blkno and level to the output for a btree block that fails
verification.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
44588c1d8b Lock btree merges
The btree locking wasn't covering the merge candidate block before the
siblings were locked.  In that unlocked code it could compact the block
corrupting it for whatever other tree walk might only have the merge
candidate locked after having unlocked the parent.

This extends locking coverage to merge and split attempts by acquiring
the block lock immediately after we read it.  Split doesn't have to lock
its destination block but it does have to know to unlock the block on
errors.  Merge has to more carefully lock both of its existing blocks in
a consistent order.

To clearly implement this we simplify the locking helpers to just unlock
and lock a given block, falling back to the btree rwsem if there isn't a
block.

I started down this road while chasing allocator bugs that manifested as
tree corruption.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
a77f88386c Add scoutfs_btree_prev()
We haven't yet had a pressing need for a search for the previous item
before a given key value.  File extent items offer the first strong
candidate.  We'd like to naturally store the start of the extent
in the key so to find an extent that overlaps a block we'd like to
find the previous key before the search block offset.

The _prev search is easy enough to implement.  We have to update tree
walking to update the prev key and update leaf block processing to find
the correct item position after the binary search.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
f32365321d Remove unused btree internal WALK_NEXT
In the past the WALK_NEXT enum was used to tell the walking core that
the caller was iterating and that they'd need to advance to sibling
blocks if their key landed off the end of a leaf.  In the current code
that's now handled by giving the walk caller a next_key which will
continue the search from the next leaf.  WALK_NEXT is unused and we can
remove it.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
1cbd84eece scoutfs: wire up sop->dirty_inode
We're using the generic block buffer_head write_begin and write_end
functions.  They call sop->dirty_inode() to update the inode i_size.  We
didn't have that method wired up so updates to the inode in the write
path wasn't dirtying the inode item.  Lost i_size updates would
trivially lose data but we first noticed this when looking at inode item
sequence numbers while overwriting.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:36 -08:00
Zach Brown
165d833c46 Walk stable trees in _since ioctls
The _since ioctls walk btrees and return items that are newer than a
given sequence number.  The intended behaviour is that items will
appear in a greater sequence number if they change after appearing
in the queries.  This promise doesn't hold for items that are being
modified in the current transaction.  The caller would have to always
ask for seq X + 1 after seeing seq X to make sure it got all the changes
that happened in seq X while it was the current dirty transaction.

This is fixed by having the interfaces walk the stable btrees from the
previous transaction.  The results will always be a little stale but
userspace already has to deal with stale results because it can't lock
out change, and it can use sync (and a commit age tunable we'll add) to
limit how stale the results can be.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:36 -08:00
Mark Fasheh
2fc1b99698 scoutfs: replace some open coded corruption checks
We can trivially do the simple check of value length against what the caller
expects in btree.c.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-27 17:25:05 -05:00
Mark Fasheh
ebbb2e842e scoutfs: implement inode orphaning
This is pretty straight forward - we define a new item type,
SCOUTFS_ORPHAN_KEY. We don't need to store any value with this, the inode
and type fields are enough for us to find what inode has been orphaned.

Otherwise this works as one would expect. Unlink sets the item, and
->evict_inode removes it. On mount, we scan for orphan items and remove any
corresponding inodes.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-24 16:41:45 -05:00