2141 Commits

Author SHA1 Message Date
Zach Brown
37bc86b558 Add check_size_lte
Add a _lte val boolean so that -EOVERFLOW is returned if the item is
greater than the value vector.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
243a36e405 Add fsync file operation method
Add a scoutfs_file_fsync() which synchronously commits the current
transaction and call it to fsync files and directories.  This fixes a
number of generic xfstests in the quick group which were failing because
fsync wasn't supported.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
1d0cd95b55 Let the commit task hold transactions
The commit work sets trans_holds so that all hold attempts block while
it's doing its work.  Now that it's calling in to generic vfs functions
to write out dirty file data it can end up in generic write functions
that try to hold the trans and can deadlock.

This adds tracking of the commit task so that holds know to let it
proceed without deadlocking.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
256166db32 Fix write trans/page lock inversions
scoutfs_write_begin() was riddled with lock ordering and cleanup bugs:

 - blocked holding the trans with the page lock held
 - dirtied the inode with the page lock held
 - didn't release the trans on error
 - tried to double unlock and release pages on readpage error

We fix all this up by reordering things so we hold the trans, dirty the
inode, then work pages all while more carefully cleaning up.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
f54f61f064 Always initialize btree lockdep class
We have to set the lock class to the btree level to keep
lockdep from building long depdencency chains.  We initialized
allocated blocks for tree growth but not for splitting.  We fix this by
moving the init up into allocation instead of in tree growth.  Now all
the places we get blocks from the block calls are set.

This silences a lockdep warning during merge during rm -rf which is the
first place where multiple blocks in a level are locked.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
6fd5396fbe Add block cache shrinker
Now that we have our own allocated block cache struct we need to add a
shrinker so that it's reclaimed under memory pressure.  We keep clean
blocks in a simple lru list that the shrinker walks to free the oldest
blocks.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
b612438abc Buddy forgot to put blocks in a few places
The buddy code missed putting the block in a few error cases.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
d71f7a24ec Don't check meta seq before locking
The block lock functions were trying to compare the block header seq and
the super seq to decide if the block is stable and if it should lock, or
not.  Readers trying to lock races with transaction commits.
Transaction commit can update the super after the reader locks and
before it unlocks.  The unlock will then fail the test and fail to
unlock.  fsstress triggered this in xfstests generic/013.

Instead we can always acquire the read lock on stable blocks.  We'll be
bouncing the rwsem cacheline around like the refcount cacheline.  If
this is a problem we can carefully maintain bits in the block to safely
indicate if it should be locked or unlocked but let's not go there if we
don't have to.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
0c67dd51ef Forget freed btree blocks
When the btree stops referencing a block and frees it we can also forget
it so that it isn't uselessly written to disk.

Callers who forget are careful to only unlock and release the block ref
after freeing it.  They won't be confused if something allocates the
block and starts using it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
d4571b6db3 Add scoutfs_block_forget()
Add scoutfs_block_forget() which ensures that a block won't satisfy
future lookups and will not be written out.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
f57c07381a Go back to having our own scoutfs_block cache
We used to have 16k blocks in our own radix_tree cache.  When we
introduced the simple file block mapping code it preferred to have block
size == page size.  That let us remove a bunch of code and reuse all the
kernel's buffer head code.

But it turns out that the buffer heads are just a bit too inflexible.

We'd like to have blocks larger than page size, obviously, but it turns
out there's real functional differences.

Resolving the problem of unlocked readers and allocating writers working
with the same blkno is the most powerful example of this.  It's trivial
to fix by always inserting new allocated cached blocks in the cache. But
solving it with buffer heads requires expensive and risky locking around
the buffer head cache which can only support a single physical instance
of a given blkno because there can be multiple blocks per page.

So this restores the simple block cache that was removed back in commit
'c8e76e2 scoutfs: use buffer heads'.  There's still work to do to get
this fully functional but it's worth it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
4042927519 Make btree nr_items le16
If we increase the block size the btree is going to need to be able to
store more than 255 items in a btree block.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:07 -08:00
Zach Brown
03787f23d3 Add scoutfs_block_data_from_contents()
The btree code needs to get a pointer to a whole block from just
pointers to elements that it's sorting.  It had some manual code that
assumed details of the blocks.  Let's give it a real block interface to
do what it wants and make it the block API's problem to figure out how
to do it.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:44:54 -08:00
Zach Brown
3d66a4b3dd Block API offers scoutfs_block instead of bh
Our thin block wrappers exposed buffer heads to all the callers.  We're
about to revert back to the block interface that uses its own
scoutfs_block struct instead of buffer heads.  Let's reduce the churn of
that patch by first having the block API give callers an opaque struct
scoutfs_block.  Internally it's still buffer heads but the callers don't
know that.

scoutfs_write_dirty_super() is the exception who has magical knowledge
of buffer heads.  That's fixed once the new block API offers a function
for writing a single block.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:44:40 -08:00
Zach Brown
932b0776d1 Add commands for working with offline data
Add the data_version, stage, and release commands for working with
offline extents of file data.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:40:09 -08:00
Zach Brown
22140c93d1 Print extents instead of bmap items
Print the extent items now that we're not using bmap items any more.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:40:04 -08:00
Zach Brown
c2cfb0227f Print the new inode data_version field
We've added a data_version field to the inode for tracking changes to
the file's data.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:39:39 -08:00
Zach Brown
f1d8955303 Support the ino_path ioctl
We updated the inode_paths ioctl to return one path and use a counter
cursor and renamed it to ino_path.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:39:00 -08:00
Zach Brown
fb16af7b7d btree nr_items is now a le16
The btree block now has a le16 nr_items field to make room for the
number of items that larger blocks can hold.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:38:04 -08:00
Zach Brown
c65b70f2aa Use full radix for buddy and record first set
The first pass of the buddy allocator had a fixed indirect block so it
couldn't address large devices.  It didn't index set bits or slots for
each order so we spent a lot of cpu searching for free space.  And it
didn't precisely account for stable free space so it could spend a lot
of cpu time discovering that free space can't be used because it wasn't
stable.

This fixes these initial critical flaws in the buddy allocator.  Before
it could only address a few hundred megs and now it can address 2^64
blocks.  Before it limited bulk inode creation searching for slots and
leaf bits and now other components are much higher in the profiles with
greater create rates.

First we remove the special case single indirect block.  The root now
references a block that can be at any height.  The root records the
height and each block records its level.  We descend until we hit the
leaf.  We add a stack of the blocks traversed so that we can ascend and
fix up parent indexing after we modify a leaf.

Now that we can have quite a lot of parent indirect blocks we can no
longer have a static bitmap for allocating buddy blocks.  We instead
precisely preallocate two blocks for every buddy block that will be used
to address all the device blocks.  The blkno offset of these pairs of
buddy blocks can be calculated for a given position in the tree.
Allocating a blkno xors the low bit of the blkno and freeing is a nop.
This happily gets rid of the specific allocation of buddy blocks with
its regions and worrying about stable free blocks itself.

Then we index the first set index in a block for each order.  In parent
blocks this tells you the slot you can traverse to find a free region of
that order.  In leaf blocks it tells you the specific block offset of
the first free extent.  This is kept up to date as we set and clear
buddy bits in leaves and free_order bits in parent slots.  Allocation
now is a simple matter of block reads and array dereferencing.

And we now precisely account for frees that should not satisfy
allocation until after a transaction commit.  We record frees of stable
data in extent nodes in an rbtree after their buddy blocks have been
dirtied.  Because their blocks are dirtied we can free them as the
transaction commits without errors.  Similarly, we can also revert them
if the transaction commit fails so that they don't satisfy allocation.
This prevents us from having to hang or go read-only if a transaction
commit fails.

The two changes visible to callers are easy argument changes:
scoutfs_buddy_free() now takes a seq to specify when the allocation was
first allocated, and scoutfs_buddy_alloc_same() has its arguments match
that it only makes sense for single block allocations.

Unfortunately all these changes are interrelated so the resulting patch
amounts to a rewrite.  The core buddy bitmap helper functions and loops
are the same but the surrounding block container code changes
significnatly.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
17ec4a1480 Add seq field to block map item
The file block mapping code needs to know if an existing block mapping
is dirty in the current transaction or not.  It was doing that by
calling in to the allocator.

Instead of calling in to the allocator we can instead store the seq of
the block in the mapping item.  We also probably want to know the seq of
data blocks to make it possible to discover regions of files that have
changed since a previous seq.

This does increase the size of the block mapping item but they're not
long for this world.  We're going to replace them with proper extent
items in the near future.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
c8d1703196 Add blkno and level to bad btree printk
Add the blkno and level to the output for a btree block that fails
verification.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
44588c1d8b Lock btree merges
The btree locking wasn't covering the merge candidate block before the
siblings were locked.  In that unlocked code it could compact the block
corrupting it for whatever other tree walk might only have the merge
candidate locked after having unlocked the parent.

This extends locking coverage to merge and split attempts by acquiring
the block lock immediately after we read it.  Split doesn't have to lock
its destination block but it does have to know to unlock the block on
errors.  Merge has to more carefully lock both of its existing blocks in
a consistent order.

To clearly implement this we simplify the locking helpers to just unlock
and lock a given block, falling back to the btree rwsem if there isn't a
block.

I started down this road while chasing allocator bugs that manifested as
tree corruption.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
a77f88386c Add scoutfs_btree_prev()
We haven't yet had a pressing need for a search for the previous item
before a given key value.  File extent items offer the first strong
candidate.  We'd like to naturally store the start of the extent
in the key so to find an extent that overlaps a block we'd like to
find the previous key before the search block offset.

The _prev search is easy enough to implement.  We have to update tree
walking to update the prev key and update leaf block processing to find
the correct item position after the binary search.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
f32365321d Remove unused btree internal WALK_NEXT
In the past the WALK_NEXT enum was used to tell the walking core that
the caller was iterating and that they'd need to advance to sibling
blocks if their key landed off the end of a leaf.  In the current code
that's now handled by giving the walk caller a next_key which will
continue the search from the next leaf.  WALK_NEXT is unused and we can
remove it.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Zach Brown
1cbd84eece scoutfs: wire up sop->dirty_inode
We're using the generic block buffer_head write_begin and write_end
functions.  They call sop->dirty_inode() to update the inode i_size.  We
didn't have that method wired up so updates to the inode in the write
path wasn't dirtying the inode item.  Lost i_size updates would
trivially lose data but we first noticed this when looking at inode item
sequence numbers while overwriting.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:36 -08:00
Zach Brown
165d833c46 Walk stable trees in _since ioctls
The _since ioctls walk btrees and return items that are newer than a
given sequence number.  The intended behaviour is that items will
appear in a greater sequence number if they change after appearing
in the queries.  This promise doesn't hold for items that are being
modified in the current transaction.  The caller would have to always
ask for seq X + 1 after seeing seq X to make sure it got all the changes
that happened in seq X while it was the current dirty transaction.

This is fixed by having the interfaces walk the stable btrees from the
previous transaction.  The results will always be a little stale but
userspace already has to deal with stale results because it can't lock
out change, and it can use sync (and a commit age tunable we'll add) to
limit how stale the results can be.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:36 -08:00
Zach Brown
cd0d045c93 Add support for full radix buddy blocks
Update mkfs and print for the full radix buddy allocators.  mkfs has to
calculate the number of blocks and the height of the tree and has to
initialize the paths down the left and right side of the tree.
Print needs to dump the new radix blockx and super block fields.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:25 -07:00
Zach Brown
40b9f19ec4 Add bitops.c for find_next_bit_le()
The upcoming buddy changes are going to need a find_next_bit_le().

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:25 -07:00
Zach Brown
871db60fb2 Add U16_MAX
Add a simple U16_MAX define for upcoming buddy changes.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:25 -07:00
Zach Brown
a901db2ff7 Print seqs in bmap items
The bmap items now have the sequence number that wrote each mapped
block.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:25 -07:00
Zach Brown
b436772376 Add orphan key
Printing the raw item is enough, it doesn't have a value to decode.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:25 -07:00
Zach Brown
e6222223c2 Update format.h for kernel code helpers
Update format.h for some format defines that have so far only been used
by the kernel code.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-04 14:16:22 -07:00
Mark Fasheh
2fc1b99698 scoutfs: replace some open coded corruption checks
We can trivially do the simple check of value length against what the caller
expects in btree.c.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-27 17:25:05 -05:00
Mark Fasheh
ebbb2e842e scoutfs: implement inode orphaning
This is pretty straight forward - we define a new item type,
SCOUTFS_ORPHAN_KEY. We don't need to store any value with this, the inode
and type fields are enough for us to find what inode has been orphaned.

Otherwise this works as one would expect. Unlink sets the item, and
->evict_inode removes it. On mount, we scan for orphan items and remove any
corresponding inodes.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-24 16:41:45 -05:00
Nic Henke
d4355dd587 Add all target for make
Adding in an 'all' target allows us to use canned build scripts for any
of the scoutfs related repositories.

Signed-off-by: Nic Henke <nic.henke@versity.com>
Signed-off-by: Zach Brown <zab@zabbo.net>
2016-10-20 13:55:31 -07:00
Nic Henke
ad2f5b33ee Use make variable CURDIR instead of PWD
When running make in a limited shell or in docker, there is no PWD from
shell. By using CURDIR we avoid worrying about the environment and let
make take care of this for us.

Signed-off-by: Nic Henke <nic.henke@versity.com>
Signed-off-by: Zach Brown <zab@zabbo.net>
2016-10-20 13:55:26 -07:00
Zach Brown
16e94f6b7c Search for file data that has changed
We don't overwrite existing data.  Every file data write has to allocate
new blocks and update block mapping items.

We can search for inodes whose data has changed by filtering block
mapping item walks by the sequence number.  We do this by using the
exact same code for finding changed inodes but using the block mapping
key type.

Signed-off-by: Zach Brown <zab@versity.com>
2016-10-20 13:55:14 -07:00
Mark Fasheh
5b7f9ddbe2 Trace scoutfs btree functions
We make an event class for the two most common btree op patterns, and reuse
that to make our tracepoints for each function. This covers all the entry
points listed in btree.h. We don't get every single parameter of every
function but this is enough that we can see which keys are being queried /
inserted.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-13 14:08:08 -07:00
Mark Fasheh
31d182e2db Add 'make clean' target
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-13 13:52:34 -07:00
Zach Brown
5601f8cef5 scoutfs: add scoutfs_block_forget()
The upcoming allocator changes have a need to forget dirty blocks so
they're not written.  It proabably won't be the only one.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:46:18 -07:00
Zach Brown
9d08b34791 scoutfs: remove excessive block locking tracing
I accidentally left some lock tracing in the btree locking commit that
is very noisy and not particularly useful.  Let's remove it.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:44:31 -07:00
Zach Brown
f7f7a2e53f scoutfs: add scoutfs_block_zero_from()
We already have a function that zeros the end of a block starting at a
given offset.  Some callers have a pointer to the byte to zero from so
let's add a convenience function that calculates the offset from the
pointer.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:42:27 -07:00
Zach Brown
0dff7f55a6 Use openssl for pseudo random bytes
The pseudo random byte wrapper function used the intel instructions
so that it could deal with high call rates, like initializing random
node priorities for a large treap.

But this is obviously not remotely portable and has the annoying habit
of tripping up versions of valgrind that haven't yet learned about these
instructions.

We don't actually have high bandwidth callers so let's back off and just
let openssl take care of this for us.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-27 09:47:50 -07:00
Zach Brown
cf0199da00 scoutfs: allow more concurrent btree locking
The btree locking so far was a quick interim measure to get the rest of
the system going.  We want to clean it up both for correctness and
performance but also to make way for using the btree for block
allocation.

We were unconditionally using the buffer head lock for tree block
locking.  This is bad for at least four reasons:  it's invisible to
lockdep, it doesn't allow concurrent reads, it doesn't allow reading
while a block is being written during the transaction, and it's not
necessary at all when the for stable read-only blocks.

Instead we add a rwsem to the buffer head private which we use to lock
the block when it's writable.  We clean up the locking functions to make
it clearer that btree_walk holds one lock at a time and either returns
it to the caller with the buffer head or unlocks the parent if its
returning an error.

We also add the missing sibling block locking during splits and merges.
Locking the parent prevented walks from descending down our path but it
didn't protect against previous walks that were already down at our
sibling's level.

Getting all this working with lockdep adds a bit more class/subclass
plumbing calls but nothing too ornerous.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 13:41:18 -07:00
Zach Brown
bb3a5742f4 scoutfs: drop sib bh ref in split
We forgot to drop the sibling bh reference while splitting.  Oopsie!

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
84f23296fd scoutfs: remove btree cursor
The btree cursor was built to address two problems.  First it
accelerates iteration by avoiding full descents down the tree by holding
on to leaf blocks.  Second it lets callers reference item value contents
directly to avoid copies.

But it also has serious complexity costs.  It pushes refcounting and
locking out to the caller.  There have already been a few bugs where
callers did things while holding the cursor without realizing that
they're holding a btree lock and can't perform certain btree operations
or even copies to user space.

Future changes to the allocator to use the btree motivates cleaning up
the tree locking which is complicated by the cursor being a stand alone
lock reference.  Instead of continuing to layer complexity onto this
construct let's remove it.

The iteration acceleration will be addressed the same way we're going to
accelerate the other btree operations: with per-cpu cached leaf block
references.  Unlike the cursor this doesn't push interface changes out
to callers who want repeated btree calls to perform well.

We'll leave the value copying for now.  If it becomes an issue we can
add variants that call a function to operate on the value.  Let's hope
we don't have to go there.

This change replaces the cursor with a vector to memory that the value
should be copied to and from.  The vector has a fixed number of elements
and is wrapped in a struct for easy declaration and initialization.

This change to the interface looks noisy but each caller's change is
pretty mechanical.  They tend to involve:

 - replace the cursor with the value struct and initialization
 - allocate some memory to copy the value in to
 - reading functions return the number of value bytes copied
 - verify copied bytes makes sense for item being read
 - getting rid of confusing ((ret = _next())) looping
 - _next now returns -ENOENT instead of 0 for no next item
 - _next iterators now need to increase the key themselves
 - make sure to free allocated mem

Sometimes the order of operations changes significantly.  Now that we
can't modify in place we need to read, modify, write.  This looks like
changing a modification of the item through the cursor to a
lookup/update pattern.

The symlink item iterators didn't need to use next because they walk a
contiguous set of keys.  They're changed to use simple insert or lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
a9afa92482 scoutfs: correctly set the last symlink item
The final symlink item insertion was taking the min of the entire path
and the max symlink item size, not the min of the remaining length of
the path after having created all the previous items.  For paths larger
than the max item size this could use too much space.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
10a42724a9 scoutfs: add scoutfs_dec_key()
This is analagous to scoutfs_inc_key().  It decreases the next highest
order key value each time its decrement wraps.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
161063c8d6 scoutfs: remove very noisy bh ref tracing
This wasn't adding much value and was exceptionally noisy.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00