Commit Graph

1907 Commits

Author SHA1 Message Date
Mark Fasheh
ebbb2e842e scoutfs: implement inode orphaning
This is pretty straight forward - we define a new item type,
SCOUTFS_ORPHAN_KEY. We don't need to store any value with this, the inode
and type fields are enough for us to find what inode has been orphaned.

Otherwise this works as one would expect. Unlink sets the item, and
->evict_inode removes it. On mount, we scan for orphan items and remove any
corresponding inodes.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-24 16:41:45 -05:00
Nic Henke
d4355dd587 Add all target for make
Adding in an 'all' target allows us to use canned build scripts for any
of the scoutfs related repositories.

Signed-off-by: Nic Henke <nic.henke@versity.com>
Signed-off-by: Zach Brown <zab@zabbo.net>
2016-10-20 13:55:31 -07:00
Nic Henke
ad2f5b33ee Use make variable CURDIR instead of PWD
When running make in a limited shell or in docker, there is no PWD from
shell. By using CURDIR we avoid worrying about the environment and let
make take care of this for us.

Signed-off-by: Nic Henke <nic.henke@versity.com>
Signed-off-by: Zach Brown <zab@zabbo.net>
2016-10-20 13:55:26 -07:00
Zach Brown
16e94f6b7c Search for file data that has changed
We don't overwrite existing data.  Every file data write has to allocate
new blocks and update block mapping items.

We can search for inodes whose data has changed by filtering block
mapping item walks by the sequence number.  We do this by using the
exact same code for finding changed inodes but using the block mapping
key type.

Signed-off-by: Zach Brown <zab@versity.com>
2016-10-20 13:55:14 -07:00
Mark Fasheh
5b7f9ddbe2 Trace scoutfs btree functions
We make an event class for the two most common btree op patterns, and reuse
that to make our tracepoints for each function. This covers all the entry
points listed in btree.h. We don't get every single parameter of every
function but this is enough that we can see which keys are being queried /
inserted.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-13 14:08:08 -07:00
Mark Fasheh
31d182e2db Add 'make clean' target
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-13 13:52:34 -07:00
Zach Brown
5601f8cef5 scoutfs: add scoutfs_block_forget()
The upcoming allocator changes have a need to forget dirty blocks so
they're not written.  It proabably won't be the only one.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:46:18 -07:00
Zach Brown
9d08b34791 scoutfs: remove excessive block locking tracing
I accidentally left some lock tracing in the btree locking commit that
is very noisy and not particularly useful.  Let's remove it.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:44:31 -07:00
Zach Brown
f7f7a2e53f scoutfs: add scoutfs_block_zero_from()
We already have a function that zeros the end of a block starting at a
given offset.  Some callers have a pointer to the byte to zero from so
let's add a convenience function that calculates the offset from the
pointer.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-28 13:42:27 -07:00
Zach Brown
0dff7f55a6 Use openssl for pseudo random bytes
The pseudo random byte wrapper function used the intel instructions
so that it could deal with high call rates, like initializing random
node priorities for a large treap.

But this is obviously not remotely portable and has the annoying habit
of tripping up versions of valgrind that haven't yet learned about these
instructions.

We don't actually have high bandwidth callers so let's back off and just
let openssl take care of this for us.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-27 09:47:50 -07:00
Zach Brown
cf0199da00 scoutfs: allow more concurrent btree locking
The btree locking so far was a quick interim measure to get the rest of
the system going.  We want to clean it up both for correctness and
performance but also to make way for using the btree for block
allocation.

We were unconditionally using the buffer head lock for tree block
locking.  This is bad for at least four reasons:  it's invisible to
lockdep, it doesn't allow concurrent reads, it doesn't allow reading
while a block is being written during the transaction, and it's not
necessary at all when the for stable read-only blocks.

Instead we add a rwsem to the buffer head private which we use to lock
the block when it's writable.  We clean up the locking functions to make
it clearer that btree_walk holds one lock at a time and either returns
it to the caller with the buffer head or unlocks the parent if its
returning an error.

We also add the missing sibling block locking during splits and merges.
Locking the parent prevented walks from descending down our path but it
didn't protect against previous walks that were already down at our
sibling's level.

Getting all this working with lockdep adds a bit more class/subclass
plumbing calls but nothing too ornerous.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 13:41:18 -07:00
Zach Brown
bb3a5742f4 scoutfs: drop sib bh ref in split
We forgot to drop the sibling bh reference while splitting.  Oopsie!

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
84f23296fd scoutfs: remove btree cursor
The btree cursor was built to address two problems.  First it
accelerates iteration by avoiding full descents down the tree by holding
on to leaf blocks.  Second it lets callers reference item value contents
directly to avoid copies.

But it also has serious complexity costs.  It pushes refcounting and
locking out to the caller.  There have already been a few bugs where
callers did things while holding the cursor without realizing that
they're holding a btree lock and can't perform certain btree operations
or even copies to user space.

Future changes to the allocator to use the btree motivates cleaning up
the tree locking which is complicated by the cursor being a stand alone
lock reference.  Instead of continuing to layer complexity onto this
construct let's remove it.

The iteration acceleration will be addressed the same way we're going to
accelerate the other btree operations: with per-cpu cached leaf block
references.  Unlike the cursor this doesn't push interface changes out
to callers who want repeated btree calls to perform well.

We'll leave the value copying for now.  If it becomes an issue we can
add variants that call a function to operate on the value.  Let's hope
we don't have to go there.

This change replaces the cursor with a vector to memory that the value
should be copied to and from.  The vector has a fixed number of elements
and is wrapped in a struct for easy declaration and initialization.

This change to the interface looks noisy but each caller's change is
pretty mechanical.  They tend to involve:

 - replace the cursor with the value struct and initialization
 - allocate some memory to copy the value in to
 - reading functions return the number of value bytes copied
 - verify copied bytes makes sense for item being read
 - getting rid of confusing ((ret = _next())) looping
 - _next now returns -ENOENT instead of 0 for no next item
 - _next iterators now need to increase the key themselves
 - make sure to free allocated mem

Sometimes the order of operations changes significantly.  Now that we
can't modify in place we need to read, modify, write.  This looks like
changing a modification of the item through the cursor to a
lookup/update pattern.

The symlink item iterators didn't need to use next because they walk a
contiguous set of keys.  They're changed to use simple insert or lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
a9afa92482 scoutfs: correctly set the last symlink item
The final symlink item insertion was taking the min of the entire path
and the max symlink item size, not the min of the remaining length of
the path after having created all the previous items.  For paths larger
than the max item size this could use too much space.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
10a42724a9 scoutfs: add scoutfs_dec_key()
This is analagous to scoutfs_inc_key().  It decreases the next highest
order key value each time its decrement wraps.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
161063c8d6 scoutfs: remove very noisy bh ref tracing
This wasn't adding much value and was exceptionally noisy.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
2bed78c269 scoutfs: specify btree root
The btree functions currently don't take a specific root argument.  They
assume, deep down in btree_walk, that there's only one btree in the
system.  We're going to be adding a few more to support richer
allocation.

To prepare for this we have the btree functions take an explicit btree
argument.  This should make no functional difference.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
d2a696f4bd scoutfs: add zero key set and test functions
Add some quick functions to set a key to all zeros and to test if a key
is all zeros.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
3bb0c80686 scoutfs: fix buddy stable bit test
The buddy allocator had the test for non-existant stable bitmap blocks
backwards.  An uninitialized block implies that all the bits are marked
free and we don't need to test that the specific bits are free.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:02:19 -07:00
Zach Brown
4ccb80a8ec Initialize all the buddy slot free order fields
Initialize the free_order field in all the slots of the buddy index
block so that the kernel will try to allocate from them and will
initialize and populate the first block.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 16:40:39 -07:00
Zach Brown
1dd4a14d04 scoutfs: don't dereference IS_ERR buffer_head
The check for aligned buffer head data pointers was trying to
dereference a bad IS_ERR pointer when allocation of a new block failed
with ENOSPC.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 15:36:25 -07:00
Zach Brown
49c3d5ed34 scoutfs: add btree block verification
Add a function to verify that a btree block is valid.  It's disabled for
now because it's expensive.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 14:49:37 -07:00
Zach Brown
f44306757c scoutfs: add btree deletion trace message
Add a simple trace message with the result of item deletion calls.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 14:43:13 -07:00
Zach Brown
b55da5ecb7 scoutfs: compact btree more carefully when merging
The btree block merging code knew to try and compact the destination
block if it was going to move more bytes worth of items than there was
contiguous free space in the destination block.  But it missed the case
where item movement moves more than the hint because the last item it
moves was big.  In the worst case this creates an item which overlaps
the item offsets and ends up looking like corrupt items.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 14:36:35 -07:00
Zach Brown
164bcb5d99 scoutfs: bug if btree item creation corrupts
Add a BUG_ON() assertion for the case where we create an item that
starts in the item offset array.  This happens if the callers free space
calculations are incorrect.  It shouldn't be triggerable by corrupt
blocks if we're verifying the blocks as we read them in.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 14:28:22 -07:00
Zach Brown
5375ed5f38 scoutfs: fill nameidata with symlink path
Our follow_link method forgot to fill the nameidata with the target path
of the symlink.  The uninitialized nameidata tripped up the generic
readlink code in a debugging kernel.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 13:47:19 -07:00
Zach Brown
04e0df4f36 scoutfs: forgot to initialize file alloc lock
Thank goodness for lockdep!

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-08 13:38:58 -07:00
Zach Brown
b2e12a9f27 scoutfs: sync large transactions as released
We don't want very large transactions to build up and create huge commit
latencies.  All blocks are written to free space so we use a count of
allocations to count dirty blocks.  We arbitrarily limit the transaction
to 128MB and try to kick off commits when we release transactions that
have gotten that big.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-06 15:16:50 -07:00
Zach Brown
06c718e16a scoutfs: remove unlinked inode items
Wire up the inode callbacks that let us remove all the persistent items
associated with an unlinked inode as its final reference is dropped.
This is the first part of full truncate and orphan inode support.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-31 09:31:23 -07:00
Zach Brown
86ffdf24a2 Add symlink support
Print out the raw symlink items.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-29 10:25:46 -07:00
Zach Brown
64b82e1ac3 scoutfs: add symlink support
Symlinks are easily implemented by storing the target path in btree
items.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-29 10:21:27 -07:00
Zach Brown
df93073971 scoutfs: don't unlock err bh after validation
If block validation failed then we'd end up trying to unlock an IS_ERR
buffer_head pointer.  Fix it so that we drop the ref and set the
pointer after unlocking.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-26 16:51:47 -07:00
Zach Brown
cb318982c9 scoutfs: add support for statfs
To do a credible job of this we need to track the number of free blocks.
We add counters of order allocations free to the indirect blocks so that
we can quickly scan them.  We also need a bit of help to count inodes.

Finally I noticed that we were miscalculating the number of slots in the
indirect blocks because we were using the size of the buddy block
header, not the size of the indirect block header.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-24 15:52:54 -07:00
Zach Brown
a89f6c10b1 Add buddy indirect order totals
The total counts of all the set order bits in all the child buddy blocks
is needed for statfs.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 16:41:57 -07:00
Zach Brown
2f91a9a735 Make command listing less noisy
It's still not great, but at least it's a little clearer.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 12:31:03 -07:00
Zach Brown
c17a7036ed Add find xattr commands
Add commands that use the find-xattr ioctls to show the inode numbers of
inodes which probably contain xattrs matching the specified name or
value.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 12:21:47 -07:00
Zach Brown
c90710d26b scoutfs: add find xattr ioctls
Add ioctls that return the inode numbers that probably contain the given
xattr name or value.  To support these we add items that index inodes by
the presence of xattr items whose names or values hash to a give hash
value.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 12:14:55 -07:00
Zach Brown
634114f364 scoutfs: update CKF key format
The previous %llu for the key type came from the weird tracing functions
that cast all the arguments to long long.  Those have since been
removed.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 12:05:59 -07:00
Zach Brown
6c12e7c38b scoutfs: add hard link support
Now that we have the link backrefs let's add support for hard links so
we can verify that an inode can have multiple backrefs.  (It can.)

It's a straight forward refactoring of mknod to let callers either
allocate or use existing inodes.  We push all the btree item specific
work into a function called by mknod and link.

The only surprising bit is the small max link count.  It's limiting
the worst case buffer size for the inode_paths ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-17 16:22:00 -07:00
Zach Brown
43619a245d Add inode-paths via link backrefs
Add the inode-paths command which uses the ioctl to display all the
paths that lead to the given inode.  We add support for printing
the new link backref items and inode and dirent fields.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-11 16:50:57 -07:00
Zach Brown
0991622a21 scoutfs: add inode_paths ioctl
This adds the ioctl that returns all the paths from the root to a given
inode.  The implementation only traverses btree items to keep it
isolated from the vfs object locking and life cycles, but that could be
a performance problem.  This is another motivation to accelerate the
btree code.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-11 16:46:18 -07:00
Zach Brown
be4a137479 Add support for printing block map items
Signed-off-by: Zach Brown <zab@versity.com>
2016-08-10 15:19:09 -07:00
Zach Brown
77e0ffb981 scoutfs: track data blocks in bmap items
Up to this point we'd been storing file data in large fixed size items.
This obviously needed to change to get decent large file IO patterns.

This wires the file IO into the usual page cache and buffer head paths
so that we write data blocks into allocations referenced by btree items.
We're aggressively trying to find the highest ratio of performance to
implementation complexity.

Writing dirty metadata blocks during transaction commit changes a bit.
We need to discover if we have dirty blocks before trying to sync the
inodes.  We add our _block_has_dirty() function back and use it to avoid
write attempts during transaction commit.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-10 15:18:45 -07:00
Zach Brown
198ec2ed5b scoutfs: have btree_update return errors
We can certainly have btree update callers that haven't yet dirtied the
blocks but who can deal with errors.  So make it return errors and have
its only current caller freak out if it fails.  This will let the file
data block mapping code attempt to get a dirty item without first
dirtying.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-09 17:03:30 -07:00
Zach Brown
8a6715ff02 scoutfs: add buddy was_free and free_extent
Add helpers to discover if a given allocation was free and to free all
the buddy order allocations that make up an abritrary block extent.
These are going to be used by the file data block mapping code.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-09 16:56:27 -07:00
Zach Brown
25e3b03d94 Add support for simpler btree block
Update mkfs and print to the new simpler btree block format.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:31:06 -07:00
Zach Brown
0af40547b5 Update to smaller block size
We're going to try using a smaller fixed block size to reduce complexity
in the file data extent code.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:30:40 -07:00
Zach Brown
6a97aa3c9a Add support for the radix buddy bitmaps
Update mkfs and print to support the buddy allocator that's indexed by
radix blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:29:51 -07:00
Zach Brown
1fde47170b scoutfs: simplify btree block format
Now that we are using fixed smaller blocks we can make the btree format
significantly simpler.  The fixed small block size limits the number of
items that will be stored in each block.  We can use a simple sorted
array of item offsets to maintain the item sort order instead of
the treap.

Getting rid of the treap not only removes a bunch of code, it makes
tasks like verifying or repairing a btree block a lot simpler.

The main impact on the code is that now an item doesn't record its
position in the sort order.  Users of sorted item position now need to
track an items sorted position instead of just the item.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:28:08 -07:00
Zach Brown
8bc2b15e3d scoutfs: remove scoutfs_buddy_dirty
The buffer head rewrite got rid of the only caller who needed to ensure
that a free couldn't fail.  Let's get rid of this.  We can always bring
it back if it's needed again.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:28:08 -07:00