Commit Graph

795 Commits

Author SHA1 Message Date
Zach Brown
debac8ab06 scoutfs: free all forest iter pos
Forest item iteration allocates iterator positions for each tree root
it reads from.  The postorder destruction of the iterator nodes wasn't
quite right because we were balancing the nodes as they were freed.
That can change parent/child relationships and cause postorder iteration
to skip some nodes, leaking memory.  It would have worked if we just
freed the nodes without using rb_erase to balance.

The fix is to actually iterate over the rbnodes while using the destroy
helper which rebalances as it frees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-03-05 09:02:06 -08:00
Zach Brown
e9e515524b scoutfs: remove unused corruption sources
Remove a bunch of constants for sources of corruption that are no longer
used in the code.

Signed-off-by: Zach Brown <zab@versity.com>
2020-03-05 09:02:06 -08:00
Zach Brown
7cf8d01c1b scoutfs: fix super read error race
The conversion to reading the super with buffer_head IO caused racing
readers to risk spurious errors.  Clearing uptodate to force device
access could race with a current waking reader.  They could wake and
find uptodate cleared and think that an IO error had occurred.

The buffer_head functions generally require higher level serialization
of this kind of use of the uptodate bit.  We use bh_private as a counter
to ensure that we don't clear uptodate while there are active readers.
We then also use a private buffer_head bit to satisfy batches of waiting
readers with each IO.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-28 11:34:02 -08:00
Zach Brown
d374a7c06f scoutfs: fix up radix block _first tracking
Updating the _first tracking in leaf bits was pretty confusing because
we tried to mashing all the tracking updates from all leaf modifications
into one shared code path.

It had a bug where merging would advance _first tracking by the number
of bits merged in the leaf rather than the number of contiguous set bits
after the new first.  This lead to allocation failures eventually as
_first was after actual set bits in the leaf.

This fixes that by moving _first tracking updates into the leaf callers
that modify bits and to the parent ref updating code.

In the process we also fix little bugs in the support code that were
found by the radix block consistency checking.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-28 11:34:02 -08:00
Zach Brown
6eac823bd3 scoutfs: add radix block metadata checker
Add a quick runtime check of the consistency of the radix block and
reference metadata fields.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-28 11:34:02 -08:00
Zach Brown
c10c7d9748 scoutfs: clean up forest lock data
The client lock code forgot to call into the forest to clear its
per-lock tracking before freeing the lock.  This would result in a slow
memory leak over time as locks were reclaimed by memory pressure.  It
shouldn't have affected consistency.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-28 11:34:02 -08:00
Zach Brown
757ee85520 scoutfs: don't lose block wakeups
The block end_io path could lose wakeups.  Both the bio submission
task and a bio's end_io completion could see an io_count > 1 and neither
would set the block uptodate before dropping their io_count and waking.

It got into this mess because readers were waiting for io_count to drop
to 0.  We add a io_busy bit which indicates that io is still in flight
which waiters now wait for.  This gives the final io_count drop a chance
to do work before clearing io_busy and dropping their reference before
waking.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-28 11:34:02 -08:00
Zach Brown
44a7e2ab56 scoutfs: more carefully handle alloc cursors
The first pass at the radix allocator wasn't paying a lot of attention
to the allocation cursors.

This more carefully manages them.  They're only advanced after
allocating.  Previously the metadata alloc cursor was advanced as it
searched through leaves that it might allocate from.  We test for
wrapping past the specific final allocatable bit, rather than the limit
of what the radix height can store.  This required pushing knoweldge of
metadata or data allocs down through some of the code paths.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
76ed627548 scoutfs: reclaim freed metadata blocks in server
Reclaim freed metadata blocks in the server by merging the stable freed
tree into the allocator as a commit opens and we can trust that the
stable version of the freed allocator in the super is a strict subset of
the allocator's dirty freed tree.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
093f8ead58 scoutfs: refactor server commit locking
Server processing paths had open coded management of holding and
applying transactions.  Refactor that into hold_commit() and
apply_commit() helpers.  It makes the code a whole lot clearer and gives
us a place in hold_commit() to add code that needs to be run before
anything is modified in a commit on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
ce7f7bdbd3 scoutfs: reclaim client log allocators
The server now consistently reclaims free space in client allocator
radix trees.  It merges the client's freed trees as the client
opens a new transaction.  And it reclaims all the client's trees
when it is removed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
5b6401b5cd scoutfs: add missed btree block freeing
The conversion of the btree to using allocators missed freeing blocks in
two places.  As we overwrite dirty new blocks we weren't freeing the old
stable block as its reference was overwritten.  And as we removed the
final item in the tree we weren't freeing the final empty block as it's
removed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
128a2c64f4 scoutfs: restore df/statfs block counts
The removal of extent allocators in the server removed the tracking of
total free blocks in the system as extents were allocated and freed.

This restores tracking of total free blocks by observing the difference
in each allocator's sm_total count as a new version is stored during a
commit on the server.

We change the single free_blocks counter in the super to separate counts
of free metadata and data blocks to reflect the metadata and data
allocators.  The statfs net command is updated.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
300b7bc3ba scoutfs: remove allocators that used btree items
Now that we have the allocators that use radix blocks we can remove all
the code that was using btree items to store free block bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
85142dcadf scoutfs: use radix allocator
Convert metadata block and file data extent allocations to use the radix
allocator.

Most of this is simple transitions between types and calls.  The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix.  We remove the code and fields that were responsible for adding
uninitialized data and metadata.

The rest of the unused block allocator code is only ifdefed out.  It'll
be removed in a separate patch to reduce noise here.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
455a547e8e scoutfs: add radix allocator
Add the allocator that uses bits stored in the leaves of a cow radix.
It'll replace two metadata and data allocators that were previously
storing allocation bitmap fragments in btree items.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
8681f920e0 scoutfs: add scoutfs_block_move
Add a call to move a block's location in the cache without failure.  The
radix allocator is going to use this to dirty radix blocks while making
atomic changes to multipls paths through multiple radix trees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
809d4be58e scoutfs: switch block cache to rbtree
Switch the block cache from indexing blocks in a radix tree to using an
rbtree.  We lose the RCU lookups but we gain being able to move blocks
around in the cache without allocation failure.  And we no longer have
the problem of not being able to index large blocks with a 32bit long
radix key.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
05a8573054 scoutfs: add block visited bit
Add functions for callers to maintain a visited bit in cached blocks.
The radix allocator is going to use this to count the number of clean
blocks it sees across paths through the radix which can share parent
blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
10fd4fcec0 scoutfs: verify read bloom block ref
The bloom block reading code forgot to test if the read block was stale.
It would trust whatever it read.  Now the read when building up roots to
use can return stale and retry.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
5ed1cb3aaf scoutfs: remove LSM from README.md
Update the summary of the benefit we get from concurrent per-mount
commits.  Instead of describing it specifically in terms of LSM we
abstract it out a bit to make it also true of writing per-mount log
btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
e034ffa7e9 scoutfs: fix forest iteration
The forest item iterator was missing items.  Picture the following
search pattern:

 - find a candidate item to return in a root
 - ignore a greater candidate to return in another root
 - find the first candidates item's deletion in another root

The problem was that finding the deletion item didn't reset the notion
that we'd found a key.  The next item from the second root was never
used because the found key wasn't reset and that root had already
searched past the found key.

The core architectural problem is that iteration can't examine each item
only once given that keys and deletions can be randomly distributed
across the roots.

The most efficient way to solve the problem is to really sort the
iteration positions in each root and then walk those in order.  We
get the right answer and pay some data structure overhead to perform
the minimum number of btree searches.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
85178efa19 scoutfs: add more forest tracing
Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
587120830d scoutfs: initialize transaction block writer
As we shut down the transaction tries to destroy any remaining dirty
blocks in its writer context.  The block writer context was only
initialized by the client as it asked the server for the log trees.

This makes sure the writer is always initialized.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
3978bbd23f scoutfs: have xattr use max val size
The xattr code had a static defintion of the largest part item that it
would create.  Change it to be a function of the largest fs item
value that can be created and clean up the code a bit in the process.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
55fa73f407 scoutfs: add packed extent and bitmap tracing
Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
0de6cade19 scoutfs: remove generic extents storage
We are no longer storing individual extents in items from multiple
places and indexed in multiple ways.  We can remove this extent support
code.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
dee9fbcf66 scoutfs: use packed extents and bitmaps
The btree forest item storage doesn't have as much item granular state
as the item cache did.  The item cache could tell if a cached item was
populated from persistent storage or was created in memory.  It could
simply remove created items rather than leaving behind a deletion item.

The cached btree blocks in the btree forest item storage mechanism can't
do this.  It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.

This created a problem with the extent storage we were using.  The
individual extent items were stored with a key set to the last logical
block of their extent.  As extents grew or shrank they often were
deleted and created at different key values during a transaction.  In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent.  Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.

Streaming writes would operate on O(n) for every extent operation.  It
got to be out of hand.  This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.

For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.

Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items.  The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly.  It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.

Previously the client and server would exchange extents with network
messages.  Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction.  The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.

The server no longer has to manage free extents.  It transfers block
bitmap items between trees around commits.   All of its extent
manipulation can be removed.

The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
986e66d6c6 scoutfs: add block tracing
Add tracing of operations on our block cache.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
388175fc6a scoutfs: add forest tracing
Add some tracing events to the forest subsystem.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
fbffad1d51 scoutfs: add initial lock write_version
We need a way to compare two items in different log btrees and learn
which is the most recent.  Each time we grant a new write lock we give
it a larger write version.  Items store the version of the lock they're
written under.  Readers can now easily see which item is newer.

This is a trivial initial implementation which is not consistent across
unmount or server failover.   We'll need to recover the greatest
write_version from locks during recovery and from log trees as the
server starts up.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
43d416003a scoutfs: add scoutfs_btree_force
Add a btree_update variant which will insert the item if a previous
wasn't found instead of returning -ENOENT.  This saves callers from
having to lookup befure updating to discover if they should call _create
or _update.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
2fab3b4377 scoutfs: allow larger 8MB transactions
Try using larger transactions.  This will probably be tweaked over time.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
edd8fe075c scoutfs: remove lsm code
Remove all the now unused code that deals with lsm: segment IO, the item
cache, and the manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
43f451d015 scoutfs: read and write super with buffer_head
Use simple buffer_heads to read and write the super.  After getting rid
of the lsm code this would be the last user of our bio helpers.  With
this converted we can remove the bio helpers along with the rest of the
lsm code.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
58f062a2c1 scoutfs: use forest in locking and transaction
Transaction commit now has to ask the forest to write the btrees during
a transaction commit instead of writing dirty items in segments.  It
also determines if holds fit in the dirty transaction by looking at
dirty btree blocks instead of item counts.

Locking no longer has to invalidate a private item cache because the
forest paths use the btree block cache where inconsistency is discovered
and invalidated as blocks are read.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
48448d3926 scoutfs: convert fs callers to forest
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls.  This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
858dad1d51 scoutfs: add forest subsystem
The forest code presents a consistent item interface that's implemented
on top of a forest of persistent btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
e6af174c79 scoutfs: add commit btree net command
Add a simple start of a command that the client will use to commit its
dirty trees.  This'll be expanded in the future to include more trees
and block allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
0f83dfd512 scoutfs: update block btree interfaces in server
Teach the server to maintain and use its block allocator and writer
contexts when operating on its btrees.

The manifest tree operations aren't updated because they're about to be
removed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
8775826d7e scoutfs: have btree use blocks, allocator, writer
Convert the btree to use our block cache, block allocation, and the
caller's explicit dirty block tracking writer context instead of the
ring.  This is in preparation for the btree forest format where there
are concurrent multiple writers of independent dynamically sized btrees
instead of only the server writing one btree with a fixed maximum size.

All the machinery for tracking dirty blocks in the ring and migrating is
no longer needed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
bdafa6ede6 scoutfs: add block allocator
Add our block allocator core.  It'll be used shortly.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
d20c950c17 scoutfs: restore our block cache
Previous versions of the system had a simple block cache.  This brings
it back with support for blocks that are larger than page size, a more
efficient LRU, and an explicit writer context.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
9456eda583 scoutfs: support larger btree block sizes
The btree block header had some aggressively small values that limited
the largest block size that could be supported.  Use larger 32bit values
so that we can support larger block sizes.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
e444c2b8c2 scoutfs: remove sort_priv
The only user was item compaction in the btree and it has been removed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
42b311c5be scoutfs: memmove deleted btree items
It turns out that the sorting performed by btree block item compaction
was pretty expensive.  It's cheaper to keep the items packed at the end
of the block by moving earlier items towards the back of the block as
interior items are deleted.  When the items are always packed at the end
of the block we no longer need to track fragmented free space and can
remove the 'free_reclaim' btree block field.

This brought the bulk empty file create rate up by about 20%.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
f3a8a5110e scoutfs: allow btree update with different lengths
The previous _btree_update implementation required that the new value be
the same length as the old value.  This change allows a new updated item
to be a different length.  It performs the btree walk assuming that the
item will be larger so that there's room for the difference.   It
doesn't search for the size of the existing item so it doesn't know if
the new item is smaller.  It might leave the dirty leaf under the low
water mark, which is fine.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
4265ecedb0 scoutfs: increase max btree value length
Now that we're storing fs items in the btree we need a larger max value
length.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
ac2d00629c scoutfs: add scoutfs_lock_protected()
The item code had a manual comparison of lock modes when testing if a
given access was protected by a held lock.  Let's offer a proper
interface from the lock code.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
ddd1a4ef5a scoutfs: support newer ->iterate readdir
The modern upstream kernel has a ->iterate() readdir file_operattions
method which takes a context and calls dir_emit().   We add some
kernelcompat helpers to juggle the various function definitions, types,
and arguments to support both the old ->readdir(filldir) and the new
->iterate(ctx) interfaces.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-15 14:57:57 -08:00