Commit Graph

30 Commits

Author SHA1 Message Date
Zach Brown
91bbf90f71 Don't pin input btrees when merging
The btree_merge code was pinning leaf blocks for all input btrees as it
iterated over them.  This doesn't work when there are a very large
number of input btrees.  It can run out of memory trying to hold a
reference to a 64KiB leaf block for each input root.

This reworks the btree merging code.  It reads a window of blocks from
all input trees to get a set of merged items.  It can take multiple
passes to complete the merge but by setting the merge window large
enough this overhead is reduced.  Merging now consumes a fixed amount of
memory rather than using memory proportional to the number of input
btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:30:17 -08:00
Zach Brown
d8231016f8 Free fewer log btree blocks per server commit
After we've merged a log btree back into the main fs tree we kick off
work to free all its blocks.  This would fully fill the transactions
free blocks list before stopping to apply the commit.

Consuming the entire free list makes it hard to have concurrent holders
of a commit who also want to free things.  This chnages the log btree
block freeing to limit itself to a fraction of the budget that each
holder gets.  That coarse limit avoids us having to precisely account
for the allocations and frees while modifying the freeing item while
still freeing many blocks per commit.

Signed-off-by: Zach Brown <zab@versity.com>
2022-04-01 15:28:20 -07:00
Zach Brown
a59fd5865d Add seq and flags to btree items
The fs log btrees have values that start with a header that stores the
item's seq and flags.  There's a lot of sketchy code that manipulates
the value header as items are passed around.

This adds the seq and flags as core item fields in the btree.   They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge.  The rest of the btree items that use the main interface
don't work with the fields.

This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-09 14:44:55 -07:00
Zach Brown
3d1a0f06c0 Add scoutfs_btree_free_blocks
Add a btree function for freeing all the blocks in a btree without
having to cow the blocks to track which refs have been freed.  We use a
key from the caller to track which portions of the tree have been freed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d8478ed6f1 Add scoutfs_btree_rebalance()
Add a btree call to just dirty to a leaf block, joining and splitting
along the way so that the blocks in the path satisfy the balance
constraints.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
0538c882bc Add btree_merge()
Add a btree function for merging the items in a range from a number of
read-only input btrees into a destination btree.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
b6d0a45f6d Add btree_{get,set}_parent
Add calls for working with subtrees built around references to blocks in
the last level of parents.  This will let the server farm out btree
merging work where concurrency is built around safely working with all
the items and leaves that fall under a given parent block.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
d7f8896fac Add scoutfs_btree_parent_range
Add a btree helper for finding the range of keys which are found in
leaves referenced by the last parent block when searching for a given
key.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
e60f4e7082 scoutfs: use full extents for data and alloc
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly.  That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.

By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.

Most of this change is churn from changing allocator function and struct
names.

File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity.  All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions.  This now means
that fallocate and especially restoring offline extents can use larger
extents.  Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.

The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing.  The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks.  This resulted in a lot of bugs.  Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction.  We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.

The server now only moves free extents into client allocators when they
fall below a low threshold.  This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
1a994137f4 scoutfs: add btree methods for item cache
Add btree calls to call a callback for all items in a leaf, and to
insert a list of items into their leaf blocks.  These will be used by
the item cache to populate the cache and to write dirty items into dirty
btree blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
ac0e58839d scoutfs: remove btree _before and _after
There's no users of these variants of _prev and _next so they can be
removed.  Support for them was also dropped in the previous reworking of
the internal structure of the btree blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
ad99636af8 scoutfs: use scoutfs_key as btree key
The btree currently uses variable length big-endian buffers that are
compared with memcmp() as keys.  This is a historical relic of the time
when keys could be very large.  We had dirent keys that included the
name and manifest entries that included those fs keys.

But now all the btree callers are jumping through hoops to translate
their fs keys into big-endian btree keys.  And the memcmp() of the
keys is showing up in profiles.

This makes the btree take native scoutfs_key structs as its key.  The
forest callers which are working with fs keys can just pass their keys
straight through.  The server btree callers with their private btrees
get key fields definied for their use instead of having individual
big-endian key structs.

A nice side-effect of this is that splitting parents doesn't have to
assume that a maximal key will be inserted by a child split.  We can
have more keys in parents and wider trees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
85142dcadf scoutfs: use radix allocator
Convert metadata block and file data extent allocations to use the radix
allocator.

Most of this is simple transitions between types and calls.  The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix.  We remove the code and fields that were responsible for adding
uninitialized data and metadata.

The rest of the unused block allocator code is only ifdefed out.  It'll
be removed in a separate patch to reduce noise here.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
43d416003a scoutfs: add scoutfs_btree_force
Add a btree_update variant which will insert the item if a previous
wasn't found instead of returning -ENOENT.  This saves callers from
having to lookup befure updating to discover if they should call _create
or _update.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
8775826d7e scoutfs: have btree use blocks, allocator, writer
Convert the btree to use our block cache, block allocation, and the
caller's explicit dirty block tracking writer context instead of the
ring.  This is in preparation for the btree forest format where there
are concurrent multiple writers of independent dynamically sized btrees
instead of only the server writing one btree with a fixed maximum size.

All the machinery for tracking dirty blocks in the ring and migrating is
no longer needed.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
3eaabe81de scoutfs: add btree stored in persistent ring
Add a cow btree whose blocks are stored in a persistently allocated
ring.   This will let us incrementally index very large data sets
efficiently.

This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
97cb75bd88 Remove dead btree, block, and buddy code
Remove all the unused dead code from the previous btree block design.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
af5955e95a Add found_seq argument to scoutfs_btree_prev
Add a *found_seq argument to _prev so that it can give the caller the
seq of the item that's returned.  The extent code is going to use this
to find seqs of extents.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
37bc86b558 Add check_size_lte
Add a _lte val boolean so that -EOVERFLOW is returned if the item is
greater than the value vector.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
a77f88386c Add scoutfs_btree_prev()
We haven't yet had a pressing need for a search for the previous item
before a given key value.  File extent items offer the first strong
candidate.  We'd like to naturally store the start of the extent
in the key so to find an extent that overlaps a block we'd like to
find the previous key before the search block offset.

The _prev search is easy enough to implement.  We have to update tree
walking to update the prev key and update leaf block processing to find
the correct item position after the binary search.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:37 -08:00
Mark Fasheh
2fc1b99698 scoutfs: replace some open coded corruption checks
We can trivially do the simple check of value length against what the caller
expects in btree.c.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-10-27 17:25:05 -05:00
Zach Brown
84f23296fd scoutfs: remove btree cursor
The btree cursor was built to address two problems.  First it
accelerates iteration by avoiding full descents down the tree by holding
on to leaf blocks.  Second it lets callers reference item value contents
directly to avoid copies.

But it also has serious complexity costs.  It pushes refcounting and
locking out to the caller.  There have already been a few bugs where
callers did things while holding the cursor without realizing that
they're holding a btree lock and can't perform certain btree operations
or even copies to user space.

Future changes to the allocator to use the btree motivates cleaning up
the tree locking which is complicated by the cursor being a stand alone
lock reference.  Instead of continuing to layer complexity onto this
construct let's remove it.

The iteration acceleration will be addressed the same way we're going to
accelerate the other btree operations: with per-cpu cached leaf block
references.  Unlike the cursor this doesn't push interface changes out
to callers who want repeated btree calls to perform well.

We'll leave the value copying for now.  If it becomes an issue we can
add variants that call a function to operate on the value.  Let's hope
we don't have to go there.

This change replaces the cursor with a vector to memory that the value
should be copied to and from.  The vector has a fixed number of elements
and is wrapped in a struct for easy declaration and initialization.

This change to the interface looks noisy but each caller's change is
pretty mechanical.  They tend to involve:

 - replace the cursor with the value struct and initialization
 - allocate some memory to copy the value in to
 - reading functions return the number of value bytes copied
 - verify copied bytes makes sense for item being read
 - getting rid of confusing ((ret = _next())) looping
 - _next now returns -ENOENT instead of 0 for no next item
 - _next iterators now need to increase the key themselves
 - make sure to free allocated mem

Sometimes the order of operations changes significantly.  Now that we
can't modify in place we need to read, modify, write.  This looks like
changing a modification of the item through the cursor to a
lookup/update pattern.

The symlink item iterators didn't need to use next because they walk a
contiguous set of keys.  They're changed to use simple insert or lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
2bed78c269 scoutfs: specify btree root
The btree functions currently don't take a specific root argument.  They
assume, deep down in btree_walk, that there's only one btree in the
system.  We're going to be adding a few more to support richer
allocation.

To prepare for this we have the btree functions take an explicit btree
argument.  This should make no functional difference.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
198ec2ed5b scoutfs: have btree_update return errors
We can certainly have btree update callers that haven't yet dirtied the
blocks but who can deal with errors.  So make it return errors and have
its only current caller freak out if it fails.  This will let the file
data block mapping code attempt to get a dirty item without first
dirtying.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-09 17:03:30 -07:00
Zach Brown
1fde47170b scoutfs: simplify btree block format
Now that we are using fixed smaller blocks we can make the btree format
significantly simpler.  The fixed small block size limits the number of
items that will be stored in each block.  We can use a simple sorted
array of item offsets to maintain the item sort order instead of
the treap.

Getting rid of the treap not only removes a bunch of code, it makes
tasks like verifying or repairing a btree block a lot simpler.

The main impact on the code is that now an item doesn't record its
position in the sort order.  Users of sorted item position now need to
track an items sorted position instead of just the item.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:28:08 -07:00
Zach Brown
7b18bce2e2 scoutfs: use buffer heads
Now that we have a fixed small block size we don't need our own code for
tracking contiguous memory for blocks that are larger than the page
size.  We can use buffer heads which support block sizes smaller than
the page size.

Our block API remains to enforce transactions, cheksumming, cow, and
eventually invalidating and retrying reads of stale bloks.

We set the logical blocksize of the bdev buffer cache to our fixed block
size.  We use a private bh state bit to indicate that the contents of a
block have had their checksum verified.  We use a small structure stored
at b_private to track dirty blocks so that we can control when they're
written.

The btree block traversal code uses the buffer_head lock to serialize
access to btree block contents now that the block rwsem has gone
away.  This isn't great but works for now.

Not being able to relocate blocks in the buffer cache (really fragments
of pages in the bdev page cache.. blkno determines memory location)
means that the cow path always has to copy.

Callers are easily translated: use struct buffer_head instead of
scoutfs_block and use a little helper instead of dereferencing ->data
directly.

I took the opportunity to clean up some of the inconsistent block
function names.  Now more of the functions follow the scoutfs_block_*()
pattern.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:28:08 -07:00
Zach Brown
b51511466a scoutfs: add inodes_since ioctl
Add the ioctl that let's us find out about inodes that have changed
since a given sequence number.

A sequence number is added to the btree items so that we can track the
tree update that it last changed in.  We update this as we modify
items and maintain it across item copying for splits and merges.

The big change is using the parent item ref and item sequence numbers
to guide iteration over items in the tree.  The easier change is to have
the current iteration skip over items whose sequence number is too old.

The more subtle change has to do with how iteration is terminated.  The
current termination could stop when it doesn't find an item because that
could only happen at the final leaf.  When we're ignoring items with old
seqs this can happen at the end of any leaf.  So we change iteration to
keep advancing through leaf blocks until it crosses the last key value.

We add an argument to btree walking which communicates the next key that
can be used to continue iterating from the next leaf block.  This works
for the normal walk case as well as the seq walking case where walking
terminates prematurely in an interior node full of parent items with old
seqs.

Now that we're more robustly advancing iteration with btree walk calls
and the next key we can get rid fo the 'next_leaf' hack which was trying
to do the same thing inside the btree walk code.  It wasn't right for
the seq walking case and was pretty fiddly.

The next_key increment could wrap the maximal key at the right spine of
the tree so we have _inc saturate instead of wrap.

And finally, we want these inode scans to not have to skip over all the
other items associated with each inode as it walks looking for inodes
with the given sequence number.  We change the item sort order to first
sort by type instead of by inode.  We've wanted this more generally to
isolate item types that have different access patterns.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-05 14:46:20 -07:00
Zach Brown
a64ca8018a scoutfs: add scoutfs_btree_hole() for finding keys
Directory entries found a hole in the key range between the first and
last possible hash value for a new entry's key.  The xattrs want
to do the same thing so let's extract this into a proper function.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-04 10:45:17 -07:00
Zach Brown
5651d48c18 scoutfs: add core btree functionality
Previously we had stubbed out the btree item API with static inlines.
Those are replaced with real functions in a reasonably functional btree
implementation.

The btree implementation itself is pretty straight forward.  Operations
are performed top-down and we dirty, lock, and split/merge blocks as we
go.  Callers are given a cursor to give them full access to the item.
Items in the btree blocks are stored in a treap.  There are a lot of
comments in the code to help make things clear.

We add the notion of block references and some block functions for
reading and dirtying blocks by reference.

This passes tests up to the point where unmount tries to write out data
and the world catches fire.  That's far enough to commit what we have
and iterate from there.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-12 19:33:09 -07:00
Zach Brown
5369fa1e05 scoutfs: first step towards multiple btrees
Starting to implement LSM merging made me really question if it is the
right approach.  I'd like to try an experiment to see if we can get our
concurrent writes done with much simpler btrees.

This commit removes all the functionality that derives from the large
LSM segments and distributing the manifest.

What's left is a multi-page block layer and the husk of the btree
implementation which will give people access to items.  Callers that
work with items get translated to the btree interface.

This gets as far as reading the super block but the format changes and
large block size mean that the crc check fails and the mount returns an
error.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-11 11:35:37 -07:00