The btree_merge code was pinning leaf blocks for all input btrees as it
iterated over them. This doesn't work when there are a very large
number of input btrees. It can run out of memory trying to hold a
reference to a 64KiB leaf block for each input root.
This reworks the btree merging code. It reads a window of blocks from
all input trees to get a set of merged items. It can take multiple
passes to complete the merge but by setting the merge window large
enough this overhead is reduced. Merging now consumes a fixed amount of
memory rather than using memory proportional to the number of input
btrees.
Signed-off-by: Zach Brown <zab@versity.com>
After we've merged a log btree back into the main fs tree we kick off
work to free all its blocks. This would fully fill the transactions
free blocks list before stopping to apply the commit.
Consuming the entire free list makes it hard to have concurrent holders
of a commit who also want to free things. This chnages the log btree
block freeing to limit itself to a fraction of the budget that each
holder gets. That coarse limit avoids us having to precisely account
for the allocations and frees while modifying the freeing item while
still freeing many blocks per commit.
Signed-off-by: Zach Brown <zab@versity.com>
The fs log btrees have values that start with a header that stores the
item's seq and flags. There's a lot of sketchy code that manipulates
the value header as items are passed around.
This adds the seq and flags as core item fields in the btree. They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge. The rest of the btree items that use the main interface
don't work with the fields.
This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree function for freeing all the blocks in a btree without
having to cow the blocks to track which refs have been freed. We use a
key from the caller to track which portions of the tree have been freed.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree call to just dirty to a leaf block, joining and splitting
along the way so that the blocks in the path satisfy the balance
constraints.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree function for merging the items in a range from a number of
read-only input btrees into a destination btree.
Signed-off-by: Zach Brown <zab@versity.com>
Add calls for working with subtrees built around references to blocks in
the last level of parents. This will let the server farm out btree
merging work where concurrency is built around safely working with all
the items and leaves that fall under a given parent block.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree helper for finding the range of keys which are found in
leaves referenced by the last parent block when searching for a given
key.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add btree calls to call a callback for all items in a leaf, and to
insert a list of items into their leaf blocks. These will be used by
the item cache to populate the cache and to write dirty items into dirty
btree blocks.
Signed-off-by: Zach Brown <zab@versity.com>
There's no users of these variants of _prev and _next so they can be
removed. Support for them was also dropped in the previous reworking of
the internal structure of the btree blocks.
Signed-off-by: Zach Brown <zab@versity.com>
The btree currently uses variable length big-endian buffers that are
compared with memcmp() as keys. This is a historical relic of the time
when keys could be very large. We had dirent keys that included the
name and manifest entries that included those fs keys.
But now all the btree callers are jumping through hoops to translate
their fs keys into big-endian btree keys. And the memcmp() of the
keys is showing up in profiles.
This makes the btree take native scoutfs_key structs as its key. The
forest callers which are working with fs keys can just pass their keys
straight through. The server btree callers with their private btrees
get key fields definied for their use instead of having individual
big-endian key structs.
A nice side-effect of this is that splitting parents doesn't have to
assume that a maximal key will be inserted by a child split. We can
have more keys in parents and wider trees.
Signed-off-by: Zach Brown <zab@versity.com>
Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree_update variant which will insert the item if a previous
wasn't found instead of returning -ENOENT. This saves callers from
having to lookup befure updating to discover if they should call _create
or _update.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the btree to use our block cache, block allocation, and the
caller's explicit dirty block tracking writer context instead of the
ring. This is in preparation for the btree forest format where there
are concurrent multiple writers of independent dynamically sized btrees
instead of only the server writing one btree with a fixed maximum size.
All the machinery for tracking dirty blocks in the ring and migrating is
no longer needed.
Signed-off-by: Zach Brown <zab@versity.com>
Add a cow btree whose blocks are stored in a persistently allocated
ring. This will let us incrementally index very large data sets
efficiently.
This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.
Signed-off-by: Zach Brown <zab@versity.com>
Add a *found_seq argument to _prev so that it can give the caller the
seq of the item that's returned. The extent code is going to use this
to find seqs of extents.
Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Add a _lte val boolean so that -EOVERFLOW is returned if the item is
greater than the value vector.
Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
We haven't yet had a pressing need for a search for the previous item
before a given key value. File extent items offer the first strong
candidate. We'd like to naturally store the start of the extent
in the key so to find an extent that overlaps a block we'd like to
find the previous key before the search block offset.
The _prev search is easy enough to implement. We have to update tree
walking to update the prev key and update leaf block processing to find
the correct item position after the binary search.
Signed-off-by: Zach Brown <zab@versity.com>
We can trivially do the simple check of value length against what the caller
expects in btree.c.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
The btree cursor was built to address two problems. First it
accelerates iteration by avoiding full descents down the tree by holding
on to leaf blocks. Second it lets callers reference item value contents
directly to avoid copies.
But it also has serious complexity costs. It pushes refcounting and
locking out to the caller. There have already been a few bugs where
callers did things while holding the cursor without realizing that
they're holding a btree lock and can't perform certain btree operations
or even copies to user space.
Future changes to the allocator to use the btree motivates cleaning up
the tree locking which is complicated by the cursor being a stand alone
lock reference. Instead of continuing to layer complexity onto this
construct let's remove it.
The iteration acceleration will be addressed the same way we're going to
accelerate the other btree operations: with per-cpu cached leaf block
references. Unlike the cursor this doesn't push interface changes out
to callers who want repeated btree calls to perform well.
We'll leave the value copying for now. If it becomes an issue we can
add variants that call a function to operate on the value. Let's hope
we don't have to go there.
This change replaces the cursor with a vector to memory that the value
should be copied to and from. The vector has a fixed number of elements
and is wrapped in a struct for easy declaration and initialization.
This change to the interface looks noisy but each caller's change is
pretty mechanical. They tend to involve:
- replace the cursor with the value struct and initialization
- allocate some memory to copy the value in to
- reading functions return the number of value bytes copied
- verify copied bytes makes sense for item being read
- getting rid of confusing ((ret = _next())) looping
- _next now returns -ENOENT instead of 0 for no next item
- _next iterators now need to increase the key themselves
- make sure to free allocated mem
Sometimes the order of operations changes significantly. Now that we
can't modify in place we need to read, modify, write. This looks like
changing a modification of the item through the cursor to a
lookup/update pattern.
The symlink item iterators didn't need to use next because they walk a
contiguous set of keys. They're changed to use simple insert or lookup.
Signed-off-by: Zach Brown <zab@versity.com>
The btree functions currently don't take a specific root argument. They
assume, deep down in btree_walk, that there's only one btree in the
system. We're going to be adding a few more to support richer
allocation.
To prepare for this we have the btree functions take an explicit btree
argument. This should make no functional difference.
Signed-off-by: Zach Brown <zab@versity.com>
We can certainly have btree update callers that haven't yet dirtied the
blocks but who can deal with errors. So make it return errors and have
its only current caller freak out if it fails. This will let the file
data block mapping code attempt to get a dirty item without first
dirtying.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we are using fixed smaller blocks we can make the btree format
significantly simpler. The fixed small block size limits the number of
items that will be stored in each block. We can use a simple sorted
array of item offsets to maintain the item sort order instead of
the treap.
Getting rid of the treap not only removes a bunch of code, it makes
tasks like verifying or repairing a btree block a lot simpler.
The main impact on the code is that now an item doesn't record its
position in the sort order. Users of sorted item position now need to
track an items sorted position instead of just the item.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have a fixed small block size we don't need our own code for
tracking contiguous memory for blocks that are larger than the page
size. We can use buffer heads which support block sizes smaller than
the page size.
Our block API remains to enforce transactions, cheksumming, cow, and
eventually invalidating and retrying reads of stale bloks.
We set the logical blocksize of the bdev buffer cache to our fixed block
size. We use a private bh state bit to indicate that the contents of a
block have had their checksum verified. We use a small structure stored
at b_private to track dirty blocks so that we can control when they're
written.
The btree block traversal code uses the buffer_head lock to serialize
access to btree block contents now that the block rwsem has gone
away. This isn't great but works for now.
Not being able to relocate blocks in the buffer cache (really fragments
of pages in the bdev page cache.. blkno determines memory location)
means that the cow path always has to copy.
Callers are easily translated: use struct buffer_head instead of
scoutfs_block and use a little helper instead of dereferencing ->data
directly.
I took the opportunity to clean up some of the inconsistent block
function names. Now more of the functions follow the scoutfs_block_*()
pattern.
Signed-off-by: Zach Brown <zab@versity.com>
Add the ioctl that let's us find out about inodes that have changed
since a given sequence number.
A sequence number is added to the btree items so that we can track the
tree update that it last changed in. We update this as we modify
items and maintain it across item copying for splits and merges.
The big change is using the parent item ref and item sequence numbers
to guide iteration over items in the tree. The easier change is to have
the current iteration skip over items whose sequence number is too old.
The more subtle change has to do with how iteration is terminated. The
current termination could stop when it doesn't find an item because that
could only happen at the final leaf. When we're ignoring items with old
seqs this can happen at the end of any leaf. So we change iteration to
keep advancing through leaf blocks until it crosses the last key value.
We add an argument to btree walking which communicates the next key that
can be used to continue iterating from the next leaf block. This works
for the normal walk case as well as the seq walking case where walking
terminates prematurely in an interior node full of parent items with old
seqs.
Now that we're more robustly advancing iteration with btree walk calls
and the next key we can get rid fo the 'next_leaf' hack which was trying
to do the same thing inside the btree walk code. It wasn't right for
the seq walking case and was pretty fiddly.
The next_key increment could wrap the maximal key at the right spine of
the tree so we have _inc saturate instead of wrap.
And finally, we want these inode scans to not have to skip over all the
other items associated with each inode as it walks looking for inodes
with the given sequence number. We change the item sort order to first
sort by type instead of by inode. We've wanted this more generally to
isolate item types that have different access patterns.
Signed-off-by: Zach Brown <zab@versity.com>
Directory entries found a hole in the key range between the first and
last possible hash value for a new entry's key. The xattrs want
to do the same thing so let's extract this into a proper function.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we had stubbed out the btree item API with static inlines.
Those are replaced with real functions in a reasonably functional btree
implementation.
The btree implementation itself is pretty straight forward. Operations
are performed top-down and we dirty, lock, and split/merge blocks as we
go. Callers are given a cursor to give them full access to the item.
Items in the btree blocks are stored in a treap. There are a lot of
comments in the code to help make things clear.
We add the notion of block references and some block functions for
reading and dirtying blocks by reference.
This passes tests up to the point where unmount tries to write out data
and the world catches fire. That's far enough to commit what we have
and iterate from there.
Signed-off-by: Zach Brown <zab@versity.com>
Starting to implement LSM merging made me really question if it is the
right approach. I'd like to try an experiment to see if we can get our
concurrent writes done with much simpler btrees.
This commit removes all the functionality that derives from the large
LSM segments and distributing the manifest.
What's left is a multi-page block layer and the husk of the btree
implementation which will give people access to items. Callers that
work with items get translated to the btree interface.
This gets as far as reading the super block but the format changes and
large block size mean that the crc check fails and the mount returns an
error.
Signed-off-by: Zach Brown <zab@versity.com>