The first pass at compaction just kicked a thread any time we added a
segment that brought its level's count over the limit. Tasks could
create dirty items and write level0 segments regardless of the progress
of compaction.
This ties the writing rate to compaction. Writers have to wait to hold
a transaction until the dirty item count is under a segment and there's
no level0 segments. Usualy more level0 segments are allowed but we're
aggressively pushing compaction, we'll relax this later.
This also more forcefully ensures that compaction makes forward
progress. We kick the compaction thread if we exceed the level count,
wait for level0 to drain, or successfully complete a compaction. We
tweak scoutfs_manifest_next_compact() to return 0 if there's no
compaction work to do so the the compaction thread can exit without
triggering another.
For clarity we also kick off a sync after compaction so that we don't
sit around with a dirty manifest until the next sync. This may not be
wise.
Signed-off-by: Zach Brown <zab@versity.com>
A previous refactoring messed up and had scoutfs_trans_write_func()
always write the dirty super even when nothing was dirty and there was
nothing for the sync attempt to do. This was very confusing and made it
look like the segment and treap writes were being lost when in fact it
was the super write that shouldn't have happened.
Signed-off-by: Zach Brown <zab@versity.com>
The transaction writing thread tests if the manifest and alloc treaps
are dirty. It did this by testing if there were any dirty nodes in the
treap.
But this misses the case where the treap has been modified and all nodes
have been removed. In that case the root references no dirty nodes but
needs to be written.
Instead let's specifically mark the treap dirty when it's modified.
From then on sync will always try to write it out. We also integrate
updating the persistent root as part of writing the dirty nodes to the
persistent ring. It's required and every caller did it so it was silly
to make it a separate step.
Signed-off-by: Zach Brown <zab@versity.com>
Some recent refactoring accidentally set the trans task to null instead
of the current task. It's not used but until it's removed it should be
correct.
Signed-off-by: Zach Brown <zab@versity.com>
We were manually manipulating the level counts in the super in a bunch
of places under the manifest rwsem. This refactors them into simple get
and add functions. We protect them with a seqcount so that we'll be
able read them without blocking (from trans hold attempts). We also add
a helper for testing that a level is full because we already used
different comparisons in two call sites.
Signed-off-by: Zach Brown <zab@versity.com>
This is the first draft of compaction which has the core mechanics.
Add segment functions to free a segment's segno and to delete the entry
that refers to the given segment.
Add manifest functions that lock the manifest and dirty and delete
manifest entries. These are used by the compaction thread to atomically
modify the manfiest with the result of a compaction.
Sort the level 0 entries in the manifest by their sequence. This lets
compaction use the first oldest entry and reading can walk them
backwards to get them in order and not have to sort. We also more
carefully use the sequence field in the manifest search key to
differentiate between finding high level entries that overlap and
finding specific entries identified by their seq.
Add some fields to the per-super compact_info struct which support
compaction. We need to know the limit on the number of segments per
level and we record keys per level which tell us which segment to use
next time that level is compacted.
We kick a compaction thread when we add a manifest entry and that brings
the level count over the limit.
scoutfs_manifest_next_compact() is the first meaty function. The
compaction thread uses this to get all the segments involved in a
compaction. It does a quick manifest update if the next manifest
candidate doesn't overlap with any sgements in the next level.
The compaction operation itself is a pretty straight forward
read-modify-write operation. It asks the manifest to give it references
to the segments it'll need, reads them in, iterates over them to count
and copies items in order to output segments, and atomically updates the
manifest.
Now that the manifest can be dirty without any dirty segments we need to
fix the transaction writing function's assumption that everything flows
from dirty segments. It also has to now lock and unlock the manifest as
it adds the entry for its level 0 segment.
Signed-off-by: Zach Brown <zab@versity.com>
_lookup() and _lookup_next() each had nearly identical loops that took a
dirty boolean. We combine them into one walker with flags for dirty and
next and add a prev prev as well, giving us all the exported functions
with combinations of the flags.
We also add _last() to match _first() and _prev() to match _next().
Signed-off-by: Zach Brown <zab@versity.com>
While walking up parents looking for the next node we were comparing the
child with the wrong parent pointer. This is easily verified by
glancing at rb_next() :).
Signed-off-by: Zach Brown <zab@versity.com>
Treap deletion was pretty messed up. It forgot to reset parent and ref
for the swapped node before using them to finally delete. And it didn't
get all the weird cases right where the child node to swap is the direct
child of the node. In that case we can't just swap the parent pointers
and node pointers, they need to be special cased.
So nuts to all that. We'll just rotate the node down until it doesn't
have both children. They result in pretty similar patterns and the
rotation mechanism is much simpler to understand.
Signed-off-by: Zach Brown <zab@versity.com>
We were derefing the null parent when deleting a single node in a tree.
There's no need to use parent_ref() here, we know that there's no node
and we can just clear the root's aug bits.
Signed-off-by: Zach Brown <zab@versity.com>
The code was using parent_ref() to set the parent ref's node pointer.
But parent_ref() uses the parent's left node pointer to determine which
ref points to the node. If we were setting the left it would return the
right because the left isn't set yet. This messed up the tree shape and
all hell broke loose.
Just set it through the ref, we have it anyway.
Signed-off-by: Zach Brown <zab@versity.com>
The segment writing loop was assuming that the currently dirty items
will fit in a segment. That's not true.
Signed-off-by: Zach Brown <zab@versity.com>
We forgot to or in a node's children's augmentation bits when setting
the augmentation bits up in the parent's ref. This stopped ring
dirtying from finding all the dirty nodes in the treap.
Signed-off-by: Zach Brown <zab@versity.com>
We hadn't yet assigned real sequence numbers to the segments.
Let's track the next sequence in the super block and assign it to
segments as we write the first new item in each.
Signed-off-by: Zach Brown <zab@versity.com>
The first pass manifest and allocator storage used a simple ring log
that was entirely replayed into memory to be used. That risked the
manifest being too large to fit in memory, especially with large keys
and large volumes.
So we move to using an indexed persistent structure that can be read on
demand and cached. We use a treap of byte referenced nodoes stored in a
circular ring.
The code interface is modeled a bit on the in-memory rbtree interface.
Except that we can get IO errors and manage allocation so we return data
pointers to the item payload istead of item structs and we can return
errors.
The manifest and allocator are converted over and the old ring code is
removed entirely.
Signed-off-by: Zach Brown <zab@versity.com>
We have use of this little u64 comparison function in a few more places
so let's make it a proper usable inline function in a header.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache only knew about present items in the rbtree. Attempts to
read items that didn't exist would always trigger expensive manfifest
and segment searches.
This reworks the item cache and item reading code to support the notion
of cached ranges of keys. When we read items we also communicate the
range of keys that we searched. This lets the cache return negative lookups
for key values in the search that don't have items.
The item cache gets an rbtree of key ranges. Each item lookup method
now uses it to determine if a missing item needs to trigger a read.
Item reading is now performed in batches instead of one at a time. This
lets us specify the cache range along with the batch and apply them all
atomically under the lock.
The item range code is much more robust now that it has to track
the range of keys that it searches. The read items call now takes a range.
It knows to look for all level0 segments that interesect that range, not
just the first key. The manifest segment references now include the
min and max keys for the segment so we can use those to define the
item search range.
Since the refs now include keys we no longer have them as a dumb allocated
array but instead have a list of alloced ref structs.
Signed-off-by: Zach Brown <zab@versity.com>
Add a suite of simple kvec functions to work with kvecs that point to
file system keys.
The ones worth mentioning format keys into strings. They're used to
add formatted strings for the keys to tracepoints. They're still a
little rough but this is a functional first step.
Signed-off-by: Zach Brown <zab@versity.com>
Some parts of the ring reading were still using the old 'nr' for the
number of blocks to read, but it's now the total number of blocks
in the ring. Use part instead.
Signed-off-by: Zach Brown <zab@versity.com>
When iterating over items the manifest would always insert
whatever values it found at the caller's key, instead of the
key that it found in the segment.
Signed-off-by: Zach Brown <zab@versity.com>
The initial kvec code was a bit wobbly. It had raw loops, some weird
constructs, and had more elements than we need.
Add some iterator helpers that make it less likely that we'll screw up
iterating over different length vectors. Get rid of reliance on a
tailing null pointer and always use the count of elements to stop
iterating. With that in place we can shrink the number of elements to
just the greatest user.
Signed-off-by: Zach Brown <zab@versity.com>
The estimate for the number of dirty segment bytes was wildly over
calculating the number of segment headers by confusing the length of the
segment header with the length of segments.
Signed-off-by: Zach Brown <zab@versity.com>
The space calculation didn't include the terminating zero entry. That
ensured that the space for the netry would never be consumed. But the
remaining space was used to zero the end of the block so the final entry
wasn't being zeroed.
So have the space remaining include the terminating entry and factor
that into the space consumption of each entry being appended.
Signed-off-by: Zach Brown <zab@versity.com>
The comparisons were a bit wrong when comparing overlaping kvec
endpoints. We want to compare the starts and ends with the ends and
starts, respectively.
Signed-off-by: Zach Brown <zab@versity.com>
We were trying to propagate dirty bits from a node itself when its dirty
bit is set. But it's bits are consistent so it stops immediately. We
need to propagate from the parent of the node that changed.
Signed-off-by: Zach Brown <zab@versity.com>
We went to the trouble of allocating a work queue with one work in
flight but then didn't use it. We could have concurrent trans write
func execution.
Signed-off-by: Zach Brown <zab@versity.com>
A reader that hits an allocated segment would wait on IO forever.
Setting the end_io bit lets readers use written segments.
Signed-off-by: Zach Brown <zab@versity.com>
Specifying the ring blocks with a head and tail index lead to pretty
confusing code to figure out how many blocks to read and if we had
passed the tail.
Instead specify the ring with a starting index and number of blocks.
The code to read and write the ring blocks naturally falls out and is a
lot more clear.
Signed-off-by: Zach Brown <zab@versity.com>
The ring appending next entry header cursor assignment was pointing past
the caller's src header, not at the next header to write to in the
block.
The writing block index and blkno calculations were just bad. Pretend
they never happened.
And finally we need to point the dirty super at the ring index for the
commit and we need to reset the append state for the next commit.
Signed-off-by: Zach Brown <zab@versity.com>
The ring block tail zeroing memset treated the value as the offset to
zero from, not the number of bytes at the tail to zero.
Signed-off-by: Zach Brown <zab@versity.com>
Manifest entries were being written with the size of their in-memory
nodes, not the smaller persistent add_manifest structure size.
Signed-off-by: Zach Brown <zab@versity.com>
Unfortunately the generic augmented callbacks don't work for our
augmented node bits which specifically reflect the left and right nodes.
We need our own rotation callback and then we have boilerplate for the
other two copy and propagate callbacks. Once we have to provide
.propagate we can call it instead of our own update_dirty_parents()
equivalent.
In addition some callers messed up marking and clearing dirty. We only
want to mark dirty item insertions, not all inserted items. And if we
update an item's keys and values we need to clear and mark it to keep
the counters consistent.
Signed-off-by: Zach Brown <zab@versity.com>
Inserting an item over an existing key was super broken. Now that we're
not replacing we can't stop descent if we find an existing item. We
need to keep descending and then insert. And the caller needs to, you
know, actually remove the existing item when it's found -- not the item
it just inserted :P.
Signed-off-by: Zach Brown <zab@versity.com>
The two functions that added to items had little bugs. They initialized
the item vectors incorrectly and didn't actually store the keys and
values. Appending was always overwriting the first segment. Have it
call 'nr' 'pos' like the rest of the code to make it more clear.
Signed-off-by: Zach Brown <zab@versity.com>
Rename scoutfs_seg_add_ment to _manifest_add as that makes it a lot more
clear that it's a wrapper around scoutfs_manifest_add() that gets its
arguments from the segment.
Signed-off-by: Zach Brown <zab@versity.com>
A few of the kvec iterators that work with byte offsets forgot to reset
the offsets as they advanced to the next vec.
These should probably be refactored into a set of iterator helpers.
Signed-off-by: Zach Brown <zab@versity.com>
The manifest entries were changed to be a single contiguous allocation.
The calculation of the vec that points to the last key vec was adding
the key length in units of the add manifest struct.
Adding the manifest wasn't setting the key lengths nor copying the keys
into their position in the entry alloc.
Signed-off-by: Zach Brown <zab@versity.com>
Add all the core strutural components to be able to modify metadata. We
modify items in fs write operations, track dirty items in the cache,
allocate free segment block reagions, stream dirty items into segments,
write out the segments, update the manifest to reference the written
segments, and write out a new ring that has the new manifest.
Signed-off-by: Zach Brown <zab@versity.com>
The ring block replay walk messed up the blkno it read
from and its exit condition. It needed to test for having
just replayed the tail before moving on.
Signed-off-by: Zach Brown <zab@versity.com>
The segment waiting loop was rewritten to use n to
iterate up to i, but the body of the loop still had i. Take that as a
signal to Always Iterate With 'i' and store the last i and then
iterate towards it with i again.
Signed-off-by: Zach Brown <zab@versity.com>
A copy+paste error led us to overwrite the first
key in the manifest with the last, leaving the
last uninitialized.
Signed-off-by: Zach Brown <zab@versity.com>