get_manifest_refs was using the btree root in its stale copy of the
super block. It is supposed to use the btree root that it was given by
its caller who went to the trouble of finding a sufficiently current
btree root.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: added commit message and fixed formatting]
Signed-off-by: Zach Brown <zab@versity.com>
Otherwise we get into a problem where the listen lock is conflicting with
regular inode group requests. Since we never drop the listen lock and it (by
design) blocks progress on another node, those inode group requests may
hang.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The delayed downconvert work wasn't being canceled on shutdown. 60s
after unmount at least the net lock's timer would fire and crash trying
to queue the delayed work on the destroyed workqueue.
Proactively unlocking the locks isn't always beneficial to begin with.
The relative costs of mispredicting the future are wildly different if
we have to re-read item caches from segments or have to downconvert a
blocking read lock.
So we can just remove the delayed work to fix the bug and remove a
moving piece that would need to be considered and tuned. There's still
a race where we can get basts after destroying the workqueue but before
we destroy the lockspace, we'll get there.
Signed-off-by: Zach Brown <zab@versity.com>
These transformations are mechanical and there aren't many callers of
these so we combine them into one commit.
Signed-off-by: Zach Brown <zab@versity.com>
Add an end argument to _set_batch to specify the limit of
items we'll read into the cache.
And it turns out that the loop in _set_batch that meant to cache all the
items covered by the batch didn't try hard enough. It would stop once
the first key was covered but didn't make sure that the coverage
extended to cover last. This can happen if segment boundaries happen to
fall within the items that make up the batch. Fix it up while we're in
here.
Signed-off-by: Zach Brown <zab@versity.com>
Add locks around inode index item iteration. This is tricky because the
inode index items are enormous and we can't default to coarse locks that
let it read and iterate over the entire key space. We use the manifest
to find the next small fixed size region to lock and iterate from.
Signed-off-by: Zach Brown <zab@versity.com>
Add an end key to the item_next calls to limit how many items will be
read into the cache. Callers typically get this from the lock they hold
that covers the iteration. We differentiate between iteration and
caching so that a series of small iterations (listxattr on inodes,
namespace walk in small dirs) can be satisfied by a single read of
adjacent items from segments.
Signed-off-by: Zach Brown <zab@versity.com>
Add a locking wrapper for the inode index items. It maps the index
fields to a lock name for each index type.
Signed-off-by: Zach Brown <zab@versity.com>
Add an item reading variant that just returns the next key that it finds
in segments after the given key. This will be used while iterating
to find the next key to lock and then try to iterate towards.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache can only be populated with items that are covered by
locks. Require callers to provide the farthest key that can be covered
by the locks. Locks provide a key for exactly this purpose.
Signed-off-by: Zach Brown <zab@versity.com>
Let both check_range and read_items take a NULL end. check_range just
doesn't do anything with the end of the range. read_items defaults
to trying to read as many items as it can but clamps to the extent of
the segments that intersect with the key.
This will let us incrementally add end arguments to the item functions
that are intially passed in as NULL in callers as we add lock coverage.
Signed-off-by: Zach Brown <zab@versity.com>
We don't need the dlm to track key ranges if we implement ranges by
mapping keys to resources which represent ranges of the key space.
Signed-off-by: Zach Brown <zab@versity.com>
Instead of locking one resource with ranges we'll have callers map their
logical resources to a tuple name that we'll store in lock resources.
The names still map to ranges for cache reading and cache invalidation
but the ranges aren't exposed to the DLM. This lets us use the stock
DLM and distribute resources across masters.
Signed-off-by: Zach Brown <zab@versity.com>
We intend to use more of the dlm lock levels. Let's use its modes
directly so we don't have to maintain a mental map from differently
named modes.
Signed-off-by: Zach Brown <zab@versity.com>
Holding a DLM lock protects a range of the key space. The DLM locks
span inodes or regions of inodes. We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments. If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.
Previously we were sorting by type then, within types, by inode. Now we
want to sort by inode then by type. But there are structures which
previously had a type but weren't then sorted by inode. We introduce
zones as the primary sort key. Inode index and node zones are sorted by
the inode fields and node ids respectively. Then comes the fs zone
first sorted by inode then the type of the key.
The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.
The orphan keys needed to be put in a zone. They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.
The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."
And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.
Signed-off-by: Zach Brown <zab@versity.com>
We're going to need to be able to sample the current stable manifest
root occasionally. We're adding it now because we don't yet
have the lock plumbing that would provide the lvb. Eventually
this call will bubble up into the locking and the root will be
stored in the lock instead of always requested.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the net server metadata dirtying and committing code to use the
btree instead of the ring. It has to be careful to setup and teardown
the btree info as it starts up and shuts down the server.
This fixes up some questionable setup/teardown changes made in the
previous patches to convert the manifest and allocator. We could rebase
the patches to merge those together. But given that the previous
patches don't work at all without the net updates it might not be worth
the trouble.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the segment allocator to store its free region bitmaps in the
btree.
This is a very straight forward mechanical transformation. We split the
allocator region into a big-endian index key and the bitmap value
payload. We're careful to operate on aligned copies of the bitmaps so
that they're long aligned.
We can remove all the funky functions that were needed when writing the
ring. All we're left with is a call to apply the pending allocations to
dirty btree blocks before writing the btree.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the manifest to store entries in persistent btree keys and
values instead of using the rbtree in memory from the ring.
The btree doesn't have a sort function. It just compares variable
length keys. The most complicated part of this transformation is
dealing with the fallout of this. The compare function can't compare
different search keys and item keys so searches need to construct full
synthetic btree keys to search. It also can't return different
comparisons, like overlaping, so the caller needs to do a bit more work
to use key comparisons to find overlapping segments. And it can't
compare differently depending on the level of the manifest so we store
the manifest in keys differently depending on whether its in level 0 or
not.
All mount clients can now see the manifest blocks. They can query the
manifest directly when trying to find segments to read. We can get rid
of all the networking calls that were finding the segments for readers.
We change the manifest functions that relied on the ring that the to
make changes in the manifest persistent. We don't touch the allocator
or the rest of the manifest server, though, so this commit breaks the
world. It'll be restored in future patches as we update the segment
allocator and server to work with the btree.
Signed-off-by: Zach Brown <zab@versity.com>
Add a cow btree whose blocks are stored in a persistently allocated
ring. This will let us incrementally index very large data sets
efficiently.
This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have a dlm, this is a needless redirection. Merge all fields
back into the lock_info struct.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We refcount our locks and hold them across system calls. If another node
wants access to a given lock we'll mark it as blocking in the bast and queue
a work item so that the lock can later be released. Otherwise locks are
free'd under memory pressure, unmount or after a timer fires.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The refactoring of compaction to use the skip lists changed the nature
of item insertion. Previously it would precisely count the number of
items to insert. Now it discovers that the current output segment is
full by having _append_item() return false.
In this case the cursors currently point to the item that would have
been inserted but failed. compact_items() caller loops around to
allocate the next segment. Then it calls compact_items() again and it
mistakenly advances *past* the current item that still needed to be
inserted.
Hiding next_item() away from the segment loop made it hard to see this
mechanism. Let's drop the compact_items() function and bring item
iteration and item appending into the main loop so we can more carefully
advance or not as we write and allocate new segments.
This stops losing items at segment boundaries.
Signed-off-by: Zach Brown <zab@versity.com>
If the starting key for a segment read doesn't fall in a segment then we
have to include the next segment from that level in the read. If we
don't then the read can think that there are no more items at that level
and assume that all the items in the upper level are all that exist.
Signed-off-by: Zach Brown <zab@versity.com>
Writers can add level 0 segments much faster (~20x) than compaction can
compact them down into the lower levels. Without a limit on the number
of level 0 segments item readind can try to read an extraordinary number
of level 0 segments and wedge the box nonreclaimable page allocations.
Signed-off-by: Zach Brown <zab@versity.com>
Back when we changed the transaction commit to ask the server to update
the commit we accidentally lost the put of the level0 segment that was
just written. This leaked refcount would pin segments over time and
eventually drag the box into crippling oom.
Signed-off-by: Zach Brown <zab@versity.com>
We want to be able to read a region of items from a segment by searching
for the key that starts the item. In the first version of the segment
format we find a key by performing a binary search across an array of
offsets that point to the items.
Unfortunately the current format requires that we know the number of
items before we start writing. With thousands of items per segment it's
a little bonkers to ask compaction to walk through all the items twice.
Worse still, we didn't want the item offset array entries to span pages
so they're rounded up to a power of two after having seqs and offsets
and lengths. This makes them surprisingly large and sometimes they can
consume up to 60% (!) of a segment.
We know that we're inserting in sort order so it's very easy to build an
index as we insert. Skip lists give us a nice simple way to ensure
o(log n) lookups with only an average of two links per node.
CPU use is greatly reduced by removing a full redundant item walk and we
know use up almost all of the space in segments. There's still little
gaps at the ends of blocks as item's still won't cross block boundaries.
Most of this change is safely mechanical. The big difference is in how
the compaction loop is built. It used to count the items before hand.
It would never try to append when out of segments and writing would stop
after the exact number of items. Now it discovers its out of items by
allocating and trying to append and finding that there's no more work to
do. It required rethinking the loop exit and segment allocation and
stopping conditions.
Signed-off-by: Zach Brown <zab@versity.com>
Add a relatively small universal value size limit. This will be needed
by more dense item packing to predict the worst case padding to avoid
full items crossing block boundaries.
We refactor the existing symlink and xattr item value limit to use this
new limit.
Signed-off-by: Zach Brown <zab@versity.com>
We're shrinking the max item value size so we need to store symlinks
with large target paths in multiple items. The arbitrary max value size
defined here will be replaced in the future with the new global maximum
value size.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible for the next segno to fall at the end of an allocation
region that doesn't have any bits set. The code shouldn't return -EIO
in that case, it should carry on to the next region.
Signed-off-by: Zach Brown <zab@versity.com>
The item reading limit was intended to minimize latency when we were
directly reading cached manifests. We're now asking the server to walk
the manifest for us and that's a lot more expensive than querying local
cached blocks.
Let's gulp in an entire segment's worth of items if we can. We'll have
plenty of opportunity to tune this down later.
Signed-off-by: Zach Brown <zab@versity.com>
When read items into the cache we have the range of keys that were
missing from the cache. The item walk was stopping when it hit the end
of the missing cache range, not when it hit the end of the keys that
were covered by all the segments.
This would manifest as huge regions of missing items. The read would
walk off the relatively closed end of the highest level segment. It
would keep reading while there were items in the upper levels but all
those keys that would have been found in additional lower level segments
are missing. Eventually it'd hit the end of the higher level sgement
and mark that region as cached.
With it fixed it now stops the read appropriately and will come around
next time to read the range that coveres the next lowest level segment.
Signed-off-by: Zach Brown <zab@versity.com>
We read into the item cache when we hit a region that isn't cached.
Inode index items are created without cached range coverage. It's easy
to trigger read attempts that overlap with dirty inode index items.
insert_item() had a bug where it's notion of overwriting only applied to
logical presence. It always let an insertion overwrite an existing item
if it was a deletion.
But that only makes sense for new item creation. Item cache population
can't do this. In this inode index case it can replace a correct dirty
inode index item with its old pre-deletion item from the read. This
clobbers the deletions and leaks the old inode index item versions.
So teach the item insertion for caching to never, ever, replace an
existing item. This fixes assertion failures from trying to
immediately walk meta seq items after creating a few thousand dirty
entries.
Signed-off-by: Zach Brown <zab@versity.com>
As we write ring blocks we need to update the first_seq to point at
the first live block in the ring.
The existing calculation gets it wrong and stores the seq of the first
block that we wrote in this commit, not the first ring block that
is still live and would need to be read.
Fix the calculation to so that we set first_seq to the first live block
in the ring.
This fixes the bug where a mount can spin printing the super it's using.
This is the server trying to constantly startup as each server start
fails as it can't read the ring.
Signed-off-by: Zach Brown <zab@versity.com>
As we write segments we're not limiting the number of segments they
intersect at the next level. Compactions are limited to a fanout's
worth of overlapping segments. This means that we can get a compaction
where the upper level segment overlapps more than the segments that are
part of the compaction. In this case we can't write the remaining upper
level items at the lower level because now we can have a level with
segments whose keys intersect.
Instead we detect this compaction case. We call it sticky because after
merging with the lower level segments the remaining items in the upper
level need to stick to the upper level. The next time compaction comes
around it'll compact the remaining items with the additional lower
overlaping segments.
Signed-off-by: Zach Brown <zab@versity.com>
Add a LRU and shrinker to reclaim old cached items under memory
pressure. This is pretty awful today because of the separate cached
range structs and rbtree. We do our best to blow away enough of the
cache and range to try and make progress.
Signed-off-by: Zach Brown <zab@versity.com>
Some item tracing functions were really just tracing a key. Refactor it
into a trace class with event users. Later patches can then use the key
trace class.
Signed-off-by: Zach Brown <zab@versity.com>
To actually use it, we first have to copy symbols over from the dlm build
into the scoutfs source directory. Make that happen automatically for us in
the Makefile.
The only users of locking at the moment are mount, unmount and xattr
read/write. Adding more locking calls should be a straight-forward endeavor.
The LVB based server ip communication didn't work out, and LVBS as they are
written don't make sense in a range locking world. So instead, we record the
server ip address in the superblock. This is protected by the listen lock,
which also arbitrates which node will be the manifest server.
We take and drop the dlm lock on each lock/unlock call. Lock caching will
come in a future patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Using the new interval tree code we add a tree for each lock status list to
efficiently track ranged requests. Internally, most operations on a
resources lock status list (granted, waiting, converting) then are turned
into operations within a given range.
There is no API change other than a new call, dlm_lock_range() and a new
structure, 'struct dlm_key' to define our range endpoints. Keys can have
arbitrary lengths and are compared via memcmp. A ranged blocking ast type is
defined so that users of dlm_lock_range() can know which range they are
blocking.
A rudimentary test, dlmtest.ko is included.
TODO:
- Update userspace entry points, need to add one for new lock call
- Manage backwards compatibility with network protocol
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Users pass in a comparison function which is used when endpoints need to be
checked against each other. We also put each ITTYPE local definition on it's
own line to facilitate the use of pointers. An upcoming dlm patch will make
use of this to allow for keyed, ranged locking.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Also wire it into the build system. We have to figure out how to get scoutfs
pulling in the right headers but that can wait until we have something more
usable.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>