This adds tracing functionality that's cheap and easy to
use. By constantly gathering traces we'll always have rich
history to analyze when something goes wrong.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce the concept of acquiring write locks around write operations.
The core idea is that reads are unlocked and that write lock contention
between nodes should be rare. This first pass simply broadcasts write
lock requests to all the mounts in the volume. It achieves a reasonable
degree of fairness and doesn't require centralizing state in a lock
server.
We have to flesh out a bit of initial infrastructure to support the
write locking protocol. The roster manages cluster membership and
messaging and only understands mounts in the same kernel for now.
Creation needs to know which inodes to try and lock so we see the start
of per-mount free inode reservations.
The transformation of users is straight forward: they aquire the write
lock on the inodes they're working with instead of holding a
transaction. The write lock machinery now manages transactions.
This passes single mount testing but that isn't saying much. The next
step is to run multi-mount tests.
Signed-off-by: Zach Brown <zab@versity.com>
The current mechanism for dealing with dirent name hash collisions is to
use multiple hash functions. This won't work great with the btree where
it's expensive to search multiple distant items for a given entry.
Instead of having multiple full precision functions we linearly probe a
given number of hash values after the initial name hash. Now the slow
colliding path walks adjacent items in the tree instead of bouncing
around the tree.
Signed-off-by: Zach Brown <zab@versity.com>
The next inode number to be allocated has been stored only in the
in-memory super block and hasn't survived across mounts. This sometimes
accidentally worked if the tests removed the initial inodes but often
would cause failures when inode allocation returned existing inodes.
This tracks the next inode to allocate in the super block and maintains
it across mounts. Tests now consistently pass as inode allocations
consistently return free inode numbers.
Signed-off-by: Zach Brown <zab@versity.com>
Add the block allocator.
Logically we use a buddy allocator that's built from bitmaps for
allocators of each order up to the largest allocator that fits in the
device. This ends up using two bits per block.
On disk we log modified regions of these bitmaps in chunks in blocks in
a preallocated ring. We carefully coordinate logging the chunks and the
ring size so that we can always write to the tail of the ring.
There's one allocator and it's only read on mount today. We'll
eventually have multiple of these allocators covering the device and
nodes will coordinate exclusive access to them.
Signed-off-by: Zach Brown <zab@versity.com>
Add the transaction machinery that writes out dirty metadata blocks as
atomic transactions.
The block radix tracks dirty blocks with a dirty radix tag.
Blocks are written with bios whose completion marks them clean and
propogates errors through the super info. The blocks are left tagged
during writeout so that they won't be (someday) mistaken for clean by
eviction. Since we're modifying the radix from io completion we change
all block lock acquisitions to be interrupt safe.
All the operations that modify blocks hold and release the transaction
while they're doing their work.
sync kicks off work that waits for the transaction to be released so
that it can write out all the dirty blocks and then the new supers that
reference them.
Signed-off-by: Zach Brown <zab@versity.com>
The btree walk was storing a pointer to the current btree block that it
was working on. It later used this when the walk continues and the
block becomes a parent. But it didn't update this pointer if splitting
changed the block to traverse. By removing this pointer and using the
block data pointers directly we remove the risk of the pointer going
stale.
Signed-off-by: Zach Brown <zab@versity.com>
The wild casting in the treap code can cause memory corruption if it's
fed bad offsets. Add some assertions so that we can see when this is
happening.
Signed-off-by: Zach Brown <zab@versity.com>
The format doesn't yet record free blocks. We've been relying on the
scary initialization of the block allocator past the blocks that are
written by mkfs. And it was wrong. This garbage will be replaced with
an allocator in a few commits.
Signed-off-by: Zach Brown <zab@versity.com>
Not surprisingly, testing the btree code shook out a few bugs
- the treap root wasn't initialized
- existing split source block wasn't compacted
- item movement used item treap fields after deletion
All of these had the consequence of feeding the treap code bad offsets
so its node/u16 casts could lead it to scribble over memory.
Signed-off-by: Zach Brown <zab@versity.com>
The conversion of the filerw item callers of the btree cursor wasn't
updated to consistently release the cursors. This was causing block
refcounting problems that could scribble on freed and realloced memory.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we had stubbed out the btree item API with static inlines.
Those are replaced with real functions in a reasonably functional btree
implementation.
The btree implementation itself is pretty straight forward. Operations
are performed top-down and we dirty, lock, and split/merge blocks as we
go. Callers are given a cursor to give them full access to the item.
Items in the btree blocks are stored in a treap. There are a lot of
comments in the code to help make things clear.
We add the notion of block references and some block functions for
reading and dirtying blocks by reference.
This passes tests up to the point where unmount tries to write out data
and the world catches fire. That's far enough to commit what we have
and iterate from there.
Signed-off-by: Zach Brown <zab@versity.com>
Starting to implement LSM merging made me really question if it is the
right approach. I'd like to try an experiment to see if we can get our
concurrent writes done with much simpler btrees.
This commit removes all the functionality that derives from the large
LSM segments and distributing the manifest.
What's left is a multi-page block layer and the husk of the btree
implementation which will give people access to items. Callers that
work with items get translated to the btree interface.
This gets as far as reading the super block but the format changes and
large block size mean that the crc check fails and the mount returns an
error.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the interval tree we can use it to store the manifest.
Instead of having different indexes for each level we store all the
levels in one index.
This simplifies the code quite a bit. In particular, we won't have to
special case merging between level 0 and 1 quite as much because level 0
is no longer a special list.
We have a strong motivation to keep the manifest small. So we get
rid of the blkno radix. It wasn't wise to trade off more manifest
storage to make the ring a bit smaller. We can store full manifests
in the ring instead of just the block numbers.
We rework the new_manifest interace that adds a final manifest entry
and logs it. The ring entry addition and manifest update are atomic.
We're about to implement merging which will permute the manifest. Read
methods won't be able to iterate over levels while racing with merging.
We change the manifest key search interface to return a full set of all
the segments that intersect the key.
The next item interface now knows how to restart the search if hits
the end of a segment on one level and the next least key is in another
segment and greater than the end of that completed segment.
There was also a very crazy cut+paste bug where next item was testing
that the item is past the last search key with a while instead of an if.
It'd spin throwing list_del_init() and brelse() debugging warnings.
Signed-off-by: Zach Brown <zab@versity.com>
The next interval interface didn't set the ival to return to null when
it finds a null next node. The caller would continuously get the same
interval.
This is what I get for programming late at night, I guess.
Signed-off-by: Zach Brown <zab@versity.com>
Add an interval tree that lets us efficiently discover intervals that
overlap a given search region. We're going to need this now to sanely
implementing merging and in the future to implement granting access
ranges.
It's easy to implement an interval tree by using the kernel's augmented
rbtree to track the max end value of the subtree of intervals. The
tricky bit is that the augmented interface assumes that it can directly
compare the augmented value.
If we were developing against mainline we'd just patch the interface.
But we're developing against distro kernels that development partners
deploy so the kernel is frozen in amber.
We deploy a giant stinky hack to import a private tweaked version of the
interface. It's isolated so we can trivially drop it once we merge with
the fixed upstream interface. We also add some build time checks to
make sure that we don't accidentally combine rb structures between the
private import and the main kernel interface.
Signed-off-by: Zach Brown <zab@versity.com>
Add percpu counters that will let us track all manner of things.
To report them we add a sysfs directory full of attribute files in a
sysfs dir for each mount:
# (cd /sys/fs/scoutfs/loop0/counters && grep . *)
skip_delete:0
skip_insert:3218
skip_lookup:8439
skip_next:1190
skip_search:156
The implementation is careful to define each counter in only one place.
We don't have to make sure that a bunch of defintions and arrays are in
sync.
This builds off of Ben's initial patches that added sysfs dirs.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
Manifests for newly written segments can be inserted at the highest
level that doesn't have segments they intersect. This avoids ring and
merging churn.
The change cleans up the code a little bit, which is nice, and adds
tracepoints for manifests entering and leaving the in memory structures.
Signed-off-by: Zach Brown <zab@versity.com>
The manifests for level 0 blocks always claimed that they could contain
all keys. That causes a lot of extra bloom filter lookups when in fact
the blocks contain a very small range of keys.
It's true that we don't know what items a dirty segment is going to
contain, don't want to update the manfiest at every insertion, and have
to find the items in the segments in regular searching.
But when they're finalized we know the items they'll contain and can
update the manifest. We do that by initializing the item block range to
nonsense and extending it as items are added. When it's finalized we
update the manifest in memory and in the ring.
Signed-off-by: Zach Brown <zab@versity.com>
Item headers are written from the front of the block to the tail.
Item values are written from the tail of the block towards the head.
The math to detect their overlapping in the center forgot to take the
length of the item header into account. We could have final item
headers and values overriding each other which causes file data to
appear as an item header.
Signed-off-by: Zach Brown <zab@versity.com>
The bloom filter was much too large for the current typical limit on the
number of items that fit in a segment. Having them too large decreases
storage efficiency, has us read more data from a cold cache, and bloom
tests pin too much data.
We can cut it down to 25% for our current segment and item sizes.
Signed-off-by: Zach Brown <zab@versity.com>
The segment code wasn't always locking around concurrent accesses to the
dirty segment. This is mostly a problem for updating all the next
elements in skip list modification. But we also want to serialize dirty
block writing.
Add a little helper function to acquire the dirty mutex when we're
reading from the current dirty segment.
Bring sync in to segment.c so it's clear that it's intimately related to
the dirty segment.
The item deletion hack was totally unlocked.
Signed-off-by: Zach Brown <zab@versity.com>
Extended file data wasn't persistent because we weren't writing out the
inode with the i_size update that covered the newly written data.
Signed-off-by: Zach Brown <zab@versity.com>
I added these tracepoints to verify that file data isn't reachable after
mount because we're not writing out the inode with the current i_size.
Signed-off-by: Zach Brown <zab@versity.com>
Add the intrastucture for tracepoints. We include an example user that
traces bloom filter hits and misses.
Signed-off-by: Zach Brown <zab@versity.com>
Add basic file data support by implementing the address space file and
page read and write methods. This passis basic read/write tests but is
only the seed of a final implementation.
Signed-off-by: Zach Brown <zab@versity.com>
The vallue offset allocation knew to skip block headers at the start of
each segment block but, weirdly, the item offset allocation didn't.
We make item offset calculation skip the header and we add some tracing
to help see the problem.
Signed-off-by: Zach Brown <zab@versity.com>
Stop leaking dentry_info allocations by adding a dentry_op with a
d_release that frees our dentry info allocation. rmmod tests no longer
fail when dmesg screams that we have slab caches that still have
allocated objects.
Signed-off-by: Zach Brown <zab@versity.com>
s
Inode updates weren't persistent because they were being stored in clean
segments in memory. This was triggered by the new hashed dirent
mechanism returning -ENOENT when the inode still had a 0 max dirent hash
nr.
We make sure that there is a dirty item in the dirty segment at the
start of inode modification so that later updates will store in the
dirty segment. Nothing ensures that the dirty segment won't be written
out today but that will be added soon.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we know that it's easy to fix sparse build failures against
RHEL kernel headers we can require sparse builds when developing.
Signed-off-by: Zach Brown <zab@versity.com>
I was building against a RHEL tree that broke sparse builds. With that
fixed I can now see and fix sparse errors.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we dealt with colliding dirent hash values by storing all the
dirents that share a hash value in a big item with multiple dirents.
This complicated the code and strongly encouraged resizing items as
dirents come and go. Resizing items isn't very easy with our simple log
segment item creation mechanism.
Instead let's deal with collisions by allowing a dirent to be stored at
multiple hash values. The code is much simpler.
Lookup has to iterate over all possible hash values. We can track the
greatest hash iteration stored in the directory inode to limit the
overhead of negative lookups in small directories.
Signed-off-by: Zach Brown <zab@versity.com>
Initially items were stored in memory with an rbtree. That let us build
up the API above items without worrying about their storage. That gave
us dirty items in memory and we could start working on writing them to
and reading them from the log segment blocks.
Now that we have the code on either side we can get rid of the item
cache in between. It had some nice properties but it's fundamentally
duplicating the item storage in cached log segment blocks. We'd also
have to teach it to differentiate between negative cache entries and
missing entries that need to be filled from blocks. And the giant item
index becomes a bottleneck.
We have to index items in log segments anyway so we rewrite the item
APIs to read and write the items in the log segments directly. Creation
writes to dirty blocks in memory and reading and iteration walk through
the cached blocks in the buffer cache.
I've tried to comment the files and functions appropriately so most of
the commentary for the new methods is in the body of the commit.
The overall theme is making it relatively efficient to operate on
individual items in log segments. Previously we could only walk all the
items in an existing segment or write all the dirty items to a new
segment. Now we have bloom filters and sorted item headers to let us
test for the presence of an item's key with progressively more expensive
methods. We hold on to a dirty segment and fill it as we create new
items.
This needs more fleshing out and testing but this is a solid first pass
and it passes our existing tests.
Signed-off-by: Zach Brown <zab@versity.com>
Add code to walk all the block segments that intersect a key range to
find the next item after that key value.
It is easier to just return failure from the next item reader and have
the caller retry the searches so we change the specific item reading
path to use the same convention to keep the caller consistent.
This still warns as it falls off the last block but that's fine for now.
We're going to be changing all this in the next few commits.
Signed-off-by: Zach Brown <zab@versity.com>
In mknod the newly created inode's times are set down in the new inode
creation path instead of up in the mknod path to match the parent dir's
ctime and mtime.
This is strictly legal but it's easier to test that all the times have
been set in the mknod by having them equal. This stops mkdir-interface
test failures when enough time passes between inode creation and parent
dir timestamp updates to have them differ.
Signed-off-by: Zach Brown <zab@versity.com>
Wire up the code to update dirty inode items as inodes are modified in
memory. We had a bit of the code but it wasn't being called.
Signed-off-by: Zach Brown <zab@versity.com>
Add a sync_fs method that writes dirty items into level 0 item blocks.
Add chunk allocator code to allocate new item blocks in free chunks. As
the allocator bitmap is modified it adds bitmap entries to the ring.
As new item blocks are allocated we create manifest entries that
describe their block location and keys. The entry is added to the
in-memory manifest and to entries in the ring.
This isn't complete and there's still bugs but this is enough to start
building on.
Signed-off-by: Zach Brown <zab@versity.com>
I had forgotten to translate from the scoutfs types in items to the vfs
types for filldir() so userspace was seeing garbage d_type values.
Signed-off-by: Zach Brown <zab@versity.com>
The migration from the new iterator interface in upstream to the old
readdir interface in rhel7 got the sense of the filldir return code
wrong. Any readdir would deadlock livelock as the dot entry was
returned at offset 0 without advancing f_pos.
Signed-off-by: Zach Brown <zab@versity.com>
We had forgotten to actually insert manifest nodes in to the blkno
radix. This hasn't mattered yet because there's only been one manifest
in the level 0 list.
Signed-off-by: Zach Brown <zab@versity.com>
Add the most basic ability to read items from log segment blocks. If
an item isn't in the cache then we walk segments in the manifest and
check for the item in each one.
This is just the core fundamental code. There's still a lot to do:
basic corruption validation, multi-block segments, bloom filters and
arrays to optimize segment misses, and some day the ability to read file
data items directly into page cache pages. The manifest locking is also
super broken.
But this is enough to let us mount and stat the root inode!
Signed-off-by: Zach Brown <zab@versity.com>
Read the ring described by the super block and replay its entries to
rebuild the in-memory state of the chunk allocator and log segment
manifest.
We add just enough of the chunk allocator to set the free bits to the
contents of the ring bitmap entries.
We start to build out the basic manifest data structure. It'll
certainly evolve when we later add code to actually query it.
Signed-off-by: Zach Brown <zab@versity.com>
Update to the format.h from the recent -utils changes that moved from
the clumsy 'brick' terminology to the more reasonable
'block/chunk/segment' terminology.
Signed-off-by: Zach Brown <zab@versity.com>
The sync implementation was a quick demonstration of packing items in to
large log blocks. We'll be doing things very differently in the actual
system. So tear this code out so we can build up more functional
structures. It'll still be in revision control so we'll be able
to reuse the parts that make sense in the new code.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have a working userspace mkfs we can read the supers on
mount instead of always initializing a new file system. We still don't
know how to read items from blocks so mount fails when it can't find the
root dir inode.
Signed-off-by: Zach Brown <zab@versity.com>