Commit Graph

70 Commits

Author SHA1 Message Date
Zach Brown
f024c70802 scoutfs: decrease block size
File data extent tracking can get very complicated if we have to worry
about page sized writes that are less than the block size.  We can avoid
all that complexity if we define the block size to be the smallest
possible page size.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-02 13:27:14 -07:00
Zach Brown
0e017ff0dc scoutfs: free btree unused btree blocks
The btree code wasn't freeing blocks either when it had removed
references to them or when an operation fails after having allocated a
new block.   Now that the allocator is more capable we can add in these
free calls.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-27 16:40:22 -07:00
Zach Brown
ad34f40744 scoutfs: free source blkno after cow
As we update references to point to newly allocated dirty blocks in a
transaction we need to free the old referenced blknos.  By using a
two-phase dirty/free interface we can avoid freeing failing after we've
made it through stages of the cow processing which can't be easily
undone.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-27 16:13:37 -07:00
Zach Brown
dcef9c0ada scoutfs: store the buddy allocator in a radix
The current implementation of the allocator was built for for a world
where blocks were much, much, larger.  It could get away with keeping
the entire bitmap resident and having to read it its entirety before
being able to use it for the first time.

That will not work in the current architecture that's built around a
smaller metadata block size.  The raw size of the allocator gets large
enough that all of those behaviours become problematic at scale.

This shifts the buddy allocator to be stored in a radix of blocks
instead of in a ring log.  This brings it more in line with the
structure of the btree item indexes.  It can be initially read, cached,
and invalidated at block granularity.

In addition, it cleverly uses the cow block structures to solve the
unreferenced space allocation constraint that the previous allocator
hadn't.  It can compare the dirty and stable blocks to discover free
blocks that aren't referenced by the old stable state.  The old
allocator would have grown a bunch of extra special complexity to
address this.

There's still work to be done but this is a solid start.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-27 12:09:29 -07:00
Zach Brown
e226927174 scoutfs: add support for cowing blocks
The current block interface for allocating a dirty copy of a given
stable block didn't cow.  It moved the existing stable block into
its new dirty location.

This is fine for the btree which will never reference old stable blocks.
It's not optimal for the allocator which is going to want to combine the
previous stable allocator blocks with the current dirty allocator blocks
to determine which free regions can satisfy allocations.  If we
invalidate the old stable cached copy it'll immediately read it back in.

And it turns out that it was a little buggy in how it moved the stable
block to its new dirty location.  It didn't remove any old blocks at the
new blkno.

So we offer two high level interfaces for either moving or copying
the contents of the dirty block.  And we're sure to always invalidate
old cached blocks at the new dirty blkno location.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-27 11:30:53 -07:00
Zach Brown
90a73506c1 scoutfs: remove homebrew tracing
Oh, thank goodness.  It turns out that there's a crash extension for
working with tracepoints in crash dumps.  Let's use standard tracepoints
and pretend this tracing hack never happened.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-20 12:08:12 -07:00
Zach Brown
b51511466a scoutfs: add inodes_since ioctl
Add the ioctl that let's us find out about inodes that have changed
since a given sequence number.

A sequence number is added to the btree items so that we can track the
tree update that it last changed in.  We update this as we modify
items and maintain it across item copying for splits and merges.

The big change is using the parent item ref and item sequence numbers
to guide iteration over items in the tree.  The easier change is to have
the current iteration skip over items whose sequence number is too old.

The more subtle change has to do with how iteration is terminated.  The
current termination could stop when it doesn't find an item because that
could only happen at the final leaf.  When we're ignoring items with old
seqs this can happen at the end of any leaf.  So we change iteration to
keep advancing through leaf blocks until it crosses the last key value.

We add an argument to btree walking which communicates the next key that
can be used to continue iterating from the next leaf block.  This works
for the normal walk case as well as the seq walking case where walking
terminates prematurely in an interior node full of parent items with old
seqs.

Now that we're more robustly advancing iteration with btree walk calls
and the next key we can get rid fo the 'next_leaf' hack which was trying
to do the same thing inside the btree walk code.  It wasn't right for
the seq walking case and was pretty fiddly.

The next_key increment could wrap the maximal key at the right spine of
the tree so we have _inc saturate instead of wrap.

And finally, we want these inode scans to not have to skip over all the
other items associated with each inode as it walks looking for inodes
with the given sequence number.  We change the item sort order to first
sort by type instead of by inode.  We've wanted this more generally to
isolate item types that have different access patterns.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-05 14:46:20 -07:00
Zach Brown
3efec0c094 scoutfs: add scoutfs_set_max_key()
It's nice to have a helper that sets the max possible key instead of
messing around with memset or ~0 manually.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-05 14:24:10 -07:00
Zach Brown
ae748f0ebc scoutfs: allow tracing with a null sb
The sb counter field isn't necessary, allow a null sb pointer arg which
then results in a counter output of 0.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-05 14:20:19 -07:00
Zach Brown
59b1f62df8 scoutfs: add basic xattr support
Add basic support for extended attributes.  The next steps are
to add support for more prefixes, including ACLs, and to properly
delete them on unlink.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-04 10:59:43 -07:00
Zach Brown
cedeacacb8 scoutfs: add file with simple name functions
Directory entries and extended attributes similarly hash and compare
strings so we'll give them some shared functions for doing so.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-04 10:49:41 -07:00
Zach Brown
a64ca8018a scoutfs: add scoutfs_btree_hole() for finding keys
Directory entries found a hole in the key range between the first and
last possible hash value for a new entry's key.  The xattrs want
to do the same thing so let's extract this into a proper function.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-04 10:45:17 -07:00
Zach Brown
5c7ba5ed39 scoutfs: remove wrlock and roster
These were interesting experiments in how to manage locks across the
cluster but we'll be going in a more flexible direction.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-01 21:03:40 -07:00
Zach Brown
4689bf0881 scoutfs: free once granted wrlock entries
The entry free routine only frees entries that don't have any references
from its context.  Callers are supposed to try to free entries after
removing references to them.

Callers that were removing entries from a shard's granted pointer were
trying to free the entry before removing the pointer to the entry.
Entries that were last removed from shard granted pointers were never
freed.

Signed-off-by: Zach Brown <zab@versity.com>
2016-06-01 21:17:30 -07:00
Zach Brown
171aea62cd scoutfs: add some wrlock tracing
Add a bunch of tracing to the wrlock code paths.

Signed-off-by: Zach Brown <zab@versity.com>
2016-06-01 21:10:16 -07:00
Zach Brown
c9caebc117 scoutfs: remove unused held_trans
The held lock struct had an unused 'held_trans' field from a previous
version of the code that specifically tried to track if a held lock had
the trans open.

Signed-off-by: Zach Brown <zab@versity.com>
2016-06-01 21:07:05 -07:00
Zach Brown
ad5a58c348 scoutfs: make trace format a little nicer
The first trace format was pretty noisy.  Now the time is printed in a
gettimeofday timeval so that it can be correlated with other time
stamps.  The super block gets a counter instead of a pointer.  The pid
and cpu are printed without a lavel and we add the line number so that
it's easy to grep the source to find a trace caller.

Signed-off-by: Zach Brown <zab@versity.com>
2016-06-01 20:34:31 -07:00
Zach Brown
7d6dd91a24 scoutfs: add tracing messages
This adds tracing functionality that's cheap and easy to
use.  By constantly gathering traces we'll always have rich
history to analyze when something goes wrong.

Signed-off-by: Zach Brown <zab@versity.com>
2016-05-28 11:11:15 -07:00
Zach Brown
0820a7b5bd scoutfs: introduce write locking
Introduce the concept of acquiring write locks around write operations.

The core idea is that reads are unlocked and that write lock contention
between nodes should be rare.  This first pass simply broadcasts write
lock requests to all the mounts in the volume.  It achieves a reasonable
degree of fairness and doesn't require centralizing state in a lock
server.

We have to flesh out a bit of initial infrastructure to support the
write locking protocol.  The roster manages cluster membership and
messaging and only understands mounts in the same kernel for now.
Creation needs to know which inodes to try and lock so we see the start
of per-mount free inode reservations.

The transformation of users is straight forward: they aquire the write
lock on the inodes they're working with instead of holding a
transaction.  The write lock machinery now manages transactions.

This passes single mount testing but that isn't saying much.  The next
step is to run multi-mount tests.

Signed-off-by: Zach Brown <zab@versity.com>
2016-05-23 17:25:06 -07:00
Zach Brown
4163236fc1 scoutfs: dirent hashes use linear probing
The current mechanism for dealing with dirent name hash collisions is to
use multiple hash functions.  This won't work great with the btree where
it's expensive to search multiple distant items for a given entry.

Instead of having multiple full precision functions we linearly probe a
given number of hash values after the initial name hash.  Now the slow
colliding path walks adjacent items in the tree instead of bouncing
around the tree.

Signed-off-by: Zach Brown <zab@versity.com>
2016-05-02 21:55:39 -07:00
Zach Brown
e0f38231b3 scoutfs: store next allocated inode in super
The next inode number to be allocated has been stored only in the
in-memory super block and hasn't survived across mounts.  This sometimes
accidentally worked if the tests removed the initial inodes but often
would cause failures when inode allocation returned existing inodes.

This tracks the next inode to allocate in the super block and maintains
it across mounts.  Tests now consistently pass as inode allocations
consistently return free inode numbers.

Signed-off-by: Zach Brown <zab@versity.com>
2016-05-01 09:16:40 -07:00
Zach Brown
979a36e175 scoutfs: add buddy block allocator
Add the block allocator.

Logically we use a buddy allocator that's built from bitmaps for
allocators of each order up to the largest allocator that fits in the
device.  This ends up using two bits per block.

On disk we log modified regions of these bitmaps in chunks in blocks in
a preallocated ring.  We carefully coordinate logging the chunks and the
ring size so that we can always write to the tail of the ring.

There's one allocator and it's only read on mount today.  We'll
eventually have multiple of these allocators covering the device and
nodes will coordinate exclusive access to them.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-30 12:20:18 -07:00
Zach Brown
e3b308c0d0 scoutfs: add transactions and metadata writing
Add the transaction machinery that writes out dirty metadata blocks as
atomic transactions.

The block radix tracks dirty blocks with a dirty radix tag.

Blocks are written with bios whose completion marks them clean and
propogates errors through the super info.  The blocks are left tagged
during writeout so that they won't be (someday) mistaken for clean by
eviction.  Since we're modifying the radix from io completion we change
all block lock acquisitions to be interrupt safe.

All the operations that modify blocks hold and release the transaction
while they're doing their work.

sync kicks off work that waits for the transaction to be released so
that it can write out all the dirty blocks and then the new supers that
reference them.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-14 14:35:32 -07:00
Zach Brown
a2f55f02a1 scoutfs: avoid stale btree block pointer
The btree walk was storing a pointer to the current btree block that it
was working on.  It later used this when the walk continues and the
block becomes a parent.  But it didn't update this pointer if splitting
changed the block to traverse.  By removing this pointer and using the
block data pointers directly we remove the risk of the pointer going
stale.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-14 14:32:21 -07:00
Zach Brown
1c284af854 scoutfs: add assertions for bad treap offsets
The wild casting in the treap code can cause memory corruption if it's
fed bad offsets.  Add some assertions so that we can see when this is
happening.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-14 13:15:12 -07:00
Zach Brown
3e5eeaa80c scoutfs: initialize block alloc past mkfs blocks
The format doesn't yet record free blocks.  We've been relying on the
scary initialization of the block allocator past the blocks that are
written by mkfs.  And it was wrong.  This garbage will be replaced with
an allocator in a few commits.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-14 13:13:18 -07:00
Zach Brown
5d77fa4f18 scoutfs: fix serious but small btree bugs
Not surprisingly, testing the btree code shook out a few bugs

 - the treap root wasn't initialized
 - existing split source block wasn't compacted
 - item movement used item treap fields after deletion

All of these had the consequence of feeding the treap code bad offsets
so its node/u16 casts could lead it to scribble over memory.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-14 13:08:56 -07:00
Zach Brown
0234abf098 scoutfs: update filerw cursor use
The conversion of the filerw item callers of the btree cursor wasn't
updated to consistently release the cursors.  This was causing block
refcounting problems that could scribble on freed and realloced memory.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-13 09:49:13 -07:00
Zach Brown
affee9da7c scoutfs: add cscope noise to .gitignore
Signed-off-by: Zach Brown <zab@versity.com>
2016-04-12 19:37:31 -07:00
Zach Brown
5651d48c18 scoutfs: add core btree functionality
Previously we had stubbed out the btree item API with static inlines.
Those are replaced with real functions in a reasonably functional btree
implementation.

The btree implementation itself is pretty straight forward.  Operations
are performed top-down and we dirty, lock, and split/merge blocks as we
go.  Callers are given a cursor to give them full access to the item.
Items in the btree blocks are stored in a treap.  There are a lot of
comments in the code to help make things clear.

We add the notion of block references and some block functions for
reading and dirtying blocks by reference.

This passes tests up to the point where unmount tries to write out data
and the world catches fire.  That's far enough to commit what we have
and iterate from there.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-12 19:33:09 -07:00
Zach Brown
5369fa1e05 scoutfs: first step towards multiple btrees
Starting to implement LSM merging made me really question if it is the
right approach.  I'd like to try an experiment to see if we can get our
concurrent writes done with much simpler btrees.

This commit removes all the functionality that derives from the large
LSM segments and distributing the manifest.

What's left is a multi-page block layer and the husk of the btree
implementation which will give people access to items.  Callers that
work with items get translated to the btree interface.

This gets as far as reading the super block but the format changes and
large block size mean that the crc check fails and the mount returns an
error.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-11 11:35:37 -07:00
Zach Brown
a07b41fa8b scoutfs: store the manifest in an interval tree
Now that we have the interval tree we can use it to store the manifest.
Instead of having different indexes for each level we store all the
levels in one index.

This simplifies the code quite a bit.  In particular, we won't have to
special case merging between level 0 and 1 quite as much because level 0
is no longer a special list.

We have a strong motivation to keep the manifest small.  So we get
rid of the blkno radix.  It wasn't wise to trade off more manifest
storage to make the ring a bit smaller.  We can store full manifests
in the ring instead of just the block numbers.

We rework the new_manifest interace that adds a final manifest entry
and logs it.  The ring entry addition and manifest update are atomic.

We're about to implement merging which will permute the manifest.  Read
methods won't be able to iterate over levels while racing with merging.
We change the manifest key search interface to return a full set of all
the segments that intersect the key.

The next item interface now knows how to restart the search if hits
the end of a segment on one level and the next least key is in another
segment and greater than the end of that completed segment.

There was also a very crazy cut+paste bug where next item was testing
that the item is past the last search key with a while instead of an if.
It'd spin throwing list_del_init() and brelse() debugging warnings.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-02 17:27:58 -07:00
Zach Brown
20cc8c220c scoutfs: fix next ival busy loop
The next interval interface didn't set the ival to return to null when
it finds a null next node.  The caller would continuously get the same
interval.

This is what I get for programming late at night, I guess.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-02 17:17:19 -07:00
Zach Brown
eb790a7761 scoutfs: remove nonsense comment
I think the range comparisons are correct here.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-02 17:16:51 -07:00
Zach Brown
d91dc45368 scoutfs: add interval tree
Add an interval tree that lets us efficiently discover intervals that
overlap a given search region.  We're going to need this now to sanely
implementing merging and in the future to implement granting access
ranges.

It's easy to implement an interval tree by using the kernel's augmented
rbtree to track the max end value of the subtree of intervals.  The
tricky bit is that the augmented interface assumes that it can directly
compare the augmented value.

If we were developing against mainline we'd just patch the interface.
But we're developing against distro kernels that development partners
deploy so the kernel is frozen in amber.

We deploy a giant stinky hack to import a private tweaked version of the
interface.  It's isolated so we can trivially drop it once we merge with
the fixed upstream interface.  We also add some build time checks to
make sure that we don't accidentally combine rb structures between the
private import and the main kernel interface.

Signed-off-by: Zach Brown <zab@versity.com>
2016-04-01 14:53:06 -07:00
Zach Brown
7a565a69df scoutfs: add percpu coutners with sysfs files
Add percpu counters that will let us track all manner of things.

To report them we add a sysfs directory full of attribute files in a
sysfs dir for each mount:

    # (cd /sys/fs/scoutfs/loop0/counters && grep . *)
    skip_delete:0
    skip_insert:3218
    skip_lookup:8439
    skip_next:1190
    skip_search:156

The implementation is careful to define each counter in only one place.
We don't have to make sure that a bunch of defintions and arrays are in
sync.

This builds off of Ben's initial patches that added sysfs dirs.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
2016-03-31 16:44:37 -07:00
Zach Brown
6e20913661 scoutfs: insert new manifests at highest level
Manifests for newly written segments can be inserted at the highest
level that doesn't have segments they intersect.  This avoids ring and
merging churn.

The change cleans up the code a little bit, which is nice, and adds
tracepoints for manifests entering and leaving the in memory structures.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-29 16:15:09 -07:00
Zach Brown
52c315942f scoutfs: update item block and manifest item range
The manifests for level 0 blocks always claimed that they could contain
all keys.  That causes a lot of extra bloom filter lookups when in fact
the blocks contain a very small range of keys.

It's true that we don't know what items a dirty segment is going to
contain, don't want to update the manfiest at every insertion, and have
to find the items in the segments in regular searching.

But when they're finalized we know the items they'll contain and can
update the manifest.  We do that by initializing the item block range to
nonsense and extending it as items are added.  When it's finalized we
update the manifest in memory and in the ring.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-29 11:27:27 -07:00
Zach Brown
97e6c1e605 scoutfs: fix final overlapping item/val
Item headers are written from the front of the block to the tail.
Item values are written from the tail of the block towards the head.

The math to detect their overlapping in the center forgot to take the
length of the item header into account.  We could have final item
headers and values overriding each other which causes file data to
appear as an item header.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-29 11:25:36 -07:00
Zach Brown
c7c8969704 scoutfs: adjust bloom size for segment item max
The bloom filter was much too large for the current typical limit on the
number of items that fit in a segment.  Having them too large decreases
storage efficiency, has us read more data from a cold cache, and bloom
tests pin too much data.

We can cut it down to 25% for our current segment and item sizes.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-29 10:08:37 -07:00
Zach Brown
f1b5eb8a80 scoutfs: more dirty segment locking
The segment code wasn't always locking around concurrent accesses to the
dirty segment.  This is mostly a problem for updating all the next
elements in skip list modification.  But we also want to serialize dirty
block writing.

Add a little helper function to acquire the dirty mutex when we're
reading from the current dirty segment.

Bring sync in to segment.c so it's clear that it's intimately related to
the dirty segment.

The item deletion hack was totally unlocked.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-27 19:29:38 -07:00
Zach Brown
9c3918b576 scoutfs: remove accidentally committed notes
Some brainstorming notes in a comment accdentally made their way in to a
commit.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-27 16:19:19 -07:00
Zach Brown
eff3d78cb1 scoutfs: update inode when write changes i_size
Extended file data wasn't persistent because we weren't writing out the
inode with the i_size update that covered the newly written data.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-26 22:28:45 -07:00
Zach Brown
059212d50e scoutfs: add some basic tracepoints
I added these tracepoints to verify that file data isn't reachable after
mount because we're not writing out the inode with the current i_size.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-26 22:28:42 -07:00
Zach Brown
402dd2969f scoutfs: add tracepoint support with bloom example
Add the intrastucture for tracepoints.  We include an example user that
traces bloom filter hits and misses.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-26 20:58:31 -07:00
Zach Brown
9cf87ee571 scoutfs: add basic file page cache read and write
Add basic file data support by implementing the address space file and
page read and write methods.  This passis basic read/write tests but is
only the seed of a final implementation.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-26 10:58:06 -07:00
Zach Brown
867d717d2b scoutfs: item offsets need to skip block headers
The vallue offset allocation knew to skip block headers at the start of
each segment block but, weirdly, the item offset allocation didn't.

We make item offset calculation skip the header and we add some tracing
to help see the problem.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-25 19:28:21 -07:00
Zach Brown
6834100251 scoutfs: free our dentry info
Stop leaking dentry_info allocations by adding a dentry_op with a
d_release that frees our dentry info allocation.  rmmod tests no longer
fail when dmesg screams that we have slab caches that still have
allocated objects.

Signed-off-by: Zach Brown <zab@versity.com>
s
2016-03-25 11:08:20 -07:00
Zach Brown
434cbb9c78 scoutfs: create dirty items for inode updates
Inode updates weren't persistent because they were being stored in clean
segments in memory.  This was triggered by the new hashed dirent
mechanism returning -ENOENT when the inode still had a 0 max dirent hash
nr.

We make sure that there is a dirty item in the dirty segment at the
start of inode modification so that later updates will store in the
dirty segment.  Nothing ensures that the dirty segment won't be written
out today but that will be added soon.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-25 10:08:34 -07:00
Zach Brown
3bb00fafdc scoutfs: require sparse builds
Now that we know that it's easy to fix sparse build failures against
RHEL kernel headers we can require sparse builds when developing.

Signed-off-by: Zach Brown <zab@versity.com>
2016-03-24 21:45:08 -07:00