Commit Graph

35 Commits

Author SHA1 Message Date
Zach Brown
79110a74eb scoutfs: prevent partial block stage, except final
The staging ioctl is just a thin wrapper around writing.  If we allowed
partial-block staging then the write would zero a newly allocated block
and only stage in the partial region of the block, leaving zeros in the
file that didn't exist before.

We prevent staging when the starting offset isn't block aligned.  We
prevent staging when the final offset isn't block aligned unless it
matches the size because the stage ends in the final partial block of
the file.

This is verified by an xfstest (scoutfs/003) that is in flight.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-07 13:49:37 -07:00
Zach Brown
599269e539 scoutfs: don't return uninit index entries
Initially the index walking ioctl only ever output a single entry per
iteration.  So the number of entries to return and the next entry
pointer to copy to userspace were maintained in the post-increment of
the for loop.

When we added locking of the index item results we made it possible to
not copy any entries in a loop iteration.  When that happened the nr and
pointer would be incremented without initializing the entry.  The ioctl
caller would see a garbage entry in the results.

This was visible in scoutfs/002 test results on a volume that had an
interesting file population after having run through all the other
scoutfs tests.  The uninitialized entries would show up as garbage in
the size index portion of the test.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Mark Fasheh
0011c185a9 scoutfs: plug the rest of our locking into dlmglue
We move struct ocfs2_lock_res_ops and flags to dlmglue.c so that
locks.c can get access to it. Similarly, we export
ocfs2_lock_res_init_common() for locks.c can initialize each lockres
before use. Also, free_lock_tree() now has to happen before we shut
down the dlm - this gives dlmglue the opportunity to unlock their
underlying dlm locks before we go off freeing the structures.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-24 11:45:15 -05:00
Mark Fasheh
021404bb6a scoutfs: remove inode ctime index
Like the mtime index, this index is unused. Removing it is a near
identical task. Running the same createmany test from our last
patch gives us the following:

 $ createmany -o '/scoutfs/file_%lu' 10000000

 total: 10000000 creates in 598.28 seconds: 16714.59 creates/second

 real    9m58.292s
 user    0m7.420s
 sys     5m44.632s

So after both indices are gone, we go from a 12m56 run time to 9m58s,
saving almost 3 minutes which translates into a total performance
increase of about 23%.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-22 15:59:13 -07:00
Mark Fasheh
d59367262d scoutfs: remove inode mtime index
This index is unused - we can gain some create performance by removing it.

To verify this, I ran createmany for 10 million files:

 $ createmany -o '/scoutfs/file_%lu' 10000000

Before this patch:
 total: 10000000 creates in 776.54 seconds: 12877.56 creates/second

 real    12m56.557s
 user    0m7.861s
 sys     6m56.986s

After this patch:
 total: 10000000 creates in 691.92 seconds: 14452.46 creates/second

 real    11m31.936s
 user    0m7.785s
 sys     6m19.328s

So removing the index gained us about a minute and a half on the test or a
12% performance increase.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-22 15:59:13 -07:00
Zach Brown
c7ad9fe772 scoutfs: make release block granular
The existing release interface specified byte regions to release but
that didn't match what the underlying file data mapping structure is
capable of.  What happens if you specify a single byte to release?  Does
it release the whole block?  Does it release nothing?  Does it return an
error?

By making the interface match the capability of the operation we make
the functioning of the system that much more predictable.  Callers are
forced to think about implementing their desires in terms of block
granular releasing.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-14 09:19:03 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00
Zach Brown
0b64a4c83f scoutfs: lock inode index item iteration
Add locks around inode index item iteration.  This is tricky because the
inode index items are enormous and we can't default to coarse locks that
let it read and iterate over the entire key space.  We use the manifest
to find the next small fixed size region to lock and iterate from.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
8d29c82306 scoutfs: sort keys by zone, then inode, then type
Holding a DLM lock protects a range of the key space.  The DLM locks
span inodes or regions of inodes.  We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments.  If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.

Previously we were sorting by type then, within types, by inode.  Now we
want to sort by inode then by type.  But there are structures which
previously had a type but weren't then sorted by inode.  We introduce
zones as the primary sort key.  Inode index and node zones are sorted by
the inode fields and node ids respectively.  Then comes the fs zone
first sorted by inode then the type of the key.

The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.

The orphan keys needed to be put in a zone.   They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.

The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."

And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
2eecbbe78a scoutfs: add item cache key ioctls
These ioctls let userspace see the items and ranges that are cached.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
b7bbad1fba scoutfs: add precise transation item reservations
We had a simple mechanism for ensuring that transaction didn't create
more items than would fit in a single written segment.  We calculated
the most dirty items that a holder could generate and assumed that all
holders dirtied that much.

This had two big problems.

The first was that it wasn't accounting for nested holds.
write_begin/end calls the generic inode dirtying path whild holding a
transaction.  This ended up deadlocking as the dirty inode waited to be
able to write while its trans held back in write_begin prevented
writeout.

The second was that the worst case (full size xattr) item dirtying is
enormous and meaningfully restricts concurrent transaction holders.
With no currently dirty items you can have less than 16 full size xattr
writes.  This concurrency limit only gets worse as the transaction fills
up with dirty items.

This fixes those problems.  It adds precise accounting of the dirty
items that can be created while a transaction is held.  These
reservations are tracked in journal_info so that they can be used by
nested holds.  The precision allows much greater concurrency as
something like a create will try to reserve a few hundreds bytes instead
of 64k.  Normal sized xattr operations won't try to reserve the largest
possible space.

We add some feedback from the item cache to the transaction to issue
warnings if a holder dirties more items than it reserved.

Now that we have precise item/key/value counts (segment space
consumption is a function of all three :/) we can't have a single atomic
track transaction holders.  We add a long-overdue trans_info and put a
proper lock and fields there and much more clearly track transaction
serialization amongst the holders and writer.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:15:13 -07:00
Zach Brown
5f11cdbfe5 scoutfs: add and index inode meta and data seqs
For each transaction we send a message to to the server asking for a
unique sequence number to associate with the transaction.  When we
change metadata or data of an inode we store the current transaction seq
in the inode and we index it with index items like the other inode
fields.

The server remembers the sequences it gives out.  When we go to walk the
inode sequence indexes we ask the server for the largest stable seq and
limit results to that seq.  This ensures that we never return seqs that
are past dirty items so never have inodes and seqs appear in the past.

Nodes use the sync timer to regularly cycle through seqs and ensure that
inode seq index walks don't get stuck on their otherwise idle seq.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:12:24 -07:00
Zach Brown
5307c56954 scoutfs: add a stat_more ioctl
We have inode fields that we want to return to userspace with very low
overhead.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 14:28:10 -07:00
Zach Brown
b97587b8fa scoutfs: add indexing of inodes by fields
Add items for indexing inodes by their fields.  When we update the inode
item we also delete the old index items and create the new items.  We
rename and refactor the old inode since ioctl to now walk the inode
index items.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 10:48:12 -07:00
Zach Brown
e34f8db4a9 scoutfs: add release argument and result tracing
Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 10:48:12 -07:00
Zach Brown
a262a158ce scoutfs: fix single block release
The offset comparison in release that was meant to catch wrapping was
inverted and accidentally prevented releasing a single block.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 10:48:12 -07:00
Zach Brown
97cb75bd88 Remove dead btree, block, and buddy code
Remove all the unused dead code from the previous btree block design.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:55 -07:00
Zach Brown
02af35a98e Convert inode since ioctl to the item API
The inode since ioctl was the last user of the btree.  It doesn't yet
work because the item cache doesn't know how to search for items by
sequence yet.

It's not yet clear exactly how we'll build the data since ioctls.  It'll
be easy enough to refactor the inode since item walk if they follow a
similar pattern again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
a310027380 Remove the find xattr ioctls
The current plan for finding populations of inodes to search no longer
involves xattr backrefs.  We're about to change the xattr storage format
so let's remove these interfaces so we don't have to update them.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
fff6fb4740 Restore link backref items
Convert the link backref code from btree items to the item cache.

Now that the backref items have the full entry name we can traverse a
link with one item lookup.  We don't need to lock the inode and verify
that the entry at the backref offset really points to our inode.  The
link backref walk gets a lot simpler.

But we have to widen the ioctl cursor to store a full dir ino and path
name isntead of just the dir's backref counter.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
9f5e42f7dd Add simple data items
Add basic file data support by managing file data items from the page
cache address space callbacks.

Data is read by copying from cached items into page contents in
readpage.

Writes create new ephemeral items which reference dirty pages.  The
items are deleted once they're written in a transaction or if
invalidatepage removes the dirty page they reference.

There's a lot more to do to remove data copies, avoid compaction bw
overhead, and add support for truncate, o_direct, and mmap.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
c6b688c2bf Add staging ioctl
This adds the ioctl for writing archived file contents back into the
file if the data_version still matches.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
df561bbd19 Add offline extent flag and release ioctl
Add the _OFFLINE flag to indicate offline extents.  The release ioctl
frees extents within the release range and sets their _OFFLINE flag if
the data_version still matches.

We tweak the existing truncate item function just a bit to support
making extents offline.  We make it take an explicit range of blocks to
remove instead of just giving it the size and it learns to mark extents
offline and update them instead of always deleting them.

Reads from offline extents return zeros like reading from a sparse
region (later it will trigger demand staging) and writing to offline
extents clears the offline flag (later only staging can do that).

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
5d87418925 Add ioctl for sampling inode data version
Add an ioctl that samples the inode's data_version.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Mark Fasheh
467801de73 scoutfs: use extents for file data
We're very basic here at this stage and simply put a single-block extent
item where we would have previously had a multi-block bmap item.
Multi-block extents will come in future patches.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Zach Brown <zab@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
ae6cc83d01 Raise the nlink limit
A few xfstests tests were failing because they tried to create a decent
number of hard links to a file.

We had a small nlink limit because the inode-paths ioctl copied all the
paths for all the hard links to a userspace buffer which could be
enormous if there was a larger nlink limit.

The hard link backref disk format already has a natural counter that
could be used as a cursor to iterate over all the hard links that point
to a given inode.

This refactors the inode_paths ioctl into a ino_path ioctl that returns
a single path for the given counter and returns the counter for the next
path that links to the inode.  Happily this lets us get rid of all the
weird path component lists and allocations.  Now there's just the kernel
path buffer that gets null terminated path components and the userspace
buffer that those are copied to.

We don't fully relax the nlink limit.  stat(2) returns the link count as
a u32.  We go a step further and limit it to S32_MAX so that apps might
avoid sign bugs.  That still gives us a more generous limit than ext4
and btrfs which are around U16_MAX.

Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
2016-11-16 14:45:08 -08:00
Zach Brown
165d833c46 Walk stable trees in _since ioctls
The _since ioctls walk btrees and return items that are newer than a
given sequence number.  The intended behaviour is that items will
appear in a greater sequence number if they change after appearing
in the queries.  This promise doesn't hold for items that are being
modified in the current transaction.  The caller would have to always
ask for seq X + 1 after seeing seq X to make sure it got all the changes
that happened in seq X while it was the current dirty transaction.

This is fixed by having the interfaces walk the stable btrees from the
previous transaction.  The results will always be a little stale but
userspace already has to deal with stale results because it can't lock
out change, and it can use sync (and a commit age tunable we'll add) to
limit how stale the results can be.

Signed-off-by: Zach Brown <zab@versity.com>
2016-11-08 16:05:36 -08:00
Zach Brown
16e94f6b7c Search for file data that has changed
We don't overwrite existing data.  Every file data write has to allocate
new blocks and update block mapping items.

We can search for inodes whose data has changed by filtering block
mapping item walks by the sequence number.  We do this by using the
exact same code for finding changed inodes but using the block mapping
key type.

Signed-off-by: Zach Brown <zab@versity.com>
2016-10-20 13:55:14 -07:00
Zach Brown
84f23296fd scoutfs: remove btree cursor
The btree cursor was built to address two problems.  First it
accelerates iteration by avoiding full descents down the tree by holding
on to leaf blocks.  Second it lets callers reference item value contents
directly to avoid copies.

But it also has serious complexity costs.  It pushes refcounting and
locking out to the caller.  There have already been a few bugs where
callers did things while holding the cursor without realizing that
they're holding a btree lock and can't perform certain btree operations
or even copies to user space.

Future changes to the allocator to use the btree motivates cleaning up
the tree locking which is complicated by the cursor being a stand alone
lock reference.  Instead of continuing to layer complexity onto this
construct let's remove it.

The iteration acceleration will be addressed the same way we're going to
accelerate the other btree operations: with per-cpu cached leaf block
references.  Unlike the cursor this doesn't push interface changes out
to callers who want repeated btree calls to perform well.

We'll leave the value copying for now.  If it becomes an issue we can
add variants that call a function to operate on the value.  Let's hope
we don't have to go there.

This change replaces the cursor with a vector to memory that the value
should be copied to and from.  The vector has a fixed number of elements
and is wrapped in a struct for easy declaration and initialization.

This change to the interface looks noisy but each caller's change is
pretty mechanical.  They tend to involve:

 - replace the cursor with the value struct and initialization
 - allocate some memory to copy the value in to
 - reading functions return the number of value bytes copied
 - verify copied bytes makes sense for item being read
 - getting rid of confusing ((ret = _next())) looping
 - _next now returns -ENOENT instead of 0 for no next item
 - _next iterators now need to increase the key themselves
 - make sure to free allocated mem

Sometimes the order of operations changes significantly.  Now that we
can't modify in place we need to read, modify, write.  This looks like
changing a modification of the item through the cursor to a
lookup/update pattern.

The symlink item iterators didn't need to use next because they walk a
contiguous set of keys.  They're changed to use simple insert or lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
2bed78c269 scoutfs: specify btree root
The btree functions currently don't take a specific root argument.  They
assume, deep down in btree_walk, that there's only one btree in the
system.  We're going to be adding a few more to support richer
allocation.

To prepare for this we have the btree functions take an explicit btree
argument.  This should make no functional difference.

Signed-off-by: Zach Brown <zab@versity.com>
2016-09-21 10:04:07 -07:00
Zach Brown
c90710d26b scoutfs: add find xattr ioctls
Add ioctls that return the inode numbers that probably contain the given
xattr name or value.  To support these we add items that index inodes by
the presence of xattr items whose names or values hash to a give hash
value.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-23 12:14:55 -07:00
Zach Brown
0991622a21 scoutfs: add inode_paths ioctl
This adds the ioctl that returns all the paths from the root to a given
inode.  The implementation only traverses btree items to keep it
isolated from the vfs object locking and life cycles, but that could be
a performance problem.  This is another motivation to accelerate the
btree code.

Signed-off-by: Zach Brown <zab@versity.com>
2016-08-11 16:46:18 -07:00
Zach Brown
90a73506c1 scoutfs: remove homebrew tracing
Oh, thank goodness.  It turns out that there's a crash extension for
working with tracepoints in crash dumps.  Let's use standard tracepoints
and pretend this tracing hack never happened.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-20 12:08:12 -07:00
Zach Brown
b51511466a scoutfs: add inodes_since ioctl
Add the ioctl that let's us find out about inodes that have changed
since a given sequence number.

A sequence number is added to the btree items so that we can track the
tree update that it last changed in.  We update this as we modify
items and maintain it across item copying for splits and merges.

The big change is using the parent item ref and item sequence numbers
to guide iteration over items in the tree.  The easier change is to have
the current iteration skip over items whose sequence number is too old.

The more subtle change has to do with how iteration is terminated.  The
current termination could stop when it doesn't find an item because that
could only happen at the final leaf.  When we're ignoring items with old
seqs this can happen at the end of any leaf.  So we change iteration to
keep advancing through leaf blocks until it crosses the last key value.

We add an argument to btree walking which communicates the next key that
can be used to continue iterating from the next leaf block.  This works
for the normal walk case as well as the seq walking case where walking
terminates prematurely in an interior node full of parent items with old
seqs.

Now that we're more robustly advancing iteration with btree walk calls
and the next key we can get rid fo the 'next_leaf' hack which was trying
to do the same thing inside the btree walk code.  It wasn't right for
the seq walking case and was pretty fiddly.

The next_key increment could wrap the maximal key at the right spine of
the tree so we have _inc saturate instead of wrap.

And finally, we want these inode scans to not have to skip over all the
other items associated with each inode as it walks looking for inodes
with the given sequence number.  We change the item sort order to first
sort by type instead of by inode.  We've wanted this more generally to
isolate item types that have different access patterns.

Signed-off-by: Zach Brown <zab@versity.com>
2016-07-05 14:46:20 -07:00
Zach Brown
7d6dd91a24 scoutfs: add tracing messages
This adds tracing functionality that's cheap and easy to
use.  By constantly gathering traces we'll always have rich
history to analyze when something goes wrong.

Signed-off-by: Zach Brown <zab@versity.com>
2016-05-28 11:11:15 -07:00