Commit Graph

368 Commits

Author SHA1 Message Date
Mark Fasheh
bc2fef7fc8 scoutfs: ifdef out ocfs2 specific callbacks and functions
We only want the generic stuff. Long term the Ocfs2 specific code would be
what's left in fs/ocfs2/dlmglue.[ch].

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-23 16:05:24 -05:00
Mark Fasheh
fc21a0253c scoutfs: Hook dlmglue into our build system
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-23 15:54:08 -05:00
Mark Fasheh
f7e3f6f9e6 scoutfs: import fs/ocfs2/dlmglue.[ch] from Linux v4.13-rc6
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-22 19:07:53 -05:00
Mark Fasheh
021404bb6a scoutfs: remove inode ctime index
Like the mtime index, this index is unused. Removing it is a near
identical task. Running the same createmany test from our last
patch gives us the following:

 $ createmany -o '/scoutfs/file_%lu' 10000000

 total: 10000000 creates in 598.28 seconds: 16714.59 creates/second

 real    9m58.292s
 user    0m7.420s
 sys     5m44.632s

So after both indices are gone, we go from a 12m56 run time to 9m58s,
saving almost 3 minutes which translates into a total performance
increase of about 23%.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-22 15:59:13 -07:00
Mark Fasheh
d59367262d scoutfs: remove inode mtime index
This index is unused - we can gain some create performance by removing it.

To verify this, I ran createmany for 10 million files:

 $ createmany -o '/scoutfs/file_%lu' 10000000

Before this patch:
 total: 10000000 creates in 776.54 seconds: 12877.56 creates/second

 real    12m56.557s
 user    0m7.861s
 sys     6m56.986s

After this patch:
 total: 10000000 creates in 691.92 seconds: 14452.46 creates/second

 real    11m31.936s
 user    0m7.785s
 sys     6m19.328s

So removing the index gained us about a minute and a half on the test or a
12% performance increase.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-22 15:59:13 -07:00
Zach Brown
8135b18c76 scoutfs: start truncate from first block
Truncation updates extents that intersect with the input range.  It
starts with the first block in the range and iterates until it has
searched for all the extents that could cover the range.

Extents are stored in items at their final block location so that we can
use _next to find intersections.  Truncation was searching for the next
extent after the full extent that it was still searching for.  That
means it was starting the search at the last block in the extent, not
the first.  It would miss all the extents that didn't overlap with the
last block it was searching for.

This fixed by searching from a temporary single block extent at the
start of the search range.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-17 15:29:08 -07:00
Mark Fasheh
d1ae486d83 scoutfs: provide ->llseek
Without this we return -ESPIPE when a process tries to seek on a regular
file.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-08-14 19:57:13 -07:00
Zach Brown
07bbc418c3 scoutfs: merge offline extents
Offline extents weren't being merged because they all had their physical
blkno set to 0 and all the extent calculations didn't treat them
specially.  They would only merge if the physical blocks of two extent
were contiguous.  Instead of special casing offline extents everywhere
we store them with a physical blkno set to the logical blk_off.  This
lets all the current extent calculations work as expected.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-14 09:19:03 -07:00
Zach Brown
7cc09761f5 scoutfs: release item cleanup needs transaction
Release tries to re-instate extents if it sees an error during release.
Those item manipulations need to be covered by the transaction.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-14 09:19:03 -07:00
Zach Brown
c7ad9fe772 scoutfs: make release block granular
The existing release interface specified byte regions to release but
that didn't match what the underlying file data mapping structure is
capable of.  What happens if you specify a single byte to release?  Does
it release the whole block?  Does it release nothing?  Does it return an
error?

By making the interface match the capability of the operation we make
the functioning of the system that much more predictable.  Callers are
forced to think about implementing their desires in terms of block
granular releasing.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-14 09:19:03 -07:00
Zach Brown
87ab27beb1 scoutfs: add statfs network message
The ->statfs method was still using the super_block in the super_info
that was read during mount.  This will get progressively more out
of date.

We add a network message to ask the server for the current fields that
impact statfs.  This is always racy and the fields are mostly nonsense,
but we try our best.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:43:35 -07:00
Zach Brown
ba7bde30fc scoutfs: delete inode index items
Delete inode index items when deleting all the items associated with an
inode after its been unlinked and had all its references dropped.

The index items should always match the fields in the inode item so we
read it to determine the index items that should be deleted, regardless
of if we have the vfs inode cached or not.  We take the opportunity to
collapse the two callers of item deletion which looked up the inode into
item deletion so that it can use the inode fields.

The deletion of index items is partially verified by an inode index test
in xfstests which makes sure that unlinked files are no longer present
in the index.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:13:56 -07:00
Zach Brown
3768e3c41c scoutfs: don't add dirs to data_seq index
Directories were getting added to the data_seq index.  It might have
looked like they weren't because their data_seqs were always 0 but when
inodes are created they don't have 'have_item' set so all the fields are
added regardless of their current value.

We'd rather not have to wade their directories when looking for regular
file data in the data_seq index so let's explicitly test for regular
files when updating the data_seq index items.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:13:56 -07:00
Zach Brown
1398b2316d scoutfs: clean up racey inode index updates
The updating of the inode index items was racey.  It loaded the inode
values, updated the items, loaded the fields again, and then stored the
fields in the inode info.  All without locking.  Concurrent attempts
could get the fields scrambled and racing with other paths that update
the fields could get the items and inode info out of sync.

This fixes up the two races by only reading the inode fields once and
performing the multi-stage update under a mutex.  We add a new lock to
avoid ordering problems with trying to add an existing lock at these
points in the locking heirarchy.  We specifically use a mutex because
the item functions can block.

Now the inode index field update just has to safely race with concurrent
access to the fields.

This was found by generic/037 once getattr started refreshing the inode.
It now passes again.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-11 10:07:42 -07:00
Zach Brown
cdb58a967a scoutfs: give module fs scoutfs alias
Use MODULE_ALIAS_FS() to register the "scoutfs" fs alias so that
modprobe can find the module if it's installed and visible to depmod.

We don't yet have clever enough xfstests to mess around with modules.  I
manually verified this by installing the module in /lib/modules and
trying mount -t scoutfs before and after the change.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-10 18:07:26 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00
Zach Brown
74a80b772e scoutfs: add endian_swap.h
Add a helper header for conversions between little and big endian.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:44:06 -07:00
Zach Brown
b98f97e143 scoutfs: use hlist hash for data cursors
The rhashtable API has changed over time.  Continuing to use it means
having to worry about maintaining different APIs in different kernel
generations.

We have a static pool of cursors so we don't need the flexibility of the
resizable rhashtable.  We can roll a simple array of hlist heads to use
as a hash table.

And finally, these cursors will probably disappear eventually anyway.
Let's not invest too much in them.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:44:06 -07:00
Zach Brown
9f4095bffb scoutfs: break the build if we export raw types
Raw [su]{8,16,32,64} types keep leaking into our exported headers where
they break userspace builds.  Make sure that we only use the exported __
types and add a check to break our build if we get it wrong.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:37:49 -07:00
Zach Brown
cefe06af61 scoutfs: add git describe to built module
It's handy to quickly find the git commit that built a given module.  We
add a MOD_INFO() tag for it so we can see it in modinfo on the built
module.  We add a ELF note that the kernel tracks in
/sys/modules/$m/notes/ when the module is loaded.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-03 15:07:23 -07:00
Zach Brown
6d16034112 scoutfs: remove old dlm make -I
We don't need arguments for a dlm build.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-03 15:07:23 -07:00
Zach Brown
65c3ac5043 scoutfs: Add cluster locking to node/file ops
This gives us cluster locking for the overwhelming majority of metadata ops
that scoutfs supports. In particular, we can create and modify file metadata
from one node and immediately see the changes reflected on another node.

In addition to synchonrization the cluster locks here are providing an I/O
endpoint for our item cache, ensuring that it doesn't read stale items.

Readdir and file read/write are notable exception - they require a more
specific approach and will be implemented in a future patch.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[fixed iget unlock and truncated commit message summary]
Signed-off-by: Zach Brown <zab@versity.com>
2017-08-03 11:16:35 -07:00
Zach Brown
172cff5537 scoutfs: return -ENODATA from getxattr
The conversion to the multi-item xattrs accidentally returned -EIO when
an attribute wasn't found instead of -ENODATA.  That broke a huge number
of xfstests because ls can look up xattrs and return EIO.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-02 11:16:12 -07:00
Mark Fasheh
325eadca9f scoutfs: check for NULL lock in scoutfs_unlock
This reduces the amount of duplicate code in callers and makes error
handling easier. The alternative is to sprinkle the code with 'if (lock)'
lines at the end of our functions.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-27 12:33:21 -07:00
Mark Fasheh
4ff2148f10 scoutfs: Don't use stale root in get_manifest_refs
get_manifest_refs was using the btree root in its stale copy of the
super block.  It is supposed to use the btree root that it was given by
its caller who went to the trouble of finding a sufficiently current
btree root.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: added commit message and fixed formatting]
Signed-off-by: Zach Brown <zab@versity.com>
2017-07-27 12:32:05 -07:00
Mark Fasheh
a65b28d440 scoutfs: lock impossible ino group for listen lock
Otherwise we get into a problem where the listen lock is conflicting with
regular inode group requests. Since we never drop the listen lock and it (by
design) blocks progress on another node, those inode group requests may
hang.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-19 19:04:41 -05:00
Mark Fasheh
2d11f08f5e scoutfs: Remove unused functions, scoutfs_[un]lock_addr
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-19 19:04:41 -05:00
Zach Brown
13ebd8d18c scoutfs: don't use delayed downconvert work
The delayed downconvert work wasn't being canceled on shutdown.  60s
after unmount at least the net lock's timer would fire and crash trying
to queue the delayed work on the destroyed workqueue.

Proactively unlocking the locks isn't always beneficial to begin with.
The relative costs of mispredicting the future are wildly different if
we have to re-read item caches from segments or have to downconvert a
blocking read lock.

So we can just remove the delayed work to fix the bug and remove a
moving piece that would need to be considered and tuned.  There's still
a race where we can get basts after destroying the workqueue but before
we destroy the lockspace, we'll get there.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
47b26d7888 scoutfs: add end to _item_delete
Add the end argument to scoutfs_item_delete() to limit how many items it
will read into the cache.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
d5b4677e7f scoutfs: add end to _dirty, _delete_many, _update
These transformations are mechanical and there aren't many callers of
these so we combine them into one commit.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
d78ed098a7 scoutfs: add cache reading limit to _set_batch
Add an end argument to _set_batch to specify the limit of
items we'll read into the cache.

And it turns out that the loop in _set_batch that meant to cache all the
items covered by the batch didn't try hard enough.  It would stop once
the first key was covered but didn't make sure that the coverage
extended to cover last.  This can happen if segment boundaries happen to
fall within the items that make up the batch.  Fix it up while we're in
here.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
0b64a4c83f scoutfs: lock inode index item iteration
Add locks around inode index item iteration.  This is tricky because the
inode index items are enormous and we can't default to coarse locks that
let it read and iterate over the entire key space.  We use the manifest
to find the next small fixed size region to lock and iterate from.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
f611c769e2 scoutfs: add 'end' to item_next to limit reads
Add an end key to the item_next calls to limit how many items will be
read into the cache.  Callers typically get this from the lock they hold
that covers the iteration.  We differentiate between iteration and
caching so that a series of small iterations (listxattr on inodes,
namespace walk in small dirs) can be satisfied by a single read of
adjacent items from segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
4f6f842efa scoutfs: add inode index item locking
Add a locking wrapper for the inode index items.  It maps the index
fields to a lock name for each index type.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
c80dd579e1 scoutfs: add scoutfs_manifest_next_key
Add an item reading variant that just returns the next key that it finds
in segments after the given key.  This will be used while iterating
to find the next key to lock and then try to iterate towards.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
19171f7a25 scoutfs: add end to _item_lookup
The item cache can only be populated with items that are covered by
locks.  Require callers to provide the farthest key that can be covered
by the locks.  Locks provide a key for exactly this purpose.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
67cc4fb697 scoutfs: allow NULL end around read_items
Let both check_range and read_items take a NULL end.  check_range just
doesn't do anything with the end of the range.  read_items defaults
to trying to read as many items as it can but clamps to the extent of
the segments that intersect with the key.

This will let us incrementally add end arguments to the item functions
that are intially passed in as NULL in callers as we add lock coverage.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
11a8570117 scoutfs: remove our copy of the dlm
We don't need the dlm to track key ranges if we implement ranges by
mapping keys to resources which represent ranges of the key space.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
8a42a4d75a scoutfs: introduce lock names
Instead of locking one resource with ranges we'll have callers map their
logical resources to a tuple name that we'll store in lock resources.
The names still map to ranges for cache reading and cache invalidation
but the ranges aren't exposed to the DLM.  This lets us use the stock
DLM and distribute resources across masters.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
6de2bfc1c5 scoutfs: use the dlm mode/levels directly
We intend to use more of the dlm lock levels.  Let's use its modes
directly so we don't have to maintain a mental map from differently
named modes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
8d29c82306 scoutfs: sort keys by zone, then inode, then type
Holding a DLM lock protects a range of the key space.  The DLM locks
span inodes or regions of inodes.  We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments.  If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.

Previously we were sorting by type then, within types, by inode.  Now we
want to sort by inode then by type.  But there are structures which
previously had a type but weren't then sorted by inode.  We introduce
zones as the primary sort key.  Inode index and node zones are sorted by
the inode fields and node ids respectively.  Then comes the fs zone
first sorted by inode then the type of the key.

The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.

The orphan keys needed to be put in a zone.   They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.

The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."

And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
690049c293 scoutfs: add GET_MANIFEST_ROOT network op
We're going to need to be able to sample the current stable manifest
root occasionally.  We're adding it now because we don't yet
have the lock plumbing that would provide the lvb.  Eventually
this call will bubble up into the locking and the root will be
stored in the lock instead of always requested.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
412e7a7e3b scoutfs: remove unused ring log storage
Remove the old unused ring now all of its previous callers now use the
btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
c2f13ccf24 scoutfs: have net.c commit btree blocks
Convert the net server metadata dirtying and committing code to use the
btree instead of the ring.  It has to be careful to setup and teardown
the btree info as it starts up and shuts down the server.

This fixes up some questionable setup/teardown changes made in the
previous patches to convert the manifest and allocator.  We could rebase
the patches to merge those together.  But given that the previous
patches don't work at all without the net updates it might not be worth
the trouble.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
ff5a094833 scoutfs: store allocator regions in btree
Convert the segment allocator to store its free region bitmaps in the
btree.

This is a very straight forward mechanical transformation.  We split the
allocator region into a big-endian index key and the bitmap value
payload.  We're careful to operate on aligned copies of the bitmaps so
that they're long aligned.

We can remove all the funky functions that were needed when writing the
ring.  All we're left with is a call to apply the pending allocations to
dirty btree blocks before writing the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
fc50072cf9 scoutfs: store manifest entries in the btree
Convert the manifest to store entries in persistent btree keys and
values instead of using the rbtree in memory from the ring.

The btree doesn't have a sort function.  It just compares variable
length keys.  The most complicated part of this transformation is
dealing with the fallout of this.  The compare function can't compare
different search keys and item keys so searches need to construct full
synthetic btree keys to search.  It also can't return different
comparisons, like overlaping, so the caller needs to do a bit more work
to use key comparisons to find overlapping segments.  And it can't
compare differently depending on the level of the manifest so we store
the manifest in keys differently depending on whether its in level 0 or
not.

All mount clients can now see the manifest blocks.  They can query the
manifest directly when trying to find segments to read.  We can get rid
of all the networking calls that were finding the segments for readers.

We change the manifest functions that relied on the ring that the to
make changes in the manifest persistent.  We don't touch the allocator
or the rest of the manifest server, though, so this commit breaks the
world.  It'll be restored in future patches as we update the segment
allocator and server to work with the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
3eaabe81de scoutfs: add btree stored in persistent ring
Add a cow btree whose blocks are stored in a persistently allocated
ring.   This will let us incrementally index very large data sets
efficiently.

This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Mark Fasheh
eb439ccc01 scoutfs: s/lck/lock/ lock.[ch]
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:18:58 -05:00
Mark Fasheh
136cbbed29 scoutfs: only release lockspace/workqueues in lock_destroy if they exist
Mount failure means these might be NULL.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:17:22 -05:00
Mark Fasheh
19f6f40fee scoutfs: get rid of held_locks construct
Now that we have a dlm, this is a needless redirection. Merge all fields
back into the lock_info struct.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:17:22 -05:00