We only want the generic stuff. Long term the Ocfs2 specific code would be
what's left in fs/ocfs2/dlmglue.[ch].
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Like the mtime index, this index is unused. Removing it is a near
identical task. Running the same createmany test from our last
patch gives us the following:
$ createmany -o '/scoutfs/file_%lu' 10000000
total: 10000000 creates in 598.28 seconds: 16714.59 creates/second
real 9m58.292s
user 0m7.420s
sys 5m44.632s
So after both indices are gone, we go from a 12m56 run time to 9m58s,
saving almost 3 minutes which translates into a total performance
increase of about 23%.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This index is unused - we can gain some create performance by removing it.
To verify this, I ran createmany for 10 million files:
$ createmany -o '/scoutfs/file_%lu' 10000000
Before this patch:
total: 10000000 creates in 776.54 seconds: 12877.56 creates/second
real 12m56.557s
user 0m7.861s
sys 6m56.986s
After this patch:
total: 10000000 creates in 691.92 seconds: 14452.46 creates/second
real 11m31.936s
user 0m7.785s
sys 6m19.328s
So removing the index gained us about a minute and a half on the test or a
12% performance increase.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Truncation updates extents that intersect with the input range. It
starts with the first block in the range and iterates until it has
searched for all the extents that could cover the range.
Extents are stored in items at their final block location so that we can
use _next to find intersections. Truncation was searching for the next
extent after the full extent that it was still searching for. That
means it was starting the search at the last block in the extent, not
the first. It would miss all the extents that didn't overlap with the
last block it was searching for.
This fixed by searching from a temporary single block extent at the
start of the search range.
Signed-off-by: Zach Brown <zab@versity.com>
Offline extents weren't being merged because they all had their physical
blkno set to 0 and all the extent calculations didn't treat them
specially. They would only merge if the physical blocks of two extent
were contiguous. Instead of special casing offline extents everywhere
we store them with a physical blkno set to the logical blk_off. This
lets all the current extent calculations work as expected.
Signed-off-by: Zach Brown <zab@versity.com>
Release tries to re-instate extents if it sees an error during release.
Those item manipulations need to be covered by the transaction.
Signed-off-by: Zach Brown <zab@versity.com>
The existing release interface specified byte regions to release but
that didn't match what the underlying file data mapping structure is
capable of. What happens if you specify a single byte to release? Does
it release the whole block? Does it release nothing? Does it return an
error?
By making the interface match the capability of the operation we make
the functioning of the system that much more predictable. Callers are
forced to think about implementing their desires in terms of block
granular releasing.
Signed-off-by: Zach Brown <zab@versity.com>
The ->statfs method was still using the super_block in the super_info
that was read during mount. This will get progressively more out
of date.
We add a network message to ask the server for the current fields that
impact statfs. This is always racy and the fields are mostly nonsense,
but we try our best.
Signed-off-by: Zach Brown <zab@versity.com>
Delete inode index items when deleting all the items associated with an
inode after its been unlinked and had all its references dropped.
The index items should always match the fields in the inode item so we
read it to determine the index items that should be deleted, regardless
of if we have the vfs inode cached or not. We take the opportunity to
collapse the two callers of item deletion which looked up the inode into
item deletion so that it can use the inode fields.
The deletion of index items is partially verified by an inode index test
in xfstests which makes sure that unlinked files are no longer present
in the index.
Signed-off-by: Zach Brown <zab@versity.com>
Directories were getting added to the data_seq index. It might have
looked like they weren't because their data_seqs were always 0 but when
inodes are created they don't have 'have_item' set so all the fields are
added regardless of their current value.
We'd rather not have to wade their directories when looking for regular
file data in the data_seq index so let's explicitly test for regular
files when updating the data_seq index items.
Signed-off-by: Zach Brown <zab@versity.com>
The updating of the inode index items was racey. It loaded the inode
values, updated the items, loaded the fields again, and then stored the
fields in the inode info. All without locking. Concurrent attempts
could get the fields scrambled and racing with other paths that update
the fields could get the items and inode info out of sync.
This fixes up the two races by only reading the inode fields once and
performing the multi-stage update under a mutex. We add a new lock to
avoid ordering problems with trying to add an existing lock at these
points in the locking heirarchy. We specifically use a mutex because
the item functions can block.
Now the inode index field update just has to safely race with concurrent
access to the fields.
This was found by generic/037 once getattr started refreshing the inode.
It now passes again.
Signed-off-by: Zach Brown <zab@versity.com>
Use MODULE_ALIAS_FS() to register the "scoutfs" fs alias so that
modprobe can find the module if it's installed and visible to depmod.
We don't yet have clever enough xfstests to mess around with modules. I
manually verified this by installing the module in /lib/modules and
trying mount -t scoutfs before and after the change.
Signed-off-by: Zach Brown <zab@versity.com>
The networking code was really suffering by trying to combine the client
and server processing paths into one file. The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.
The client maintains a single connection. Blocked senders work on the
socket under a sending mutex. The recv path runs in work that can be
canceled after first shutting down the socket.
A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets. Each accepted socket has
a single recv work blocked waiting for requests. That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.
All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server. This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.
Signed-off-by: Zach Brown <zab@versity.com>
The rhashtable API has changed over time. Continuing to use it means
having to worry about maintaining different APIs in different kernel
generations.
We have a static pool of cursors so we don't need the flexibility of the
resizable rhashtable. We can roll a simple array of hlist heads to use
as a hash table.
And finally, these cursors will probably disappear eventually anyway.
Let's not invest too much in them.
Signed-off-by: Zach Brown <zab@versity.com>
Raw [su]{8,16,32,64} types keep leaking into our exported headers where
they break userspace builds. Make sure that we only use the exported __
types and add a check to break our build if we get it wrong.
Signed-off-by: Zach Brown <zab@versity.com>
It's handy to quickly find the git commit that built a given module. We
add a MOD_INFO() tag for it so we can see it in modinfo on the built
module. We add a ELF note that the kernel tracks in
/sys/modules/$m/notes/ when the module is loaded.
Signed-off-by: Zach Brown <zab@versity.com>
This gives us cluster locking for the overwhelming majority of metadata ops
that scoutfs supports. In particular, we can create and modify file metadata
from one node and immediately see the changes reflected on another node.
In addition to synchonrization the cluster locks here are providing an I/O
endpoint for our item cache, ensuring that it doesn't read stale items.
Readdir and file read/write are notable exception - they require a more
specific approach and will be implemented in a future patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[fixed iget unlock and truncated commit message summary]
Signed-off-by: Zach Brown <zab@versity.com>
The conversion to the multi-item xattrs accidentally returned -EIO when
an attribute wasn't found instead of -ENODATA. That broke a huge number
of xfstests because ls can look up xattrs and return EIO.
Signed-off-by: Zach Brown <zab@versity.com>
This reduces the amount of duplicate code in callers and makes error
handling easier. The alternative is to sprinkle the code with 'if (lock)'
lines at the end of our functions.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
get_manifest_refs was using the btree root in its stale copy of the
super block. It is supposed to use the btree root that it was given by
its caller who went to the trouble of finding a sufficiently current
btree root.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: added commit message and fixed formatting]
Signed-off-by: Zach Brown <zab@versity.com>
Otherwise we get into a problem where the listen lock is conflicting with
regular inode group requests. Since we never drop the listen lock and it (by
design) blocks progress on another node, those inode group requests may
hang.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The delayed downconvert work wasn't being canceled on shutdown. 60s
after unmount at least the net lock's timer would fire and crash trying
to queue the delayed work on the destroyed workqueue.
Proactively unlocking the locks isn't always beneficial to begin with.
The relative costs of mispredicting the future are wildly different if
we have to re-read item caches from segments or have to downconvert a
blocking read lock.
So we can just remove the delayed work to fix the bug and remove a
moving piece that would need to be considered and tuned. There's still
a race where we can get basts after destroying the workqueue but before
we destroy the lockspace, we'll get there.
Signed-off-by: Zach Brown <zab@versity.com>
These transformations are mechanical and there aren't many callers of
these so we combine them into one commit.
Signed-off-by: Zach Brown <zab@versity.com>
Add an end argument to _set_batch to specify the limit of
items we'll read into the cache.
And it turns out that the loop in _set_batch that meant to cache all the
items covered by the batch didn't try hard enough. It would stop once
the first key was covered but didn't make sure that the coverage
extended to cover last. This can happen if segment boundaries happen to
fall within the items that make up the batch. Fix it up while we're in
here.
Signed-off-by: Zach Brown <zab@versity.com>
Add locks around inode index item iteration. This is tricky because the
inode index items are enormous and we can't default to coarse locks that
let it read and iterate over the entire key space. We use the manifest
to find the next small fixed size region to lock and iterate from.
Signed-off-by: Zach Brown <zab@versity.com>
Add an end key to the item_next calls to limit how many items will be
read into the cache. Callers typically get this from the lock they hold
that covers the iteration. We differentiate between iteration and
caching so that a series of small iterations (listxattr on inodes,
namespace walk in small dirs) can be satisfied by a single read of
adjacent items from segments.
Signed-off-by: Zach Brown <zab@versity.com>
Add a locking wrapper for the inode index items. It maps the index
fields to a lock name for each index type.
Signed-off-by: Zach Brown <zab@versity.com>
Add an item reading variant that just returns the next key that it finds
in segments after the given key. This will be used while iterating
to find the next key to lock and then try to iterate towards.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache can only be populated with items that are covered by
locks. Require callers to provide the farthest key that can be covered
by the locks. Locks provide a key for exactly this purpose.
Signed-off-by: Zach Brown <zab@versity.com>
Let both check_range and read_items take a NULL end. check_range just
doesn't do anything with the end of the range. read_items defaults
to trying to read as many items as it can but clamps to the extent of
the segments that intersect with the key.
This will let us incrementally add end arguments to the item functions
that are intially passed in as NULL in callers as we add lock coverage.
Signed-off-by: Zach Brown <zab@versity.com>
We don't need the dlm to track key ranges if we implement ranges by
mapping keys to resources which represent ranges of the key space.
Signed-off-by: Zach Brown <zab@versity.com>
Instead of locking one resource with ranges we'll have callers map their
logical resources to a tuple name that we'll store in lock resources.
The names still map to ranges for cache reading and cache invalidation
but the ranges aren't exposed to the DLM. This lets us use the stock
DLM and distribute resources across masters.
Signed-off-by: Zach Brown <zab@versity.com>
We intend to use more of the dlm lock levels. Let's use its modes
directly so we don't have to maintain a mental map from differently
named modes.
Signed-off-by: Zach Brown <zab@versity.com>
Holding a DLM lock protects a range of the key space. The DLM locks
span inodes or regions of inodes. We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments. If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.
Previously we were sorting by type then, within types, by inode. Now we
want to sort by inode then by type. But there are structures which
previously had a type but weren't then sorted by inode. We introduce
zones as the primary sort key. Inode index and node zones are sorted by
the inode fields and node ids respectively. Then comes the fs zone
first sorted by inode then the type of the key.
The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.
The orphan keys needed to be put in a zone. They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.
The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."
And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.
Signed-off-by: Zach Brown <zab@versity.com>
We're going to need to be able to sample the current stable manifest
root occasionally. We're adding it now because we don't yet
have the lock plumbing that would provide the lvb. Eventually
this call will bubble up into the locking and the root will be
stored in the lock instead of always requested.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the net server metadata dirtying and committing code to use the
btree instead of the ring. It has to be careful to setup and teardown
the btree info as it starts up and shuts down the server.
This fixes up some questionable setup/teardown changes made in the
previous patches to convert the manifest and allocator. We could rebase
the patches to merge those together. But given that the previous
patches don't work at all without the net updates it might not be worth
the trouble.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the segment allocator to store its free region bitmaps in the
btree.
This is a very straight forward mechanical transformation. We split the
allocator region into a big-endian index key and the bitmap value
payload. We're careful to operate on aligned copies of the bitmaps so
that they're long aligned.
We can remove all the funky functions that were needed when writing the
ring. All we're left with is a call to apply the pending allocations to
dirty btree blocks before writing the btree.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the manifest to store entries in persistent btree keys and
values instead of using the rbtree in memory from the ring.
The btree doesn't have a sort function. It just compares variable
length keys. The most complicated part of this transformation is
dealing with the fallout of this. The compare function can't compare
different search keys and item keys so searches need to construct full
synthetic btree keys to search. It also can't return different
comparisons, like overlaping, so the caller needs to do a bit more work
to use key comparisons to find overlapping segments. And it can't
compare differently depending on the level of the manifest so we store
the manifest in keys differently depending on whether its in level 0 or
not.
All mount clients can now see the manifest blocks. They can query the
manifest directly when trying to find segments to read. We can get rid
of all the networking calls that were finding the segments for readers.
We change the manifest functions that relied on the ring that the to
make changes in the manifest persistent. We don't touch the allocator
or the rest of the manifest server, though, so this commit breaks the
world. It'll be restored in future patches as we update the segment
allocator and server to work with the btree.
Signed-off-by: Zach Brown <zab@versity.com>
Add a cow btree whose blocks are stored in a persistently allocated
ring. This will let us incrementally index very large data sets
efficiently.
This is an adaptation of the previous btree code which now uses the
ring, stores variable length keys, and augments the items with bits that
ored up through parents.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have a dlm, this is a needless redirection. Merge all fields
back into the lock_info struct.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>