Also clean up these traces a bit and make a lock_info trace class
which we can expand in a future patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Most paths correctly calculated the bits to set in a parent item by
combining the bits set in the child with the half bit of the child
block's position in the ring.
With the exception of fixing up the parent item bits after descent by
walking the path. This mistake caused the parent item half bits to be
zero and prevented migrating of blocks from the old half of the ring.
Fix it by introducing a helper function to calculate the parent ref item
bits and consistently using it.
Signed-off-by: Zach Brown <zab@versity.com>
store_pos_bits() was trying to iterate over bits that were different
between the existing bits set in the item and the new bits that will be
set. It used a too clever for_each helper that tried to only iterate as
many times as there were bits. But it messed up and only used ffs to
find the next bit for the first iteration. From then on it would
iterate over bits that weren't different. This would cause the counts
to be changed when the bits didn't change and end up being wildly wrong.
Fix this by using a much clearer loop. It still breaks out when there
are no more different bits and we're only using a few low bits so the
number of iterations is tiny.
Signed-off-by: Zach Brown <zab@versity.com>
We were getting asserts during deletion that insertion while updating a
parent item didn't have enough room in the block.
Our btree has variable length keys. During a merge it's possible that
the items moved between the blocks can result in the final key of a
block changing from a small key to a large key. To update the parent
ref the parent block must have as much free space as the difference in
the key sizes.
We ensure free space in parents during descent by trying to split the
block. Deletion wasn't doing that, it was only trying to merge blocks.
We need to try to split as well as merge during deletion.
And we have to update the merge threshold so that we don't just split
the resulting block again if it doesn't have the min free space for a
new parent item.
Signed-off-by: Zach Brown <zab@versity.com>
Move to static mapping items instead of unbounded extents.
We get more predictable data structures and simpler code but still get
reasonably dense metadata.
We no longer need all the extent code needed to split and merge extents,
test for overlaps, and all that. The functions that use the mappings
(get_block, fiemap, truncate) now have a pattern where they decode the
mapping item into an allocated native representation, do their work, and
encode the result back into the dense item.
We do have to grow the largest possible item value to fit the worst case
encoding expansion of random block numbers.
The local allocators are no longer two extents but are instead simple
bitmaps: one for full segments and one for individual blocks. There are
helper functions to free and allocate segments and blocks, with careful
coordination of, for example, freeing a segment once all of its
constituent blocks are free.
_fiemap is refactored a bit to make it more clear what's going on.
There's one function that either merges the next bit with the currently
building extent or fills the current and starts recording from a
non-mergable additional block. The old loop worked this way but was
implemented with a single squirrely iteration over the extents. This
wasn't feasible now that we're also iterating over blocks inside the
mapping items. It's a lot clearer to call out to merge or fill the
fiemap entry.
The dirty item reservation counts for using the mappings is reduced
significantly because each modification no longer has to assume that it
might merge with two adjacent contiguous neighbours.
Signed-off-by: Zach Brown <zab@versity.com>
Add a function for updating the value of a dirty item. Callers can use
this to make changes without having to worry about failure.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache maintains a count of the number of dirty items, keys, and
values. It updates the counts as it dirties and cleans items. There
are callers who want to modify the accounting directly instead of having
the accounting updated as a side effect of cleaning and re-dirtying the
item.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing warnings in destroy_workqueue() which meant that work was
queued on the server workqueue after it was drained and before it was
finally destroyed.
The only work that wasn't properly waited for was the commit work. It
looks like it'd be idle because the server receive threads all wait for
their request processing work to finish. But the way the commit work is
batched means that a request can have its commit processed by executing
commit work while leaving the work queued for another run.
Fix this by specifically waiting for the commit work to finish after the
server work has waited for all the recv and compaction work to finish.
I wasn't able to reliably trigger the assertion in repeated xfstests
runs. This survived many runs also, let's see if it stops the
destroy_workqueue() assertion from triggering in the future.
Signed-off-by: Zach Brown <zab@versity.com>
The item count estimate functions didn't obviously differentiate between
adding to a count and resetting it. Most callers initialized the count
struct to 0 on the stack, incremented their estimate once, and passed it
in. The problem is that those same functions that increment once in
callers are also used in other estimates to build counts based on
multiple operations.
This tripped up the data truncate path. It looped and kept incrementing
its count while truncating a file with lots of extents until the count
got so large that it didn't fit in a segment by itself and blocked
forever.
This cleans up the item count code so that it's much harder to get
wrong. We differentiate between the SIC_*() high level count estimates
that are meant to be passed in to _hold_trans(), and the internal
__count_*() functions which are used to add up the item counts that make
up an aggregate operation.
With this fix the only way to use the count in extent truncation is to
correctly reset it for the item count for each transacation.
Signed-off-by: Zach Brown <zab@versity.com>
We had a bug where a caller was slowly increasing their item count for
every transaction they attempted in a loop. Eventually the item count
grew to be too large to fit in a segment and they slept indefinitely.
Let's warn on invalid and impossibly large item counts as we enter
transactions.
Signed-off-by: Zach Brown <zab@versity.com>
This was used by compaction to wake local holders who were waiting for
compaction to free up level 0 segments for them to enter the
transaction. Throttling level 0 segment writes works differently now
and doesn't involve blocking trans holders.
Signed-off-by: Zach Brown <zab@versity.com>
All this does is uninitialize the dlmglue debug infrastructure, however
we already do that in ocfs2_dlm_shutdown(), resulting in some double
frees on unmount.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
xfstests generic/028 was crashing dereferencing NULL locks. It'd hit
either rename trying to refresh an inode with a NULL lock or lookup
trying to pass a NULL lock's end to item lookup.
The addition of the lock LRU fixed a bug in lock_name_keys() where it
wouldn't drop a lock when _cluster_lock() returned an error. But it
always returned 0 instead of returning the error. Returning 0 without
setting the lock caused the callers to deref their NULL locks.
We also forcefully NULL the lock at the start of the function. It was
lucky that callers had already NULLed their locks. If they hadn't they
would have been following random on-stack memory and it might have been
harder to debug.
Signed-off-by: Zach Brown <zab@versity.com>
We're missing initialization of this field. It should never cause a
problem today because we always do a list add immediately afterwards
but let's be extra careful here and initialize it just in case. We
also add a sanity check in ocfs2_add_lockres_tracking() that the
lockres hasn't already been put on the debug list.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The dlmglue lockres gets put on a debugging list at initialization time,
and taken off the list at uninit time. We were missing the uninit
portion of this cycle, causing some list debugging warnings.
Call ocfs2_lock_res_free() in free_scoutfs_lock(). In addition, we had
a raw kfree() of a scoutfs lock in find_alloc_scoutfs_lock() which also
needed to be replaced.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
We saw inode index queries spinning. They were finding no cached
entries in their locked region but the next key in the segments was in
the region. This can happen if an item has been deleted in the current
transaction. The query won't walk up in to the new dirty seq but it
will try to walk the old seq. The item will still be in the segments
but won't be visible to item_next because it's marked deleted. The
query will spin finding the next stale key to read from and finding it
missing in the cache.
This is fixed by taking the current coherent cache at its word. When it
tells us there's no entries we advance the key to check the manifest for
to past the locked region. In this case it'll skip past the cached
delete item and look for the next key in the segments.
Signed-off-by: Zach Brown <zab@versity.com>
The move to dlmglue necessitating removing the old lru code. That's
fine, it didn't work anyway.
We can't drop locks in the shrinker directly, so instead we have the
shrinker put them on a workqueue where they are dropped.
The rules for the LRU are simple. Locks get a users count. Any process
that holds the lock or is in the process of acquiring the lock
increments this count. When unlock is called, the count is decremented.
We can use the value of this count to manage the LRU - scoutfs_unlock
puts locks on the LRU when the count reaches zero, lock_name_keys()
always takes them off.
If the lock is selected for reclaim, callers wanting to use the lock
will need to wait. We acheive this with a pair of flags.
SCOUTFS_LOCK_RECLAIM is used to indicate the the lock is now queued for
reclaim. Once the is ready to be destroyed, we set SCOUTFS_LOCK_DROPPED
flag, telling callers to put the lock and retry their rbtree search.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
With the dlmglue transition over, we can finally remove some of the no
longer used portions of lock.[ch].
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The staging ioctl is just a thin wrapper around writing. If we allowed
partial-block staging then the write would zero a newly allocated block
and only stage in the partial region of the block, leaving zeros in the
file that didn't exist before.
We prevent staging when the starting offset isn't block aligned. We
prevent staging when the final offset isn't block aligned unless it
matches the size because the stage ends in the final partial block of
the file.
This is verified by an xfstest (scoutfs/003) that is in flight.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing __block_write_begin spin when staging writes were called
after a read of an offline region saw an error. It turns out that
the way __block_write_begin iterates through buffer heads on a page will
livelock if b_size is 0.
Our get_block was clearing b_blocknr and b_size before doing anything.
It'd set them when it allocated blocks or found existing mapped
blocks. But it'd leave them 0 on an error and trigger this hang.
So we'll back off and only do the same things to the result bh that
ext2/3 do, presumably that's what's actually supported. We only set
mapped, set or clear new, and set b_size to less than the input b_size.
While we're at it we remove a totally bogus extent flag check that's
done before seeing if the next extent we found even intersects with the
logical block that we're searching for. The extra test is performed
again correctly inside the check for the extents overlapping. It is an
artifact from the days when the "extents" were a single block and didn't
need to check for overlaps.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_next() could livelock given the right key and segment key
boundaries. This was easiest to trigger with an fio command that wrote
a lot of data:
fio --filesize=100m --nrfiles=25 --name=100m --numjobs=1 \
--iodepth=1 --ioengine=sync --fallocate=0 \
--rw=write --openfiles=256 \
--directory=$TEST_DIR
There were two problems.
First, if it found a cached region that didn't contain a next item it
would try to read the *end* of the existing cached region instead of
trying to populate more items by reading from the key past the existing
cached region. This is fixed by incrementing the key to read from after
setting it to the end of the cached region.
Second, it got totally confused by non-merged but adjacent cached
regions. It would find a cached region that contains the search key and
try to read from the key after that region, but that key could also be
cached and just not merged with its previous region. This is fixed by
(duh) having an allocated pos key that we set as we walk through cached
regions. It used to always try and read from the search key which was
bonkers.
With these fixes fio now completes.
Signed-off-by: Zach Brown <zab@versity.com>
The key printing functions only output the key material that's described
by the format. We have some callers that need to increment or decrement
keys so they expand them to full size keys. This expansion and extra
high precision low significance was hidden from the traces.
This adds a helper that prints the key material with the format and then
appends an encoding of the trailing bytes.
The key printer was a huge mess of cases and ifs that made it hard to
integrate a sane helper. We also take the opportunity to break it up
into zone|type key printer functions. The isolation makes it much
clearer to see what's going on.
Signed-off-by: Zach Brown <zab@versity.com>
We can't block on a lock while holding the transaction open because
that'd stop lock downconversion from syncing to write out items while it
is converting from EX. Add a warning if we try to acquire a blocking
lock while holding a transaction.
Signed-off-by: Zach Brown <zab@versity.com>
Initially the index walking ioctl only ever output a single entry per
iteration. So the number of entries to return and the next entry
pointer to copy to userspace were maintained in the post-increment of
the for loop.
When we added locking of the index item results we made it possible to
not copy any entries in a loop iteration. When that happened the nr and
pointer would be incremented without initializing the entry. The ioctl
caller would see a garbage entry in the results.
This was visible in scoutfs/002 test results on a volume that had an
interesting file population after having run through all the other
scoutfs tests. The uninitialized entries would show up as garbage in
the size index portion of the test.
Signed-off-by: Zach Brown <zab@versity.com>
If item iteration finds a hole in the cache it tries to read items.
After the items are read it can look at the cached region and return
items or -ENOENT. We recently added an end key to limit how far we can
read and cache items.
The end key addition correctly limited the cache read to the lock end
value. It could never read item cache ranges beyond that. This means
that it can't iterate past the end value and should return -ENOENT if it
gets past end. But the code forgot to do that, it only checked for
iteration past last before returning -ENOENT. It spins continually
finding a hole past end but inside last, tries to read items but limits
them to end, then finds the same hole again.
Triggering this requires a lock end that's nearer than the last
iteration key. That's hard to do because most of our item reads are
covered by inode group locks which extend well past iteration inside a
given inode. Inode index item can easily trigger this if there's no
items. I tripped over it when walking empty indexes (data_seq or
online_blocks with no regular files).
Signed-off-by: Zach Brown <zab@versity.com>
The client connection loop was a bit of a mess. It only slept between
retries in one particular case. Other failures to connect would spin
and livelock. It would spin forever.
This fixed loop now has a much more orderly reconnect procedure. Each
connecting sender always tries once. Then retry attempts backoff
exponentially, settling at a nice long timeout. After long enough it'll
return errors.
This fixes livelocks in the xfstests that mount and unmount around
dm-flakey config. generic/{034,039,040} would easily livelock before
this fix.
Signed-off-by: Zach Brown <zab@versity.com>
stackglue is trying to call dlm posix symbols that don't exist in some
RHEL dlm kernels. We're not using this functionality yet so let's just
tear it out for now.
Signed-off-by: Zach Brown <zab@versity.com>
There's a fair amount of lock.c that's dead code now that we're using
dlmglue. Some of the dead code is seen as unused and is throwing
warnings. This silences the errors by removing the code.
Signed-off-by: Zach Brown <zab@versity.com>
Some dlmglue functions are unused by the current ifdefery. They're
throwing warnigns that obscure other warnings in the build. This
broadens the ifdef coverage so that we don't get warnings. The unused
code will either be promoted to an interface or removed as dlmglue
evolves into a reusable component.
Signed-off-by: Zach Brown <zab@versity.com>
We lock multiple inodes by order of their inode number. This fixes
the directory entry paths that hold parent dir and target inode locks.
Link and unlink are easy because they just acquire the existing parent
dir and target inode locks.
Lookup is a little squirrely because we don't want to try and order
the parent dir lock with locks down in iget. It turns out that it's
safe to drop the dir lock before calling iget as long as iget handles
racing the inode cache instantiation with inode deletion.
Creation is the remaining pattern and it's a little weird because we
want to lock the newly created inode before we create it and the items
that store it. We add a function that correctly orders the locks,
transaction, and inode cache instantiation.
Signed-off-by: Zach Brown <zab@versity.com>
It looked like it was easier to have a helper dirty and delete items.
But now that we also have to pass in locks the interface gets messy
enough that it's easier to have the caller take care of it.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we had lots of inode creation callers that used a function to
create the dirent items and we had unlink remove entries by hand.
Rename is different because it wants to remove and add multiple links as
it does its work, including recreating links that it has deleted.
We rework add_entry_item() so that it gets the specific fields it needs
instead of getting them from the vfs structs. This makes it clear that
callers are responsible for the source of the fields. Specifically we
need to be able to add entries during failed rename cleanup without
allocating a new readdir pos from the parent dir.
With callers now responsible for the inputs to add_entry_items() we move
some of its code out into all callers: checking name length, dirtying
the parent dir inode, and allocating a readdir pos from the parent.
We then refactor most of _unlink() into a a del_entry_items() to match
addition. This removes the last user of scoutfs_item_delete_many() and
it will be removed in a future commit.
With the entry item helpers taking specific fields all the helpers they
use also need to use specific fields instead of the vfs structs.
To make rename cluster safe we need to get cluster locks for all the
inodes that we work with. We also have to check that the locally cached
vfs input is still valid after acquiring the locks. We only check the
basic structural correctness of the args: that parent dirs don't violate
ancestor rules to create loops and that the entries assumed by the
rename arguments still exist, or not.
Signed-off-by: Zach Brown <zab@versity.com>
Add a lock name that has a global scope in a given lockspace. It's not
associated with any file system items. We add a scope to the lock name
to indicate if a lock is global or not and set that in other lock naming
intitialization. We permit lock allocation to accept null start and end
keys.
Signed-off-by: Zach Brown <zab@versity.com>
Add a function that can lock multiple inodes in order of their inode
numbers. It handles nulls and duplicate inodes.
Signed-off-by: Zach Brown <zab@versity.com>
Without this we return -ESPIPE when a process tries to seek on a regular
file.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to new lock call]
Signed-off-by: Zach Brown <zab@zabbo.net>
We need to lock and refresh the VFS inode before it checks permissions in
system calls, otherwise we risk checking against stale inode metadata.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call]
Signed-off-by: Zach Brown <zab@versity.com>
With trylock implemented we can add locking in readpage. After that it's
pretty easy to implement our own read/write functions which at this
point are more or less wrapping the kernel helpers in the correct
cluster locking.
Data invalidation is a bit interesting. If the lock we are invalidating
is an inode group lock, we use the lock boundaries to incrementally
search our inode cache. When an inode struct is found, we sync and
(optionally) truncate pages.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call, fixed some error handling]
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the inode refreshing flags let's add them to the
callers that want to have a current inode after they have their lock.
Callers locking newly created items use the new inode flag to reset the
refresh gen.
A few inode tests are moved down to after locking so that it can test
the current refreshed inode.
Signed-off-by: Zach Brown <zab@versity.com>