Commit Graph

459 Commits

Author SHA1 Message Date
Mark Fasheh
c5e6676b04 scoutfs: remove some ifdef'd out dlmglue code
We can dump the ocfs2-specifics as well as any definitions that have been
exported via the header file. This makes reading through and modifying
dlmglue much more palatable.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-10-04 08:55:14 -07:00
Mark Fasheh
17c6025cb7 scoutfs: clean up some comments in lock.c
We finished the lock lru work and can remove these TODO comments.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-10-04 08:55:14 -07:00
Zach Brown
ccf5301c37 scoutfs: add -Werror for build errors
We insist on a warning free build but it's up to human diligence to
discover and address warnings.  We've also caught errors when compilers
in automated testing saw problems that the compilers in developer
environments didn't.  That is, a human only could have noticed by
investigating the output from successful test runs.

Let's put some weight behind our promise of a warning free build and
turn gcc warnings into errors.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
b6c592f099 scoutfs: don't dirty btree buffers
The btree sets some buffer head flags before it writes them to satisfy
submit_bh() requirements.  It was setting dirty which wasn't required.
Nothing every cleared dirty so those buffers sat around and were never
freed.  Each btree block we wrote sat around forever.  Eventually the vm
gets clogged up and the world backs up trying to allocate pages to
write and we see massive stalls.

With this fix we no longer see the 'buffers' vm stat continously grow
and IO rates are consistent over time.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
4bc565be39 scoutfs: silence bulk_alloc gcc warning
Some versions of gcc correctly noticed that bulk_alloc() had a
case where it wouldn't initialize ret if the first segno was 0.  This
won't happen because the client response processing returns an error in
this case.  So this just shuts up the warning.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
c5ddec7058 scoutfs: more aggressively shrink items
The old item shrinking was very conservative.  It would only try and
reclaim items from the front of the range of the oldest items in the
lru.  It would stop making progress if all the items in the front of the
lru are in a range whose initial item can't be reclaimed.

With the item shrinking not making progress memory fills with items.
Eventually the system backs up behind an allocation during segment
writing blocking waiting for free pages.

We fix this by much more aggressively shrinking items.  We now look for
a region of items around the oldest items to shrink.  If those fall in
the middle of a range then we use the memory from the items to construct
a new range and split the existing range.  Now the only way we'll refuse
to shrink items is if they're dirty.   We have a reasonably small cap on
the number of dirty items so we shoudln't get stuck.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
fd509840d4 scoutfs: use pages for seg shrink object count
The VM wasn't very excited about trying to reclaim our seg count when we
returned small count of the number of large segment objects available
for reclaim.  Each segment represents a ton of memory so we want to give
the VM more visibility into the scale of the cache to encourage it to
shrink it.  We define the object count for the seg shrinker as the
number of pages of segments in the lru.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
ccefffe74f scoutfs: add item, range, lock alloc/free counters
Add some counters to track allocation and freeing of our structures that
are subject to shrinking.  This lets us eyeball the counters to see if
we have runaway leaks.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
15aa09b0c2 scoutfs: add shrink exit trace points
Add trace points that show the incoming nr_to_scan and resulting object
count for shrinker calls.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
43a2d63f79 scoutfs: replace trace_printk in bio.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
e67e500940 scoutfs: turn off tracing in dlmglue.c
Put this behind a #define. Leave the asserts (mlog_bug_on_msg) though
and redefine their macro to printk instead of going to the trace
buffer.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
3430edb60b scoutfs: replace trace_printk in item.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
2a07e6f642 scoutfs: replace trace_printk in data.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
deecfa0ad5 scoutfs: replace trace_printk in trans.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
285842086d scoutfs: replace trace_printk in ioctl.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
8ad6ff9d41 scoutfs: replace trace_printk in inode.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
44a19b63c0 scoutfs: replace trace_printk in segment.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
cf3f9fee75 scoutfs: replace trace_printk in lock.c
Also clean up these traces a bit and make a lock_info trace class
which we can expand in a future patch.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
7739a0084e scoutfs: replace trace_printk in xattr.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
87adeb9306 scoutfs: replace trace_printk in manifest.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
0d28930271 scoutfs: replace trace_printk in super.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
a5283e6f2c scoutfs: replace trace_printk in dir.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
2c1f117bef scoutfs: replace trace_printk in compact.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Mark Fasheh
3a5093c6ae scoutfs: replace trace_printk in alloc.c
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-28 13:59:49 -07:00
Zach Brown
215ba7d4ad scoutfs: more reliably set btree parent item bits
Most paths correctly calculated the bits to set in a parent item by
combining the bits set in the child with the half bit of the child
block's position in the ring.

With the exception of fixing up the parent item bits after descent by
walking the path.  This mistake caused the parent item half bits to be
zero and prevented migrating of blocks from the old half of the ring.

Fix it by introducing a helper function to calculate the parent ref item
bits and consistently using it.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-20 10:26:40 -07:00
Zach Brown
42b33d616e scoutfs: fix btree bit iteration
store_pos_bits() was trying to iterate over bits that were different
between the existing bits set in the item and the new bits that will be
set.  It used a too clever for_each helper that tried to only iterate as
many times as there were bits.  But it messed up and only used ffs to
find the next bit for the first iteration.  From then on it would
iterate over bits that weren't different.  This would cause the counts
to be changed when the bits didn't change and end up being wildly wrong.

Fix this by using a much clearer loop.  It still breaks out when there
are no more different bits and we're only using a few low bits so the
number of iterations is tiny.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-20 10:26:40 -07:00
Zach Brown
0b15cfe7f8 scoutfs: split on btree deletion
We were getting asserts during deletion that insertion while updating a
parent item didn't have enough room in the block.

Our btree has variable length keys.  During a merge it's possible that
the items moved between the blocks can result in the final key of a
block changing from a small key to a large key.  To update the parent
ref the parent block must have as much free space as the difference in
the key sizes.

We ensure free space in parents during descent by trying to split the
block.  Deletion wasn't doing that, it was only trying to merge blocks.
We need to try to split as well as merge during deletion.

And we have to update the merge threshold so that we don't just split
the resulting block again if it doesn't have the min free space for a
new parent item.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-19 21:19:25 -07:00
Zach Brown
1012ee5e8f scoutfs: use block mapping items
Move to static mapping items instead of unbounded extents.
We get more predictable data structures and simpler code but still get
reasonably dense metadata.

We no longer need all the extent code needed to split and merge extents,
test for overlaps, and all that.  The functions that use the mappings
(get_block, fiemap, truncate) now have a pattern where they decode the
mapping item into an allocated native representation, do their work, and
encode the result back into the dense item.

We do have to grow the largest possible item value to fit the worst case
encoding expansion of random block numbers.

The local allocators are no longer two extents but are instead simple
bitmaps: one for full segments and one for individual blocks.  There are
helper functions to free and allocate segments and blocks, with careful
coordination of, for example, freeing a segment once all of its
constituent blocks are free.

_fiemap is refactored a bit to make it more clear what's going on.
There's one function that either merges the next bit with the currently
building extent or fills the current and starts recording from a
non-mergable additional block.  The old loop worked this way but was
implemented with a single squirrely iteration over the extents.  This
wasn't feasible now that we're also iterating over blocks inside the
mapping items.  It's a lot clearer to call out to merge or fill the
fiemap entry.

The dirty item reservation counts for using the mappings is reduced
significantly because each modification no longer has to assume that it
might merge with two adjacent contiguous neighbours.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-19 11:25:38 -07:00
Zach Brown
c4f3c26343 scoutfs: add scoutfs_item_update_dirty()
Add a function for updating the value of a dirty item.  Callers can use
this to make changes without having to worry about failure.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-19 11:25:38 -07:00
Zach Brown
5325aff698 scoutfs: add item count update function
The item cache maintains a count of the number of dirty items, keys, and
values.  It updates the counts as it dirties and cleans items.  There
are callers who want to modify the accounting directly instead of having
the accounting updated as a side effect of cleaning and re-dirtying the
item.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-19 11:25:38 -07:00
Zach Brown
7854471475 scoutfs: fix server wq destory warning
We were seeing warnings in destroy_workqueue() which meant that work was
queued on the server workqueue after it was drained and before it was
finally destroyed.

The only work that wasn't properly waited for was the commit work.  It
looks like it'd be idle because the server receive threads all wait for
their request processing work to finish.  But the way the commit work is
batched means that a request can have its commit processed by executing
commit work while leaving the work queued for another run.

Fix this by specifically waiting for the commit work to finish after the
server work has waited for all the recv and compaction work to finish.

I wasn't able to reliably trigger the assertion in repeated xfstests
runs.  This survived many runs also, let's see if it stops the
destroy_workqueue() assertion from triggering in the future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 15:22:03 -07:00
Zach Brown
f0a7c4f294 scoutfs: make trans item count const rhs
The item count estimate functions didn't obviously differentiate between
adding to a count and resetting it.  Most callers initialized the count
struct to 0 on the stack, incremented their estimate once, and passed it
in.  The problem is that those same functions that increment once in
callers are also used in other estimates to build counts based on
multiple operations.

This tripped up the data truncate path.  It looped and kept incrementing
its count while truncating a file with lots of extents until the count
got so large that it didn't fit in a segment by itself and blocked
forever.

This cleans up the item count code so that it's much harder to get
wrong.  We differentiate between the SIC_*() high level count estimates
that are meant to be passed in to _hold_trans(), and the internal
__count_*() functions which are used to add up the item counts that make
up an aggregate operation.

With this fix the only way to use the count in extent truncation is to
correctly reset it for the item count for each transacation.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 10:41:55 -07:00
Zach Brown
e165d89f7f scoutfs: warn on invalid item counts
We had a bug where a caller was slowly increasing their item count for
every transaction they attempted in a loop.  Eventually the item count
grew to be too large to fit in a segment and they slept indefinitely.

Let's warn on invalid and impossibly large item counts as we enter
transactions.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 10:41:55 -07:00
Zach Brown
ba40899e84 scoutfs: remove scoutfs_trans_wake_holders()
This was used by compaction to wake local holders who were waiting for
compaction to free up level 0 segments for them to enter the
transaction.  Throttling level 0 segment writes works differently now
and doesn't involve blocking trans holders.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-12 10:41:55 -07:00
Mark Fasheh
5fa97018e5 scoutfs: get rid of dlmglues ocfs2_uninit_super
All this does is uninitialize the dlmglue debug infrastructure, however
we already do that in ocfs2_dlm_shutdown(), resulting in some double
frees on unmount.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-11 21:57:31 -07:00
Zach Brown
fbfbe910aa scoutfs: return error from lock_name_keys
xfstests generic/028 was crashing dereferencing NULL locks.   It'd hit
either rename trying to refresh an inode with a NULL lock or lookup
trying to pass a NULL lock's end to item lookup.

The addition of the lock LRU fixed a bug in lock_name_keys() where it
wouldn't drop a lock when _cluster_lock() returned an error.  But it
always returned 0 instead of returning the error.  Returning 0 without
setting the lock caused the callers to deref their NULL locks.

We also forcefully NULL the lock at the start of the function.  It was
lucky that callers had already NULLed their locks.  If they hadn't they
would have been following random on-stack memory and it might have been
harder to debug.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-11 09:27:57 -07:00
Mark Fasheh
b1fff0997e scoutfs: dlmglue should initialize res->l_debug_list
We're missing initialization of this field. It should never cause a
problem today because we always do a list add immediately afterwards
but let's be extra careful here and initialize it just in case. We
also add a sanity check in ocfs2_add_lockres_tracking() that the
lockres hasn't already been put on the debug list.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-08 13:29:56 -07:00
Mark Fasheh
f276771d8c scoutfs: we need to uninitialize the dlmglue lockres
The dlmglue lockres gets put on a debugging list at initialization time,
and taken off the list at uninit time. We were missing the uninit
portion of this cycle, causing some list debugging warnings.

Call ocfs2_lock_res_free() in free_scoutfs_lock(). In addition, we had
a raw kfree() of a scoutfs lock in find_alloc_scoutfs_lock() which also
needed to be replaced.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-08 13:29:56 -07:00
Zach Brown
2d6d113e03 scoutfs: continue index walk after lock
We saw inode index queries spinning.  They were finding no cached
entries in their locked region but the next key in the segments was in
the region.  This can happen if an item has been deleted in the current
transaction.  The query won't walk up in to the new dirty seq but it
will try to walk the old seq.  The item will still be in the segments
but won't be visible to item_next because it's marked deleted.  The
query will spin finding the next stale key to read from and finding it
missing in the cache.

This is fixed by taking the current coherent cache at its word.  When it
tells us there's no entries we advance the key to check the manifest for
to past the locked region.  In this case it'll skip past the cached
delete item and look for the next key in the segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-08 10:27:07 -07:00
Mark Fasheh
9461104f8e scoutfs: Use LRU for locks
The move to dlmglue necessitating removing the old lru code. That's
fine, it didn't work anyway.

We can't drop locks in the shrinker directly, so instead we have the
shrinker put them on a workqueue where they are dropped.

The rules for the LRU are simple. Locks get a users count. Any process
that holds the lock or is in the process of acquiring the lock
increments this count. When unlock is called, the count is decremented.
We can use the value of this count to manage the LRU - scoutfs_unlock
puts locks on the LRU when the count reaches zero, lock_name_keys()
always takes them off.

If the lock is selected for reclaim, callers wanting to use the lock
will need to wait. We acheive this with a pair of flags.
SCOUTFS_LOCK_RECLAIM is used to indicate the the lock is now queued for
reclaim. Once the is ready to be destroyed, we set SCOUTFS_LOCK_DROPPED
flag, telling callers to put the lock and retry their rbtree search.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-08 09:57:04 -07:00
Mark Fasheh
4bb5cadaf0 scoutfs: remove dead code in lock.[ch]
With the dlmglue transition over, we can finally remove some of the no
longer used portions of lock.[ch].

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-09-08 09:57:04 -07:00
Zach Brown
79110a74eb scoutfs: prevent partial block stage, except final
The staging ioctl is just a thin wrapper around writing.  If we allowed
partial-block staging then the write would zero a newly allocated block
and only stage in the partial region of the block, leaving zeros in the
file that didn't exist before.

We prevent staging when the starting offset isn't block aligned.  We
prevent staging when the final offset isn't block aligned unless it
matches the size because the stage ends in the final partial block of
the file.

This is verified by an xfstest (scoutfs/003) that is in flight.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-07 13:49:37 -07:00
Zach Brown
e7b5cd4c66 scoutfs: limit get_block bh use
We were seeing __block_write_begin spin when staging writes were called
after a read of an offline region saw an error.  It turns out that
the way __block_write_begin iterates through buffer heads on a page will
livelock if b_size is 0.

Our get_block was clearing b_blocknr and b_size before doing anything.
It'd set them when it allocated blocks or found existing mapped
blocks.  But it'd leave them 0 on an error and trigger this hang.

So we'll back off and only do the same things to the result bh that
ext2/3 do, presumably that's what's actually supported.  We only set
mapped, set or clear new, and set b_size to less than the input b_size.

While we're at it we remove a totally bogus extent flag check that's
done before seeing if the next extent we found even intersects with the
logical block that we're searching for.  The extra test is performed
again correctly inside the check for the extents overlapping.  It is an
artifact from the days when the "extents" were a single block and didn't
need to check for overlaps.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-07 13:49:37 -07:00
Zach Brown
51e8b614e5 scoutfs: stop livelocking in item_next
scoutfs_item_next() could livelock given the right key and segment key
boundaries.  This was easiest to trigger with an fio command that wrote
a lot of data:

	fio --filesize=100m --nrfiles=25 --name=100m --numjobs=1 \
		--iodepth=1 --ioengine=sync --fallocate=0 \
		--rw=write --openfiles=256 \
		--directory=$TEST_DIR

There were two problems.

First, if it found a cached region that didn't contain a next item it
would try to read the *end* of the existing cached region instead of
trying to populate more items by reading from the key past the existing
cached region.  This is fixed by incrementing the key to read from after
setting it to the end of the cached region.

Second, it got totally confused by non-merged but adjacent cached
regions.  It would find a cached region that contains the search key and
try to read from the key after that region, but that key could also be
cached and just not merged with its previous region.  This is fixed by
(duh) having an allocated pos key that we set as we walk through cached
regions.  It used to always try and read from the search key which was
bonkers.

With these fixes fio now completes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-05 14:22:47 -07:00
Zach Brown
82f8daaebf scoutfs: print trailing key bytes
The key printing functions only output the key material that's described
by the format.  We have some callers that need to increment or decrement
keys so they expand them to full size keys.  This expansion and extra
high precision low significance was hidden from the traces.

This adds a helper that prints the key material with the format and then
appends an encoding of the trailing bytes.

The key printer was a huge mess of cases and ifs that made it hard to
integrate a sane helper.  We also take the opportunity to break it up
into zone|type key printer functions.  The isolation makes it much
clearer to see what's going on.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-05 14:22:47 -07:00
Zach Brown
2eddeb5db4 scoutfs: delete unused net key types
Remove these key types from the format which haven't been used for a
while.

Signed-off-by: Zach Brown <zab@versity.com>
2017-09-05 14:22:47 -07:00
Zach Brown
76cf28b442 scoutfs: warn if lock with trans held
We can't block on a lock while holding the transaction open because
that'd stop lock downconversion from syncing to write out items while it
is converting from EX.  Add a warning if we try to acquire a blocking
lock while holding a transaction.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Zach Brown
599269e539 scoutfs: don't return uninit index entries
Initially the index walking ioctl only ever output a single entry per
iteration.  So the number of entries to return and the next entry
pointer to copy to userspace were maintained in the post-increment of
the for loop.

When we added locking of the index item results we made it possible to
not copy any entries in a loop iteration.  When that happened the nr and
pointer would be incremented without initializing the entry.  The ioctl
caller would see a garbage entry in the results.

This was visible in scoutfs/002 test results on a volume that had an
interesting file population after having run through all the other
scoutfs tests.  The uninitialized entries would show up as garbage in
the size index portion of the test.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Zach Brown
a8db7e5b74 scoutfs: stop iteration at lock end value
If item iteration finds a hole in the cache it tries to read items.
After the items are read it can look at the cached region and return
items or -ENOENT.  We recently added an end key to limit how far we can
read and cache items.

The end key addition correctly limited the cache read to the lock end
value.  It could never read item cache ranges beyond that.  This means
that it can't iterate past the end value and should return -ENOENT if it
gets past end.  But the code forgot to do that, it only checked for
iteration past last before returning -ENOENT.  It spins continually
finding a hole past end but inside last, tries to read items but limits
them to end, then finds the same hole again.

Triggering this requires a lock end that's nearer than the last
iteration key.  That's hard to do because most of our item reads are
covered by inode group locks which extend well past iteration inside a
given inode.  Inode index item can easily trigger this if there's no
items.  I tripped over it when walking empty indexes (data_seq or
online_blocks with no regular files).

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00
Zach Brown
ca78757ca5 scoutfs: more careful client connect timeouts
The client connection loop was a bit of a mess.  It only slept between
retries in one particular case.  Other failures to connect would spin
and livelock.  It would spin forever.

This fixed loop now has a much more orderly reconnect procedure.  Each
connecting sender always tries once.  Then retry attempts backoff
exponentially, settling at a nice long timeout.  After long enough it'll
return errors.

This fixes livelocks in the xfstests that mount and unmount around
dm-flakey config.  generic/{034,039,040} would easily livelock before
this fix.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-30 10:38:00 -07:00