Commit Graph

645 Commits

Author SHA1 Message Date
Zach Brown
dfac36a9aa scoutfs: trace key struct
The userspace trace event printing code has trouble with arguments that
refer to fields in entries.  Add macros to make entries for all the
fields and use them as the formatted arguments.

We also remove the mapping of zone and type to strings.  It's smaller to
print the values directly and gets rid of some silly code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
5935a3f43e scoutfs: remove unused trace events
These trace events were all orphaned long ago by commits which removed
their callers but forgot to remove their definitions.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
fddc3a7a75 scoutfs: minimize commit writeback latencies
Our simple transaction machinery causes high commit latencies if we let
too much dirty file data accumulate.

Small files have a natural limit on the amount of dirty data because
they have more dirty items per dirty page.  They fill up the single
segment sooner and kick off a commit which finds a relatively small
amount of dirty file data.

But large files can reference quite a lot of dirty data with a small
amount of extent items which don't fill up the transaction's segment.
During large streaming writes we can fill up memory with dirty file data
before filling a segment with mapping extent metadata.  This can lead to
high commit latencies when memory is full of dirty file pages.

Regularly kicking off background writeback behind streaming write
positions reduces the amount of dirty data that commits will find and
have to write out.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
59170f41b1 scoutfs: revive item deletion path
The inode deletion path had bit rotted.  Delete the ifdefs that were
stopping it from deleting all the items associated with an inode.  There
can be a lot of xattr and data mapping items so we have them manage
their own transactions (data already did).  The xattr deletion code was
trying to get a lock while the caller already held it so delete that.
Then we accurately account for the small number of remaining items that
finally delete the inode.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
0c7ea66f57 scoutfs: add SIC_EXACT
Add an item count call that lets the caller give the exact item count
instead of basing it on the operation they're performing.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
002daf3c1c scoutfs: return -ENOSPC to client alloc segno
The server send_reply interface is confusing.  It uses errors to shut
down the connection.  Clients getting enospc needs to happen in the
message reply payload.

The segno allocation server processing needs to set the segno to 0 so
that the client gets it and translates that into -ENOSPC.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
876414065b scoutfs: warn if we try IO outside the device
We've had bugs in allocators that return success and crazy block
numbers.   The bad block numbers eventually make their way down to the
context-free kernel warning that IO was attempted outside the device.
This at least gives us a stack trace to help find where it's coming
from.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
2efba47b77 scoutfs: satisfy large allocs with smaller extents
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size.  This prematurely
returns -ENOSPC if a very large allocation is attempted.  Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.

This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server.  It looks for previous extents in the index of
extents by length.  This builds on the previously added item and extent
_prev operations.

Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for.  The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
04660dbfee scoutfs: add scoutfs_extent_prev()
Add an extent function for iterating backwards through extents.  We add
the wrapper and have the extent IO functions call their storage _prev
functions.  Data extent IO can now call the new scoutfs_item_prev().

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
d53ec115bc scoutfs: add scoutfs_item_prev()
Add scoutfs_item_prev() for searching for an item before a given key.

This wasn't initially implemented because it's rarely needed and for a
long time the segment reading and item cache populating code had a
strong bias for iterating forward from the given search key.

Since we've added limiting item cache reading to the keys covered by
locks and reading in entire segments it's now very easy to iterate
backwards through keys just like scoutfs_item_next() iterates forwards.

The only remaining forward iteration bias was in check_range().  It had
to give callers the start of the cached range that it found.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
600ecd9fad scoutfs: adapt to fallcated extents
The addition of fallocate() now means that offline extents can be
unwritten and allocated and that extents can now be found outside of
i_size.

Truncating needs to know about the possible flag combinations, writing
preallocation needs to know to update an existing extent or allocate up
to the next extent, get_block can't map unwritten extents for read,
extent conversion needs to also clear offline, and truncate needs to
drop extents outside i_size even if truncating to the existing file
size.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1fca13b092 scoutfs: add fallocate
Add an fallocate operation.

This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size.  This will confuse the rest of
the code in a few places and that will be fixed up next.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
dab0fd7d9a scoutfs: update inode item after releasing
The release ioctl forgot to update the inode item after truncating
online block mappings.  This meant that the offline block count update
was lost when the inode was evicted and re-read, leading to inconsistent
offline block counts.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
9c74f2011d scoutfs: add server work tracing
Add some server workqueue and work tracing to chase down the destruction
of an active workqueue.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
08a6fab725 scoutfs: always trace item create/delete ret
Add a trace event for item creation and always trace the return value of
create and delete events.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
27d1f3bcf7 scoutfs: inode read shouldn't modify online blocks
There was a typo in the addition of i_blocks tracking that would set
online blocks to the value of offline blocks when reading an existing
inode into memory.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
9c80f109d5 scoutfs: don't always write deletion items
Items deleted from the item cache would always write deletion items to
segments.  We need to write deletion items so that compaction can
eventually combine them with the existing item and remove both.  We
don't need them for items that were only created in the current
transaction.  Writing a deletion item for them only results in a lot of
extra work compacting the item down to the final segment level so that
it can be removed.

The upcoming extent code really demonstrated the cost of this overhead.
It happens to create and delete quite a lot of temporary extent items
during the transaction as all the different kinds of indexed extents
change.

This change tracks whether a given item in the cache reflects an item
that is present in the persistent storage.  This lets us free items
that have only existed in the current transaction.

This made a meaningful difference when writing a 4MB file with the
current block mapping items, but it made an enormous difference when
writing that same file with the extent items.  It went from writing 1024
deletion items for 11 real items to only writing those real items.

                        items  deletions
block mappings before:     25          5
block mappings after:      25          0
extents before:            11       1024
extents after:             11          0

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1c5d84fa3e scoutfs: add counters for items written in level 0
Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
e227c6446e scoutfs: don't advance btree after wrapping
The btree writes its blocks to a fixed ring of preallocated blocks.  We
added a trigger to force the index to advance to the next half of the
ring to test conditions where the cached btree blocks are out of date
with respect to the blocks on disk.

We have to be careful to only advance the index once all the live blocks
are migrated out of the half that we're about to advance to.  The
trigger tested that condition.

But it missed the case where the normal btree block allocation *just*
advanced into the next ring.  In this case the migration needs to occur
to make it safe to advance *again* to the previous half.  But it missed
this case because the migration keys are reset after we test the
trigger.

This resulted in leaving live btree blocks in the half that we advance
to and start overwriting.  The server got -ESTALE as it tried to read
through blocks that had been overwritten and hilarity ensued.

This precise condition of having the trigger fire just as we wrapped was
amazingly caught by scoutfs/505 in xfstests.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
345721c933 scoutfs: preserve sticky deletion items
We limit the number of lower segments that a compaction will read.  A
sticky compaction happens when the upper segment overlaps more lower
segments.  The remaining items in the upper segment are written back to
the upper level -- they're stuck.  A future compaction will attempt to
compact the remaining items with the next set of overlapping lower
segments.

Deletion items are rightly discarded as they're compacted to the lowest
level -- at that point they have no more matching items in lower
segments to destroy and are done.

Deletion items were being dropped instead of being written back into the
upper level of a sticky compaction.  The test for discarding the
deletion items only considered the lowest level of the compaction, not
the level that the items were being written to.  We need to be careful
to preserve the deletion items in the case of compaction to the lowest
level writing sticky items back to the upper segment.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
5f0c87970c scoutfs: fix level 0 key iteration increment
Compaction has to find the oldest level 0 segment for compaction.  It
iterates over the level 0 segments by their manifest entry's btree key.

It was incorrectly incrementing the btree search key.  It was
incrementing the first key stored in the entry, but that's not the least
significant field.  The seq is the least significant field so this
iteration could skip over segments written at different times with the
same first key.

The fix to have it visit all the entries is to increment the lowest
precision seq field.

Right now we have a single level 0 segment so this code never actually
matters.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
41c29c48dd scoutfs: add extent corruption cases
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata.  The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption.  In this case we
continue returning the error that caused us to try and clean up.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
874a44aef0 scoutfs: remove dead file allocation cursor code
This is no longer used now that we allocate large extents for
concurrently extending files by preallocating unwritten extents.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
fe94eb7363 scoutfs: add unwritten extents
Now that we have extents we can address the fragmentation of concurrent
writes with large preallocated unwritten extents instead of trying to
allocate from disjoint free space with cursors.

First we add support for unwritten extents.  Truncate needs to make sure
it doesn't treat truncated unwritten blocks as online just because
they're not offline.  If we try to write into them we convert them to
written extents.  And fiemap needs to flag them as unwritten and be sure
to check for extents past i_size.

Then we allocate unwritten extents only if we're extending a contiguous
file.  We try to preallocate the size of the file and cap it to a meg.
This ends up with a power of two progression of preallocation sizes,
which nicely balances extent sizes and wasted allocation as file sizes
increase.

We need to be careful to truncate the preallocated regions if the entire
file is released.  We take that as an indication that the user doesn't
want the file consuming any more space.

This removes most of the use of the cursor code.  It will be completely
removed in a further patch.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
dd091e18a9 scoutfs: add trans item tracking trace
Add a trace event that records the changes to a reservation's dirty item
count.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
c01a715852 scoutfs: use extents in the server allocator
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.

We add a client request to allocate an extent of a given length.  The
existing segment alloc and free now work with a segment's worth of
blocks.

The server maintains counters in the super block of free blocks instead
of free segments.  We maintain an allocation cursor so that allocation
results tend to cycle through the device.  It's stored in the super so
that it is maintained across server instances.

This doesn't remove unused dead code to keep the commit from getting too
noisy.  It'll be removed in a future commit.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
19f7e0284b scoutfs: add online/offline block trace event
Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
5eddd10eb7 scoutfs: remove dead block mapping code
Remove all the code for tracking block mapping items and storing free
blocks in bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
70b2a50c9a scoutfs: remove individual online/offline calls
Remove the functions that operate on online and offline blocks
independently now that the file data mapping code isn't using it any
more.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
abbe76093b scoutfs: store file data in extents
Store file data mappings and free block ranges in extents instead of in
block mapping items and bitmaps.

This adds the new functionality and refactors the functions that use it.
The old functions are no longer called and we stop at ifdeffing them out
to keep the change small.  We'll remove all the dead code in a future
change.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
869d11fd0f scoutfs: add core extent functions
Add a file of extent functions that callers will use to manipulate and
store extents in different persistent formats.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
036577890f scoutfs: add atomic online/offline blocks calls
Add functions that atomically change and query the online and offline
block counts as a pair.  They're semantically linked and we shouldn't
present counts that don't match if they're in the process of being
updated.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
4ceb123473 scoutfs: include counters.h for messages
The corruption helpers use counters and callers shouldn't have to
include the counters header themselves.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
4fc554584a scoutfs: add SCOUTFS_BLOCK_MAX
Add the max possible logical block / physical blkno number given u64
bytes recorded at block size granularity.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
55e063d2a1 scoutfs: get rid of silly lock destroy BUG_ON
The BUG_ON() at the start of scoutfs_lock_destroy() was intended to
ensure that scoutfs_lock_shutdown() had been called first.  But that
doesn't happen in the case where we get an error during mount.

The _destroy() function is careful to notice active use and only tears
down resources that were created.  The BUG_ON() can just be removed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-05-04 09:21:44 -07:00
Zach Brown
f3007f10ca scoutfs: shut down server on commit errors
We hadn't yet implemented any error handling in the server when commits
fail.

Commit errors are serious and we take them as a sign that something has
gone horribly wrong.  This patch prints commit error warnings to the
console and shuts down.  Clients will try to reconnect and resend their
requests.

The hope is that another server will be able to make progress.  But this
same node could become the server again and it could well be that the
errors are persistent.

The next steps are to implement server startup backoff, client retry
backoff, and hard failure policies.

Signed-off-by: Zach Brown <zab@versity.com>
2018-05-01 11:48:19 -07:00
Zach Brown
ae6907623c scoutfs: add btree rw error traces and counters
Add some trivial traces and counters around btree block IO errors.

Signed-off-by: Zach Brown <zab@versity.com>
2018-05-01 11:48:19 -07:00
Zach Brown
24cc5cc296 scoutfs: lock manifest root request
The manifest root request processing samples the stable_manifest_root in
the server info.  The stable_manifest_root is updated after a
commit has suceeded.

The read of stable_manifest_root in request processing was locking the
manifest.  The update during commit doesn't lock the manifest so these
paths were racing.  The race is very tight, a few cpu stores, but it
could in theory give a client a malformed root that could be
misinterpreted as corruption.

Add a seqcount around the store of the stable manifest root during
commit and its load during request processing.  This ensures that
clients always get a consistent manifest root.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
7d7f8e45b7 scoutfs: more carefully manage private bh bits
The management of _checked and _valid_crc private bits in the
buffer_head wasn't quite right.

_checked indicates that the block has been checked and that the
expensive crc verification doesn't need to be recalculated.  _valid_crc
then indicates the result of the crc verification.

_checked is read without locks.  First, we didn't make sure that
_valid_crc was stored before _checked.  Multiple tasks could race to see
_checked before _valid_crc.  So we add some memory barriers.

Then we didn't clear _checked when re-reading a stale block.  This meant
that the moment the block was read its private flags could still
indicate that it had a valid crc.  We clear the private bits before we
read so that we'll recalculate the crc.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
fe8b155061 scoutfs: add btree corruption messages
Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
3efcc87413 scoutfs: add corruption messages for namei
Add scoutfs_corruption() calls for corruption associated with mapping
names to inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
c9573d13bb scoutfs: add scoutfs_corruption()
Add a helper for printing a message warning about corruption.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-27 09:06:35 -07:00
Zach Brown
ac259c82a0 scoutfs: allow interrupting client sends
Waiting for replies to sent requests wasn't interruptible.  This was
preventing ctl-c from breaking out of mount when a server wasn't yet
around to accept connections.

The only complication was that the receive thread was accessing the
sender's struct outside of the lock.  An interrupted sender could remove
their struct while receive was processing it.  We rework recv processing
so that it only uses the sender struct under the lock.  This introduces
a cpu copy of the payload but they're small and relatively infrequent
control messages.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
8061a5cd28 scoutfs: add server bind warning
Emit an error message if the server fails to bind.  It can mean that
there is a bad configured address.  But we might want to be able to bind
if the address becomes available, so we don't hard error.  We only emit
the message once for a series of failures.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
81b3159508 scoutfs: return errors from read_items
The introduction of the helper to handle stale segment retrying was
masking errors.  It's meant to pass through the caller's return status
when it doesn't return -EAGAIN to trigger stale read retries.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 15:49:14 -07:00
Zach Brown
676d1e32ef scoutfs: more carefully trace backref walk loop
We were only issuing one kernel warning when we couldn't resolve a path
to an inode due to excessive retries.  It was hard to capture and we
only saw details from the first instance.

This adds a counter for each time we see excessive retries and returns
-ELOOP in that case.  We also extend the link backref adding trace point
to include the found entry, if any.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 10:09:31 -07:00
Zach Brown
c118f7cc03 scoutfs: add option to force tiny btree blocks
Add a tunable option to force using tiny btree blocks on an active
mount.  This lets us quickly exercise large btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 08:59:03 -07:00
Zach Brown
e145267c05 scoutfs: allow smaller btree keys and values
Now that we're using small file system keys we can dramatically shrink
the maximum allowed btree keys and values.  This more accurately matches
the current users and less us fit more possible items in each block.
Which lets us turn the block size way down and still have multiple worst
case largest items per block.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 08:59:03 -07:00
Zach Brown
31286ad714 scoutfs: add options debugfs dir
Add a debugfs dir that will offer debugging options for an actively
mounted volume.

Signed-off-by: Zach Brown <zab@versity.com>
2018-04-13 08:59:03 -07:00