Commit Graph

1453 Commits

Author SHA1 Message Date
Zach Brown
7e935898ab Avoid premature metadata enospc
server_get_log_trees() sets the low flag in a mount's meta_avail
allocator, triggering enospc for any space consuming allocatins in the
mount, if the server's global meta_vail pool falls below the reserved
block count.  Before each server transaction opens we swap the global
meta_avail and meta_freed allocators to ensure that the transaction has
at least the reserved count of blocks available.

This creates a risk of premature enospc as the global meta_avail pool
drains and swaps to the larger meta_freed.  The pool can be close to the
reserved count, perhaps at it exactly.  _get_log_trees can fill the
client's mount, even a little, and drop the global meta_avail total
under the reserved count, triggering enospc, even though meta_Freed
could have had quite a lot of blocks.

The fix is to ensure that the global meta_avail has 2x the reserved
count and swapping if it falls under that.  This ensures that a server
transaction can consume an entire reserved count and still have enough
to avoid triggering enospc.

This fixes a scattering of rare premature enospc returns that were
hitting during tests.  It was rare for meta_avail to fall just at the
reserved count and for get_log_trees to have to refill the client
allocator, but it happened.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
6d0694f1b0 Add resize_devices ioctl and scoutfs command
Add a scoutfs command that uses an ioctl to send a request to the server
to safely use a device that has grown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
fd686cab86 Fix total_data_blocks calculation in mkfs
mkfs was incorrectly initializing total_data_blocks.  The field is meant
to record the number of blocks from the start of the device that the
filesystem could access.  mkfs was subtracting the initial reserved area
of the device, meaning the number of blocks that the filesystem might
access.

This could allow accesses past devices if mount checks the device size
against the smaller total_data_blocks.

And we're about to use total_data_blocks as the start of a new extent to
add when growing the volume.  It needs to be fixed so that this new
grown free extent doesn't overlap with the end of the existing free
extents.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
4c1181c055 Remove first_ and last_ super blkno fields
There are fields in the super block that specify the range of blocks
that would be used for metadata or data.  They are from the time when a
single block device was carved up into regions for metadata and data.

They don't make sense now that we have separate metadata and data block
devices.  The starting blkno is static and we go to the end of the
device.

This removes the fields now that they serve no purpose.   The only use
of them to check that freed extents fell within the correct bounds can
still be performed by using the static starting number or roughly using
the size of the devices.  It's not perfect, but this is already only
a check to see that the blknos aren't utter nonsense.

We're removing the fields now to avoid having to update them while
worrying about users when resizing devices.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
d6bed7181f Remove almost all interruptible waits
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.

The reality is that we have significant code paths that have trouble
unwinding.  Final inode deletion during iput->evict in a task is a good
example.  It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.

It also happens that golang built pre-emptive thread scheduling around
signals.  Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.

This changes waits to expect that IOs (including network commands) will
complete reasonably promptly.  We remove all interruptible waits with
the notable exception of breaking out of a pending mount.  That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
4893a6f915 scoutfs_dirents_equal should return bool
It looks like it returned u64 because it was derived from _name_hash().

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
384590f016 Sync net shouldn't wait for errored submits
If async network request submission fails then the response handler will
never be called.  The sync request wrapper made the mistake of trying to
wait for completion when initial submission failed.  This never happened
in normal operation but we're able to trigger it with some regularity
with forced unmount during tests.  Unmount would hang waiting for work
to shutdown which was waiting for request responses that would never
happen.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
192f077c16 Update data_version when fallocate changes size
Changing the file size can changes the file contents -- reads will
change when they stop returning data.  fallocate can change the file
size and if it does it should increment the data_version, just like
setattr does.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
a9baeab22e stage_tmpfile test gets current data_version
The stage_tmpfile test util was written when fallocate didn't update
data_version for size extensions.  It is more correct to get the
data_version after fallocate changes data_versions for however many
transactions, extent allocations, and i_size extensions it took to
allocate space.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
b7ab26539a Avoid lockdep warning about upstream inversion
Some kernels have blkdev_reread_part acquire the bd_mutex and then call
into drop_partitions which calls fsync_bdev which acquires s_umount.
This inverts the usual pattern of deactivate_super getting s_umount and
then using blkdev_put in kill_sb->put_super to drop a second device.

The inversion has been fixed upstream by years of rewrites.  We can't go
back in time to fix the kernels that we're testing against,
unfortunately, so we disable lockdep around our valid leg of the
inversion that lockdep is noticing in our testing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
c51f0c37da Defer dirty inode data writeback (and use list)
iput() can only be used in contexts that could perform final inode
deletion which requires cluster locks and transactions.  This is
absolutely true for the transaction committing worker.  We can't have
deletion during transaction commit trying to get locks and dirty *more*
items in the transaction.

Now that we're properly getting locks in final inode deletion and
O_TMPFILE support has put pressure on deletion, we're seeing deadlocks
between inode eviction during transaction commit getting a index lock
and index lock invalidation trying to commit.

We use the newly offered queued iput to defer the iput from walking our
dirty inodes.   The transaction commit will be able to proceed while
the iput worker is off waiting for a lock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:20:40 -07:00
Zach Brown
52107424dd Promote deferred iput to inode call
Lock invalidation had the ability to kick iput off to work context.  We
need to use it for inode writeback as well so we move the mechanism over
to inode.c and give it a proper call.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
099a65ab07 Try recovering from truncate errors and more info
We're seeing errors during truncate that are surprising.  Let's try and
recover from them and provide more info when they happen so that we can
dig deeper.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
21c5724dd5 Update fenced service file StartLimitBurst
The first draft was written against an older schema, StartLimitBurst is
in [Service] now.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
3974d98f6b Don't use "/dev/*" redirections near systemd
It sets up stdout and stderr as sockets, not pipes, so these links don't
work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
2901b43906 Also allow omap requests to disconnected clients
We recently fixed problems sending omap responses to originating clients
which can race with the clients disconnecting.  We need to handle the
requests sent to clients on behalf of an origination request in exactly
the same way.  The send can race with the client being evicted.  It'll
be cleaned after the race is safely ignored by the client's rid being
removed from the server's request tracking.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
03d7a4e7fe Show relative times in quorum status file output
The times in the quorum status file are in absolute monotinic kernel
time since bootup.  That's not particularly helpful especially when
comparing across hosts with different boot times.

This shows relative times in timespec64 seconds until or since the times
in question.   While we're at it we also collect the send and receive
timestamps closer to each send or receive call.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
d5d3b12986 Specficially shutdown quorum during forced unmount
Generally, forced unmount works by returning errors for all IO.  Quorum
is pretty resilient in that it can have the IO errors eaten by server
startup and does its own messaging that won't return errors.  Trying to
force unmount can have the quorum service continually participate in
electing a server that immediately fails and shutds down.

This specifically shuts down the internal quorum service when it sees
that unmount is being forced.  This is easier and cleaner than having
the network IO return errors and then having that trigger shutdown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
e4dca8ddcc Don't shutdown quorum if server startup fails
The quorum service shuts down if it sees errors that mean that it can't
do its job.

This is mostly fatal errors gathering resources at startup or runtime IO
errors but it was also shutting down if server startup fails.   That's
not quite right.  This should be treated like the server shutting down
on errors.  Quorum needs to stay around to participate in electing the
next server.

Fence timeouts could trigger this.   A quorum mount could crash, the
next server without a fence script could have a fence request timeout
and shutdown, and now the third remaining server is left to indefinitely
send vote requests into the void.

With this fixed, continuing that example, the quorum service in the
second mount remains to elect the third server with a working fence
script after the second server shuts down after its fence request times
out.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
011b7d52e5 Merge pull request #45 from versity/ben/systemd_configs
Add fenced systemd and example configs
2021-07-09 08:39:18 -07:00
Ben McClelland
3a9db45194 Add fenced systemd and example configs
This should be good enough to get single node mounts up and running with
fenced with minimal effort.  The example config will need to be copied
to /etc/scoutfs/scoutfs-fenced.conf for it to be functional, so this
still requires specific opt-in and wont accidentally run for multi-node
systems.

Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
2021-07-09 08:22:39 -07:00
Zach Brown
53f11f5479 Merge pull request #46 from versity/zab/orphan_deletion_and_enospc
Zab/orphan deletion and enospc
2021-07-08 10:52:53 -07:00
Zach Brown
b4ede2ac6a Allow omap responses to disconnected originators
The omap message lifecycle is a little different than the server's usual
handling that sends a response from the request handler.  The response
is sent long after the initial receive handler is pinning the connection
to the client.   It's fine for the response to be dropped.

The main server request handler handled this case but other response
senders didn't.  Put this error handling in the server response sender
itself so that all callers are covered.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-08 09:36:07 -07:00
Zach Brown
cbe8d77f78 Prevent duplicate inode item deletion
We hide I_FREEING inodes from inode lookup to avoid inversions with
cluster locking.  This can result in duplicate inodes structs for a
given inode number.  Then can both race to try and delete the same items
for their shared inode number.  This leads to error messages from
evict_inode and could lead to corruption if they, for example, both try
and free the same data extents.

This adds very basic serialization so only one instance can try to
delete items at a time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
5f682dabb5 Item cache invalidation uses seqs to avoid readers
The item cache has to be careful not to insert stale read items when
previously dirty items have been written and invalidated while a read
was in flight.

This was previously done by recording the possible range of items that a
reader could see based on the key range of its lock.   This is
disasterous when a workload operates entirely within one lock.  I ran
into this when testing a small number of files with massive amounts of
xattrs.  While any reader is in flight all pages can't be invalidated
because they all intersect with the one lock that covers all the items
in use.

The fix is to more naturally reflect the problem by tracking the
greatest item seq in pages and the earliest seq that any readers
can't see.  This lets invalidate only skip pages with items
that weren't visible to the earliest reader.

This more naturally reflects that the problem is due to the age of the
items, not their position in the key space.  Now only a few of the most
recently modified pages could be skipped and they'll be at the end
of the LRU and won't typically be visited.  As an added benefit it's
now much cheaper to add, delete, and test the active readers.

This fix stopped rm -rf of a full system's worth of xattrs from taking
minutes constantly spinning skipping all pages in the LRU to seconds of
doing real removal work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
120c2d342a Add create_xattr_loop test tool
Add a quick tool that creates xattrs in a tight loop.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
84454b38c5 Add mkfs -A for small device sizes
Normally mkfs would fail if we specify meta or data devices that are too
small.  We'd like to use small devices for test scenarios, though, so
add an option to allow specifying sizes smaller than the minumum
required sizes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
29cfa81574 Remove unused leftovers from quorum changes
These forward declarations were for interfaces that have since been
removed or changed and are no longer needed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
73bf916182 Return ENOSPC as space gets low
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free.  This adds support for
returning ENOSPC to client posix allocators as free space gets low.

For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space.  The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks.  In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing).  When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.

Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.

For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.

The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.

We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.

We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.

And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
9db3b475c0 Stop log merge work earlier during unmount
The forest log merge work calls into the client to send commit requests
to the server.  The forest is usually destroyed relatively late in the
sequence and can still be running after the client is destroyed.

Adding a _forest_stop call lets us stop the log merging work
before the client is destroyed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
24d682bf81 Add orphan-inodes test
Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
2957f3e301 Avoid warnings when evict has signals pending
Killing a task can end up in evict and break out of acquiring the locks
to perform final inode deletion.  This isn't necessarily fatal.  The
orphan task will come around and will delete the inode when it is truly
no longer referenced.

So let's silence the error and keep track of how many times it happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
0374661a92 Merge pull request #43 from versity/zab/btree_merging
Zab/btree merging
2021-06-22 13:16:30 -07:00
Zach Brown
28759f3269 Rotate srch files as log trees items are reclaimed
The log merging work deletes log trees items once their item roots are
merged back into the fs root.  Those deleted items could still have
populated srch files that would be lost.  We force rotation of the srch
files in the items as they're reclaimed to turn them into rotated srch
files that can be compacted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:37:45 -07:00
Zach Brown
5c3fdb48af Fix btree join item movement
Refilling a btree block by moving items from its siblings as it falls
under the join threshold had some pretty serious mistakes.  It used the
target block's total item count instead of the siblings when deciding
how many items to move.  It didn't take item moving overruns into
account when deciding to compact so it could run out of contiguous free
space as it moved the last item.  And once it compacted it returned
without moving because the return was meant to be in the error case.

This is all fixed by correctly examining the sibling block to determine
if we should join a block up to 75% full or move a big chunk over,
compacting if the free space doesn't have room for an excessive worst
case overrun, and fixing the compaction error checking return typo.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a7828a6410 Add log merge item allocators to alloc detail
The alloc iterator needs to find and include the totals of the avail and
freed allocator list heads in the log merge items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a1d46e1a92 Fix mkfs btree item offset calculation
mkfs was miscalculating the offset of the start of the free region in
the center of blocks as it populated blocks with items.  It was using
the length of the free region as its offset in the block.  To find
the offset of the end of the free region in the block it has to be
taken relative to the end of the item array.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d67db6662b Fix item cache val_len alignment math
Some item_val_len() callers were applying alignment twice, which isn't
needed.

And additions to erased_bytes as value lengths change  didn't take
alignment into account.  They could end up double counting if val_len
changes within the alignment are then accounted for again as the full
item and alignment is later deleted.  Additions to erased_bytes based on
val_len should always take alignment into account.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c5c050bef0 Item cache might free null page on alloc error
The item cache allocates a page and a little tracking struct for each
cached page.  If the page allocation fails it might try to free a null
page pointer, which isn't allowed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
96d286d6e5 Zero btree item padding as items are created
Item creation, which fills out a new item at the end of the array of
item structs at the start of the block, didn't explicitly zero the item
struct padding to 0.  It would only have been zero if the memory was
already zero, which is likely for new blocks, but isn't necessarily true
if the memory had previously been used by deleted values.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9febc6b5dc Update btree block validator for 8byte alignment
The change to aligning values didn't update the btree block verifier's
total length calculation, and while we're in there we can also check
that values are correctly aligned.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
045b3ca8d4 Expand unused btree verifying walker
Previously we had an unused function that could be flipped on to verify
btree blocks during traversal.   This refactors the block verifier a bit
to be called by a verifying walker.  This will let callers walk paths to
leaves to verify the tree around operations, rather than verification
being performed during the next walk.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
ff882a4c4f Add btree total_above_join_low_water() test
Take the condition used to decide if a btree block needs to be joined
and put it in total_above_join_low_water() so that btree_merging will be
able to call it to see if the leaf block it's merging into needs to be
joined.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3d1a0f06c0 Add scoutfs_btree_free_blocks
Add a btree function for freeing all the blocks in a btree without
having to cow the blocks to track which refs have been freed.  We use a
key from the caller to track which portions of the tree have been freed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3488b4e6e0 Add scoutfs print support for log merge items
Add support for printing all the items in the log_merge tree that the
server uses to track log merging.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c482204fcf Clean up btree root printing in superblock
Over time the printing of the btree roots embedded in the super block
has gotten a little out of hand.  Add a helper macro for the printf
format and args and re-order them to match their order in the
superblock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9711fef122 Update for core, trans, and item seq use
We now have a core seq number in the super that is advanced for multiple
users.    The client transaction seq comes from the core seq so we
remove the trans_seq from the super.  The item version is also converted
to use a seq that's derived from the core seq.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
91acf92666 Add client btree merge processing
Add the client work which is regularly scheduled to ask the server for
log merging work to do.  The relatively simple client work gets a
request from the server, finds the log roots to merge given the reqeust
seq, performs the merge with a btree call and callbacks, and commits the
result to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9c2122f7de Add server btree merge processing
This adds the server processing side of the btree merge functionality.
The client isn't yet sending the log_merge messages so no merging will
be performed.

The bulk of the work happens as the server processess a get_log_merge
message to build a merge request for the client.  It starts a log merge
if one isn't in flight.  If one is in flight it checks to see if it
should be spliced and maybe finished.  In the common case it finds the
next range to be merged and sends the request to the client to process.

The commit_log_merge handler is the completion side of that request.  If
the request failed then we unwind its resources based on the stored
request item.  If it succeeds we record it in an item for get_
processing to splice eventually.

Then we modify two existing server code paths.

First, get_log_tree doesn't just create or use a single existing log
btree for a client mount.  If the existing log btree is large enough it
sets its finalized flag and advances the nr to use a new log btree.
That makes the old finalized log btree available for merging.

Then we need to be a bit more careful when reclaiming the open log btree
for a client.  We can't use next to find the only open log btree, we use
prev to find the last and make sure that it isn't already finalized.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00