Commit Graph

1064 Commits

Author SHA1 Message Date
Bryant Duffy-Ly
66b8c5fbd7 Enhance clarify of some kfree paths
In some of the allocation paths there are goto statements
that end up calling kfree(). That is fine, but in cases
where the pointer is not initially set to NULL then we
might have an undefined behavior. kfree on a NULL pointer
does nothing, so essentially these changes should not
change behavior, but clarifies the code path better.

Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>
2021-10-06 18:07:27 -05:00
Zach Brown
6ca8c0eec2 Consistently initialize dentry info
Unfortunately, we're back in kernels that don't yet have d_op->d_init.
We allocate our dentry info manually as we're given dentries.  The
recent verification work forgot to consistently make sure the info was
allocated before using it.   Fix that up, and while we're at it be a bit
more robust in how we check to see that it's been initialized without
grabbing the d_lock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
ea2b01434e Add support for i_version
This adds i_version to our inode and maintains it as we allocate, load,
modify, and store inodes.  We set the flag in the superblock so
in-kernel users can use i_version to see changes in our inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
d5eec7d001 Fix uninitialized srch ret that won't happen
More recent gcc notices that ret in delete_files can be undefined if nr
is 0 while missing that we won't call delete_files in that case.  Seems
worth fixing, regardless.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
b9a0f1709f Add xattr .totl. tag
Add the .totl. xattr tag.  When the tag is set the end of the name
specifies a total name with 3 encoded u64s separated by dots.  The value
of the xattr is a u64 that is added to the named total.   An ioctl is
added to read the totals.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
a59fd5865d Add seq and flags to btree items
The fs log btrees have values that start with a header that stores the
item's seq and flags.  There's a lot of sketchy code that manipulates
the value header as items are passed around.

This adds the seq and flags as core item fields in the btree.   They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge.  The rest of the btree items that use the main interface
don't work with the fields.

This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-09 14:44:55 -07:00
Zach Brown
46edf82b6b Add inode crtime creation time
Add an inode creation time field.  It's created for all new inodes.
It's visible to stat_more.  setattr_more can set it during
restore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-03 11:14:41 -07:00
Zach Brown
79fbaa6481 Verify dentries after locking
Our dir methods were trusting dentry args.  The vfs code paths use
i_mutex to protect dentries across revalidate or lookup and method
calls.  But that doesn't protect methods running in other mounts.
Multiple nodes can interleave the initial lookup or revalidate then
actual method call.

Rename got this right.  It is very paranoid about verifying inputs after
acquiring all the locks it needs.

We extend this pattern to the rest of the methods that need to use the
mapping of name to inode (and our hash and pos) in dentries.  Once we
acquire the parent dir lock we verify that the dentry is still current,
returning -EEXIST or -ENOENT as appropriate.

Along these lines, we tighten up dentry info correctness a bit by
updating our dentry info (recording lock coverage and hash/pos) for
negative dentries produced by lookup or as the result of unlink.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-31 09:49:32 -07:00
Zach Brown
ad5662b892 Handle dupe invalidation requests during recovery
Client lock invalidation handling was very strict about not receiving
duplicate invalidation requests from the server because it could only
track one pending request.  The promise to only send one invalidate at a
time is made by one server, it can't be enforced across server failover.
Particularly because invalidation processing can have to do quite a lot
of work with the server as it tears down state associated with the lock.

We fix this by recording and processing each individual incoming
invalidation request on the lock.

The code that handled reordering of incoming grant responses and
invalidation requests waited for the lock's mode to match the old mode
in the invalidation request before proceeding.  That would have
prevented duplicate invalidation requests from making forward progress.

To fix this we make lock client recieve processing synchronous instead
of going through async work which can reorder.  Now grant responses are
processed as they're received and will always be resolved before all the
invalidation requests are queued and processed in order.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
f5577e26b1 Reset item state when retrying stale forest reads
The forest reader reads items from the fs_root and all log btrees and
gives them to the caller who tracks them to resolve version differences.

The reads can run into stale blocks which have been overwritten.  The
forest reader was implementing the retry under the item state in the
caller.  This can corrupt items that are only seen firest in an old fs
root before a merge and then only seen in the fs_root after a merge.  In
this case the item won't have any versioning and the existing version
from the old fs_root is preferred.  This is particularly bad when the
new version was deleted -- in that case we have no metadata which would
tell us to drop the old item that was read from the old fs_root.

This is fixed by pushing the retry up to callers who wipe the item state
before each retry.  Now each set of items is related to a single
snapshot of the fs_root and logs at one point in time.

I haven't seen definitive evidence of this happening in practice.  I
found this problem after putting on my craziest thinking toque and
auditing the code for places where we could lose item updates.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
5f57785790 Fix btree merge input item iteration
Btree merging attempted to build an rbtree of the input roots with only
one version of an item present in the rbtree at a time.  It really
messed this up by completely dropping an input root when a root with a
newer version of its item tried to take its place in the rbtree.  What
it should have done is advance to the next item in the older root, which
itself could have required advancing some other older root.  Dropping
the root entirely is catastrophically wrong because it hides the rest of
the items in the root from merging.  This has been manifesting as
occasional mysterious item loss during tests where memory pressure, item
update patterns, and merging all lined up just so.

This fixes the problem by more clearly keeping the next item in each
root in the rbtree.   We sort by newest to oldest version so that once
we merge the most recent version of an item its easy to skip all the
older versions of the items in the next rbtree entries for the
rest of the input roots.

While we're at it we work with references to the static cached input
btree blocks.  The old code was a first pass that used an expensive
btree walk per item and copied the value payload.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
3740c0a995 More carefully scan for orphan inodes
The current orphan scan uses the forest_next_hint to look for candidate
orphan items to delete.  It doesn't skip deleted items and checks the
forest of log btrees so it'd return hints for every single item that
existed in all the log btrees across the system.  And we call the hint
per-item.

When the system is deleting a lot of files we end up generating a huge
load where all mounts are constantly getting the btree roots from the
server, reading all the newest log btree blocks, finding deleted orphan
items for inodes that have already been deleted, and moving on to the
next deleted orphan item.

The fix is to use a read-only traversal of only one version of the fs
root for all the items in one scan.   This avoids all the deleted orphan
items that exist in the log btrees which will disappear when they're
merged.  It lets the item iteration happen in a single read-only cached
btree instead of constantly reading in the most recently written root
block of every log btree.

The result is an enormous speedup of large deletions.  I don't want to
describe exactly how enormous.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
a4f5293e78 Flush invalidate and iput inode references
We can be performing final deletion as inodes are evicted during
unmount.  We have to keep full locking, transactions, and networking up
and running for the evict_inodes() call in generic_shutdown_super().
Unfortunately, this means that workers can be using inode references
during evict_inodes() which prevents them from being evicted.  Those
workers can then remain running as we tear down the system, causing
crashes and deadlocks as the final iputs try to use resources that have
been destroyed.

The fix is to first properly stop orphan scanning, which can instantiate
new cached inodes, up before the call to kill_block_super ends up trying
to evict all inodes.  Then we just need to wait for any pending iput and
invalidate work to finish and perform the final iput, which will always
evict because generic_shutdown_super has cleared MS_ACTIVE.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
0c3026a2b7 Add simple per-lock server message count stats
Add some simple tracking of message counts for each lock in the lock
server so that we can start to see where conflicts may be happening in a
running system.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
5bc95fac7d Add scoutfs_unmounting()
Add a quick helper that can be used to avoid doing work if we know that
we're already shutting down.  This can be a single coarser indicator
than adding functions to each subsystem to track that we're shutting
down.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
b0a08eb922 Remove lock grace period
We had some logic to try and delay lock invalidation while the lock was
still actively in use.  This was trying to reduce the cost of
pathological lock conflict cases but it had some severe fairness
problems.

It was first introduced to deal with bad patterns in userspace that no
longer exist and it was built on top of the LSM transaction machinery
that also no longer exists.   It hasn't aged well.

Instead of introducing invalidation latency in the hopes that it leads
to more batched work, which it can't always, let's aim more towards
reducing latency in all parts of the write-invalidate-read path and
also aim towards reducing contention in the first place.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
bb571377dc Don't merge newer items past older
We have a problem where items can appear to go backwards in time because
of the way we chose which log btrees to finalize and merge.

Because we don't have versions in items in the fs_root, and even might
not have items at all if they were deleted, we always assume items in
log btrees are newer than items in the fs root.

This creates the requirement that we can't merge a log btree if it has
items that are also present in older versions in other log btrees which
are not being merged.  The unmerged old item in the log btree would take
precedent over the newer merged item in the fs root.

We weren't enforcing this requirement at all.  We used the max_item_seq
to ensure that all items were older than the current stable seq but that
says nothing about the relationship between older items in the finalized
and active log btrees.  Nothing at all stops an active btree from having
an old version of a newer item that is present in another mount's
finalized log btree.

To reliably fix this we create a strict item seq discontinuity between
all the finalized merge inputs and all the active log btrees.  Once any
log btree is naturally finalized the server forced all the clients to
group up and finalize all their open log btrees.   A merge operation can
then safely operate on all the finalized trees before any new trees are
given to clients who would start using increasing items seqs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
5897f4d889 Add a trivial trace_printk wrapper
Make it a bit easier to include the fsid and rid in trace_printk
messages when we're experimenting.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:12:20 -07:00
Zach Brown
999093bfc9 Add sync log trees network command
Add a command for the server to request that clients commit their open
transaction.   This will be used to create groups of finalized log
btrees for consistent merging.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:12:17 -07:00
Zach Brown
05b5d93365 Verify that quorum_slot_nr references valid slot
We were checking that quorum_slot_nr was within the range of possible
slots allowed by the format as it was parsed.  We weren't checking that
it referenced a configured slot.  Make sure, and give a nice error
message that shows the configured slots.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:11:40 -07:00
Zach Brown
4d7191dc48 Print messages on extent ins/rem errors
Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:11:40 -07:00
Zach Brown
4495dbdce6 Set initial quorum term from max of all blocks
During rough forced unmount testing we saw a seemingly mysterious
concurrent election.  It could be explained if mounts coming up don't
start with the same term.  Let's try having mounts initialize their term
to the greatest of all the terms they can see in the quorum blocks.
This will prevent the situation where some new quorum actors with
greater terms start out ignoring all the messages from others.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:11:40 -07:00
Zach Brown
70569b0448 Trivial quorum test;set -> test_and_set
Nothing interesting here, just a minor convenience to use test and set
instead of testing and then setting.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:11:40 -07:00
Zach Brown
823838cf01 Add more messages to server processing errors
The server doesn't give us much to go on when it gets an error handling
requests to work with log trees from the client.  This adds a lot of
specific error messages so we can get a better understanding of
failures.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-24 09:11:40 -07:00
Zach Brown
89b5865a4c Verify that log tree commit is for sending rid
We were trusting the rid in the log trees struct that the client sent.
Compare it to our recorded rid on the connection and fail if the client
sent the wrong rid.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-17 12:13:01 -07:00
Zach Brown
65ac42831f Queue invalidation during previous request
The locking protocol only allows one outstanding invalidation request
for a lock at a time.  The client invalidation state is a bit hairy and
involves removing the lock from the invalidation list while it is being
processed which includes sending the response.  This means that another
request can arrive while the lock is not on the invalidation list.  We
have fields in the lock to record another incoming request which puts
the lock back on the list.

But the invalidation work wasn't always queued again in this case.  It
*looks* like the incoming request path would queue the work, but by
definition the lock isn't on the invalidation list during this race.  If
it's the only lock in play then the invalidation list will be empty and
the work won't be queued.  The lock can get stuck with a pending
invalidation if nothing else kicks the invaliation worker.  We saw this
in testing when the root inode lock group missed the wakeup.

The fix is to have the work requeue itself after putting the lock back
on the invalidation list when it notices that another request came in.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-06 15:41:11 -07:00
Zach Brown
cb1726681c Fix net BUG_ON if reconnection farewell send races
When a client socket disconnects we save the connection state to re-use
later if the client reconnects.  A newly accepted connection finds the
old connection associated with the reconnecting client and migrates
state from the old idle connection to the newly accepted connection.

While moving messages between the old and new send and resend queues the
code had an aggressive BUG_ON that was asserting that the newly accepted
connection couldn't have any messages in its resend queue.

This BUG can be tripped due to the ordering of greeting processing and
connection state migration.  The server greeting processing path sends
the farewell response to the client before it calls the net code to
migrate connection state.  When it "sends" the farewell response it puts
the message on the send queue and kicks the send work.  It's possible
for the send work to execute and move the farewell response to the
resend queue and trip the BUG_ON.

This is harmless.   The sent greeting response is going to end up on the
resend queue either way, there's no reason for the reconnection
migration to assert that it can't have happened yet.  It is going to be
dropped the moment we get a message from the client with a recv_seq that
is necessarily past the greeting response which always gets a seq of 1
from the newly accepted connection.

We remove the BUG_ON and try to splice the old resend queue after the
possible response at the head of the resend_queue so that it is the
first to be dropped.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-02 11:15:57 -07:00
Zach Brown
cdff272163 Fix alloc list exhaustion calculation
The last thing server commits do is move extents from the freed list
into freed extents.  It moves as many as it can until it runs out of
avail meta blocks and space fore freed meta blocks in the current
allocator's lists.

The calculation for whether the lists had resources to move an extent
was quite off.  It missed that the first move might have to dirty the
current allocator or the list block, that the btree could join/split
blocks at each level down the paths, and boy does it look like the
height component of the calculation was just bonkers.

With the wrong calculation the server could overflow the freed list
while moving extents and trigger a BUG_ON.   We rarely saw this in
testing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-01 14:31:57 -07:00
Zach Brown
7e935898ab Avoid premature metadata enospc
server_get_log_trees() sets the low flag in a mount's meta_avail
allocator, triggering enospc for any space consuming allocatins in the
mount, if the server's global meta_vail pool falls below the reserved
block count.  Before each server transaction opens we swap the global
meta_avail and meta_freed allocators to ensure that the transaction has
at least the reserved count of blocks available.

This creates a risk of premature enospc as the global meta_avail pool
drains and swaps to the larger meta_freed.  The pool can be close to the
reserved count, perhaps at it exactly.  _get_log_trees can fill the
client's mount, even a little, and drop the global meta_avail total
under the reserved count, triggering enospc, even though meta_Freed
could have had quite a lot of blocks.

The fix is to ensure that the global meta_avail has 2x the reserved
count and swapping if it falls under that.  This ensures that a server
transaction can consume an entire reserved count and still have enough
to avoid triggering enospc.

This fixes a scattering of rare premature enospc returns that were
hitting during tests.  It was rare for meta_avail to fall just at the
reserved count and for get_log_trees to have to refill the client
allocator, but it happened.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
6d0694f1b0 Add resize_devices ioctl and scoutfs command
Add a scoutfs command that uses an ioctl to send a request to the server
to safely use a device that has grown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
4c1181c055 Remove first_ and last_ super blkno fields
There are fields in the super block that specify the range of blocks
that would be used for metadata or data.  They are from the time when a
single block device was carved up into regions for metadata and data.

They don't make sense now that we have separate metadata and data block
devices.  The starting blkno is static and we go to the end of the
device.

This removes the fields now that they serve no purpose.   The only use
of them to check that freed extents fell within the correct bounds can
still be performed by using the static starting number or roughly using
the size of the devices.  It's not perfect, but this is already only
a check to see that the blknos aren't utter nonsense.

We're removing the fields now to avoid having to update them while
worrying about users when resizing devices.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
d6bed7181f Remove almost all interruptible waits
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.

The reality is that we have significant code paths that have trouble
unwinding.  Final inode deletion during iput->evict in a task is a good
example.  It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.

It also happens that golang built pre-emptive thread scheduling around
signals.  Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.

This changes waits to expect that IOs (including network commands) will
complete reasonably promptly.  We remove all interruptible waits with
the notable exception of breaking out of a pending mount.  That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
4893a6f915 scoutfs_dirents_equal should return bool
It looks like it returned u64 because it was derived from _name_hash().

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
384590f016 Sync net shouldn't wait for errored submits
If async network request submission fails then the response handler will
never be called.  The sync request wrapper made the mistake of trying to
wait for completion when initial submission failed.  This never happened
in normal operation but we're able to trigger it with some regularity
with forced unmount during tests.  Unmount would hang waiting for work
to shutdown which was waiting for request responses that would never
happen.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
192f077c16 Update data_version when fallocate changes size
Changing the file size can changes the file contents -- reads will
change when they stop returning data.  fallocate can change the file
size and if it does it should increment the data_version, just like
setattr does.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
b7ab26539a Avoid lockdep warning about upstream inversion
Some kernels have blkdev_reread_part acquire the bd_mutex and then call
into drop_partitions which calls fsync_bdev which acquires s_umount.
This inverts the usual pattern of deactivate_super getting s_umount and
then using blkdev_put in kill_sb->put_super to drop a second device.

The inversion has been fixed upstream by years of rewrites.  We can't go
back in time to fix the kernels that we're testing against,
unfortunately, so we disable lockdep around our valid leg of the
inversion that lockdep is noticing in our testing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
c51f0c37da Defer dirty inode data writeback (and use list)
iput() can only be used in contexts that could perform final inode
deletion which requires cluster locks and transactions.  This is
absolutely true for the transaction committing worker.  We can't have
deletion during transaction commit trying to get locks and dirty *more*
items in the transaction.

Now that we're properly getting locks in final inode deletion and
O_TMPFILE support has put pressure on deletion, we're seeing deadlocks
between inode eviction during transaction commit getting a index lock
and index lock invalidation trying to commit.

We use the newly offered queued iput to defer the iput from walking our
dirty inodes.   The transaction commit will be able to proceed while
the iput worker is off waiting for a lock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:20:40 -07:00
Zach Brown
52107424dd Promote deferred iput to inode call
Lock invalidation had the ability to kick iput off to work context.  We
need to use it for inode writeback as well so we move the mechanism over
to inode.c and give it a proper call.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
099a65ab07 Try recovering from truncate errors and more info
We're seeing errors during truncate that are surprising.  Let's try and
recover from them and provide more info when they happen so that we can
dig deeper.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
2901b43906 Also allow omap requests to disconnected clients
We recently fixed problems sending omap responses to originating clients
which can race with the clients disconnecting.  We need to handle the
requests sent to clients on behalf of an origination request in exactly
the same way.  The send can race with the client being evicted.  It'll
be cleaned after the race is safely ignored by the client's rid being
removed from the server's request tracking.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
03d7a4e7fe Show relative times in quorum status file output
The times in the quorum status file are in absolute monotinic kernel
time since bootup.  That's not particularly helpful especially when
comparing across hosts with different boot times.

This shows relative times in timespec64 seconds until or since the times
in question.   While we're at it we also collect the send and receive
timestamps closer to each send or receive call.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
d5d3b12986 Specficially shutdown quorum during forced unmount
Generally, forced unmount works by returning errors for all IO.  Quorum
is pretty resilient in that it can have the IO errors eaten by server
startup and does its own messaging that won't return errors.  Trying to
force unmount can have the quorum service continually participate in
electing a server that immediately fails and shutds down.

This specifically shuts down the internal quorum service when it sees
that unmount is being forced.  This is easier and cleaner than having
the network IO return errors and then having that trigger shutdown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
e4dca8ddcc Don't shutdown quorum if server startup fails
The quorum service shuts down if it sees errors that mean that it can't
do its job.

This is mostly fatal errors gathering resources at startup or runtime IO
errors but it was also shutting down if server startup fails.   That's
not quite right.  This should be treated like the server shutting down
on errors.  Quorum needs to stay around to participate in electing the
next server.

Fence timeouts could trigger this.   A quorum mount could crash, the
next server without a fence script could have a fence request timeout
and shutdown, and now the third remaining server is left to indefinitely
send vote requests into the void.

With this fixed, continuing that example, the quorum service in the
second mount remains to elect the third server with a working fence
script after the second server shuts down after its fence request times
out.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
b4ede2ac6a Allow omap responses to disconnected originators
The omap message lifecycle is a little different than the server's usual
handling that sends a response from the request handler.  The response
is sent long after the initial receive handler is pinning the connection
to the client.   It's fine for the response to be dropped.

The main server request handler handled this case but other response
senders didn't.  Put this error handling in the server response sender
itself so that all callers are covered.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-08 09:36:07 -07:00
Zach Brown
cbe8d77f78 Prevent duplicate inode item deletion
We hide I_FREEING inodes from inode lookup to avoid inversions with
cluster locking.  This can result in duplicate inodes structs for a
given inode number.  Then can both race to try and delete the same items
for their shared inode number.  This leads to error messages from
evict_inode and could lead to corruption if they, for example, both try
and free the same data extents.

This adds very basic serialization so only one instance can try to
delete items at a time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
5f682dabb5 Item cache invalidation uses seqs to avoid readers
The item cache has to be careful not to insert stale read items when
previously dirty items have been written and invalidated while a read
was in flight.

This was previously done by recording the possible range of items that a
reader could see based on the key range of its lock.   This is
disasterous when a workload operates entirely within one lock.  I ran
into this when testing a small number of files with massive amounts of
xattrs.  While any reader is in flight all pages can't be invalidated
because they all intersect with the one lock that covers all the items
in use.

The fix is to more naturally reflect the problem by tracking the
greatest item seq in pages and the earliest seq that any readers
can't see.  This lets invalidate only skip pages with items
that weren't visible to the earliest reader.

This more naturally reflects that the problem is due to the age of the
items, not their position in the key space.  Now only a few of the most
recently modified pages could be skipped and they'll be at the end
of the LRU and won't typically be visited.  As an added benefit it's
now much cheaper to add, delete, and test the active readers.

This fix stopped rm -rf of a full system's worth of xattrs from taking
minutes constantly spinning skipping all pages in the LRU to seconds of
doing real removal work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
29cfa81574 Remove unused leftovers from quorum changes
These forward declarations were for interfaces that have since been
removed or changed and are no longer needed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
73bf916182 Return ENOSPC as space gets low
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free.  This adds support for
returning ENOSPC to client posix allocators as free space gets low.

For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space.  The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks.  In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing).  When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.

Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.

For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.

The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.

We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.

We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.

And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
9db3b475c0 Stop log merge work earlier during unmount
The forest log merge work calls into the client to send commit requests
to the server.  The forest is usually destroyed relatively late in the
sequence and can still be running after the client is destroyed.

Adding a _forest_stop call lets us stop the log merging work
before the client is destroyed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
2957f3e301 Avoid warnings when evict has signals pending
Killing a task can end up in evict and break out of acquiring the locks
to perform final inode deletion.  This isn't necessarily fatal.  The
orphan task will come around and will delete the inode when it is truly
no longer referenced.

So let's silence the error and keep track of how many times it happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00