1415 Commits

Author SHA1 Message Date
Zach Brown
a9da27444f Merge pull request #128 from versity/zab/prealloc_fragmentation
Zab/prealloc fragmentation
2023-06-29 09:57:32 -07:00
Zach Brown
49fe89741d Merge pull request #125 from versity/zab/get_referring_entries
Zab/get referring entries
2023-06-29 09:57:06 -07:00
Zach Brown
847916860d Advance move_blocks extent search offset
The move_blocks ioctl finds extents to move in the source file by
searching from the starting block offset of the region to move.
Logically, this is fine.  After each extent item is deleted the next
search will find the next extent.

The problem is that deleted items still exist in the item cache.  The
next iteration has to skip over all the deleted extents from the start
of the region.  This is fine with large extents, but with heavily
fragmented extents this creates a huge amplification of the number of
items to traverse when moving the fragmented extents in a large file.
(It's not quite O(n^2)/2 for the total extents, deleted items are purged
as we write out the dirty items in each transaction.. but it's still
immense.)

The fix is to simply start searching for the next extent after the one
we just moved.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-28 16:54:28 -07:00
Zach Brown
3d99fda0f6 Preallocate data around iblock when noncontig
If the _contig_only option isn't set then we try to preallocate aligned
regions of files.  The initial implementation naively only allowed one
preallocation attempt in each aligned region.  If it got a small
allocation that didn't fill the region then every future allocation
in the region would be a single block.

This changes every preallocation in the region to attempt to fill the
hole in the region that iblock fell in.  It uses an extra extent search
(item cache search) to try and avoid thousands of single block
allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-28 12:21:25 -07:00
Zach Brown
acafb869e7 Avoid deadlock from block reclaim in rht resize
The RCU hash table uses deferred work to resize the hash table.  There's
a time during resize when hash table iteration will return EAGAIN until
resize makes more progress.  During this time resize can perform
GFP_KERNEL allocations.

Our shrinker tries to iterate over its RCU hash table to find blocks to
reclaim.  It tries to restart iteration if it gets EAGAIN on the
assumption that it will be usable again soon.

Combine the two and our shrinker can get stuck retrying iteration
indefinitely because it's shrinking on behalf of the hash table resizing
that is trying to allocate the next table before making iteration work
again.  We have to stop shrinking in this case so that the resizing
caller can proceed.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-15 14:45:26 -07:00
Zach Brown
707752a7bf Add get_referring_entries ioctl
Add an ioctl that gives the callers all entries that refer to an inode.
It's like a backwards readdir.  It's a light bit of translation between
the internal _add_next_linkrefs() list of entries and the ioctl
interface of a buffer of entry structs.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
0316c22026 Extend scoutfs_dir_add_next_linkrefs
Extend scoutfs_dir_add_next_linkref() to be able to return multiple
backrefs under the lock for each call and have it take an argument to
limit the number of backrefs that can be added and returned.

Its return code changes a bit in that it returns 1 on success instead of
0 so we have to be a little careful with callers who were expecting 0.
It still returns -ENOENT when no entries are found.

We break up its tracepoint into one that records each entry added and
one that records the result of each call.

This will be used by an ioctl to give callers just the entries that
point to an inode instead of assembling full paths from the root.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
2b72c57cb0 Fix crash in quorum_heartbeat_timeout_ms parsing
Mount option parsing runs early enough that the rest of the option
read/write serialization infrastructure isn't set up yet.  The
quorum_heartbeat_timeout_ms mount option tried to use a helper that
updated the stored option but it wasn't initialized yet so it crashed.

The helper was really only to have the option validity test in one
place.  It's reworked to only verify the option and the actual setting
is left to the callers.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-22 16:29:56 -07:00
Zach Brown
15de0c21c1 Have quorum drop messages on force unmount
Forced unmount is supposed to isolate the mount from the world.  The
net.c TCP messaging returns errors when sending during forced unmount.
The quorum code has its own UDP messaging and wasn't taking forced
unmount into account.

This lead to quorum still being able to send resignation messages to
other quorum peers during forced unmount, making it hard to test
heartbeat timeouts with forced unmount.

The quorum messaging is already unreliable so we can easily make it drop
messages during forced unmount.  Now forced unmount more fully isolates
the quorum code and it becomes easier to test.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-18 10:01:19 -07:00
Zach Brown
7b65767803 Track and log quorum heartbeat delays
Add tracking and reporting of delays in sending or receiving quorum
heartbeat messages.  We measure the time between back to back sends or
receives of heartbeat messages.  We record these delays truncated down
to second granularity in the quorum sysfs status file.  We log messages
to the console for each longest measured delay up to the maximum
configurable heartbeat timeout.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
46640e4ff9 Add counter for quorum heartbeat send failures
Add a counter which tracks the number of heartbeat message send attempts
which fail.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
912906f050 Make quorum heartbeat timeout tunable
Add mount and sysfs options for changing the quorum heartbeat timeout.
This allows setting a longer delay in taking over for failed hosts that
has a greater chance of surviving temporary non-fatal delays.

We also double the existing default timeout to 10s which is still
reasonably responsive.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
ec02cf442b Use lower latency allocation in quorum socket
The quorum udp socket allocation still allowed starting io which can
trigger longer latencies trying to free memory.  We change the flags to
prefer dipping into emergency pools and then failing rather than
blocking trying to satisfy an allocation.  We'd much rather have a given
heartbeat attempt fail and have the opportunity to succeed at the next
interval rather than running the risk of blocking across multiple
intervals.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
0e9cd1eea5 Use specific work queue for quorum work
The quorum work was using the system workq.  While that's mostly fine,
we can create a dedicated workqueue with the specific flags that we
need.  The quorum work needs to run promptly to avoid fencing so we set
it to high priority.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
e18ea24561 Move quorum recv that sets timeout before check
In the quorum work loop some message receive actions extend the timeout
after the timeout expiration is checked.  This is usually fine when the
work runs soon after the messages are received and before the timeout
expires.  But under load the work might not schedule until long after
both the message has been received and the timeout has expired.

If the message was a heartbeat message then the wakeup delay would be
mistaken for lack of activity on the server and it would try to take
over for an otherwise active server.

This moves the extension of the heartbeat on message receive to before
the timeout is checked.  In our case of a delayed heartbeat message it
would still find it in the recv queue and extend the timeout, avoiding
fencing an active server.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 09:56:53 -07:00
Zach Brown
bb01a3990f Set sb->s_time_gran to support nsecs
We missed initializing sb->s_time_gran which controls how some parts of
the kernel truncate the granularity of nsec in timespec.  Some paths
don't use it at all so time would be maintained at full precision.  But
other paths, particularly setattr_copy() from userspace and
notify_change() from the kernel use it to truncate as times are set.

Setting s_time_gran to 1 maintains full nsec precision.

Signed-off-by: Zach Brown <zab@versity.com>
2023-03-24 10:50:34 -07:00
Zach Brown
a61b8d9961 Fix renaming into root directory
The VFS performs a lot of checks on renames before calling the fs
method.  We acquire locks and refresh inodes in the rename method so we
have to duplciate a lot of the vfs checks.

One of the checks involves loops with ancestors and subdirectories.  We
missed the case where the root directory is the destination and doesn't
have any parent directories.  The backref walker it calls returns
-ENOENT instead of 0 with an empty set of parents and that error bubbled
up to rename.

The fix is to notice when we're asking for ancestors of the one
directory that can't have ancestors and short circuit the test.

Signed-off-by: Zach Brown <zab@versity.com>
2023-03-08 11:00:59 -08:00
Zach Brown
2e2ccb6f61 Allow replaying srch file rotation
When a client no longer needs to append to a srch file, for whatever
reason, we move the reference from the log_trees item into a specific
srch file btree item in the server's srch file tracking btree.

Zeroing the log_trees item and inserting the server's btree item are
done in a server commit and should be written atomically.

But commit_log_trees had an error handling case that could leave the
newly inserted item dirty in memory without zeroing the srch file
reference in the existing log_trees item.  Future attempts to rotate the
file reference, perhaps by retrying the commit or by reclaiming the
client's rid, would get EEXIST and fail.

This fixes the error handling path to ensure that we'll keep the dirty
srch file btree and log_trees item in sync.  The desynced items can
still exist in the world so we'll tolerate getting EEXIST on insertion.
After enough time has passed, or if repair zeroed the duplicate
reference, we could remove this special case from insertion.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-17 14:33:27 -08:00
Zach Brown
01c8bba56d Merge pull request #109 from versity/zab/server_statfs_stable_blocks
Zab/server statfs stable blocks
2023-01-12 09:58:48 -08:00
Zach Brown
17cb1fe84b Merge pull request #110 from versity/zab/partial_alloc_move
Allow partial extent motion
2023-01-12 09:58:12 -08:00
Zach Brown
a23e7478a0 Fix move_blocks loop exit conditions
The move_blocks ioctl intends to only move extents whose bytes fall
inside i_size.  This is easy except for a final extent that straddles an
i_size that isn't aligned to 4K data blocks.

The code that either checked for an extent being entirely past i_size or
for limiting the number of blocks to move by i_size clumsily compared
i_size offsets in bytes with extent counts in 4KB blocks.  In just the
right circumstances, probably with the help of a byte length to move
that is much larger than i_size, the length calculation could result in
trying to move 0 blocks.  Once this hit the loop would keep finding that
extent and calculating 0 blocks to move and would be stuck.

We fix this by clamping the count of blocks in extents to move in terms
of byte offsets at the start of the loop.  This gets rid of the extra
size checks and byte offset use in the loop.  We also add a sanity check
to make sure that we can't get stuck if, say, corruption resulted in an
otherwise impossible zero length extent.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-10 09:34:52 -08:00
Zach Brown
7c2d83e2f8 Remove saved super block in scoutfs_sb_info
Now that we've removed its users we can remove the global saved copy of
the super block from scoutfs_sb_info.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-06 11:15:45 -08:00
Zach Brown
40aa47c888 Have the server keep a private dirty super block
As the server does its work its transactions modify a dirty super block
in memory.  This used the global super block in scoutfs_sb_info which
was visible to everything, including the client.  Move the dirty super
block over to the private server info so that only the server can see
it.

This is mostly boring storage motion but we do change that the quorum
code hands the server a static copy of the quorum config to use as it
starts up before it reads the most recent super block.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-06 11:15:45 -08:00
Zach Brown
c1bd7bcce5 Allow partial extent motion
Refilling a client's data_avail is the only alloc_move call that doesn't
try and limit the number of blocks that it dirties.  If it doesn't find
sufficiently large extents it can exhaust the server's alloc budget
without hitting the target.  It'll try to dirty blocks and return a hard
error.

This changes that behaviour to allow returning 0 if it moved any
extents.  Other callers can deal with partial progress as they already
limit the blocks they dirty.  This will also return ENOSPC if it hadn't
moved anything just as the current code would.

The result is that data fill can not necessarily hit the target.  It
might take multiple commits to fill the data_avail btree.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-15 20:47:41 -08:00
Zach Brown
7720222588 Have statfs use unlocked stable roots
The server's statfs request handler was intending to lock dirty
structures as they were walked to get sums used for statfs fields.
Other callers walk stable structures, though, so the summation calls had
grown iteration over other structures that the server didn't know it had
to lock.

This meant that the server was walking unlocked dirty structures as they
were being modified.  The races are very tight, but it can result in
request handling errors that shut down connections and IO errors from
trying to read inconsistent refs as they were modified by the locked
writer.

We've built up infrastructure so the server can now walk stable
structures just like the other callers.  It will no longer wander into
dirty blocks so it doesn't need to lock them and it will retry if its
walk of stale data crosses a broken reference.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
fff07ce19c Use stale block read retrying helper
Transition from manual checking for persistent ESTALE to the shared
helper that we just added.  This should not change behavior.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
464de56d28 Add stale block read retrying helper
Many readers had little implementations of the logic to decide to retry
stale reads with different refs or decide that they're persistent and
return hard errors.  Let's move that into a small helper.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
342c206550 Have scoutfs_forest_inode_count return stale reads
scoutfs_forest_inode_count() assumed it was called with stable refs and
would always translate ESTALE to EIO.  Change it so that it passes
ESTALE to the caller who is responsible for handling it.

The server will use this to retry reading from stable supers that it's
storing in memory.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
fe4734d019 Save a full stable super in the server
The server has a mechanism for tracking the last stable roots used by
network rpcs.  We expand it a bit to include the entire super so
that we can add users in the server which want the last full stable
super.  We can still use the stable super to give out the stable
roots.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
b1a43bb312 Make quorum config use more precise
The quorum code was using the copy of the super block in the sb info for
its config.  With that going away we make different users more carefully
reference the config.  The quorum agent has a copy that it reads on
setup, the client rarely reads a copy when trying to connect, and the
server uses its super.

This is about data access isolation and should have no functional effect
other than to cause more super reads.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
929703213f Add fsid sbi field
A few paths throughout the code get the fsid for the current mount by
using the copy of the super block that we store in the scoutfs_sb_info
for the mount.  We'd like to remove the super block from the sbi and
it's cleaner to have a specific constant field for the fsid of the mount
which will not change.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
8e067b3d3f Truncate dirties zero tail extension
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.

We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this.  We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction.  This follows the
pattern in .write_end.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-06 10:31:31 -08:00
Zach Brown
276fbebdac Avoid dput in lock invalidation
The d_prune_aliases in lock invalidation was thought to be safe because
the caller had an inode refernece, surely it can't get into iput_final.

I missed the fundamental dcache pattern that dput can ascend through
parents and end up in inode eviction for entirely unrelated inodes.
It's very easy for this to deadlock, imagine if nothing else that the
inode invalidation is blocked on in dput->iput->evict->delete->lock is
itself in the list of locks to invalidate in the caller.

We fix this by always kicking off d_prune and dput into async work.
This increases the chance that inodes will still be referenced after
invalidation and prevent inline deletion.  More deletions can be
deferred until the orphan scanner finds them.  It should be rare,
though.  We're still likely to put and drop invalidated inodes before a
writer gets around to removing the final unlink and asking us for the
omap that describes our cached inodes.

To perform the d_prune in work we make it a behavioural flag and make
our queued iputs a little more robust.   We use much safer and
understandable locking to cover the count and the new flags and we put
the work in re-entrant work in their own workqueue instead of one work
instance in the system_wq.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-02 12:28:13 -08:00
Zach Brown
71ed4512dc Include primary lock write_seq for write_only vers
FS items are deleted by logging a deletion item that has a greater item
version than the item to delete.  The versions are usually maintained by
the write_seq of the exclusive write lock that protects the item.  Any
newer write hold will have a greater version than all previous write
holds so any items created under the lock will have a greater vers than
all previous items under the lock.  All deletion items will be merged
with the older item and both will be dropped.

This doesn't work for concurrent write-only locks.  The write-only locks
match with each other so their write_seqs are asssigned in the order
that they are granted.  That grant order can be mismatched with item
creation order.  We can get deletion items with lesser versions than the
item to delete because of when each creation's write-only lock was
granted.

Write only locks are used to maintain consistency between concurrent
writers and readers, not between writers.  Consistency between writers
is done with another primary write lock.  For example, if you're writing
seq items to a write-only region you need to have the write lock on the
inode for the specific seq item you're writing.

The fix, then, is to pass these primary write locks down to the item
cache so that it can chose an item version that is the greatest amongst
the transaction, the write-only lock, and the primary lock.  This now
ensures that the primary lock's increasing write_seq makes it down to
the item, bringing item version ordering in line with exclusive holds of
the primary lock.

All of this to fix concurrent inode updates sometimes leaving behind
duplicate meta_seq items because old seq item deletions ended up with
older versions than the seq item they tried to delete, nullifying the
deletion.

Signed-off-by: Zach Brown <zab@versity.com>
2022-11-15 13:26:32 -08:00
Zach Brown
aed4313995 Simplify dentry verification
Now that we've removed the hash and pos from the dentry_info struct we
can do without it.  We can store the refresh gen in the d_fsdsta pointer
(sorry, 64bit only for now.. could allocate if we needed to.)  This gets
rid of the lock coverage spinlocks and puts a bit more pressure on lock
lookup, which we already know we have to make more efficient.  We can
get rid of all the dentry info allocation calls.

Now that we're not setting d_op as we allocate d_fsdata we put the ops
on the super block so that we get d_revalidate called on all our
dentries.

We also are a bit more precise about the errors we can return from
verification.  If the target of a dentry link changes then we return
-ESTALE rather than silently performing the caller's operation on
another inode.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-27 14:32:06 -07:00
Zach Brown
61d86f7718 Add scoutfs_lock_ino_refresh_gen
Add a lock call to get the current refresh_gen of a held lock.   If the
lock doesn't exist or isn't readable then we return 0.  This an be used
to track lock coverage of structures without the overhead and lifetime
binding of the lock coverage struct.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-27 14:16:07 -07:00
Zach Brown
717b56698a Remove __exit from scoutfs_sysfs_exit()
scoutfs_sysfs_exit() is called during error handling in module init.
When scoutfs is built-in (so, never.) the __exit section won't be
loaded.  Remove the __exit annotation so it's always available to be
called.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-26 16:42:27 -07:00
Zach Brown
c92a7ff705 Don't use dentry private hash/pos for deletion
The dentry cache life cycles are far too crazy to rely on d_fsdata being
kept in sync with the rest of the dentry fields.  Callers can do all
sorts of crazy things with dentries.  Only unlink and rename need these
fields and those operations are already so expensive that item lookups
to get the current actual hash and pos are lost in the noise.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-26 16:42:26 -07:00
Zach Brown
ef2daf8857 Make data preallocation tunable
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes.  Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-14 14:03:35 -07:00
Zach Brown
ddc5d9f04d Allow setting orphan_scan_delay_ms option
The orphan_scan_delay_ms option setting code mistakenly set the default
before testing the option for -1 (not the default) to discover if
multiple options had been set.  This made any attempt to set fail.

Initialize the option to -1 so the first set succeeds and apply the
default if we don't set the value.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
433a80c6fc Add compat for changing posix_acl_valid arguments
Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
29538a9f45 Add POSIX ACL support
Add support for the POSIX ACLs as described in acl(5).  Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
1826048ca3 Add _locked xattr get and set calls
The upcoming acl support wants to be able to get and set xattrs from
callers who already have cluster locks and transactions.   We refactor
the existing xattr get and set calls into locked and unlocked variants.

It's mostly boring code motion with the unfortunate situation that the
caller needs to acquire the totl cluster lock before holding a
transaction before calling into the xattr code.   We push the parsing of
the tags to the caller of the locked get and set so that they can know
to acquire the right lock.  (The acl callers will never be setting
scoutfs. prefixed xattrs so they will never have tags.)

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:11:24 -07:00
Zach Brown
798fbb793e Move to xattr_handler xattr prefix dispatch
Move to the use of the array of xattr_handler structs on the super to
dispatch set and get from generic_ based on the xattr prefix.   This
will make it easier to add handling of the pseudo system. ACL xattrs.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-21 14:24:52 -07:00
Zach Brown
1cbc927ccb Only clear trying inode deletion bit when set
try_delete_inode_items() is responsible for making sure that it's safe
to delete an inode's persistent items.  One of the things it has to
check is that there isn't another deletion attempt on the inode in this
mount.  It sets a bit in lock data while it's working and backs off if
the bit is already set.

Unfortunately it was always clearing this bit as it exited, regardless
of whether it set it or not.  This would let the next attempt perform
the deletion again before the working task had finished.  This was often
not a problem because background orphan scanning is the only source of
regular concurrent deletion attempts.

But it's a big problem if a deletion attempt takes a very long time.  It
gives enough time for an orphan scan attempt to clear the bit then try
again and clobber on whoever is performing the very slow deletion.

I hit this in a test that built files with an absurd number of
fragmented extents.  The second concurrent orphan attempt was able to
proceed with deletion and performed a bunch of duplicate data extent
frees and caused corruption.

The fix is to only clear the bit if we set it.  Now all concurrent
attempts will back off until the first task is done.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-29 11:25:01 -07:00
Zach Brown
233fbb39f3 Limit alloc_move per-call allocator consumption
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free.  The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.

The limiting logic was a bit off.  It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator.  It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.

Unfortunately, we don't have per-caller tracking of allocator resource
consumption.  The best we can do is sample the allocators as we start
and return if they drop by the caller's limit.  This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.

This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant.  We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-29 11:25:01 -07:00
Zach Brown
198d3cda32 Add scoutfs_alloc_meta_low_since()
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-29 11:24:10 -07:00
Zach Brown
e8c64b4217 Move freed data extents in multiple server commits
As _get_log_trees() in the server prepares the log_trees item for the
client's commit, it moves all the freed data extents from the log_trees
item into core data extent allocator btree items.  If the freed blocks
are very fragmented then it can exceed a commit's metadata allocation
budget trying to dirty blocks in the free data extent btree.

The fix is to move the freed data extents in multiple commits.  First we
move a limited number in the main commit that does all the rest of the
work preparing the commit.  Then we try to move the remaining freed
extents in multiple additional commits.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-28 11:42:33 -07:00
Zach Brown
ba9a106f72 Free send attempts to disconnected clients
Callers who send to specific client connections can get -ENOTCONN if
their client has gone away.   We forgot to free the send tracking struct
in that case.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-06 15:16:20 -07:00
Zach Brown
310725eb72 Free omap rid list as server exits
The omap code keeps track of rids that are connected to the server.  It
only freed the tracked rids as the server told it that rids were being
removed.   But that removal only happened as clients were evicted.  If
the server shutdown it'd leave the old rid entries around.   They'd be
leaked as the mount was unmounted and could linger and crate duplicate
entries if the server started back up and the same clients reconnected.

The fix is to free the tracking rids as the server shuts down.   They'll
be rebuilt as clients reconnect if the server restarts.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-06 15:16:19 -07:00