Compare commits

..

35 Commits

Author SHA1 Message Date
Zach Brown
e42e806554 Add cond_resched to iput worker
The iput worker can accumulate quite a bit of pending work to do.  We've
seen hung task warnings while it's doing its work (admitedly in debug
kernels).  There's no harm in throwing in a cond_resched so other tasks
get a chance to do work.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
406ef631b6 Add tracing for get_file_block() and scoutfs_ioc_search_xattrs().
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
0aaef899ca Fix several cases in srch.c where the return value of EIO should have been -EIO.
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
70cf41c1ed Add the inode number to scoutfs_xattr_set traces.
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
4f6a867083 Use log merge counters to fix the orphan-inodes test
log_merges_started is incremented in finalize_and_start_log_merge()
after we decide that we're going to do a merge; and in
reclaim_open_log_tree().

log_merges_completed is incremented when we've successfully completed
a merge (no errors and we did some work).

The orphan-inodes test uses these to make sure the merge work from
unlinks has completed before checking to see if the orphan scans
have run.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
4603d0f771 Suppress -EIO error messages during forced unmount
The quorum-heartbeat-timeout test was failing due to messages like
"error -5 getting log trees for rid". These are expected, since we
intentionally return -EIO during forced unmount.

Add scoutfs_err_maybe() which will suppress those and allow through
all errors during non-forced unmount, as well as non -EIO forced
unmount errors.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
69a08491a7 Only start new quorum election after a receive failure
It's possible for the quorum worker to be preempted for a long period,
especially on debug kernels. Since we only check for how much time
has passed, it's possible for a clean receive to inadvertently
trigger an election. This can cause the quorum-heartbeat-timeout
test to fail due to observed delays outside of the expected bounds.

Instead, make sure we had a receive failure before comparing timestamps.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
710c888913 Close window where we can lose search items
In finalize_and_start_log_merge(), we overwrite the server
mount's log tree with its finalized form and then later write out
its next open log tree. This leaves a window where the mount's
srch_file is nulled out, causing us to lose any search items in
that log tree.

This shows up as intermittent failures in the srch-basic-functionality
test.

Eliminate this timing window by doing what unmount/reclaim does when
it finalizes, by moving the resources from the item that we finalize
into server trees/items as it finalizes. Then there is no window
where those resources exist only in memory until we create another
transaction.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Auke Kok
6f8a81d2ef Avoid trigger munching of block_remove_stale trigger.
It's entirely likely that the trigger here is munched by a read on a
dirty block from any unrelated or background read. Avoid that by putting
the trigger at the end of the condition list.

Now that the order is swapped, we have to avoid a null deref in
block_is_dirty(bp) here, as well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-01 12:10:26 -07:00
Auke Kok
7315cdcfb9 Fully wait for orphan inode scan to complete.
The issue with the previous attempt to fix the orphan-inodes test was
that we would regularly exceed the 120s timeout value put in there.

Instead, in this commit, we change the code to add a new counter to
indicate orphan deletion progress. When orphan inodes are deleted, the
increment of this counter indicates progress happened. Inversely,
every time the counter doesn't increment, and the orphan scan attempts
counter increments, we know that there was no more work to be done.

For safety, we wait until 2 consecutive scan attempts were made without
forward progress in the test case.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-01 12:10:26 -07:00
Auke Kok
141b8bd028 Revert "Extend orphan-inodes timeout."
This reverts commit 138c7c6b49.

The timeout value here is still exceeded by CI test jobs, and thus
causing the test to fail.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
5cba43559f Fix race in offline-extent-waiting test
Before comparing file contents, wait for the background dd to complete.
Also fix a typo.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
f314b9a0fd Remove hung task workaround from large-fragmented-free test
Adjusting hung_task_timeout_secs is still needed for this test to pass
with a debug kernel. But the logic belongs on the platform side.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
980e1db231 Fix commit budget calculation with multiple holders
The try_drain_data_freed() path was generating errors about overrunning
its commit budget:

scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602

The budget overrun check was using the current number of commit holders
(in this case one) instead of the the maximum number of concurrent holders
(in this case two). So even well behaved paths like try_drain_data_freed()
can appear to exceed their commit budget if other holders dirty some blocks
and apply their commits before the try_drain_data_freed() thread does its
final budget reconciliation.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Chris Kirby
585516a2a3 Fix dirtied block calculation in extent_mod_blocks()
Free extents are stored in two btrees: one sorted by block number, one
by size. So if you insert a new extent between two existing extents, you can
be modifying two items in the by-block-number tree. And depending on the size
of those items, that can result in three items over in the -by-size tree.
So that's a 5x multiplier per level.

If we're shrinking the tree and adding more freed blocks, we're conceptually
dirtying two blocks at each level to merge. (current *2 in the code).
But if they fall under the low water mark then one of them is freed, so we
can have *3 per level in this case.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-01 12:10:26 -07:00
Auke Kok
53616762a3 Ignore sparse error about stat.h on el8.
On el8, sparse is at 0.6.4 in epel-release, but it fails with:
```
[SP src/util.c]
src/util.c: note: in included file (through /usr/include/sys/stat.h):
/usr/include/bits/statx.h:30:6: error: not a function <noident>
/usr/include/bits/statx.h:30:6: error: bad constant expression type
```

This is due to us needing O_DIRECT from <fcntl.h>, so we set _GNU_SOURCE
before including it, but this causes (through _USE_GNU in sys/stat.h)
statx.h to be included, and that has __has_include, and sparse is too
dumb to understand it.

Just shut it up.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-01 12:10:26 -07:00
Auke Kok
ccb340d841 Don't run format-version-forward-back on el8, either
This test compiles an earlier commit from the tree that is starting to
fail due to various changes on the OS level, most recently due to sparse
issues with newer kernel headers. This problem will likely increase
in the future as we add more supported releases.

We opt to just only run this test on el7 for now. While we could have
made this skip sparse checks that fail it on el8, it will suffice at
this point if this just works on one of the supported OS versions
during testing.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-10-01 12:06:38 -07:00
Chris Kirby
5aed6159f7 Don't overrun the block budget in server_log_merge_free_work().
This fixes a potential fence post failure like the following:

error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672

The code is only accounting for the freed btree blocks, not the dirtying of
other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2),
dirty some log btree blocks, loop again, then consume another
(COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget.

In this example, we went over by 13 blocks.

By only consuming up to 1/8 of the budget on each loop, and committing when we
have consumed 3/4 of the budget, we can avoid the fence post condition.

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-06-16 08:36:35 -05:00
Zach Brown
9741d40e10 Merge pull request #229 from versity/zab/v1.25
v1.25 Release
2025-06-04 11:21:25 -07:00
Zach Brown
48ac7bdf7c v1.25 Release
Finish the release notes for the 1.25 release.

Signed-off-by: Zach Brown <zab@versity.com>
2025-06-03 13:35:42 -07:00
Zach Brown
7865ee9f54 Merge pull request #223 from versity/auke/el9_5_wmaybe-uninit
Fix -Wmaybe-uninitalized since rhel9.5
2025-05-12 12:21:02 -07:00
Zach Brown
624eb128c6 Merge pull request #221 from versity/auke/enospc-test
Give enospc test more time to commit unlink.
2025-05-09 11:27:04 -07:00
Zach Brown
091eb3b683 Merge pull request #219 from versity/auke/fix-tests-failing-dirty-test-dirs
Fix test cases that don't run cleanly in a semi-dirty env.
2025-05-09 11:17:24 -07:00
Zach Brown
04e8cc6295 Merge pull request #220 from versity/auke/orphan-inodes
Extend orphan-inodes timeout.
2025-05-09 11:15:13 -07:00
Zach Brown
0f6fdb3eb5 Merge pull request #222 from versity/auke/t_kill_silent
Properly silently kill background tasks.
2025-05-09 11:11:24 -07:00
Auke Kok
2f48a606e8 Fix -Wmaybe-uninitalized since rhel9.5
Looks like the compiler isn't smart enough to understand the pass by
pointer value, and we can initialize it here easily.

make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64'
  CC [M]  /home/auke/scoutfs/kmod/src/server.o
/home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’:
/home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 4170 |                 ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr),
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 4171 |                                           SCOUTFS_FENCE_CLIENT_RECOVERY);
      |                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

There's still the obvious issue here that we'd intended to support ipv6
but just disregard that here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 15:20:50 -07:00
Auke Kok
377e49caf1 Properly silently kill background tasks.
Occasionally, we have some tests fail because these kills produce:

tests/lock-recover-invalidate.sh: line 42:  9928 Terminated

Even though we expected them to be silent. In these particular cases we
already don't care about this output.

We borrow the silent_kill() function from orphan-inodes and promote it
to t_silent_kill() in funcs/exec.sh, and then use it everywhere where
appropriate.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 12:03:04 -07:00
Auke Kok
d08eb66adc Give enospc test more time to commit unlink.
The current test sequence performs the unlink and immediately tests
whether enough resources are available to create new files again, and
this consistently fails.

One of my crummy VMs takes a good 12 seconds before the `touch` actually
succeeds. We care about the filesystem eventually returning from ENOSPC,
and certainly we don't want it to take forever, but there is a period
after our first ENOSPC error and cleanup that we expect ENOSPC to fail
for a bit longer.

Make the timeout 120s. As soon as the `touch` completes, exit the wait
loop.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 11:40:13 -07:00
Zach Brown
6f19d0bd36 Merge pull request #216 from versity/zab/stop_ending_dirty_data_freed
Zab/stop ending dirty data freed
2025-05-08 11:18:23 -07:00
Auke Kok
1d0cde7cc3 Clean up old test data as needed.
If run without `-m` (explicit mkfs) in subsequent testing, old test
data files may break several tests. Most failures are -EEXIST, but
there are some more subtle ones.

This change erases any existing test dir as needed just before we
run the tests, and avoids the issue entirely.

I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative
solution but that likely will interfere disproportionally with
tests that do disconnects and other thing that can be impacted by an
unlink storm.

This has an obvious performance aspect - tests will be a little
slower to start on subsequent runs. In CI, this will effectively be
a no-op though.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 10:10:01 -07:00
Auke Kok
138c7c6b49 Extend orphan-inodes timeout.
This test regularly fails in CI when the 15 seconds elapses and the
system still hasn't concluded the mount log merges and orphan inode
scans needed to unlink the test files.

Instead of just extending the timeout value, we test-and-retry for 120s.
This hopefully is faster in most cases. My smallest VM needs about 6s-8s
on average.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-05-08 09:56:45 -07:00
Zach Brown
8aa1a98901 Merge pull request #210 from versity/auke/perf-irq-took-too-long
Filter out perf `interrupt took too long` dmesg.
2025-04-30 10:04:00 -07:00
Zach Brown
888b1394a6 Retry client commit and get log trees separately
The client transaction commit worker has a series of functions that it
calls to commit the current transaction and open the next one.  If any
of them fail, it retries all of them from the beginning each time until
they all succeed.

This pattern behaves badly since we added the strict get_trans_seq and
commit_trans_seq latching in the log_trees.  The server will only commit
the items for a get or commit request once, and will fail a commit
request if it isn't given the seq that matches the current item.

If the server gets an error it can have persisted items while sending an
error to the client.  If this error was for a get request, then the
client will retry all of its transaction write functions.  This includes
the commit request which is now using a stale seq and will fail
indefinitely.  This is visible in the server log as:

  error -5 committing client logs for rid e57e37132c919c4f: invalid log trees item get_trans_seq

The solution is to retry the commit and get phases independently.  This
way a failed get will be retried on its own without running through the
commit phase that had succeeded.  The client will eventually get the
next seq that it can then safely commit.

Signed-off-by: Zach Brown <zab@versity.com>
2025-04-29 11:46:38 -07:00
Zach Brown
e457694f19 Don't send dirty data_freed blocks to client
At the end of get_log_trees we can try and drain the data_freed extent
tree, which can take multiple commits.  If a commit fails then the
blocks are still dirty in memory.  We can't send references to those
blocks to the client.  We have to return an error and not send the
log_trees, like the main get_log_trees does.  The client will retry and
eventually get a log_trees that references blocks that were successfully
committed.

Signed-off-by: Zach Brown <zab@versity.com>
2025-04-29 11:46:38 -07:00
Auke Kok
1b47e9429e Filter out perf interrupt took too long dmesg.
Example:

```
[ 2469.638414] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
```

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-04-14 12:06:58 -07:00
27 changed files with 509 additions and 248 deletions

View File

@@ -1,6 +1,27 @@
Versity ScoutFS Release Notes
=============================
---
v1.25
\
*Jun 3, 2025*
Fix a bug that could cause indefinite retries of failed client commits.
Under specific error conditions the client and server's understanding of
the current client commit could get out of sync. The client would retry
commits indefinitely that could never succeed. This manifested as
infinite "critical transaction commit failure" messages in the kernel
log on the client and matching "error <nr> committing client logs" on
the server.
Fix a bug in a specific case of server error handling that could result
in sending references to unwritten blocks to the client. The client
would try to read blocks that hadn't been written and return spurious
errors. This was seen under low free space conditions on the server and
resulted in error messages with error code 116 (The errno enum for
ESTALE, the client's indication that it couldn't read the blocks that it
expected.)
---
v1.24
\

View File

@@ -86,18 +86,47 @@ static u64 smallest_order_length(u64 len)
}
/*
* An extent modification dirties three distinct leaves of an allocator
* btree as it adds and removes the blkno and size sorted items for the
* old and new lengths of the extent. Dirtying the paths to these
* leaves can grow the tree and grow/shrink neighbours at each level.
* We over-estimate the number of blocks allocated and freed (the paths
* share a root, growth doesn't free) to err on the simpler and safer
* side. The overhead is minimal given the relatively large list blocks
* and relatively short allocator trees.
* Moving an extent between trees can dirty blocks in several ways. This
* function calculates worst case number of blocks across these scenarions.
* We treat the alloc and free counts independently, so the values below are
* max(allocated, freed), not the sum.
*
* We track extents with two separate btree items: by block number and by size.
*
* If we're removing an extent from the btree (allocating), we can dirty
* two blocks if the keys are in different leaves. If we wind up merging
* leaves because we fall below the low water mark, we can wind up freeing
* three leaves.
*
* That sequence is as follows, assuming the original keys are removed from
* blocks A and B:
*
* Allocate new dirty A' and B'
* Free old stable A and B
* B' has fallen below the low water mark, so copy B' into A'
* Free B'
*
* An extent insertion (freeing an extent) can dirty up to five distinct items
* in the btree as it adds and removes the blkno and size sorted items for the
* old and new lengths of the extent:
*
* In the by-blkno portion of the btree, we can dirty (allocate for COW) up
* to two blocks- either by merging adjacent extents, which can cause us to
* join leaf blocks; or by an insertion that causes a split.
*
* In the by-size portion, we never merge extents, so normally we just dirty
* a single item with a size insertion. But if we merged adjacent extents in
* the by-blkno portion of the tree, we might be working with three by-sizex
* items: removing the two old ones that were combined in the merge; and
* adding the new one for the larger, merged size.
*
* Finally, dirtying the paths to these leaves can grow the tree and grow/shrink
* neighbours at each level, so we multiply by the height of the tree after
* accounting for a possible new level.
*/
static u32 extent_mod_blocks(u32 height)
{
return ((1 + height) * 2) * 3;
return ((1 + height) * 3) * 5;
}
/*

View File

@@ -712,8 +712,8 @@ retry:
ret = 0;
out:
if ((ret == -ESTALE || scoutfs_trigger(sb, BLOCK_REMOVE_STALE)) &&
!retried && !block_is_dirty(bp)) {
if (!retried && !IS_ERR_OR_NULL(bp) && !block_is_dirty(bp) &&
(ret == -ESTALE || scoutfs_trigger(sb, BLOCK_REMOVE_STALE))) {
retried = true;
scoutfs_inc_counter(sb, block_cache_remove_stale);
block_remove(sb, bp);

View File

@@ -90,6 +90,7 @@
EXPAND_COUNTER(forest_read_items) \
EXPAND_COUNTER(forest_roots_next_hint) \
EXPAND_COUNTER(forest_set_bloom_bits) \
EXPAND_COUNTER(inode_deleted) \
EXPAND_COUNTER(item_cache_count_objects) \
EXPAND_COUNTER(item_cache_scan_objects) \
EXPAND_COUNTER(item_clear_dirty) \
@@ -146,6 +147,8 @@
EXPAND_COUNTER(lock_unlock) \
EXPAND_COUNTER(lock_wait) \
EXPAND_COUNTER(log_merge_wait_timeout) \
EXPAND_COUNTER(log_merges_completed) \
EXPAND_COUNTER(log_merges_started) \
EXPAND_COUNTER(net_dropped_response) \
EXPAND_COUNTER(net_send_bytes) \
EXPAND_COUNTER(net_send_error) \

View File

@@ -470,7 +470,7 @@ struct scoutfs_srch_compact {
* @get_trans_seq, @commit_trans_seq: These pair of sequence numbers
* determine if a transaction is currently open for the mount that owns
* the log_trees struct. get_trans_seq is advanced by the server as the
* transaction is opened. The server sets comimt_trans_seq equal to
* transaction is opened. The server sets commit_trans_seq equal to
* get_ as the transaction is committed.
*/
struct scoutfs_log_trees {

View File

@@ -1854,6 +1854,9 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)
goto out;
ret = delete_inode_items(sb, ino, &sinode, lock, orph_lock);
if (ret == 0)
scoutfs_inc_counter(sb, inode_deleted);
out:
if (clear_trying)
clear_bit(bit_nr, ldata->trying);
@@ -1962,6 +1965,8 @@ static void iput_worker(struct work_struct *work)
while (count-- > 0)
iput(inode);
cond_resched();
/* can't touch inode after final iput */
spin_lock(&inf->iput_lock);
@@ -2123,8 +2128,8 @@ static void inode_orphan_scan_worker(struct work_struct *work)
}
/* seemingly orphaned and unused, get locks and check for sure */
scoutfs_inc_counter(sb, orphan_scan_attempts);
ret = try_delete_inode_items(sb, ino);
scoutfs_inc_counter(sb, orphan_scan_attempts);
}
ret = 0;

View File

@@ -18,6 +18,19 @@ do { \
#define scoutfs_err(sb, fmt, args...) \
scoutfs_msg_check(sb, KERN_ERR, " error", fmt, ##args)
/*
* This can be used to suppress the expected -EIO error messages that are
* generated during forced unmount.
*/
#define ignore_err(sb, error) \
((error) == -EIO && unlikely(scoutfs_forcing_unmount(sb)))
#define scoutfs_err_maybe(sb, error, fmt, args...) \
do { \
if (!ignore_err(sb, error)) \
scoutfs_err(sb, fmt, args); \
} while (0)
#define scoutfs_warn(sb, fmt, args...) \
scoutfs_msg_check(sb, KERN_WARNING, " warning", fmt, ##args)

View File

@@ -39,7 +39,6 @@ enum {
Opt_orphan_scan_delay_ms,
Opt_quorum_heartbeat_timeout_ms,
Opt_quorum_slot_nr,
Opt_meta_reserve_blocks,
Opt_err,
};
@@ -53,7 +52,6 @@ static const match_table_t tokens = {
{Opt_orphan_scan_delay_ms, "orphan_scan_delay_ms=%s"},
{Opt_quorum_heartbeat_timeout_ms, "quorum_heartbeat_timeout_ms=%s"},
{Opt_quorum_slot_nr, "quorum_slot_nr=%s"},
{Opt_meta_reserve_blocks, "meta_reserve_blocks=%s"},
{Opt_err, NULL}
};
@@ -128,9 +126,6 @@ static void free_options(struct scoutfs_mount_options *opts)
#define MIN_DATA_PREALLOC_BLOCKS 1ULL
#define MAX_DATA_PREALLOC_BLOCKS ((unsigned long long)SCOUTFS_BLOCK_SM_MAX)
#define SCOUTFS_META_RESERVE_DEFAULT_BLOCKS 16384
static void init_default_options(struct scoutfs_mount_options *opts)
{
memset(opts, 0, sizeof(*opts));
@@ -141,7 +136,6 @@ static void init_default_options(struct scoutfs_mount_options *opts)
opts->orphan_scan_delay_ms = -1;
opts->quorum_heartbeat_timeout_ms = SCOUTFS_QUORUM_DEF_HB_TIMEO_MS;
opts->quorum_slot_nr = -1;
opts->meta_reserve_blocks = SCOUTFS_META_RESERVE_DEFAULT_BLOCKS;
}
static int verify_log_merge_wait_timeout_ms(struct super_block *sb, int ret, int val)
@@ -173,24 +167,6 @@ static int verify_quorum_heartbeat_timeout_ms(struct super_block *sb, int ret, u
return 0;
}
static int verify_meta_reserve_blocks(struct super_block *sb, int ret, int val)
{
/*
* Ideally we set a limit to something reasonable like 1/2 the actual
* total_meta_blocks, but we can't yet get this info when mount is called
*/
if (ret < 0) {
scoutfs_err(sb, "failed to parse meta_reserve_blocks value");
return -EINVAL;
}
if (val < 0 || val > INT_MAX) {
scoutfs_err(sb, "invalid meta_reserve_blocks value %d, must be between 0 and %d",
val, INT_MAX);
return -EINVAL;
}
return 0;
}
/*
* Parse the option string into our options struct. This can allocate
@@ -303,14 +279,6 @@ static int parse_options(struct super_block *sb, char *options, struct scoutfs_m
opts->quorum_slot_nr = nr;
break;
case Opt_meta_reserve_blocks:
ret = match_int(args, &nr);
ret = verify_meta_reserve_blocks(sb, ret, nr);
if (ret < 0)
return ret;
opts->meta_reserve_blocks = nr;
break;
default:
scoutfs_err(sb, "Unknown or malformed option, \"%s\"", p);
return -EINVAL;
@@ -403,7 +371,6 @@ int scoutfs_options_show(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",orphan_scan_delay_ms=%u", opts.orphan_scan_delay_ms);
if (opts.quorum_slot_nr >= 0)
seq_printf(seq, ",quorum_slot_nr=%d", opts.quorum_slot_nr);
seq_printf(seq, ".meta_reserve_blocks=%llu", opts.meta_reserve_blocks);
return 0;
}
@@ -622,17 +589,6 @@ static ssize_t quorum_slot_nr_show(struct kobject *kobj, struct kobj_attribute *
}
SCOUTFS_ATTR_RO(quorum_slot_nr);
static ssize_t meta_reserve_blocks_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
struct scoutfs_mount_options opts;
scoutfs_options_read(sb, &opts);
return snprintf(buf, PAGE_SIZE, "%lld\n", opts.meta_reserve_blocks);
}
SCOUTFS_ATTR_RO(meta_reserve_blocks);
static struct attribute *options_attrs[] = {
SCOUTFS_ATTR_PTR(data_prealloc_blocks),
SCOUTFS_ATTR_PTR(data_prealloc_contig_only),
@@ -641,7 +597,6 @@ static struct attribute *options_attrs[] = {
SCOUTFS_ATTR_PTR(orphan_scan_delay_ms),
SCOUTFS_ATTR_PTR(quorum_heartbeat_timeout_ms),
SCOUTFS_ATTR_PTR(quorum_slot_nr),
SCOUTFS_ATTR_PTR(meta_reserve_blocks),
NULL,
};

View File

@@ -13,7 +13,6 @@ struct scoutfs_mount_options {
unsigned int orphan_scan_delay_ms;
int quorum_slot_nr;
u64 quorum_heartbeat_timeout_ms;
u64 meta_reserve_blocks;
};
void scoutfs_options_read(struct super_block *sb, struct scoutfs_mount_options *opts);

View File

@@ -726,6 +726,8 @@ static void scoutfs_quorum_worker(struct work_struct *work)
struct quorum_status qst = {0,};
struct hb_recording hbr;
bool record_hb;
bool recv_failed;
bool initializing = true;
int ret;
int err;
@@ -758,6 +760,8 @@ static void scoutfs_quorum_worker(struct work_struct *work)
update_show_status(qinf, &qst);
recv_failed = false;
ret = recv_msg(sb, &msg, qst.timeout);
if (ret < 0) {
if (ret != -ETIMEDOUT && ret != -EAGAIN) {
@@ -765,6 +769,9 @@ static void scoutfs_quorum_worker(struct work_struct *work)
scoutfs_inc_counter(sb, quorum_recv_error);
goto out;
}
recv_failed = true;
msg.type = SCOUTFS_QUORUM_MSG_INVALID;
ret = 0;
}
@@ -822,12 +829,13 @@ static void scoutfs_quorum_worker(struct work_struct *work)
/* followers and candidates start new election on timeout */
if (qst.role != LEADER &&
(initializing || recv_failed) &&
ktime_after(ktime_get(), qst.timeout)) {
/* .. but only if their server has stopped */
if (!scoutfs_server_is_down(sb)) {
qst.timeout = election_timeout();
scoutfs_inc_counter(sb, quorum_candidate_server_stopping);
continue;
goto again;
}
qst.role = CANDIDATE;
@@ -964,6 +972,9 @@ static void scoutfs_quorum_worker(struct work_struct *work)
}
record_hb_delay(sb, qinf, &hbr, record_hb, qst.role);
again:
initializing = false;
}
update_show_status(qinf, &qst);

View File

@@ -823,13 +823,14 @@ DEFINE_EVENT(scoutfs_lock_info_class, scoutfs_lock_destroy,
);
TRACE_EVENT(scoutfs_xattr_set,
TP_PROTO(struct super_block *sb, size_t name_len, const void *value,
size_t size, int flags),
TP_PROTO(struct super_block *sb, __u64 ino, size_t name_len,
const void *value, size_t size, int flags),
TP_ARGS(sb, name_len, value, size, flags),
TP_ARGS(sb, ino, name_len, value, size, flags),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, ino)
__field(size_t, name_len)
__field(const void *, value)
__field(size_t, size)
@@ -838,15 +839,16 @@ TRACE_EVENT(scoutfs_xattr_set,
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ino = ino;
__entry->name_len = name_len;
__entry->value = value;
__entry->size = size;
__entry->flags = flags;
),
TP_printk(SCSBF" name_len %zu value %p size %zu flags 0x%x",
SCSB_TRACE_ARGS, __entry->name_len, __entry->value,
__entry->size, __entry->flags)
TP_printk(SCSBF" ino %llu name_len %zu value %p size %zu flags 0x%x",
SCSB_TRACE_ARGS, __entry->ino, __entry->name_len,
__entry->value, __entry->size, __entry->flags)
);
TRACE_EVENT(scoutfs_advance_dirty_super,
@@ -1966,15 +1968,17 @@ DEFINE_EVENT(scoutfs_server_client_count_class, scoutfs_server_client_down,
);
DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing,
exceeded),
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(int, holding)
__field(int, applying)
__field(int, nr_holders)
__field(u32, budget)
__field(__u32, avail_before)
__field(__u32, freed_before)
__field(int, committing)
@@ -1985,35 +1989,45 @@ DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
__entry->holding = !!holding;
__entry->applying = !!applying;
__entry->nr_holders = nr_holders;
__entry->budget = budget;
__entry->avail_before = avail_before;
__entry->freed_before = freed_before;
__entry->committing = !!committing;
__entry->exceeded = !!exceeded;
),
TP_printk(SCSBF" holding %u applying %u nr %u avail_before %u freed_before %u committing %u exceeded %u",
SCSB_TRACE_ARGS, __entry->holding, __entry->applying, __entry->nr_holders,
__entry->avail_before, __entry->freed_before, __entry->committing,
__entry->exceeded)
TP_printk(SCSBF" holding %u applying %u nr %u budget %u avail_before %u freed_before %u committing %u exceeded %u",
SCSB_TRACE_ARGS, __entry->holding, __entry->applying,
__entry->nr_holders, __entry->budget,
__entry->avail_before, __entry->freed_before,
__entry->committing, __entry->exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_hold,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_apply,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_start,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_end,
TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
u32 avail_before, u32 freed_before, int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
TP_PROTO(struct super_block *sb, int holding, int applying,
int nr_holders, u32 budget,
u32 avail_before, u32 freed_before,
int committing, int exceeded),
TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
);
#define slt_symbolic(mode) \
@@ -2451,6 +2465,27 @@ TRACE_EVENT(scoutfs_block_dirty_ref,
__entry->block_blkno, __entry->block_seq)
);
TRACE_EVENT(scoutfs_get_file_block,
TP_PROTO(struct super_block *sb, u64 blkno, int flags),
TP_ARGS(sb, blkno, flags),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, blkno)
__field(int, flags)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->blkno = blkno;
__entry->flags = flags;
),
TP_printk(SCSBF" blkno %llu flags 0x%x",
SCSB_TRACE_ARGS, __entry->blkno, __entry->flags)
);
TRACE_EVENT(scoutfs_block_stale,
TP_PROTO(struct super_block *sb, struct scoutfs_block_ref *ref,
struct scoutfs_block_header *hdr, u32 magic, u32 crc),
@@ -3048,6 +3083,27 @@ DEFINE_EVENT(scoutfs_srch_compact_class, scoutfs_srch_compact_client_recv,
TP_ARGS(sb, sc)
);
TRACE_EVENT(scoutfs_ioc_search_xattrs,
TP_PROTO(struct super_block *sb, u64 ino, u64 last_ino),
TP_ARGS(sb, ino, last_ino),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(u64, ino)
__field(u64, last_ino)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ino = ino;
__entry->last_ino = last_ino;
),
TP_printk(SCSBF" ino %llu last_ino %llu", SCSB_TRACE_ARGS,
__entry->ino, __entry->last_ino)
);
#endif /* _TRACE_SCOUTFS_H */
/* This part must be outside protection */

View File

@@ -65,6 +65,7 @@ struct commit_users {
struct list_head holding;
struct list_head applying;
unsigned int nr_holders;
u32 budget;
u32 avail_before;
u32 freed_before;
bool committing;
@@ -84,8 +85,9 @@ static void init_commit_users(struct commit_users *cusers)
do { \
__typeof__(cusers) _cusers = (cusers); \
trace_scoutfs_server_commit_##which(sb, !list_empty(&_cusers->holding), \
!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->avail_before, \
_cusers->freed_before, _cusers->committing, _cusers->exceeded); \
!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->budget, \
_cusers->avail_before, _cusers->freed_before, _cusers->committing, \
_cusers->exceeded); \
} while (0)
struct server_info {
@@ -303,7 +305,6 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
u32 freed_used;
u32 avail_now;
u32 freed_now;
u32 budget;
assert_spin_locked(&cusers->lock);
@@ -318,15 +319,14 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
else
freed_used = SCOUTFS_ALLOC_LIST_MAX_BLOCKS - freed_now;
budget = cusers->nr_holders * COMMIT_HOLD_ALLOC_BUDGET;
if (avail_used <= budget && freed_used <= budget)
if (avail_used <= cusers->budget && freed_used <= cusers->budget)
return;
exceeded_once = true;
cusers->exceeded = cusers->nr_holders;
scoutfs_err(sb, "%u holders exceeded alloc budget av: bef %u now %u, fr: bef %u now %u",
cusers->nr_holders, cusers->avail_before, avail_now,
scoutfs_err(sb, "holders exceeded alloc budget %u av: bef %u now %u, fr: bef %u now %u",
cusers->budget, cusers->avail_before, avail_now,
cusers->freed_before, freed_now);
list_for_each_entry(hold, &cusers->holding, entry) {
@@ -349,7 +349,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
{
bool has_room;
bool held;
u32 budget;
u32 new_budget;
u32 av;
u32 fr;
@@ -367,8 +367,8 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
}
/* +2 for our additional hold and then for the final commit work the server does */
budget = (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET;
has_room = av >= budget && fr >= budget;
new_budget = max(cusers->budget, (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET);
has_room = av >= new_budget && fr >= new_budget;
/* checking applying so holders drain once an apply caller starts waiting */
held = !cusers->committing && has_room && list_empty(&cusers->applying);
@@ -388,6 +388,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
list_add_tail(&hold->entry, &cusers->holding);
cusers->nr_holders++;
cusers->budget = new_budget;
} else if (!has_room && cusers->nr_holders == 0 && !cusers->committing) {
cusers->committing = true;
@@ -516,6 +517,7 @@ static void commit_end(struct super_block *sb, struct commit_users *cusers, int
list_for_each_entry_safe(hold, tmp, &cusers->applying, entry)
list_del_init(&hold->entry);
cusers->committing = false;
cusers->budget = 0;
spin_unlock(&cusers->lock);
wake_up(&cusers->waitq);
@@ -772,14 +774,11 @@ static int alloc_move_empty(struct super_block *sb,
u64 scoutfs_server_reserved_meta_blocks(struct super_block *sb)
{
DECLARE_SERVER_INFO(sb, server);
struct scoutfs_mount_options opts;
u64 server_blocks;
u64 client_blocks;
u64 log_blocks;
u64 nr_clients;
scoutfs_options_read(sb, &opts);
/* server has two meta_avail lists it swaps between */
server_blocks = SCOUTFS_SERVER_META_FILL_TARGET * 2;
@@ -804,7 +803,7 @@ u64 scoutfs_server_reserved_meta_blocks(struct super_block *sb)
nr_clients = server->nr_clients;
spin_unlock(&server->lock);
return server_blocks + (max(1ULL, nr_clients) * client_blocks) + opts.meta_reserve_blocks;
return server_blocks + (max(1ULL, nr_clients) * client_blocks);
}
/*
@@ -1041,6 +1040,101 @@ static int next_log_merge_item(struct super_block *sb,
return next_log_merge_item_key(sb, root, zone, &key, val, val_len);
}
static int do_finalize_ours(struct super_block *sb,
struct scoutfs_log_trees *lt,
struct commit_hold *hold)
{
struct server_info *server = SCOUTFS_SB(sb)->server_info;
struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
struct scoutfs_key key;
char *err_str = NULL;
u64 rid = le64_to_cpu(lt->rid);
bool more;
int ret;
int err;
mutex_lock(&server->srch_mutex);
ret = scoutfs_srch_rotate_log(sb, &server->alloc, &server->wri,
&super->srch_root, &lt->srch_file, true);
mutex_unlock(&server->srch_mutex);
if (ret < 0) {
scoutfs_err(sb, "error rotating srch log for rid %016llx: %d",
rid, ret);
return ret;
}
do {
more = false;
/*
* All of these can return errors, perhaps indicating successful
* partial progress, after having modified the allocator trees.
* We always have to update the roots in the log item.
*/
mutex_lock(&server->alloc_mutex);
ret = (err_str = "splice meta_freed to other_freed",
scoutfs_alloc_splice_list(sb, &server->alloc,
&server->wri, server->other_freed,
&lt->meta_freed)) ?:
(err_str = "splice meta_avail",
scoutfs_alloc_splice_list(sb, &server->alloc,
&server->wri, server->other_freed,
&lt->meta_avail)) ?:
(err_str = "empty data_avail",
alloc_move_empty(sb, &super->data_alloc,
&lt->data_avail,
COMMIT_HOLD_ALLOC_BUDGET / 2)) ?:
(err_str = "empty data_freed",
alloc_move_empty(sb, &super->data_alloc,
&lt->data_freed,
COMMIT_HOLD_ALLOC_BUDGET / 2));
mutex_unlock(&server->alloc_mutex);
/*
* only finalize, allowing merging, once the allocators are
* fully freed
*/
if (ret == 0) {
/* the transaction is no longer open */
le64_add_cpu(&lt->flags, SCOUTFS_LOG_TREES_FINALIZED);
lt->finalize_seq = cpu_to_le64(scoutfs_server_next_seq(sb));
}
scoutfs_key_init_log_trees(&key, rid, le64_to_cpu(lt->nr));
err = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, lt,
sizeof(*lt));
BUG_ON(err != 0); /* alloc, log, srch items out of sync */
if (ret == -EINPROGRESS) {
more = true;
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, hold, 0);
if (ret < 0)
WARN_ON_ONCE(ret < 0);
server_hold_commit(sb, hold);
mutex_lock(&server->logs_mutex);
} else if (ret == 0) {
memset(&lt->item_root, 0, sizeof(lt->item_root));
memset(&lt->bloom_ref, 0, sizeof(lt->bloom_ref));
lt->inode_count_delta = 0;
lt->max_item_seq = 0;
lt->finalize_seq = 0;
le64_add_cpu(&lt->nr, 1);
lt->flags = 0;
}
} while (more);
if (ret < 0) {
scoutfs_err(sb,
"error %d finalizing log trees for rid %016llx: %s",
ret, rid, err_str);
}
return ret;
}
/*
* Finalizing the log btrees for merging needs to be done carefully so
* that items don't appear to go backwards in time.
@@ -1092,7 +1186,6 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
struct scoutfs_log_merge_range rng;
struct scoutfs_mount_options opts;
struct scoutfs_log_trees each_lt;
struct scoutfs_log_trees fin;
unsigned int delay_ms;
unsigned long timeo;
bool saw_finalized;
@@ -1121,6 +1214,8 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
break;
}
scoutfs_inc_counter(sb, log_merges_started);
/* look for finalized and other active log btrees */
saw_finalized = false;
others_active = false;
@@ -1197,32 +1292,11 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
/* Finalize ours if it's visible to others */
if (ours_visible) {
fin = *lt;
memset(&fin.meta_avail, 0, sizeof(fin.meta_avail));
memset(&fin.meta_freed, 0, sizeof(fin.meta_freed));
memset(&fin.data_avail, 0, sizeof(fin.data_avail));
memset(&fin.data_freed, 0, sizeof(fin.data_freed));
memset(&fin.srch_file, 0, sizeof(fin.srch_file));
le64_add_cpu(&fin.flags, SCOUTFS_LOG_TREES_FINALIZED);
fin.finalize_seq = cpu_to_le64(scoutfs_server_next_seq(sb));
scoutfs_key_init_log_trees(&key, le64_to_cpu(fin.rid),
le64_to_cpu(fin.nr));
ret = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, &fin,
sizeof(fin));
ret = do_finalize_ours(sb, lt, hold);
if (ret < 0) {
err_str = "updating finalized log_trees";
err_str = "finalizing ours";
break;
}
memset(&lt->item_root, 0, sizeof(lt->item_root));
memset(&lt->bloom_ref, 0, sizeof(lt->bloom_ref));
lt->inode_count_delta = 0;
lt->max_item_seq = 0;
lt->finalize_seq = 0;
le64_add_cpu(&lt->nr, 1);
lt->flags = 0;
}
/* wait a bit for mounts to arrive */
@@ -1284,6 +1358,7 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
}
/* we're done, caller can make forward progress */
scoutfs_inc_counter(sb, log_merges_completed);
break;
}
@@ -1302,12 +1377,10 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
* is nested inside holding commits so we recheck the persistent item
* each time we commit to make sure it's still what we think. The
* caller is still going to send the item to the client so we update the
* caller's each time we make progress. This is a best-effort attempt
* to clean up and it's valid to leave extents in data_freed we don't
* return errors to the caller. The client will continue the work later
* in get_log_trees or as the rid is reclaimed.
* caller's each time we make progress. If we hit an error applying the
* changes we make then we can't send the log_trees to the client.
*/
static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
{
DECLARE_SERVER_INFO(sb, server);
struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
@@ -1316,6 +1389,7 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
struct scoutfs_log_trees drain;
struct scoutfs_key key;
COMMIT_HOLD(hold);
bool apply = false;
int ret = 0;
int err;
@@ -1324,22 +1398,27 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
while (lt->data_freed.total_len != 0) {
server_hold_commit(sb, &hold);
mutex_lock(&server->logs_mutex);
apply = true;
ret = find_log_trees_item(sb, &super->logs_root, false, rid, U64_MAX, &drain);
if (ret < 0)
if (ret < 0) {
ret = 0;
break;
}
/* careful to only keep draining the caller's specific open trans */
if (drain.nr != lt->nr || drain.get_trans_seq != lt->get_trans_seq ||
drain.commit_trans_seq != lt->commit_trans_seq || drain.flags != lt->flags) {
ret = -ENOENT;
ret = 0;
break;
}
ret = scoutfs_btree_dirty(sb, &server->alloc, &server->wri,
&super->logs_root, &key);
if (ret < 0)
if (ret < 0) {
ret = 0;
break;
}
/* moving can modify and return errors, always update caller and item */
mutex_lock(&server->alloc_mutex);
@@ -1355,19 +1434,19 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
BUG_ON(err < 0); /* dirtying must guarantee success */
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
if (ret < 0) {
ret = 0; /* don't try to abort, ignoring ret */
apply = false;
if (ret < 0)
break;
}
}
/* try to cleanly abort and write any partial dirty btree blocks, but ignore result */
if (ret < 0) {
if (apply) {
mutex_unlock(&server->logs_mutex);
server_apply_commit(sb, &hold, 0);
server_apply_commit(sb, &hold, ret);
}
return ret;
}
/*
@@ -1571,13 +1650,15 @@ unlock:
ret = server_apply_commit(sb, &hold, ret);
out:
if (ret < 0)
scoutfs_err(sb, "error %d getting log trees for rid %016llx: %s",
ret, rid, err_str);
if (ret < 0) {
scoutfs_err_maybe(sb, ret,
"error %d getting log trees for rid %016llx: %s",
ret, rid, err_str);
}
/* try to drain excessive data_freed with additional commits, if needed, ignoring err */
/* try to drain excessive data_freed with additional commits, if needed */
if (ret == 0)
try_drain_data_freed(sb, &lt);
ret = try_drain_data_freed(sb, &lt);
return scoutfs_net_response(sb, conn, cmd, id, ret, &lt, sizeof(lt));
}
@@ -1676,11 +1757,14 @@ unlock:
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
if (ret < 0)
scoutfs_err(sb, "server error %d committing client logs for rid %016llx: %s",
ret, rid, err_str);
if (ret < 0) {
scoutfs_err_maybe(sb, ret,
"server error %d committing client logs for rid %016llx, nr %llu: %s",
ret, rid, le64_to_cpu(lt.nr), err_str);
}
out:
WARN_ON_ONCE(ret < 0);
WARN_ON_ONCE((ret < 0) && !ignore_err(sb, ret));
return scoutfs_net_response(sb, conn, cmd, id, ret, NULL, 0);
}
@@ -1715,7 +1799,7 @@ static int server_get_roots(struct super_block *sb,
* all the allocator items.
*
* The caller holds the commit rwsem which means we have to do our work
* in one commit. The alocator btrees can be very large and very
* in one commit. The allocator btrees can be very large and very
* fragmented. We return -EINPROGRESS if we couldn't fully reclaim the
* allocators in one commit. The caller should apply the current
* commit and call again in a new commit.
@@ -1741,6 +1825,8 @@ static int reclaim_open_log_tree(struct super_block *sb, u64 rid)
mutex_lock(&server->logs_mutex);
scoutfs_inc_counter(sb, log_merges_started);
/* find the client's last open log_tree */
scoutfs_key_init_log_trees(&key, rid, U64_MAX);
ret = scoutfs_btree_prev(sb, &super->logs_root, &key, &iref);
@@ -1810,6 +1896,7 @@ static int reclaim_open_log_tree(struct super_block *sb, u64 rid)
&super->logs_root, &key, &lt, sizeof(lt));
BUG_ON(err != 0); /* alloc, log, srch items out of sync */
scoutfs_inc_counter(sb, log_merges_completed);
out:
mutex_unlock(&server->logs_mutex);
@@ -2014,7 +2101,9 @@ static int server_srch_get_compact(struct super_block *sb,
apply:
ret = server_apply_commit(sb, &hold, ret);
WARN_ON_ONCE(ret < 0 && ret != -ENOENT); /* XXX leaked busy item */
/* XXX leaked busy item */
WARN_ON_ONCE(ret < 0 && ret != -ENOENT && !ignore_err(sb, ret));
out:
ret = scoutfs_net_response(sb, conn, cmd, id, ret,
sc, sizeof(struct scoutfs_srch_compact));
@@ -2530,7 +2619,7 @@ static void server_log_merge_free_work(struct work_struct *work)
ret = scoutfs_btree_free_blocks(sb, &server->alloc,
&server->wri, &fr.key,
&fr.root, COMMIT_HOLD_ALLOC_BUDGET / 2);
&fr.root, COMMIT_HOLD_ALLOC_BUDGET / 8);
if (ret < 0) {
err_str = "freeing log btree";
break;
@@ -2549,7 +2638,7 @@ static void server_log_merge_free_work(struct work_struct *work)
/* freed blocks are in allocator, we *have* to update fr */
BUG_ON(ret < 0);
if (server_hold_alloc_used_since(sb, &hold) >= COMMIT_HOLD_ALLOC_BUDGET / 2) {
if (server_hold_alloc_used_since(sb, &hold) >= (COMMIT_HOLD_ALLOC_BUDGET * 3) / 4) {
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
commit = false;
@@ -2858,9 +2947,10 @@ out:
mutex_unlock(&server->alloc_mutex);
BUG_ON(err); /* inconsistent */
if (ret < 0 && ret != -ENOENT)
scoutfs_err(sb, "error %d getting merge req rid %016llx: %s",
ret, rid, err_str);
if (ret < 0 && ret != -ENOENT) {
scoutfs_err_maybe(sb, ret, "error %d getting merge req rid %016llx: %s",
ret, rid, err_str);
}
}
mutex_unlock(&server->logs_mutex);
@@ -4152,7 +4242,7 @@ static void fence_pending_recov_worker(struct work_struct *work)
struct server_info *server = container_of(work, struct server_info,
fence_pending_recov_work);
struct super_block *sb = server->sb;
union scoutfs_inet_addr addr;
union scoutfs_inet_addr addr = {{0,}};
u64 rid = 0;
int ret = 0;

View File

@@ -62,7 +62,7 @@
* re-allocated and re-written. Search can restart by checking the
* btree for the current set of files. Compaction reads log files which
* are protected from other compactions by the persistent busy items
* created by the server. Compaction won't see it's blocks reused out
* created by the server. Compaction won't see its blocks reused out
* from under it, but it can encounter stale cached blocks that need to
* be invalidated.
*/
@@ -442,6 +442,10 @@ out:
if (ret == 0 && (flags & GFB_INSERT) && blk >= le64_to_cpu(sfl->blocks))
sfl->blocks = cpu_to_le64(blk + 1);
if (bl) {
trace_scoutfs_get_file_block(sb, bl->blkno, flags);
}
*bl_ret = bl;
return ret;
}
@@ -749,14 +753,14 @@ static int search_log_file(struct super_block *sb,
for (i = 0; i < le32_to_cpu(srb->entry_nr); i++) {
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
ret = decode_entry(srb->entries + pos, &sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
pos += ret;
@@ -859,14 +863,14 @@ static int search_sorted_file(struct super_block *sb,
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
ret = decode_entry(srb->entries + pos, &sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
pos += ret;
@@ -972,6 +976,8 @@ int scoutfs_srch_search_xattrs(struct super_block *sb,
scoutfs_inc_counter(sb, srch_search_xattrs);
trace_scoutfs_ioc_search_xattrs(sb, ino, last_ino);
*done = false;
srch_init_rb_root(sroot);
@@ -1802,7 +1808,7 @@ static void swap_page_sre(void *A, void *B, int size)
* typically, ~10x worst case).
*
* Because we read and sort all the input files we must perform the full
* compaction in one operation. The server must have given us a
* compaction in one operation. The server must have given us
* sufficiently large avail/freed lists, otherwise we'll return ENOSPC.
*/
static int compact_logs(struct super_block *sb,
@@ -1866,14 +1872,14 @@ static int compact_logs(struct super_block *sb,
if (pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
break;
}
ret = decode_entry(srb->entries + pos, sre, &prev);
if (ret <= 0) {
/* can only be inconsistency :/ */
ret = EIO;
ret = -EIO;
goto out;
}
prev = *sre;

View File

@@ -159,6 +159,58 @@ static bool drained_holders(struct trans_info *tri)
return holders == 0;
}
static int commit_current_log_trees(struct super_block *sb, char **str)
{
DECLARE_TRANS_INFO(sb, tri);
return (*str = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
(*str = "item dirty", scoutfs_item_write_dirty(sb)) ?:
(*str = "data prepare", scoutfs_data_prepare_commit(sb)) ?:
(*str = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc, &tri->wri)) ?:
(*str = "meta write", scoutfs_block_writer_write(sb, &tri->wri)) ?:
(*str = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
(*str = "commit log trees", commit_btrees(sb)) ?:
scoutfs_item_write_done(sb);
}
static int get_next_log_trees(struct super_block *sb, char **str)
{
return (*str = "get log trees", scoutfs_trans_get_log_trees(sb));
}
static int retry_forever(struct super_block *sb, int (*func)(struct super_block *sb, char **str))
{
bool retrying = false;
char *str;
int ret;
do {
str = NULL;
ret = func(sb, &str);
if (ret < 0) {
if (!retrying) {
scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
str, ret);
retrying = true;
}
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
break;
}
msleep(2 * MSEC_PER_SEC);
} else if (retrying) {
scoutfs_info(sb, "retried transaction commit succeeded");
}
} while (ret < 0);
return ret;
}
/*
* This work func is responsible for writing out all the dirty blocks
* that make up the current dirty transaction. It prevents writers from
@@ -184,8 +236,6 @@ void scoutfs_trans_write_func(struct work_struct *work)
struct trans_info *tri = container_of(work, struct trans_info, write_work.work);
struct super_block *sb = tri->sb;
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
bool retrying = false;
char *s = NULL;
int ret = 0;
tri->task = current;
@@ -214,37 +264,9 @@ void scoutfs_trans_write_func(struct work_struct *work)
scoutfs_inc_counter(sb, trans_commit_written);
do {
ret = (s = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
(s = "item dirty", scoutfs_item_write_dirty(sb)) ?:
(s = "data prepare", scoutfs_data_prepare_commit(sb)) ?:
(s = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc,
&tri->wri)) ?:
(s = "meta write", scoutfs_block_writer_write(sb, &tri->wri)) ?:
(s = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
(s = "commit log trees", commit_btrees(sb)) ?:
scoutfs_item_write_done(sb) ?:
(s = "get log trees", scoutfs_trans_get_log_trees(sb));
if (ret < 0) {
if (!retrying) {
scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
s, ret);
retrying = true;
}
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
break;
}
msleep(2 * MSEC_PER_SEC);
} else if (retrying) {
scoutfs_info(sb, "retried transaction commit succeeded");
}
} while (ret < 0);
/* retry {commit,get}_log_trees until they succeeed, can only fail when forcing unmount */
ret = retry_forever(sb, commit_current_log_trees) ?:
retry_forever(sb, get_next_log_trees);
out:
spin_lock(&tri->write_lock);
tri->write_count++;

View File

@@ -742,7 +742,7 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
int ret;
int err;
trace_scoutfs_xattr_set(sb, name_len, value, size, flags);
trace_scoutfs_xattr_set(sb, ino, name_len, value, size, flags);
if (WARN_ON_ONCE(tgs->totl && tgs->indx) ||
WARN_ON_ONCE((tgs->totl | tgs->indx) && !tag_lock))

View File

@@ -80,3 +80,15 @@ t_compare_output()
{
"$@" >&7 2>&1
}
#
# usually bash prints an annoying output message when jobs
# are killed. We can avoid that by redirecting stderr for
# the bash process when it reaps the jobs that are killed.
#
t_silent_kill() {
exec {ERR}>&2 2>/dev/null
kill "$@"
wait "$@"
exec 2>&$ERR {ERR}>&-
}

View File

@@ -160,6 +160,9 @@ t_filter_dmesg()
re="$re|Pipe handler or fully qualified core dump path required.*"
re="$re|Set kernel.core_pattern before fs.suid_dumpable.*"
# perf warning that it adjusted sample rate
re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
egrep -v "($re)" | \
ignore_harmless_unwind_kasan_stack_oob
}

View File

@@ -1,4 +1,3 @@
== setting longer hung task timeout
== creating fragmented extents
== unlink file with moved extents to free extents per block
== cleanup

View File

@@ -49,7 +49,7 @@ offline wating should be empty:
0
== truncating does wait
truncate should be waiting for first block:
trunate should no longer be waiting:
truncate should no longer be waiting:
0
== writing waits
should be waiting for write

View File

@@ -464,7 +464,6 @@ for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
if [ "$i" -lt "$T_QUORUM" ]; then
opts="$opts,quorum_slot_nr=$i"
fi
opts="$opts,meta_reserve_blocks=0"
opts="${opts}${T_MNT_OPTIONS}"
msg "mounting $meta_dev|$data_dev on $dir"
@@ -533,12 +532,15 @@ for t in $tests; do
cmd rm -rf "$T_TMPDIR"
cmd mkdir -p "$T_TMPDIR"
# create a test name dir in the fs
# create a test name dir in the fs, clean up old data as needed
T_DS=""
for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
dir="${T_M[$i]}/test/$test_name"
test $i == 0 && cmd mkdir -p "$dir"
test $i == 0 && (
test -d "$dir" && cmd rm -rf "$dir"
cmd mkdir -p "$dir"
)
eval T_D$i=$dir
T_D[$i]=$dir

View File

@@ -88,6 +88,11 @@ rm -rf "$SCR/xattrs"
echo "== make sure we can create again"
file="$SCR/file-after"
C=120
while (( C-- )); do
touch $file 2> /dev/null && break
sleep 1
done
touch $file
setfattr -n user.scoutfs-enospc -v 1 "$file"
sync

View File

@@ -11,8 +11,8 @@
# format version.
#
# not supported on el9!
if [ $(source /etc/os-release ; echo ${VERSION_ID:0:1}) -gt 8 ]; then
# not supported on el8 or higher
if [ $(source /etc/os-release ; echo ${VERSION_ID:0:1}) -gt 7 ]; then
t_skip_permitted "Unsupported OS version"
fi

View File

@@ -10,30 +10,6 @@ EXTENTS_PER_BTREE_BLOCK=600
EXTENTS_PER_LIST_BLOCK=8192
FREED_EXTENTS=$((EXTENTS_PER_BTREE_BLOCK * EXTENTS_PER_LIST_BLOCK))
#
# This test specifically creates a pathologically sparse file that will
# be as expensive as possible to free. This is usually fine on
# dedicated or reasonable hardware, but trying to run this in
# virtualized debug kernels can take a very long time. This test is
# about making sure that the server doesn't fail, not that the platform
# can handle the scale of work that our btree formats happen to require
# while execution is bogged down with use-after-free memory reference
# tracking. So we give the test a lot more breathing room before
# deciding that its hung.
#
echo "== setting longer hung task timeout"
if [ -w /proc/sys/kernel/hung_task_timeout_secs ]; then
secs=$(cat /proc/sys/kernel/hung_task_timeout_secs)
test "$secs" -gt 0 || \
t_fail "confusing value '$secs' from /proc/sys/kernel/hung_task_timeout_secs"
restore_hung_task_timeout()
{
echo "$secs" > /proc/sys/kernel/hung_task_timeout_secs
}
trap restore_hung_task_timeout EXIT
echo "$((secs * 5))" > /proc/sys/kernel/hung_task_timeout_secs
fi
echo "== creating fragmented extents"
fragmented_data_extents $FREED_EXTENTS $EXTENTS_PER_BTREE_BLOCK "$T_D0/alloc" "$T_D0/move"

View File

@@ -38,6 +38,6 @@ while [ "$SECONDS" -lt "$END" ]; do
done
echo "== stopping background load"
kill $load_pids
t_silent_kill $load_pids
t_pass

View File

@@ -157,7 +157,7 @@ echo "truncate should be waiting for first block:"
expect_wait "$DIR/file" "change_size" $ino 0
scoutfs stage "$DIR/golden" "$DIR/file" -V "$vers" -o 0 -l $BYTES
sleep .1
echo "trunate should no longer be waiting:"
echo "truncate should no longer be waiting:"
scoutfs data-waiting -B 0 -I 0 -p "$DIR" | wc -l
cat "$DIR/golden" > "$DIR/file"
vers=$(scoutfs stat -s data_version "$DIR/file")
@@ -168,10 +168,13 @@ scoutfs release "$DIR/file" -V "$vers" -o 0 -l $BYTES
# overwrite, not truncate+write
dd if="$DIR/other" of="$DIR/file" \
bs=$BS count=$BLOCKS conv=notrunc status=none &
pid="$!"
sleep .1
echo "should be waiting for write"
expect_wait "$DIR/file" "write" $ino 0
scoutfs stage "$DIR/golden" "$DIR/file" -V "$vers" -o 0 -l $BYTES
# wait for the background dd to complete
wait "$pid" 2> /dev/null
cmp "$DIR/file" "$DIR/other"
echo "== cleanup"

View File

@@ -5,18 +5,6 @@
t_require_commands sleep touch sync stat handle_cat kill rm
t_require_mounts 2
#
# usually bash prints an annoying output message when jobs
# are killed. We can avoid that by redirecting stderr for
# the bash process when it reaps the jobs that are killed.
#
silent_kill() {
exec {ERR}>&2 2>/dev/null
kill "$@"
wait "$@"
exec 2>&$ERR {ERR}>&-
}
#
# We don't have a great way to test that inode items still exist. We
# don't prevent opening handles with nlink 0 today, so we'll use that.
@@ -52,7 +40,7 @@ inode_exists $ino || echo "$ino didn't exist"
echo "== orphan from failed evict deletion is picked up"
# pending kill signal stops evict from getting locks and deleting
silent_kill $pid
t_silent_kill $pid
t_set_sysfs_mount_option 0 orphan_scan_delay_ms 1000
sleep 5
inode_exists $ino && echo "$ino still exists"
@@ -70,7 +58,11 @@ for nr in $(t_fs_nrs); do
rm -f "$path"
done
sync
silent_kill $pids
t_silent_kill $pids
LLMS=$(t_counter log_merges_started $sv)
LLMC=$(t_counter log_merges_completed $sv)
for nr in $(t_fs_nrs); do
t_force_umount $nr
done
@@ -79,10 +71,64 @@ t_mount_all
while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
sleep .5
done
# wait for the orphan inode cleanup changes to be merged
S=0
C=0
sv=$(t_server_nr)
while sleep 1; do
LMS=$(t_counter log_merges_started $sv)
LMC=$(t_counter log_merges_completed $sv)
if [ $LMS != $LLMS ]; then
(( S++ ))
LLMS=$LMS
fi
if [ $LMC != $LLMC ]; then
(( C++ ))
LLMC=$LMC
fi
# If we've completed more than one merge, we're done
if [ $C -gt 1 ]; then
break
fi
# If we've started more than one merge, we're done
if [ $S -gt 1 ]; then
break
fi
done
# wait for orphan scans to run
t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
# also have to wait for delayed log merge work from mount
sleep 15
# wait until we see two consecutive orphan scan attempts without
# any inode deletion forward progress in each mount
for nr in $(t_fs_nrs); do
C=0
LOSA=$(t_counter orphan_scan_attempts $nr)
LDOP=$(t_counter inode_deleted $nr)
while [ $C -lt 2 ]; do
sleep 1
OSA=$(t_counter orphan_scan_attempts $nr)
DOP=$(t_counter inode_deleted $nr)
if [ $OSA != $LOSA ]; then
if [ $DOP == $LDOP ]; then
(( C++ ))
else
C=0
fi
fi
LOSA=$OSA
LDOP=$DOP
done
done
for ino in $inos; do
inode_exists $ino && echo "$ino still exists"
done
@@ -131,7 +177,7 @@ while [ $SECONDS -lt $END ]; do
done
# trigger eviction deletion of each file in each mount
silent_kill $pids
t_silent_kill $pids
wait || t_fail "handle_fsetxattr failed"

View File

@@ -22,6 +22,11 @@ RE="$RE|warning: memset with byte count of 4194304"
# some sparse versions don't know about some builtins
RE="$RE|error: undefined identifier '__builtin_fpclassify'"
# on el8, sparse can't handle __has_include for some reason when _GNU_SOURCE
# is defined, and we need that for O_DIRECT.
RE="$RE|note: in included file .through /usr/include/sys/stat.h.:"
RE="$RE|/usr/include/bits/statx.h:30:6: error: "
#
# don't filter out 'too many errors' here, it can signify that
# sparse doesn't understand something and is throwing a *ton*