Compare commits

..

19 Commits

Author SHA1 Message Date
Auke Kok
4e2c1e83be Optionally print out xattr values
We cannot validate that totl keys have the correct count/value without
also extracting and printing the value for xattrs, which by default
is omitted (a sane default). Add a default-disabled --xattr-values/-V
flag to enable printing these out.

Because xattr values can span multiple items, this only will print
out the first one, and ellipsize it if it continues elsewhere. It is
filtered through isprint() to avoid printing non-printable characters.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-04-16 11:26:29 -07:00
Auke Kok
8c4e9bfa3e Add print filters for remaining types.
Chris already added print filters for most common types, but
omitted totl, indx, orphan, quota, and inode_index items. This
adds those as well, completing the set.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-04-16 11:01:47 -07:00
Chris Kirby
91638191de Add finer grained options to scoutfs print
The default output from scoutfs print can be very large, even
when using the -S option. Add three new command line options
to allow more targeted selection of btrees and their items.

--allocs prints the metadata and data allocators
--roots allows the selection of btree roots to walk (logs, srch, fs)
--items allows the selection of items to print from the selected btrees

Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-11-21 10:39:58 -06:00
Zach Brown
1c7678b6f5 Merge pull request #263 from versity/zab/v1.26
v1.26 Release
2025-11-18 09:39:27 -08:00
Zach Brown
22b5e79bbd v1.26 Release
Finish the release notes for the 1.26 release.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-17 14:42:14 -08:00
Zach Brown
259e639271 Merge pull request #262 from versity/zab/ino_alloc_per_lock
Add ino_alloc_per_lock option
2025-11-14 13:57:49 -08:00
Zach Brown
4d66c38c71 Remove redundant WARN in commit_log_trees
The server's commit_log_trees has an error message that includes the
source of the error, but it's not used for all errors.  The WARN_ON is
redundant with the message and is removed because it isn't filtered out
when we see errors from forced unmount.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-14 10:04:30 -08:00
Zach Brown
7ef62894bd Add ino_alloc_per_lock option
Add an option that can limit the number of inode numbers that are
allocated per lock group.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 17:19:04 -08:00
Zach Brown
1f363a1ead Merge pull request #261 from versity/zab/log_merge_double_free
Zab/log merge double free
2025-11-13 17:18:30 -08:00
Zach Brown
8ddf9b8c8c Handle disappearing fencing requests and targets
The userspace fencing process wasn't careful about handling underlying
directories that disappear while it was working.

On the server/fenced side, fencing requests can linger after they've
been resolved by writing 1 to fenced or error.  The script could come
back around to see the directory before the server finally removes it,
causing all later uses of the request dir to fail.  We saw this in the
logs as a bunch of cat errors for the various request files.

On the local fence script side, all the mounts can be in the process of
being unmounted so both the /sys/fs dirs and the mount it self can be
removed while we're working.

For both, when we're working with the /sys/fs files we read them without
logging errors and then test that the dir still exists before using what
we read.  When fencing a mount, we stop if findmnt doesn't find the
mount and then raise a umount error if the /sys/fs dir exists after
umount fails.

And while we're at it, we have each scripts logging append instead of
truncating (if, say, it's a log file instead of an interactive tty).

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
fd80c17ab6 Filter out kernel message when guests are slow
Ignore more kernel messages when debug guests are being slow.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
991e2cbdf8 Ignore slow quorum hb transfers in tests
We're getting test failures from messages that our guests can be
unresponsive.  They sure can be.  We don't need to fail for this one
specific case.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
92ac132873 Silence merge splice error when forcing
Silence another error warning and assertion that's assuming that the
result of the errors is going to be persistent.  When we're forcing an
unmount we've severed storage and networking.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
ad078cd93c Avoid lock stalling mmap_stress
mmap_stress gets completely stalled in lock messaging and starving
most of the mmap_stress threads, which causes it to delay and even
time out in CI.

Instead of spawning threads over all 5 test nodes, we reduce it
to spawning over only 2 artificially. This still does a good number
of operations on those node, and now the work is spread across the
two nodes evenly.

Additionaly, I've added a miniscule (10ms) delay in between operations
that should hopefully be sufficient for other locking attempts to
settle and allow the threads to better spread the work.

This now shows that all the threads exit within < 0.25s on my test
machine, which is a lot better than the 40s variation that I was seeing
locally. Hopefully this fares better in CI.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Auke Kok
90cb458cd5 Make mmap_stress not exceed a fixed amount of time.
There's a scenarion where mmap_stress gets enough resources that
twoe of the threads will starve the others, which then all take
a very long time catching up committing changes.

Because this test program didn't finish until all the threads had
completed a fixed amount of work, essentially these threads all
ended up tripping over eachother. In CI this would exceed 6h+,
while originally I intended this to run in about 100s or so.

Instead, cap the run time to ~30s by default. If threads exceed
this time, they will immediately exit, which causes any clog in
contention between the threads to drain relatively quickly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
1ab798e7eb Silence inconsistent srch on forced unmount
Assembling a srch compaction operation creates an item and populates it
with allocator state.  It doesn't cleanly unwind the allocation and undo
the compaction item change if allocation filling fails and issues a
warning.

This warning isn't needed if the error shows that we're in forced
unmount.  The inconsistent state won't be applied, it will be dropped on
the floor as the mount is torn down.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
e182914e51 Fix double free of metadata blocks in log merging
The log merging process is meant to provide parallelism across workers
in mounts.  The idea is that the server hands out a bunch of concurrent
non-intersecting work that's based on the structure of the stable input
fs_root btree.

The nature of the parallel work (cow of the blocks that intersect a key
range) means that the ranges of concurrently issued work can't overlap
or the work will all cow the same input blocks, freeing that input
stable block multiple times.  We're seeing this in testing.

Correctness was intended by having an advancing key that sweeps sorted
ranges.  Duplicate ranges would never be hit as the key advanced past
each it visited.  This was broken by the mapping of the fs item keys to
log merge tree keys by clobbering the sk_zone key value.  It effectively
interleaves the ranges of each zone in the fs root (meta indexes,
orphans, fs items).  With just the right log merge conditions that
involve logged items in the right places and partial completed work to
insert remaining ranges behind the key, ranges can be stored at mapped
keys that end up with ranges out of order.  The server iterates over
these and ends up issueing overlapping work, which results in duplicated
frees of the input blocks.

The fix, without changing the format of the stored log tree items, is to
perform a full sweep of all the range items and determine the next item
by looking at the full precision stored keys.  This ensures that the
processed ranges always advance and never overlap.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
8484a58dd6 Have xfstest pass when using args
The xfstests's golden output includes the full set of tests we expect to
run when no args are specified.  If we specify args then the set of
tests can change and the test will always fail when they do.

This fixes that by having the test check the set of tests itself, rather
than relying on golden output.  If args are specified then our xfstest
only fails if any of the executed xfstest tests failed.  Without args,
we perform the same scraping of the check output and compare it against
the expected results ourself.

It would have been a bit much to put that large file inline in the test
file, so we add a dir of per-test files in revision control.  We can
also put the list of exclusions there.

We can also clean up the output redirection helper functions to make
them more clear.  After xfstests has executed we want to redirect output
back to the compared output so that we can catch any unexpected output.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
Zach Brown
a077104531 Add crash monitor to run-tests
Add a little background function that runs during the test which
triggers a crash if it finds catastrophic failure conditions.

This is the second bg task we want to kill and we can only have one
function run on the EXIT trap, so we create a generic process killing
trap function.

We feed it the fenced pid as well.  run-tests didn't log much of value
into the fenced log, and we're not logging the kills into anymore, so we
just remove run-tests fenced logging.

Signed-off-by: Zach Brown <zab@versity.com>
2025-11-13 12:43:31 -08:00
15 changed files with 625 additions and 140 deletions

View File

@@ -1,6 +1,41 @@
Versity ScoutFS Release Notes
=============================
---
v1.26
\
*Nov 17, 2025*
Add the ino\_alloc\_per\_lock mount option. This changes the number of
inode numbers allocated under each cluster lock and can alleviate lock
contention for some patterns of larger file creation.
Add the tcp\_keepalive\_timeout\_ms mount option. This can enable the
system to survive longer periods of networking outages.
Fix a rare double free of internal btree metadata blocks when merging
log trees. The duplicated freed metadata block numbers would cause
persistent errors in the server, preventing the server from starting and
hanging the system.
Fix the data\_wait interface to not require the correct data\_version of
the inode when raising an error. This lets callers raise errors when
they're unable to recall the details of the inode to discover its
data\_version.
Change scoutfs to more aggressively reclaim cached memory when under
memory pressure. This makes scoutfs behave more like other kernel
components and it integrates better with the reclaim policy heuristics
in the VM core of the kernel.
Change scoutfs to more efficiently transmit and receive socket messages.
Under heavy load this can process messages sufficiently more quickly to
avoid hung task messages for tasks that were waiting for cluster lock
messages to be processed.
Fix faulty server block commit budget calculations that were generating
spurious "holders exceeded alloc budget" console messages.
---
v1.25
\

View File

@@ -1482,12 +1482,6 @@ static int remove_index_items(struct super_block *sb, u64 ino,
* Return an allocated and unused inode number. Returns -ENOSPC if
* we're out of inode.
*
* Each parent directory has its own pool of free inode numbers. Items
* are sorted by their inode numbers as they're stored in segments.
* This will tend to group together files that are created in a
* directory at the same time in segments. Concurrent creation across
* different directories will be stored in their own regions.
*
* Inode numbers are never reclaimed. If the inode is evicted or we're
* unmounted the pending inode numbers will be lost. Asking for a
* relatively small number from the server each time will tend to
@@ -1497,12 +1491,18 @@ static int remove_index_items(struct super_block *sb, u64 ino,
int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
{
DECLARE_INODE_SB_INFO(sb, inf);
struct scoutfs_mount_options opts;
struct inode_allocator *ia;
u64 ino;
u64 nr;
int ret;
ia = is_dir ? &inf->dir_ino_alloc : &inf->ino_alloc;
scoutfs_options_read(sb, &opts);
if (is_dir && opts.ino_alloc_per_lock == SCOUTFS_LOCK_INODE_GROUP_NR)
ia = &inf->dir_ino_alloc;
else
ia = &inf->ino_alloc;
spin_lock(&ia->lock);
@@ -1523,6 +1523,17 @@ int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
*ino_ret = ia->ino++;
ia->nr--;
if (opts.ino_alloc_per_lock != SCOUTFS_LOCK_INODE_GROUP_NR) {
nr = ia->ino & SCOUTFS_LOCK_INODE_GROUP_MASK;
if (nr >= opts.ino_alloc_per_lock) {
nr = SCOUTFS_LOCK_INODE_GROUP_NR - nr;
if (nr > ia->nr)
nr = ia->nr;
ia->ino += nr;
ia->nr -= nr;
}
}
spin_unlock(&ia->lock);
ret = 0;
out:

View File

@@ -35,6 +35,12 @@ do { \
} \
} while (0) \
#define scoutfs_bug_on_err(sb, err, fmt, args...) \
do { \
__typeof__(err) _err = (err); \
scoutfs_bug_on(sb, _err < 0 && _err != -ENOLINK, fmt, ##args); \
} while (0)
/*
* Each message is only generated once per volume. Remounting resets
* the messages.

View File

@@ -33,6 +33,7 @@ enum {
Opt_acl,
Opt_data_prealloc_blocks,
Opt_data_prealloc_contig_only,
Opt_ino_alloc_per_lock,
Opt_log_merge_wait_timeout_ms,
Opt_metadev_path,
Opt_noacl,
@@ -47,6 +48,7 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_data_prealloc_blocks, "data_prealloc_blocks=%s"},
{Opt_data_prealloc_contig_only, "data_prealloc_contig_only=%s"},
{Opt_ino_alloc_per_lock, "ino_alloc_per_lock=%s"},
{Opt_log_merge_wait_timeout_ms, "log_merge_wait_timeout_ms=%s"},
{Opt_metadev_path, "metadev_path=%s"},
{Opt_noacl, "noacl"},
@@ -136,6 +138,7 @@ static void init_default_options(struct scoutfs_mount_options *opts)
opts->data_prealloc_blocks = SCOUTFS_DATA_PREALLOC_DEFAULT_BLOCKS;
opts->data_prealloc_contig_only = 1;
opts->ino_alloc_per_lock = SCOUTFS_LOCK_INODE_GROUP_NR;
opts->log_merge_wait_timeout_ms = DEFAULT_LOG_MERGE_WAIT_TIMEOUT_MS;
opts->orphan_scan_delay_ms = -1;
opts->quorum_heartbeat_timeout_ms = SCOUTFS_QUORUM_DEF_HB_TIMEO_MS;
@@ -238,6 +241,18 @@ static int parse_options(struct super_block *sb, char *options, struct scoutfs_m
opts->data_prealloc_contig_only = nr;
break;
case Opt_ino_alloc_per_lock:
ret = match_int(args, &nr);
if (ret < 0 || nr < 1 || nr > SCOUTFS_LOCK_INODE_GROUP_NR) {
scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
SCOUTFS_LOCK_INODE_GROUP_NR);
if (ret == 0)
ret = -EINVAL;
return ret;
}
opts->ino_alloc_per_lock = nr;
break;
case Opt_tcp_keepalive_timeout_ms:
ret = match_int(args, &nr);
ret = verify_tcp_keepalive_timeout_ms(sb, ret, nr);
@@ -393,6 +408,7 @@ int scoutfs_options_show(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",acl");
seq_printf(seq, ",data_prealloc_blocks=%llu", opts.data_prealloc_blocks);
seq_printf(seq, ",data_prealloc_contig_only=%u", opts.data_prealloc_contig_only);
seq_printf(seq, ",ino_alloc_per_lock=%u", opts.ino_alloc_per_lock);
seq_printf(seq, ",metadev_path=%s", opts.metadev_path);
if (!is_acl)
seq_puts(seq, ",noacl");
@@ -481,6 +497,45 @@ static ssize_t data_prealloc_contig_only_store(struct kobject *kobj, struct kobj
}
SCOUTFS_ATTR_RW(data_prealloc_contig_only);
static ssize_t ino_alloc_per_lock_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
struct scoutfs_mount_options opts;
scoutfs_options_read(sb, &opts);
return snprintf(buf, PAGE_SIZE, "%u", opts.ino_alloc_per_lock);
}
static ssize_t ino_alloc_per_lock_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
DECLARE_OPTIONS_INFO(sb, optinf);
char nullterm[20]; /* more than enough for octal -U32_MAX */
long val;
int len;
int ret;
len = min(count, sizeof(nullterm) - 1);
memcpy(nullterm, buf, len);
nullterm[len] = '\0';
ret = kstrtol(nullterm, 0, &val);
if (ret < 0 || val < 1 || val > SCOUTFS_LOCK_INODE_GROUP_NR) {
scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
SCOUTFS_LOCK_INODE_GROUP_NR);
return -EINVAL;
}
write_seqlock(&optinf->seqlock);
optinf->opts.ino_alloc_per_lock = val;
write_sequnlock(&optinf->seqlock);
return count;
}
SCOUTFS_ATTR_RW(ino_alloc_per_lock);
static ssize_t log_merge_wait_timeout_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
@@ -621,6 +676,7 @@ SCOUTFS_ATTR_RO(quorum_slot_nr);
static struct attribute *options_attrs[] = {
SCOUTFS_ATTR_PTR(data_prealloc_blocks),
SCOUTFS_ATTR_PTR(data_prealloc_contig_only),
SCOUTFS_ATTR_PTR(ino_alloc_per_lock),
SCOUTFS_ATTR_PTR(log_merge_wait_timeout_ms),
SCOUTFS_ATTR_PTR(metadev_path),
SCOUTFS_ATTR_PTR(orphan_scan_delay_ms),

View File

@@ -8,6 +8,7 @@
struct scoutfs_mount_options {
u64 data_prealloc_blocks;
bool data_prealloc_contig_only;
unsigned int ino_alloc_per_lock;
unsigned int log_merge_wait_timeout_ms;
char *metadev_path;
unsigned int orphan_scan_delay_ms;

View File

@@ -994,10 +994,11 @@ static int for_each_rid_last_lt(struct super_block *sb, struct scoutfs_btree_roo
}
/*
* Log merge range items are stored at the starting fs key of the range.
* The only fs key field that doesn't hold information is the zone, so
* we use the zone to differentiate all types that we store in the log
* merge tree.
* Log merge range items are stored at the starting fs key of the range
* with the zone overwritten to indicate the log merge item type. This
* day0 mistake loses sorting information for items in the different
* zones in the fs root, so the range items aren't strictly sorted by
* the starting key of their range.
*/
static void init_log_merge_key(struct scoutfs_key *key, u8 zone, u64 first,
u64 second)
@@ -1029,6 +1030,51 @@ static int next_log_merge_item_key(struct super_block *sb, struct scoutfs_btree_
return ret;
}
/*
* The range items aren't sorted by their range.start because
* _RANGE_ZONE clobbers the range's zone. We sweep all the items and
* find the range with the next least starting key that's greater than
* the caller's starting key. We have to be careful to iterate over the
* log_merge tree keys because the ranges can overlap as they're mapped
* to the log_merge keys by clobbering their zone.
*/
static int next_log_merge_range(struct super_block *sb, struct scoutfs_btree_root *root,
struct scoutfs_key *start, struct scoutfs_log_merge_range *rng)
{
struct scoutfs_log_merge_range *next;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
int ret;
key = *start;
key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
scoutfs_key_set_ones(&rng->start);
do {
ret = scoutfs_btree_next(sb, root, &key, &iref);
if (ret == 0) {
if (iref.key->sk_zone != SCOUTFS_LOG_MERGE_RANGE_ZONE) {
ret = -ENOENT;
} else if (iref.val_len != sizeof(struct scoutfs_log_merge_range)) {
ret = -EIO;
} else {
next = iref.val;
if (scoutfs_key_compare(&next->start, &rng->start) < 0 &&
scoutfs_key_compare(&next->start, start) >= 0)
*rng = *next;
key = *iref.key;
scoutfs_key_inc(&key);
}
scoutfs_btree_put_iref(&iref);
}
} while (ret == 0);
if (ret == -ENOENT && !scoutfs_key_is_ones(&rng->start))
ret = 0;
return ret;
}
static int next_log_merge_item(struct super_block *sb,
struct scoutfs_btree_root *root,
u8 zone, u64 first, u64 second,
@@ -1682,6 +1728,7 @@ static int server_commit_log_trees(struct super_block *sb,
int ret;
if (arg_len != sizeof(struct scoutfs_log_trees)) {
err_str = "invalid message log_trees size";
ret = -EINVAL;
goto out;
}
@@ -1745,7 +1792,7 @@ static int server_commit_log_trees(struct super_block *sb,
ret = scoutfs_btree_update(sb, &server->alloc, &server->wri,
&super->logs_root, &key, &lt, sizeof(lt));
BUG_ON(ret < 0); /* dirtying should have guaranteed success */
BUG_ON(ret < 0); /* dirtying should have guaranteed success, srch item inconsistent */
if (ret < 0)
err_str = "updating log trees item";
@@ -1753,11 +1800,10 @@ unlock:
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
out:
if (ret < 0)
scoutfs_err(sb, "server error %d committing client logs for rid %016llx, nr %llu: %s",
ret, rid, le64_to_cpu(lt.nr), err_str);
out:
WARN_ON_ONCE(ret < 0);
return scoutfs_net_response(sb, conn, cmd, id, ret, NULL, 0);
}
@@ -2094,7 +2140,7 @@ static int server_srch_get_compact(struct super_block *sb,
apply:
ret = server_apply_commit(sb, &hold, ret);
WARN_ON_ONCE(ret < 0 && ret != -ENOENT); /* XXX leaked busy item */
WARN_ON_ONCE(ret < 0 && ret != -ENOENT && ret != -ENOLINK); /* XXX leaked busy item */
out:
ret = scoutfs_net_response(sb, conn, cmd, id, ret,
sc, sizeof(struct scoutfs_srch_compact));
@@ -2472,10 +2518,9 @@ out:
}
}
if (ret < 0)
scoutfs_err(sb, "server error %d splicing log merge completion: %s", ret, err_str);
BUG_ON(ret); /* inconsistent */
/* inconsistent */
scoutfs_bug_on_err(sb, ret,
"server error %d splicing log merge completion: %s", ret, err_str);
return ret ?: einprogress;
}
@@ -2720,10 +2765,7 @@ restart:
/* find the next range, always checking for splicing */
for (;;) {
key = stat.next_range_key;
key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
ret = next_log_merge_item_key(sb, &super->log_merge, SCOUTFS_LOG_MERGE_RANGE_ZONE,
&key, &rng, sizeof(rng));
ret = next_log_merge_range(sb, &super->log_merge, &stat.next_range_key, &rng);
if (ret < 0 && ret != -ENOENT) {
err_str = "finding merge range item";
goto out;

View File

@@ -8,36 +8,34 @@
echo "$0 running rid '$SCOUTFS_FENCED_REQ_RID' ip '$SCOUTFS_FENCED_REQ_IP' args '$@'"
log() {
echo "$@" > /dev/stderr
echo_fail() {
echo "$@" >> /dev/stderr
exit 1
}
echo_fail() {
echo "$@" > /dev/stderr
exit 1
# silence error messages
quiet_cat()
{
cat "$@" 2>/dev/null
}
rid="$SCOUTFS_FENCED_REQ_RID"
shopt -s nullglob
for fs in /sys/fs/scoutfs/*; do
[ ! -d "$fs" ] && continue
fs_rid="$(quiet_cat $fs/rid)"
nr="$(quiet_cat $fs/data_device_maj_min)"
[ ! -d "$fs" -o "$fs_rid" != "$rid" ] && continue
fs_rid="$(cat $fs/rid)" || \
echo_fail "failed to get rid in $fs"
if [ "$fs_rid" != "$rid" ]; then
continue
fi
nr="$(cat $fs/data_device_maj_min)" || \
echo_fail "failed to get data device major:minor in $fs"
mnts=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
mnt=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
echo_fail "findmnt -t scoutfs -S $nr failed"
for mnt in $mnts; do
umount -f "$mnt" || \
echo_fail "umout -f $mnt failed"
done
[ -z "$mnt" ] && continue
if ! umount -qf "$mnt"; then
if [ -d "$fs" ]; then
echo_fail "umount -qf $mnt failed"
fi
fi
done
exit 0

View File

@@ -121,6 +121,7 @@ t_filter_dmesg()
# in debugging kernels we can slow things down a bit
re="$re|hrtimer: interrupt took .*"
re="$re|clocksource: Long readout interval"
# fencing tests force unmounts and trigger timeouts
re="$re|scoutfs .* forcing unmount"
@@ -166,6 +167,9 @@ t_filter_dmesg()
# perf warning that it adjusted sample rate
re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
# some ci test guests are unresponsive
re="$re|longest quorum heartbeat .* delay"
egrep -v "($re)" | \
ignore_harmless_unwind_kasan_stack_oob
}

View File

@@ -39,20 +39,6 @@ cmd() {
die "cmd failed (check the run.log)"
}
# we can record pids to kill as we exit, we kill in reverse added order
declare -a atexit_kill_pids
atexit_kill()
{
local pid
for pid in $(echo ${atexit_kill_pids[*]} | rev); do
if test -e "/proc/$pid/status" ; then
kill "$pid"
fi
done
}
trap atexit_kill EXIT
show_help()
{
cat << EOF
@@ -452,6 +438,30 @@ cmd grep . /sys/kernel/debug/tracing/options/trace_printk \
/sys/kernel/debug/tracing/buffer_size_kb \
/proc/sys/kernel/ftrace_dump_on_oops
# we can record pids to kill as we exit, we kill in reverse added order
atexit_kill_pids=""
add_atexit_kill_pid()
{
atexit_kill_pids="$1 $atexit_kill_pids"
}
atexit_kill()
{
local pid
# suppress bg function exited messages
exec {ERR}>&2 2>/dev/null
for pid in $atexit_kill_pids; do
if test -e "/proc/$pid/status" ; then
kill "$pid"
wait "$pid"
fi
done
exec 2>&$ERR {ERR}>&-
}
trap atexit_kill EXIT
#
# Build a fenced config that runs scripts out of the repository rather
# than the default system directory
@@ -467,7 +477,7 @@ T_FENCED_LOG="$T_RESULTS/fenced.log"
$T_UTILS/fenced/scoutfs-fenced > "$T_FENCED_LOG" 2>&1 &
fenced_pid=$!
atexit_kill_pids+=($fenced_pid)
add_atexit_kill_pid $fenced_pid
#
# some critical failures will cause fs operations to hang. We can watch
@@ -496,13 +506,12 @@ crash_monitor()
if [ "$bad" != 0 ]; then
echo "run-tests monitor triggering crash"
echo c > /proc/sysrq-trigger
# bg function doesn't reload bash, $$ is parent run-tests.sh
kill -9 $$
exit 1
fi
done
}
crash_monitor &
atexit_kill_pids+=($!)
add_atexit_kill_pid $!
# setup dm tables
echo "0 $(blockdev --getsz $T_META_DEVICE) linear $T_META_DEVICE 0" > \

View File

@@ -19,6 +19,7 @@
#include <sys/types.h>
#include <stdio.h>
#include <sys/stat.h>
#include <inttypes.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
@@ -29,7 +30,7 @@
#include <errno.h>
static int size = 0;
static int count = 0; /* XXX make this duration instead */
static int duration = 0;
struct thread_info {
int nr;
@@ -41,6 +42,8 @@ static void *run_test_func(void *ptr)
void *buf = NULL;
char *addr = NULL;
struct thread_info *tinfo = ptr;
uint64_t seconds = 0;
struct timespec ts;
int c = 0;
int fd;
ssize_t read, written, ret;
@@ -61,9 +64,15 @@ static void *run_test_func(void *ptr)
usleep(100000); /* 0.1sec to allow all threads to start roughly at the same time */
clock_gettime(CLOCK_REALTIME, &ts); /* record start time */
seconds = ts.tv_sec + duration;
for (;;) {
if (++c > count)
break;
if (++c % 16 == 0) {
clock_gettime(CLOCK_REALTIME, &ts);
if (ts.tv_sec >= seconds)
break;
}
switch (rand() % 4) {
case 0: /* pread */
@@ -99,6 +108,8 @@ static void *run_test_func(void *ptr)
memcpy(addr, buf, size); /* noerr */
break;
}
usleep(10000);
}
munmap(addr, size);
@@ -120,7 +131,7 @@ int main(int argc, char **argv)
int i;
if (argc != 8) {
fprintf(stderr, "%s requires 7 arguments - size count file1 file2 file3 file4 file5\n", argv[0]);
fprintf(stderr, "%s requires 7 arguments - size duration file1 file2 file3 file4 file5\n", argv[0]);
exit(-1);
}
@@ -130,9 +141,9 @@ int main(int argc, char **argv)
exit(-1);
}
count = atoi(argv[2]);
if (count < 0) {
fprintf(stderr, "invalid count, must be greater than 0\n");
duration = atoi(argv[2]);
if (duration < 0) {
fprintf(stderr, "invalid duration, must be greater than or equal to 0\n");
exit(-1);
}

View File

@@ -5,7 +5,7 @@
t_require_commands mmap_stress mmap_validate scoutfs xfs_io
echo "== mmap_stress"
mmap_stress 8192 2000 "$T_D0/mmap_stress" "$T_D1/mmap_stress" "$T_D2/mmap_stress" "$T_D3/mmap_stress" "$T_D4/mmap_stress" | sed 's/:.*//g' | sort
mmap_stress 8192 30 "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D3/mmap_stress" "$T_D3/mmap_stress" | sed 's/:.*//g' | sort
echo "== basic mmap/read/write consistency checks"
mmap_validate 256 1000 "$T_D0/mmap_val1" "$T_D1/mmap_val1"

View File

@@ -7,7 +7,7 @@ message_output()
error_message()
{
message_output "$@" >&2
message_output "$@" >> /dev/stderr
}
error_exit()
@@ -62,31 +62,27 @@ test -x "$SCOUTFS_FENCED_RUN" || \
# files disappear.
#
# generate failure messages to stderr while still echoing 0 for the caller
careful_cat()
# silence error messages
quiet_cat()
{
local path="$@"
cat "$@" || echo 0
cat "$@" 2>/dev/null
}
while sleep $SCOUTFS_FENCED_DELAY; do
shopt -s nullglob
for fence in /sys/fs/scoutfs/*/fence/*; do
# catches unmatched regex when no dirs
if [ ! -d "$fence" ]; then
continue
fi
# skip requests that have been handled
if [ "$(careful_cat $fence/fenced)" == 1 -o \
"$(careful_cat $fence/error)" == 1 ]; then
continue
fi
srv=$(basename $(dirname $(dirname $fence)))
rid="$(cat $fence/rid)"
ip="$(cat $fence/ipv4_addr)"
reason="$(cat $fence/reason)"
fenced="$(quiet_cat $fence/fenced)"
error="$(quiet_cat $fence/error)"
rid="$(quiet_cat $fence/rid)"
ip="$(quiet_cat $fence/ipv4_addr)"
reason="$(quiet_cat $fence/reason)"
# request dirs can linger then disappear after fenced/error is set
if [ ! -d "$fence" -o "$fenced" == "1" -o "$error" == "1" ]; then
continue
fi
log_message "server $srv fencing rid $rid at IP $ip for $reason"

View File

@@ -55,6 +55,14 @@ with initial sparse regions (perhaps by multiple threads writing to
different regions) and wasted space isn't an issue (perhaps because the
file population contains few small files).
.TP
.B ino_alloc_per_lock=<number>
This option determines how many inode numbers are allocated in the same
cluster lock. The default, and maximum, is 1024. The minimum is 1.
Allocating fewer inodes per lock can allow more parallelism between
mounts because there are more locks that cover the same number of
created files. This can be helpful when working with smaller numbers of
large files.
.TP
.B log_merge_wait_timeout_ms=<number>
This option sets the amount of time, in milliseconds, that log merge
creation can wait before timing out. This setting is per-mount, only

View File

@@ -402,25 +402,45 @@ before destroying an old empty data device.
.PD
.TP
.BI "print {-S|--skip-likely-huge} META-DEVICE"
.BI "print {-a|--allocs} {-i|--items ITEMS} {-r|--roots ROOTS} {-S|--skip-likely-huge} {-V|--xattr-values} META-DEVICE"
.sp
Prints out all of the metadata in the file system. This makes no effort
Prints out some or all of the metadata in the file system. This makes no effort
to ensure that the structures are consistent as they're traversed and
can present structures that seem corrupt as they change as they're
output.
.sp
Structures that are related to the number of mounts and are maintained at a
relatively reasonable size are always printed. These include per-mount log
trees, srch files, allocators, and the metadata allocators used by server
commits. Other btrees and their items can be selected as desired.
.RS 1.0i
.PD 0
.TP
.sp
.TP
.B "-a, --allocs"
Print the metadata and data allocators. Enabled by default.
.TP
.B "-r, --roots ROOTS"
This option can be used to select which btrees are traversed. It is a comma-separated list containing one or more of the following btree roots: logs, srch, fs. Default is all roots.
.TP
.B "-i, --items ITEMS"
This option can be used to choose which btree items are printed from the
selected btree roots. It is a comma-separated list containing one or
more of the following items: inode, xattr, dirent, symlink, backref, extent,
totl, indx, inoindex, orphan, quota.
Default is all items.
.TP
.B "-V, --xattr-values"
Print xattr values alongside the xattr item. Non-printable bytes are
rendered as '.'. A trailing '...' indicates the value continues in
additional item parts that aren't shown.
.TP
.B "-S, --skip-likely-huge"
Skip printing structures that are likely to be very large. The
structures that are skipped tend to be global and whose size tends to be
related to the size of the volume. Examples of skipped structures include
the global fs items, srch files, and metadata and data
allocators. Similar structures that are not skipped are related to the
number of mounts and are maintained at a relatively reasonable size.
These include per-mount log trees, srch files, allocators, and the
metadata allocators used by server commits.
allocators.
.sp
Skipping the larger structures limits the print output to a relatively
constant size rather than being a large multiple of the used metadata

View File

@@ -29,6 +29,54 @@
#include "leaf_item_hash.h"
#include "dev.h"
struct print_args {
char *meta_device;
bool skip_likely_huge;
bool roots_requested;
bool items_requested;
bool allocs_requested;
bool walk_allocs;
bool walk_logs_root;
bool walk_fs_root;
bool walk_srch_root;
bool print_inodes;
bool print_xattrs;
bool print_dirents;
bool print_symlinks;
bool print_backrefs;
bool print_extents;
bool print_totl;
bool print_indx;
bool print_inode_index;
bool print_orphan;
bool print_quota;
bool print_xattr_values;
};
static struct print_args print_args = {
.meta_device = NULL,
.skip_likely_huge = false,
.roots_requested = false,
.items_requested = false,
.allocs_requested = false,
.walk_allocs = true,
.walk_logs_root = true,
.walk_fs_root = true,
.walk_srch_root = true,
.print_inodes = true,
.print_xattrs = true,
.print_dirents = true,
.print_symlinks = true,
.print_backrefs = true,
.print_extents = true,
.print_totl = true,
.print_indx = true,
.print_inode_index = true,
.print_orphan = true,
.print_quota = true,
.print_xattr_values = false
};
static void print_block_header(struct scoutfs_block_header *hdr, int size)
{
u32 crc = crc_block(hdr, size);
@@ -135,15 +183,42 @@ static u8 *global_printable_name(u8 *name, int name_len)
static void print_xattr(struct scoutfs_key *key, void *val, int val_len)
{
struct scoutfs_xattr *xat = val;
unsigned int full_val_len;
int avail;
int show;
int i;
printf(" xattr: ino %llu name_hash %08x id %llu part %u\n",
le64_to_cpu(key->skx_ino), (u32)le64_to_cpu(key->skx_name_hash),
le64_to_cpu(key->skx_id), key->skx_part);
if (key->skx_part == 0)
printf(" name_len %u val_len %u name %s\n",
xat->name_len, le16_to_cpu(xat->val_len),
global_printable_name(xat->name, xat->name_len));
if (key->skx_part != 0)
return;
full_val_len = le16_to_cpu(xat->val_len);
printf(" name_len %u val_len %u name %s",
xat->name_len, full_val_len,
global_printable_name(xat->name, xat->name_len));
if (!print_args.print_xattr_values) {
putchar('\n');
return;
}
avail = val_len - (int)sizeof(*xat) - xat->name_len;
if (avail < 0)
avail = 0;
show = avail < (int)full_val_len ? avail : (int)full_val_len;
printf(" value ");
for (i = 0; i < show; i++) {
u8 c = xat->name[xat->name_len + i];
putchar(isprint(c) ? c : '.');
}
if (show < (int)full_val_len)
printf("...");
putchar('\n');
}
static void print_dirent(struct scoutfs_key *key, void *val, int val_len)
@@ -195,36 +270,72 @@ static void print_inode_index(struct scoutfs_key *key, void *val, int val_len)
typedef void (*print_func_t)(struct scoutfs_key *key, void *val, int val_len);
static print_func_t find_printer(u8 zone, u8 type)
static print_func_t find_printer(u8 zone, u8 type, bool *suppress)
{
if (zone == SCOUTFS_INODE_INDEX_ZONE &&
type >= SCOUTFS_INODE_INDEX_META_SEQ_TYPE &&
type <= SCOUTFS_INODE_INDEX_DATA_SEQ_TYPE)
type <= SCOUTFS_INODE_INDEX_DATA_SEQ_TYPE) {
if (!print_args.print_inode_index)
*suppress = true;
return print_inode_index;
if (zone == SCOUTFS_ORPHAN_ZONE) {
if (type == SCOUTFS_ORPHAN_TYPE)
return print_orphan;
}
if (zone == SCOUTFS_QUOTA_ZONE)
if (zone == SCOUTFS_ORPHAN_ZONE) {
if (type == SCOUTFS_ORPHAN_TYPE) {
if (!print_args.print_orphan)
*suppress = true;
return print_orphan;
}
}
if (zone == SCOUTFS_QUOTA_ZONE) {
if (!print_args.print_quota)
*suppress = true;
return print_quota;
}
if (zone == SCOUTFS_XATTR_TOTL_ZONE)
if (zone == SCOUTFS_XATTR_TOTL_ZONE) {
if (!print_args.print_totl)
*suppress = true;
return print_xattr_totl;
}
if (zone == SCOUTFS_XATTR_INDX_ZONE)
if (zone == SCOUTFS_XATTR_INDX_ZONE) {
if (!print_args.print_indx)
*suppress = true;
return print_xattr_indx;
}
if (zone == SCOUTFS_FS_ZONE) {
switch(type) {
case SCOUTFS_INODE_TYPE: return print_inode;
case SCOUTFS_XATTR_TYPE: return print_xattr;
case SCOUTFS_DIRENT_TYPE: return print_dirent;
case SCOUTFS_READDIR_TYPE: return print_dirent;
case SCOUTFS_SYMLINK_TYPE: return print_symlink;
case SCOUTFS_LINK_BACKREF_TYPE: return print_dirent;
case SCOUTFS_DATA_EXTENT_TYPE: return print_data_extent;
case SCOUTFS_INODE_TYPE:
if (!print_args.print_inodes)
*suppress = true;
return print_inode;
case SCOUTFS_XATTR_TYPE:
if (!print_args.print_xattrs)
*suppress = true;
return print_xattr;
case SCOUTFS_DIRENT_TYPE:
if (!print_args.print_dirents)
*suppress = true;
return print_dirent;
case SCOUTFS_READDIR_TYPE:
if (!print_args.print_dirents)
*suppress = true;
return print_dirent;
case SCOUTFS_SYMLINK_TYPE:
if (!print_args.print_symlinks)
*suppress = true;
return print_symlink;
case SCOUTFS_LINK_BACKREF_TYPE:
if (!print_args.print_backrefs)
*suppress = true;
return print_dirent;
case SCOUTFS_DATA_EXTENT_TYPE:
if (!print_args.print_extents)
*suppress = true;
return print_data_extent;
}
}
@@ -244,12 +355,16 @@ static int print_fs_item(struct scoutfs_key *key, u64 seq, u8 flags, void *val,
/* only items in leaf blocks have values */
if (val != NULL && !(flags & SCOUTFS_ITEM_FLAG_DELETION)) {
printer = find_printer(key->sk_zone, key->sk_type);
if (printer)
printer(key, val, val_len);
else
bool suppress = false;
printer = find_printer(key->sk_zone, key->sk_type, &suppress);
if (printer) {
if (!suppress)
printer(key, val, val_len);
} else {
printf(" (unknown zone %u type %u)\n",
key->sk_zone, key->sk_type);
}
}
return 0;
@@ -1037,12 +1152,7 @@ static void print_super_block(struct scoutfs_super_block *super, u64 blkno)
}
}
struct print_args {
char *meta_device;
bool skip_likely_huge;
};
static int print_volume(int fd, struct print_args *args)
static int print_volume(int fd)
{
struct scoutfs_super_block *super = NULL;
struct print_recursion_args pa;
@@ -1092,7 +1202,7 @@ static int print_volume(int fd, struct print_args *args)
ret = err;
}
if (!args->skip_likely_huge) {
if (print_args.walk_allocs) {
for (i = 0; i < array_size(super->meta_alloc); i++) {
snprintf(str, sizeof(str), "meta_alloc[%u]", i);
err = print_btree(fd, super, str, &super->meta_alloc[i].root,
@@ -1119,18 +1229,21 @@ static int print_volume(int fd, struct print_args *args)
pa.super = super;
pa.fd = fd;
if (!args->skip_likely_huge) {
if (print_args.walk_srch_root) {
err = print_btree_leaf_items(fd, super, &super->srch_root.ref,
print_srch_root_files, &pa);
if (err && !ret)
ret = err;
}
err = print_btree_leaf_items(fd, super, &super->logs_root.ref,
print_log_trees_roots, &pa);
if (err && !ret)
ret = err;
if (!args->skip_likely_huge) {
if (print_args.walk_logs_root) {
err = print_btree_leaf_items(fd, super, &super->logs_root.ref,
print_log_trees_roots, &pa);
if (err && !ret)
ret = err;
}
if (print_args.walk_fs_root) {
err = print_btree(fd, super, "fs_root", &super->fs_root,
print_fs_item, NULL);
if (err && !ret)
@@ -1143,16 +1256,16 @@ out:
return ret;
}
static int do_print(struct print_args *args)
static int do_print(void)
{
int ret;
int fd;
fd = open(args->meta_device, O_RDONLY);
fd = open(print_args.meta_device, O_RDONLY);
if (fd < 0) {
ret = -errno;
fprintf(stderr, "failed to open '%s': %s (%d)\n",
args->meta_device, strerror(errno), errno);
print_args.meta_device, strerror(errno), errno);
return ret;
}
@@ -1160,30 +1273,203 @@ static int do_print(struct print_args *args)
if (ret < 0)
goto out;
ret = print_volume(fd, args);
ret = print_volume(fd);
out:
close(fd);
return ret;
};
enum {
LOGS_OPT = 0,
FS_OPT,
SRCH_OPT
};
static char *const root_tokens[] = {
[LOGS_OPT] = "logs",
[FS_OPT] = "fs",
[SRCH_OPT] = "srch",
NULL
};
enum {
INODE_OPT = 0,
XATTR_OPT,
DIRENT_OPT,
SYMLINK_OPT,
BACKREF_OPT,
EXTENT_OPT,
TOTL_OPT,
INDX_OPT,
INOINDEX_OPT,
ORPHAN_OPT,
QUOTA_OPT
};
static char *const item_tokens[] = {
[INODE_OPT] = "inode",
[XATTR_OPT] = "xattr",
[DIRENT_OPT] = "dirent",
[SYMLINK_OPT] = "symlink",
[BACKREF_OPT] = "backref",
[EXTENT_OPT] = "extent",
[TOTL_OPT] = "totl",
[INDX_OPT] = "indx",
[INOINDEX_OPT] = "inoindex",
[ORPHAN_OPT] = "orphan",
[QUOTA_OPT] = "quota",
NULL
};
static void clear_items(void)
{
print_args.print_inodes = false;
print_args.print_xattrs = false;
print_args.print_dirents = false;
print_args.print_symlinks = false;
print_args.print_backrefs = false;
print_args.print_extents = false;
print_args.print_totl = false;
print_args.print_indx = false;
print_args.print_inode_index = false;
print_args.print_orphan = false;
print_args.print_quota = false;
}
static void clear_roots(void)
{
print_args.walk_logs_root = false;
print_args.walk_fs_root = false;
print_args.walk_srch_root = false;
}
static int parse_opt(int key, char *arg, struct argp_state *state)
{
struct print_args *args = state->input;
char *subopts;
char *value;
bool parse_err = false;
switch (key) {
case 'S':
args->skip_likely_huge = true;
break;
case 'a':
args->allocs_requested = true;
args->walk_allocs = true;
break;
case 'V':
args->print_xattr_values = true;
break;
case 'i':
/* Specific items being requested- clear them all to start */
if (!args->items_requested) {
clear_items();
if (!args->allocs_requested)
args->walk_allocs = false;
args->items_requested = true;
}
subopts = arg;
while (*subopts != '\0' && !parse_err) {
switch (getsubopt(&subopts, item_tokens, &value)) {
case INODE_OPT:
args->print_inodes = true;
break;
case XATTR_OPT:
args->print_xattrs = true;
break;
case DIRENT_OPT:
args->print_dirents = true;
break;
case SYMLINK_OPT:
args->print_symlinks = true;
break;
case BACKREF_OPT:
args->print_backrefs = true;
break;
case EXTENT_OPT:
args->print_extents = true;
break;
case TOTL_OPT:
args->print_totl = true;
break;
case INDX_OPT:
args->print_indx = true;
break;
case INOINDEX_OPT:
args->print_inode_index = true;
break;
case ORPHAN_OPT:
args->print_orphan = true;
break;
case QUOTA_OPT:
args->print_quota = true;
break;
default:
argp_usage(state);
parse_err = true;
break;
}
}
break;
case 'r':
/* Specific roots being requested- clear them all to start */
if (!args->roots_requested) {
clear_roots();
if (!args->allocs_requested)
args->walk_allocs = false;
args->roots_requested = true;
}
subopts = arg;
while (*subopts != '\0' && !parse_err) {
switch (getsubopt(&subopts, root_tokens, &value)) {
case LOGS_OPT:
args->walk_logs_root = true;
break;
case FS_OPT:
args->walk_fs_root = true;
break;
case SRCH_OPT:
args->walk_srch_root = true;
break;
default:
argp_usage(state);
parse_err = true;
break;
}
}
break;
case ARGP_KEY_ARG:
if (!args->meta_device)
args->meta_device = strdup_or_error(state, arg);
else
argp_error(state, "more than one argument given");
break;
case ARGP_KEY_FINI:
if (!args->meta_device)
argp_error(state, "no metadata device argument given");
/*
* For backwards compatibility, translate -S. Should we warn if
* this conflicts with other explicit options?
*/
if (args->skip_likely_huge) {
if (!args->allocs_requested)
args->walk_allocs = false;
args->walk_fs_root = false;
args->walk_srch_root = false;
}
break;
default:
break;
}
@@ -1192,7 +1478,11 @@ static int parse_opt(int key, char *arg, struct argp_state *state)
}
static struct argp_option options[] = {
{ "skip-likely-huge", 'S', NULL, 0, "Skip large structures to minimize output size"},
{ "allocs", 'a', NULL, 0, "Print metadata and data alloc lists" },
{ "items", 'i', "ITEMS", 0, "Item(s) to print (inode, xattr, dirent, symlink, backref, extent, totl, indx, inoindex, orphan, quota)" },
{ "roots", 'r', "ROOTS", 0, "Tree root(s) to walk (logs, srch, fs)" },
{ "skip-likely-huge", 'S', NULL, 0, "Skip allocs, srch root and fs root to minimize output size" },
{ "xattr-values", 'V', NULL, 0, "Print xattr values (non-printable bytes rendered as '.')" },
{ NULL }
};
@@ -1205,17 +1495,15 @@ static struct argp argp = {
static int print_cmd(int argc, char **argv)
{
struct print_args print_args = {NULL};
int ret;
ret = argp_parse(&argp, argc, argv, 0, NULL, &print_args);
if (ret)
return ret;
return do_print(&print_args);
return do_print();
}
static void __attribute__((constructor)) print_ctor(void)
{
cmd_register_argp("print", &argp, GROUP_DEBUG, print_cmd);