Compare commits

..

186 Commits

Author SHA1 Message Date
Chao Wang
305361c5ea WIP 2024-10-28 15:50:47 -07:00
Chao Wang
eb244065dc WIP 2024-10-28 15:35:10 -07:00
Chao Wang
2de875e4d8 WIP 2024-10-28 14:34:30 -07:00
Chao Wang
f47dcba80c WIP 2024-10-28 14:21:08 -07:00
Chao Wang
1f345e350d WIP 2024-10-25 14:45:52 -07:00
Auke Kok
ae4b55a147 Add basic parallel_restore test script
This script executes the basic parallel_restore test binary which
incorporates the parallel restore library. The test binary creates
a few files with xattrs. After restoring, we mount the filesystem
and do some basic checks to see that the restore was complete.

Added just under and just over ENOSPC cases to make sure that we are
returning the final 5% of this disk that we reserver for log trees.

Signed-off-by: Auke Kok <auke.kok@versity.com>
Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 13:35:35 -07:00
Hunter Shaffer
2be15d416d Add Quota support
Adds a function to to insert quota rules as filesystem items.
This will then have an outward facing function that takes a
writer and a mirror of the _squota_rule struct in quota.c
and is called _parallel_restore_quota_rule. Adds testing
to make sure we are restoring a test quota.

Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 12:01:04 -07:00
Hunter Shaffer
c4147a7e8d Add Retention Flag support
Adds a check in the inode creation path that checks whether
the retention feature is present. If it is then we set the
inode's flag value otherwise it stays as 0.

Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 12:01:04 -07:00
Hunter Shaffer
d653c78504 Add Project ID support
Project IDs add a new field 'proj' to the inode struct.
In this patch we simply check if the feature is present
before we compile and if it is we allocate the field
within the parallel_restore inode and make sure this
value set in the restored inode. If it is not present
we ignore it.

Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 12:01:04 -07:00
Hunter Shaffer
0d910eb7ab Check if source device has been mounted
The filesystem we are restoring into needs to be empty
and never mounted. Here we check the all of the quorum blocks
timestamps to see whether the device we are restoring into has
been mounted before. Adds a test in the test script that attempts
to restore a previously mounted device.

Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 12:01:04 -07:00
Hunter Shaffer
41b1d1180b Check device format before restore
When we import the superblock we first check whether
the filesystem is at least format version 2. We then
verify that the device given is a meta_dev. The
test script now also initialize the new filesystem
to format_V2. Adds coverage in the test script for the
new checks.

Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
130e10626d Copy a tree using parallel restore library.
This tool compies a source tree (whether it's scoutfs or not)
into an offline scoutfs meta device. It has only those 2 parameters
and does a single-process walk of the tree to restore all items
while preservice as much of the metadata as possible.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
281cd4f87a Create a 4k offline extent for each regular file.
After this change, all files have a single offline extent:

```
$ sudo src/filefrag-gc57857a5 -b4096 -v /mnt/scratch/top-0/file-1094
Filesystem type is: 554f4353
File size of /mnt/scratch/top-0/file-1094 is 4096 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:          0..         0:      1:             last,unknown_loc,eof
/mnt/scratch/top-0/file-1094: 1 extent found
```

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
c78b5cdecc Detect child process exiting with errors.
Adds a signal handler for SIGCHLD and sets up a signal handler with
SA_SIGINFO. This way we can inspect the exit code emitted by the
child and abort processing when a child process exits with an error.

Without this handler, any child process that exits with e.g. ENOSPC
will keep the parent hanging indefinitely.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
6da7034d48 Pass meta_seq and data_seq to _restore_inode.
This allows callers to pass in seq values for generated inodes. The
tester code initializes them now before calling, instead of being
hard set in the library.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
60e14e20dc Fix offline extents not being able to be created.
While online extents (non-zero size) worked just fine with this
code, the offline extent code inserted a btree item without the
appropriate key, which results in duplicate (null) keys being
inserted, hence the "duplicate" error.

All that is needed to fix is to put the created key in the btree
item to be inserted.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
5316905d12 Fix symlink insertion.
This block never called insert_fs_item() creating dangling keys
that never got inserted. Additionally, the _sk_second member is
le64 and we have to use the proper intrinsic to increment it.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
ea41b198a4 Fix printing alloc list block extents
The list alloc blocks have an array of blknos that are offset by a start
field in the block header.  The print code wasn't using that and was
always referencing the beginning of the array, which could miss blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
027a6ebce6 Import a few more functions to our list.h
Import a few more functions from the kernel's list.h into our imported
copy.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
1ac0e5bfd3 Add test for parallel restore
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
f6a40de3b0 Add parallel restore
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
17451841bf Add userspace NSEC_PER_SEC
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
51f50529fc Add bloom filter index calc for userspace utils
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
7707d98b54 Add srch_encode_entry() for userspace utils
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
8c195ee4ab Add put_unaligned_leXX() for userspace
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
7b5f59ca53 Add fls64() alias for userspace flsll()
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
597ce6a4c0 Promote userspace btree block initialization
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
afeeb47918 Add userspace version of our mode to type
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
660f46a3b4 Add userspace version of our dirent name hash
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
4697424c7c Add lk rbtree wrapper
Import the kernel's rbtree implementation with a wrapper so we can use
it from userspace.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
11f624926b Superblock checks for meta and data dev.
We check superblock magic, crc, flags. data device superblock is
checked but a little less thorough.  We check whether the device is
still mounted, since that would make checking invalid to begin with.
Quorum blocks are validated to have sane contents.

We add a global problem counter so we can trivially measure and
report whether any problem was found at all, instead of iterating
over all the problems and checking each individual count.

We pick the standard exit code values from `fsck` and mirror their
intentional behavior. This results in `fsck.scoutfs` can now be
trivially created by making it a wrapper around `scoutfs check`.

Signed-off-by: Auke Kok <auke.kok@versity.com>
Signed-off-by: Hunter Shaffer <hunter.shaffer@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
173e0f1edd Add man page content for check.
Adds basic man page content for the `check` subcommand.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Auke Kok
ca57794a00 Generic block header checks: crc, magic.
Generally as we call block_get() we should validate that if the block
has a hdr, at a minimum the crc is correct and the magic value is
the expected value passed, and the fsid matches the superblock. This
function implements just that. Returns -EINVAL, up to the caller to
report a problem() and handle the outcome. For now the code just hard
fails, which incedentally makes it fail the clobber-repair.sh tests
I wrote.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
f5f39f4432 Add test_bit to utils bitmap
Add test_bit() to the trivial utils bitmap.c implementation.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
022e280f0b Add {read,write}-metadata-image scoutfs commands
Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
897f26c839 Fix partial rename to check_meta_alloc
As I was committing the initial check command I had only partially
completed a rename of the function that checks the metadata allocators.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-17 11:42:16 -07:00
Zach Brown
25d5b507a1 Add check command
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-17 11:41:42 -07:00
Zach Brown
1d08a58add Merge pull request #151 from versity/auke/el9
EL9 support.
2024-10-04 11:46:47 -07:00
Auke Kok
fc7876e844 Allow certain tests to skip, but not fail exit condition.
Previously, any t_skip would cause the final test result to be a failure
because up until now no test should have been skipped.

However, with format-version-forward-back not being compatible with el9,
we are going to rely on el7/8 testing for that test soleley, and
therefore we have to allow skipping of this test on el9 and newer OS
versions.

We add `t_skip_permitted` to signal this from the test case to the
run-tests.sh script. A new exit code is passed, and all accounting is
updated to reflect that a test was skipped, but this was permitted. We
modify format-version-forward-back to use this new exit path.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
5337b9e221 Ingore Process accounting resumed dmesg.
I'm seeing more and more of these as audit is enabled in el8 and el9
images I am using for testing, and during ENOSPC tests this has a chance
of triggering process accounting suspension, and subsequent resume.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
8a22bdd366 Ignore device mapper size change dmesg output.
In v1.18-10-g5507ee5, we changed the test code away from loopback
to device-mapper, which simplified our DUT setup code.

However, this results in the occasional `device changed size` messages
now being emitted by the `dm` driver instead of the `loop` kernel
module. We have to additionally ignore these kernel messages from now as
well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
235ab133a7 We must provide a_ops->dirty_folio and invalidate_folio.
In v5.17-rc4-53-g3a3bae50af5d, we can no longer omit having this
method unhooked as the mm caller blindly calls it now. In-kernel
filesystems all were fixed in this change.

aops->invalidatepage was the old aops method that would free pages
with private attached data. This method is replaced with the
new invalidate_folio method. If this method is NULL, the memory
will become orphaned. (v5.17-rc4-29-gf50015a596fa)

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
9335d2eb86 Don't --track when checking out a tag.
I've pushed a tag/release to scoutfs-xfstests-dev instead of a full
blown branch. This seems simpler and cleaner than using branches,
because we're going to end up rebasing these things a lot. However, we
can't --track tags, so, if the branch name passed to -x is actually a
tag instead of a branch, we have to omit the --track option here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
97b081de3f Switch xfstests tag over in CI jobs using this marker file.
CI testing needs to know which xfstests branch to use on all OSs.

We can't just use the el9 xfstests branch on el9 only, because we
need to run the same el9 xfstests on el8 and el7 as well, otherwise
testing will just fail.

So, we put a marker file in our git repo that tells us that we're
not going to use the default `scoutfs` branch from scoutfs-xfstests-dev
but our own special tag or branch. The CI job then should pass the
proper -x {branch} flag to the run-tests.sh script.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
21b5032365 Add new xfstests that we won't support or don't pass
The new version of xfstests adds a _lot_ more tests to our mix. Many
of the new ones will auto enable or auto skip as needed.

There are tests we can't or won't support that will be in future
xfstests. Disable them now so we can avoid dealing with them later.

Quite a few fall into "we don't support these types of mounting yet",
mostly bind-mount or dm-mapper things. We disable all the swapfile
tests flatout.

A few tests fail on el7 but not el8/9 but we don't have a way to run
them without failing yet, so disable them as well.

Update golden with the proper new array of tests. This all requires
the `auke/scoutfs-el9` branch in `versity/scoutfs-xfstests-dev`.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
4723f4f9ab Disable format-version-forward-back test on el9+.
Using t_skip, we just skip this test on el9.

If we ever want to add a formatversion 2->3 test, perhaps we should
just add a separate test script, instead of going over a static array.

But let's not worry about this too much right now.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
0a8b3f4e94 Fix basic-posix-acl test output on el9
It turns out that on el9, `bash -c` prints out `bash: line 1: cd..`
instead of `line 0:` on el7 or el8. So discard all the stderr from
these `cd` lines entirely and just rely on the expected echo
output to stdout.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
8a4b0967cb Add fiemap output through scoutfs util.
There's filefrag already, and that works, but, it's output is very
inconsistent between various OS release versions, and it has already
meant that we'd needed to adjust tests to account for these little
but insignificant changes. A lot more work than useful. It's even
more changed in el9.

This adds `scoutfs get-fiemap FILE` and prints out block extent
info with flags that we care about as an abbreviated letter: U for
Unwritten, L for Last, and O for Unknown (as in, "offline").

The -P/--physical and -L/--logical options turn off logical or physical
offset display, in case you only want to see the offsets in either
units. You can pass -b/--byte to display offsets and lengths in
byte values. The block size will then be obtained from fstat() of
the queried file (4096 for scoutfs).

I've removed all uses of filefrag from our scoutfs tests. Xfstests
still calls it but their internal diff takes care of that issue.

Where needed and appropriate, the tests are adjusted so that the output
of `scoutfs get-fiemap` is as close as it can to what it used to be,
so that reading the test results allows the quick view of what might
have been going wrong.

There are some output strings I have not bothered to update because
there's no real value to updating every output string to match,
and we just adjust the golden file accordingly.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
606c519e96 Simple-staging doesn't actually test overflow.
This isn't a simple case where we can use u64_region_wraps because
length is s32.

Let's actually test an overflow case instead of a case that doesn't
overflow, though. We still should properly add an overflow test here as
well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
7d0e7e29f8 Avoid integer wrapping pitfalls for (off, len) pairs.
We use check_add_overflow(a, b, d) here to validate that (off, len)
pairs do not exceed the max value type. The kernel conveniently has
several macros to sort out the problems with signed or unsigned types.

However, we're not interested in purely seeing whether (a + b)
overflows, because we're using this for (off, len) overflow checks,
where the bytes we read are from 0 to len -1. We must therefore call
this check with (b) being "len - 1".

I've made sure that we don't accidentally fail when (len == 0)
in all cases by making sure we've already checked this condition
before, and moving code around as needed to ensure that (len > 0)
in all cases where we check.

The macro check_add_overflow requires a (d) argument in which
temporarily the result of the addition is stored and then checked to see
if an overflow occurred. We put a `tmp` variable on the stack of the
correct type as needed to make the checks function.

simple-release-extents test mistakenly relied on this buggy wrap code,
so it needs fixing. The move-blocks test also got it wrong.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
69de6d7a74 Check for zero len in scoutfs_data_wait_check
We consistently enter scoutfs_data_wait_check when len == 0 from
scoutfs_aio_write() which directly passes the i_size_read() value,
and for cases where we `echo >> $FILE` this is always reached.

This can cause the wrapping check to fail since `0 + (0 - 1) < 0` which
triggers the WARN_ON_ONCE wrap check that needs updating to allow
certain operations on huge files.

More importantly we can just omit all these checks if `len == 0` anyway,
since they should always succeed and never should require taking all the
locks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
ac00f5cedb Free after getline(), even if fail, and catch eof() on el9
getline() allocates the space for the return value even if there is an
error, so when it returns an error, we still have to free() it.

In el9, when reading stdin we will get errno=0 returned (no error) when
we hit the end of stdin. This behavior is different from el7/8. We don't
want to throw an error here to avoid failing the test, since it doesn't.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
6d42d260cf xargs option conflict now a warning in el9
The warnings thrown by el9's version of xargs are unexpected output and
cause this test to fail. When using the -I option (replace) the -n 1
arguments are always assumed. In el7/8 no warnings were printed.

We can just remove `-n 1` since the argument is never needed.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
00ebe92186 Add stddef.h to util.h to avoid duplicate offsetof() def.
In el9 releases, our includes declare offsetof() before our header
chain includes stddef.h, which doesn't properly check if offsetof
is already defined, leading to a redefinition. Just include stddef
at all times here.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
570c05898c Correct endian conversion length (blkno is le64)
Trivial correction of wrong bitlength conversion.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c298360a49 blkdev api changes - pass holder to replace FMODE_EXCL
Passing a holder ptr to these functions now replaces the FMODE_EXCL
flag. _put no longer needs flags for this reason, but the holder
instead.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
95f4e56546 Introduce blk_mode_t instead of abuse of fmode_t
v6.4-rc2-198-g05bdb9965305 adds a new type for passing flags instead
of abusing fmode_t flags. They are essentially the same flags just
in a new type.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
d5c2768f04 .tmpfile method now passed a struct file, which must be opened.
v6.0-rc6-9-g863f144f12ad changes the VFS method to pass in a struct
file and not a dentry in preperation for tmpfile support in fuse.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
676d429264 Assume el9 is the same as el8 for rpmbuild purposes.
The current spec template can't handle future major el releases
gracefully and fails to build entirely. We isolate all changes
so that they are either "el7 specific" or generic. This rids us
entirely of el8 specific conditionals.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
5b260e6b54 block_write_begin() no longer is being passed aop_flags.
The flag is now obsolete, we don't need to set flags here or
pass them.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
e2b06f2c92 mpage_readpage() is now replaced with mpage_read_folio.
Folios are the new data types used for passing pages. For now,
folios only appear to have a single page. Future kernels will
change that.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
546b437df7 Shrinkers are now registered with a name.
v5.19-rc4-52-ge33c267ab70d Adds shrinker names to the registration
call to aid with shrinker debugging, which is highly opaque.

To enable you'll have to recompile the kernel with
CONFIG_SHRINKER_DEBUG=y though, since it's disabled by default in
OSV kernels.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
381f4543b7 Use iter based read/write to support splice and thus sendfile().
The iter based read/write calls can support splice in el9 if we
hook up these calls, otherwise splice will stop working.

->write() similar to: v3.15-rc4-330-g8d0207652cbe. ->read() to
generic implementation.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
418a441604 kernel_setsockopt no longer available.
We instead opt to use sock_setsockopt which is generally exactly the
same and can be easily converted to map to kernel_setsockopt without
impacting the code significantly.

There are 3 methods we're calling with usec timeval's, and that is
significantly different now that this requires a bit more compat code
so we split these out to separate compat functions to handle them.

Some of the TCP sock functions also have a slightly different signature
that we want to split them out (struct socket vs. sock). Some further
no longer return success, either.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
f3abf9710b generic_perform_write signature changed
It now only needs the iocb and no longer the flip.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
8a45c2baff Deprecate struct timeval.
We switch to using 64bit usec structs and recommended replacement
functions from Documentation/core-api/timekeeping.rst.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
345ebd0876 fiemap_prep replaces fiemap_check_flags.
v5.7-rc4-53-gcddf8a2c4a82

The prep helper replaces the sanity checks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
b718cf09de Handle idmapped mounts in xattr_handler
In v5.11-rc4-8-ge65ce2a50cf6 the *set handler is passed a
user_namespace struct pointing to the map from the mount.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
e4721366ff Added user_ns argument to posix_acl_update_mode, set_posix_acl
v5.11-rc4-8-ge65ce2a50cf6 adds idmap support to these calls.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
4ef64c6fcf Vfs methods become user namespace mount aware.
v5.11-rc4-24-g549c7297717c

All of these VFS methods are now passed a user_namespace.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
2d58ee2a37 Account for new bio_alloc() args.
Block device and opf are now passed through and set. We mimic compat
code to do the same.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
1f0dd7f025 __vmalloc defaults to PAGE_KERNEL everywhere, so the arg was removed.
v5.7-523-g88dca4ca5a93

__vmalloc no longer has the 3rd argument.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
077468ac1e debugfs_create_atomic_t now returns void, don't check result
Greg KH tells us to do just this in v5.4-rc5-31-g9927c6fa3e1d:

	No one checks the return value of debugfs_create_atomic_t(),
	as it's not needed, so make the return value void, so that no
	one tries to do so in the future.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c951713ab2 list_cmp_func_t introduced, using const.
v5.12-rc6-9-g4f0f586bf0c8

All list_sort functions use the list_cmp_func_t type, which compares
list_head member types. These are now required to be `const` as the
compiler will now check them. This propagates into our callers.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
ad82a5e52a Squelch warning from bpf_iter.c.
v5.7-rc2-1174-gfd4f12bc38c3 significantly rewrites the bpf iterator
which hits this _next() function. It also adds a check that verifies
that the *pos is incremented after every call, even if it goes beyond
the last member (in which case it's not used).

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
d3c5328909 setattr_prepare no longer extern in fs.h
v5.11-rc4-7-g2f221d6f7b88 Changes setattr_prepare from an extern
to plain int. There's no impact further to the compat to keep it
working except for the detection regex.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c30172210f Use blk_opf_t to pass bio op flags
Compat is back to unsigned int.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
19af6e28fb "unaligned/access_ok.h" is not needed, and removed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
8885486bc8 Add several low level includes.
Newer kernels include less header dependencies by default, so we have
to add these.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
0204e092e4 FIELD_SIZEOF was deprecated.
We could use sizeof_field as a direct replacement (which is the same)
except that this entire thing can directly use offsetofend().

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Zach Brown
3b816cfd01 Merge pull request #183 from versity/auke/setattr_more_data_version_zero
Don't pass data version to attr_x unless the ioctl means to set it.
2024-10-03 12:25:46 -07:00
Auke Kok
b45fbe0bbb Don't pass data version to attr_x unless the ioctl means to set it.
The wrapper in setattr_more that translates the operations to attr_x
needs to decide whether to ask attr_x to perform a change to any of
the fields passed to it or not. For the date and size fields this
is implicit - we always tell attr_x to change them. For any of the
other fields, it should be explicit.

The only field that is in the struct that this applies to is
data_version. Because the data version field by default is zero,
we use that as condition to decide whether to pass the data_version
down to attr_x.

Previously, the code would always pass a data_version=0 down to attr_x,
triggering one of the validity checks, making it return -EINVAL. We
add a simple test case to test for this issue.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-27 19:31:22 -04:00
Zach Brown
c5d9b93b96 Merge pull request #187 from versity/auke/sparsefilt
Sparse fix for epel 0.6.4 sparse - redefines
2024-09-27 14:13:28 -07:00
Zach Brown
2984f4d3a8 Merge pull request #189 from versity/greg/el9-spec-fix
Use path-inspecific weak-modules (EL9 fix)
2024-09-27 14:05:40 -07:00
Auke Kok
3b8d2eab8e Sparse fix for epel 0.6.4 sparse - redefines
We should rely on sparse from epel to do automated sparse checking and
not a git tag. But the 0.6.4 build currently fails on sparse/gcc
redefines.

This magic Awk from Zach script processes sparse and gcc internal defines
and leaves the one intact that sparse doesn't have.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-27 15:37:47 -04:00
Greg Cymbalski
4dde57dc27 Rely on $PATH for weak-modules
This avoids having to deal with EL-specific path differences for
the weak-modules script.
2024-09-27 12:30:25 -07:00
Zach Brown
a4be74f4b1 Merge pull request #182 from versity/auke/write_test_name_to_kmsg
Write to kmsg which test we're executing.
2024-09-27 09:09:36 -07:00
Zach Brown
b66e52f3f8 Merge pull request #186 from versity/auke/add_quota_wkic_shrinker_counters
Add shrinker counters for wkic and quota_info.
2024-09-19 09:59:44 -07:00
Auke Kok
fb93d82b1e Add shrinker counters for wkic and quota_info.
These new shrinkers were recently added. Because there's very little
ways to debug them, or even see them properly function, we should at
least add counters for them.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-18 13:40:54 -04:00
Auke Kok
9d8ac2c7d7 Write to kmsg which test we're executing.
This is done by xfstests and it's so much easier to follow what is going
on from logs or e.g. serial console that I thought I should do this for
scoutfs tests as well. It makes it so much easier to discern which test
may have been cause for issues when running a bunch of tests and you're
looking back at logs later.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-28 14:36:55 -07:00
Zach Brown
49acbb4415 Merge pull request #181 from versity/auke/basic-posix-acl
POSIX ACL fixes for el8, plus tests
2024-08-26 14:19:45 -07:00
Auke Kok
7b039a1d18 Add basic POSIX ACL tests.
These are extremely limited and very quick basic ACL tests we can
trivially do in under a second - purely basic funtionality tests only.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-12 15:07:43 -04:00
Auke Kok
ccd65b9a61 Fix POSIX ACL use in el8+.
In 29160b0b I mistakenly disabled all caching of ACLs for el8
instead of only disabling cache lookups. The correct change
should have been to disable cache lookups only, and leave setting the
acl cache after storing or fetching, as the kernel needs this data
to resolve acls when doing permission checks.

Restore the acl cache insertions fixes.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-09 17:57:23 -04:00
Zach Brown
aeb1dbc5f5 Merge pull request #180 from versity/zab/cleanup_large_test_tmp_output
Clean up large test files
2024-07-23 10:44:07 -07:00
Zach Brown
e20d3ae1e8 Clean up large test files
The test harness provides a TMP directory for tests to use.  It's badly
named.  It's meant to be more of a scratch directory that is not on the
FS being tested.

Tests use it both for small log files that give insight into the
platform and for large generated files that are not worth saving.  We
want to save the directory after test runs to get at the log files, but
we don't want to burn a ton of space also saving large generated files

This updates the handful of tests to remove their handful of files that
are large enough to be a problem.  With these out of the way we can save
the tmp/ directory without its space consumption getting out of hand.

Signed-off-by: Zach Brown <zab@versity.com>
2024-07-22 14:08:32 -07:00
Zach Brown
3228749957 Merge pull request #137 from versity/auke/client-unmount-test-debug-data
Fix the debug output of client-unmount-recovery
2024-07-22 14:03:15 -07:00
Auke Kok
db445ce517 Fix the debug output of client-unmount-recovery
The script really wants to print rid instead of pid. But in case
of failure, we can just dump the arrays as well.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-07-22 14:02:53 -04:00
Zach Brown
bb5d98730b Merge pull request #167 from versity/auke/test-zero-time-createmany-parallel-mounts
Avoid math issues on fast test machines.
2024-07-22 10:12:09 -07:00
Zach Brown
cb0838a0ef Merge pull request #179 from versity/auke/extra_version_device_checks
Extra device checks when changing formatversion.
2024-07-12 13:24:42 -07:00
Auke Kok
7eaed848ed Increase time measurement accuracy beyond whole seconds.
We can rely on `bc` and `date` to record, manipulate and compare
time data with nanosecond precision. This fixes timing issues on
faster systems where this test completes a single pass of createmany in
under 1.0 second, causing the math to always fail.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-07-12 15:28:28 -04:00
Auke Kok
267c1cc2d5 Check meta flags bit set/unset for devices.
This extra check assures the passed meta device and data device
are indeed what they should be, and prevents against unwanted
swapping or repeated duplicate device arguments.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-07-12 15:22:45 -04:00
Zach Brown
c6b92329b3 Merge pull request #178 from versity/zab/v1.21
v1.21 Release
2024-07-01 14:31:27 -07:00
Zach Brown
91e7f051cf v1.21 Release
Finish the release notes for the 1.21 release.

Signed-off-by: Zach Brown <zab@versity.com>
2024-07-01 13:49:35 -07:00
Zach Brown
7645f04363 Merge pull request #177 from versity/zab/retention_quota_project_indx
Zab/retention quota project indx
2024-06-28 17:21:15 -07:00
Zach Brown
8c06302984 Let run-tests specify mkfs format version
Add a run-tests -V option that passes through the -V option to mkfs so
that runs can specify the format version that the primary volume will
have.  This doesn't affect the scratch file system versions.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
1bc83e9e2d Add indx xattr tag support to utils
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
38c6d66ffc Add indx xattr tag support
Add support for the indx xattr tag which lets xattrs determine the sort
order of by their inode number in a global index.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
6a17dc335f Add quota tests
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
e0bb6ca481 Add quota support to utils
Add scoutfs cli commands for managing quotas and add its persistent
structures to the print command.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
38e6f11ee4 Add quota support
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
442980f1c9 Add project ID tests
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
82c2d0b1d0 Add o_tmpfile_linkat test binary
Add a test binary that uses o_tmpfile and linkat to create a file in a
given dir.  We have something similar, but it's weirdly specific to a
given test.  This is a simpler building block that could be used by more
tests.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
4a8240748e Add project ID support
Add support for project IDs.  They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
60ca950f42 Drop caches in totl test
Now that the _READ_XATTR_TOTALS ioctl uses the weak item cache we have
to drop caches before each attempt to read the xattrs that we just wrote
and synced.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
9c45e8b7ef read_xattr_totls ioctl uses weak item cache
Change the read_xattr_totls ioctl to use the weak item cache instead of
manually reading and merging the fs items for the xattr totals on every
call.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
ee9e8c3e1a Extract .totl. item merging into own functions
The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl.
total and value while reading fs items.  We're going to want to do this
in another reader so let's put these in their own funcions that clearly
isolate the logic of merging the fs items into a coherent result.

We can get rid of some of the totl_read_ counters that tracked which
items we were merging.  They weren't adding much value and conflated the
reading ioctl interface with the merging logic.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
5f156b7a36 Add scoutfs_forest_read_items_roots
Add a forest item reading interface that lets the caller specify the net
roots instead of always getting them from a network request.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
3a51ca369b Add the weak item cache
Add the weak item cache that is used for reads that can handle results
being a little behind.  This gives us a lot more freedom to implement
the cache that biases concurrent reads.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Bryant G. Duffy-Ly
460f3ce503 Add unit tests for retention
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: refactored for retention, added test cases]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
fb5331a1d9 Add inode retention bit
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode.  The bit is visible through the _attr_x
interface.  It can only be set on regular files and when set it prevents
modification to all but non-user xattrs.  It can be cleared by root.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
5a53e7144d Add format-version back/forward compat test
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
a23877b150 Add fs test functions for mounted paths
We have some fs functions which return info based on the test mount nr
as the test has setup. This refactors those a bit to also provide
some of the info when the caller has a path in a given mount. This will
let tests work with scratch mounts a little more easily.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
5ccdf3c9f0 Add T_MODULE for tests
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
270726a6ea Implement stat_more and setattr_more with attr_x
Now that we have the attr_x calls we can implement stat_more with
get_attr_x and setattr_more with set_attr_x.

The conversion of stat_more fixes a surprising consistency bug.
stat_more wasn't acquiring a cluster lock for the inode nore refreshing
it so it could have returned stale data if modifications were made in
another mount.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
de304628ea Add attr_x commands and documentation to utils
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
6a99ca9ede Add attr_x core and ioctls
The existing stat_more and setattr_more interfaces aren't extensible.
This solves that problem by adding attribute interfaces which specify
the specific fields to work with.

We're about to add a few more inode fields and it makes sense to add
them to this extensible structure rather than adding more ioctls or
relatively clumsy xattrs.  This is modeled loosely on the upstream
kernel's statx support.

The ioctl entry points call core functions so that we can also implement
the existing stat_more and setattr_more interfaces in terms of these new
attr_x functions.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
0521bd0e6b Make offline extent creation use one transaction
Initially setattr_more followed the general pattern where extent
manipulation might require multiple transactions if there are lots of
extent items to work with.   The scoutfs_data_init_offline_extent()
function that creates an offline extent handled transactions itself.

But in this case the call only supports adding a single offline extent.
It will always use a small fixed amount of metadata and could be
combined with other metadata changes in one atomic transaction.

This changes scoutfs_data_init_offline_extent() to have the caller
handle transactions, inode updates, etc.  This lets the caller perform
all the restore changes in one transaction.  This interface change will
then be used as we add another caller that adds a single offline extent
in the same way.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
361491846d Add scoutfs_fmt_vers_unsupported()
Add a little inline helper to test whether the mounted format version
supports a feature or not, returning an errno that callers can use when
they can return a shared expected error.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
9ba4271c26 Add new max format version of 2
We're about to add new format structures so increment the max version to
2.  Future commits will add the features before we release version 2 in
the wild.

Signed-off-by: Zach Brown <zab@zabbo.net>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
90cfaf17d1 Initial support for different inode sizes
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version.   This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
6931cb7b0e Add scoutfs_inode_[gs]et_flags
Add functions for getting and setting our private inode flags.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
7d4db05445 Add scoutfs_item_lookup_smaller_zero
Add a lookup variant that returns an error if the item value is larger
than the caller's value buffer size and which zeros the rest of the
caller's buffer if the returned value is smaller.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
7b71250072 Merge pull request #176 from versity/zab/accumulated_fixes
Zab/accumulated fixes
2024-06-26 13:21:50 -07:00
Zach Brown
8e37be279c Use seqlock to protect inode fields
We were using a seqcount to protect high frequency reads and writes to
some of our private inode fields.  The writers were serialized by the
caller but that's a bit too easy to get wrong.  We're already storing
the write seqcount update so the additional internal spinlock stores in
seqlocks isn't a significant additional overhead.  The seqlocks also
handle preemption for us.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
d6642da44d Prevent downgrade of format version
Don't let change-format-version decrease the format version.  It doesn't
have the machinery to go back and migrate newer structures to older
structures that would be compatible with code expecting the older
version.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: split from initial patch with other changes]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
4b87045447 Pre-declare scoutfs_lock in forest.h
Definitions in forest.h use lock pointers.  Pre-declare the struct so it
doesn't break inclusion without lock.h, following current practice in
the header.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
3f773a8594 Fix uninit written in scoutfs_file_write_iter
scoutfs_file_write_iter tried to track written bytes and return those
unless there was an error.  But written was uninitialized if we got
errors in any of the calls leading up to performing the write.  The
bytes written were also not being passed to the generic_write_sync
helper.  This fixes up all those inconsistencies and makes it look like
the write_iter path in other filesystems.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
c385eea9a1 Check for all offline in scoutfs_file_write_iter
When we write to file contents we change the data_version.  To stage old
contents into an offline region the data_version of the file must match
the archived copy.  When writing we have to make sure that there is no
offline data so that we don't increase the data_version which will
prevent staging of any other file regions because the data_versions no
longer match.

scoutfs_file_write_iter was only checking for offline data in its write
region, not the entire file.  Fix it to match the _aio_write method and
check the whole file.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
c296bc1959 Remove scoutfs_data_wait_check_iter
scoutfs_data_wait_check_iter() was checking the contiguous region of the
file starting at its pos and extending for iter_iov_count() bytes.  The
caller can do that with the previous _data_wait_check() method by
providing the same count that _check_iter() was using.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
3052feac29 Have item cache show unprotected lock
The item cache has a bit of safety checks that make sure that an
operation is performed while holding a lock that covers the item.  It
dumped a stack trace via WARN when that wasn't true, but it didn't
include any details about the keys or lock modes involved.

This adds a message that's printed once which includes the keys and
modes when an operation is attempted that isn't protected.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
1fa0d7727c scoutfs_item_create checks wrong lock mode
scoutfs_item_create() was checking that its lock had a read mode, when
it should have been checking for a write mode.  This worked out because
callers with write mode locks are also protecting reads.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
2af6f47c8b Fix bad error exit path in unlink
Unlink looks up the entry items for the name it is removing because we
no longer store the extra key material in dentries.  If this lookup
fails it will use an error path which release a transaction which wasn't
held.  Thankfully this error path is unlikely (corruption or systemic
errors like eio or enomem) so we haven't hit this in practice.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
6db69b7a4f Set root inode crtime in mkfs
When we added the crtime creation timestamp to the inode we forgot to
update mkfs to set the crtime of the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:18 -07:00
Zach Brown
8ca1f1994d Merge pull request #174 from versity/zab/trace_block_estale
Add tracepoint as block read returns ESTALE
2024-06-11 09:42:45 -07:00
Zach Brown
48716461e4 Add tracepoint as block read returns ESTALE
Block reads can return ESTALE naturally as mounts read through old
cached blocks.  We won't always log it as an error but we should add a
tracepoint that can be inspected.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-10 11:03:38 -07:00
Zach Brown
965b692bdc Merge pull request #171 from versity/zab/v1.20
v1.20 Release
2024-04-22 14:51:39 -07:00
Zach Brown
c3c4b08038 v1.20 Release
Finish the release notes for the 1.20 release.

Signed-off-by: Zach Brown <zab@versity.com>
2024-04-22 13:20:42 -07:00
Zach Brown
0519830229 Merge pull request #165 from versity/greg/kmod-uninstall-cleanup
More cleanly drive weak-modules on install/uninstall
2024-04-11 14:32:06 -07:00
Greg Cymbalski
4d6e1a14ae More safely install/uninstall with weak-modules
This addresses some minor issues with how we handle driving the
weak-modules infrastructure for handling running on kernels not
explicitly built for.

For one, we now drive weak-modules at install-time more explicitly (it
was adding symlinks for all modules into the right place for the running
kernel, whereas now it only handles that for scoutfs against all
installed kernels).

Also we no longer leave stale modules on the filesystem after an
uninstall/upgrade, similar to what's done for vsm's kmods right now.
RPM's pre/postinstall scriptlets are used to drive weak-modules to clean
things up.

Note that this (intentionally) does not (re)generate initrds of any
kind.

Finally, this was tested on both the native kernel version and on
updates that would need the migrated modules. As a result, installs are
a little quicker, the module still gets migrated successfully, and
uninstalls correctly remove (only) the packaged module.
2024-04-11 13:20:50 -07:00
Greg Cymbalski
fc3e061ea8 Merge pull request #164 from versity/greg/preserve-git-describe
Encode git info into spec to keep git info in final kmod
2024-03-29 13:48:33 -07:00
Greg Cymbalski
a4bc3fb27d Capture git info at spec creation time, pass into make 2024-02-05 15:44:10 -08:00
Zach Brown
67990a7007 Merge pull request #162 from versity/zab/v1.19
v1.19 Release
2024-01-30 15:46:49 -08:00
Zach Brown
ba819be8f9 v1.19 Release
Finish the release notes for the 1.19 release.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-30 12:11:23 -08:00
Zach Brown
1b103184ca Merge pull request #161 from versity/zab/merge_timeout_option_fix
Correctly set the log_merge_wait_timeout_ms option
2024-01-30 12:07:10 -08:00
Zach Brown
c3890abd7b Correctly set the log_merge_wait_timeout_ms option
The initial code for setting the timeout used the wrong parsed variable.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-30 12:01:35 -08:00
Zach Brown
5ab38bfa48 Merge pull request #160 from versity/zab/log_merging_speedups
Zab/log merging speedups
2024-01-29 12:26:55 -08:00
Zach Brown
e9ad61b444 Delete multiple log trees items per server commit
server_log_merge_free_work() is responsible for freeing all the input
log trees for a log merge operation that has finished.  It looks for the
next item to free, frees the log btree it references, and then deletes
the item.  It was doing this with a full server commit for each item
which can take an agonizingly long time.

This changes it perform multiple deletions in a commit as long as
there's plenty of alloc space.  The moment the commit gets low it
applies the commit and opens a new one.  This sped up the deletion of a
few hundred thousand log tree items from taking hours to seconds.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:30:17 -08:00
Zach Brown
91bbf90f71 Don't pin input btrees when merging
The btree_merge code was pinning leaf blocks for all input btrees as it
iterated over them.  This doesn't work when there are a very large
number of input btrees.  It can run out of memory trying to hold a
reference to a 64KiB leaf block for each input root.

This reworks the btree merging code.  It reads a window of blocks from
all input trees to get a set of merged items.  It can take multiple
passes to complete the merge but by setting the merge window large
enough this overhead is reduced.  Merging now consumes a fixed amount of
memory rather than using memory proportional to the number of input
btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:30:17 -08:00
Zach Brown
b5630f540d Add tracing of the log merge finalizing decision
Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:30:17 -08:00
Zach Brown
90a4c82363 Make log merge wait timeout tunable
Add a mount option for the amount of time that log merge creation can
wait before giving up.  We add some counters so we can see how often
the timeout is being hit and what the average successfull wait time is.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:25:56 -08:00
Zach Brown
f654fa0fda Send syncs once when starting to merge
The server sends sync requests to clients when it sees that they have
open log trees that need to be committed for log merging to proceed.

These are currently sent in the context of each client's get_log_trees
request, resulting in sync requests queued for one client from all
clients.  Depending on message delivery and commit latencies, this can
create a sync storm.

The server's sends are reliable and the open commits are marked with the
seq when they opened.  It's easy for us to record having sent syncs to
all open commits so that future attempts can be avoided.  Later open
commits will have higher seqs and will get a new round of syncs sent.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:25:20 -08:00
Zach Brown
50168a2d2a Check each client's last log item for stable seq
The server was checking all client log_trees items to search for the
lowest commit seq that was still open.  This can be expensive when there
are a lot of finalized log_trees items that won't have open seqs.  Only
the last log_trees item for each client rid can be open, and the items
are sorted by rid and nr, so we can easily only check the last item for
each client rid.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:24:50 -08:00
Zach Brown
3c0616524a Only search last log_trees per rid for finalizing
During get_log_trees the server checks log_trees items to see if it
should start a log merge operation.  It did this by iterating over all
log_trees items and there can be quite a lot of them.

It doesn't need to see all of the items.  It only needs to see the most
recent log_trees item for each mount.  That's enough to make the
decisions that start the log merging process.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:23:59 -08:00
Zach Brown
8d3e6883c6 Merge pull request #159 from versity/auke/trans_hold
Fix ret output for scoutfs_trans_hold trace pt.
2024-01-09 09:23:32 -08:00
Auke Kok
8747dae61c Fix ret output for scoutfs_trans_hold trace pt.
Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-01-08 16:27:41 -08:00
Zach Brown
fffcf4a9bb Merge pull request #158 from versity/zab/kasan_stack_oob_get_reg
Ignore spurious KASAN unwind warning
2023-11-22 10:04:18 -08:00
Zach Brown
b552406427 Ignore spurious KASAN unwind warning
KASAN could raise a spurious warning if the unwinder started in code
without ORC metadata and tried to access in the KASAN stack frame
redzones.  This was fixed upstream but we can rarely see it in older
kernels.  We can ignore these messages.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-21 12:25:16 -08:00
Zach Brown
d812599e6b Merge pull request #157 from versity/zab/dmsetup_test_devices
Zab/dmsetup test devices
2023-11-21 10:13:02 -08:00
Zach Brown
03ab5cedb6 clean up createmany-parallel-mounts test
This test is trying to make sure that concurrent work isn't much, much,
slower than individual work.  It does this by timing creating a bunch of
files in a dir on a mount and then timing doing the same in two mounts
concurrently.  But it messed it up the concurrency pretty badly.

It had the concurrent createmany tasks creating files with a full path.
That means that every create is trying to read all the parent
directories.  The way inode number allocation works means that one of
the mounts is likely to be getting a write lock that includes a shared
parent.  This created a ton of cluster lock contention between the two
tasks.

Then it didn't sync the creates between phases.  It could be
accidentally recording the time it took to write out the dirty single
creates as time taken during the parallel creates.

By syncing between phases and having the createmany tasks create files
relative to their per-mount directories we actually perform concurrent
work and test that we're not creating contention outside of the task
load.

This became a problem as we switched from loopback devices to device
mapper devices.  The loopback writers were using buffered writes so we
were masking the io cost of constantly invalidating and refilling the
item cache by turning the reads into memory copies out of the page
cache.

While we're in here we actually clean up the created files and then use
t_fail to fail the test while the files still exist so they can be
examined.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-15 15:12:57 -08:00
Zach Brown
2b94cd6468 Add loop module kernel message filter
Now that we're not setting up per-mount loopback devices we can not have
the loop module loaded until tests are running.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-15 13:39:38 -08:00
Zach Brown
5507ee5351 Use device-mapper for per-mount test devices
We don't directly mount the underlying devices for each mount because
the kernel notices multiple mounts and doesn't setup a new super block
for each.

Previously the script used loopback devices to create the local shared
block construct 'cause it was easy.  This introduced corruption of
blocks that saw concurrent read and write IOs.  The buffered kernel file
IO paths that loopback eventually degrades into by default (via splice)
could have buffered readers copying out of pages without the page lock
while writers modified the page.  This manifest as occasional crc
failure of blocks that we knowingly issue concurrent reads and writes to
from multiple mounts (the quorum and super blocks).

This changes the script to use device-mapper linear passthrough devices.
Their IOs don't hit a caching layer and don't provide an opportunity to
corrupt blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-15 13:39:38 -08:00
Zach Brown
1600a121d9 Merge pull request #156 from versity/zab/large_fragmented_free_hung_task
Extend hung task timeout for large-fragmented-free
2023-11-15 09:49:13 -08:00
Zach Brown
6daf24ff37 Extend hung task timeout for large-fragmented-free
Our large fragmented free test creates pathologically file extents which
are as expensive as possible to free.  We know that debugging kernels
can take a long time to do this so we can extend the hung task timeout.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-14 15:01:37 -08:00
Zach Brown
cd5d9ff3e0 Merge pull request #154 from versity/zab/srch_test_fixes
Zab/srch test fixes
2023-11-13 09:47:46 -08:00
Zach Brown
d94e49eb63 Fix quoted glob in srch-basic-functionality
One of the phases of this test wanted to delete files but got the glob
quoting wrong.  This didn't matter for the original test but when we
changed the test to use its own xattr name then those existing undeleted
files got confused with other files in later phases of the test.

This changes the test to delete the files with a more reliable find
pattern instead of using shell glob expansion.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-09 14:16:36 -08:00
Zach Brown
1dbe408539 Add tracing of srch compact struct communication
Signed-off-by: Zach Brown <zab@versity.com>
2023-11-09 14:16:33 -08:00
Zach Brown
bf21699ad7 bulk_create_paths test tool takes xattr name
Previously the bulk_create_paths test tool used the same xattr name for
each category of xattrs it was creating.

This created a problem where two tests got their xattrs confused with
each other.  The first test created a bunch of srch xattrs, failed, and
didn't clean up after itself.  The second test saw these search xattrs
as its own and got very confused when there were far more srch xattrs
than it thought it had created.

This lets each test specify the srch xattr names that are created by
bulk_create_paths so that tests can work with their xattrs independent
of each other.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-09 14:15:44 -08:00
Zach Brown
c7c67a173d Specifically wait for compaction in srch test
We just added a test to try and get srch compaction stuck by having an
input file continue at a specific offset.  To exercise the bug the test
needs to perform 6 compactions.  It needs to merge 4 sets of logs into 4
sorted files, it needs to make partial progress merging those 4 sorted
files into another file, and then finall attempt to continue compacting
from the partial progress offset.

The first version of the test didn't necessarily ensure that these
compactions happened.  It created far too many log files then just
waited for time to pass.  If the host was slow then the mounts may not
make it through the initial logs to try and compact the sorted files.
The triggers wouldn't fire and the test would fail.

These changes much more carefully orchestrate and watch the various
steps of compaction to make sure that we trigger the bug.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-09 14:13:13 -08:00
Zach Brown
0d10189f58 Make srch compact request delay tunable
Add a sysfs file for getting and setting the delay between srch
compaction requests from the client.  We'll use this in testing to
ensure compaction runs promptly.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-09 14:13:07 -08:00
Zach Brown
6b88f3268e Merge pull request #153 from versity/zab/v1.18
v1.18 Release
2023-11-08 10:57:56 -08:00
Zach Brown
4b2afa61b8 v1.18 Release
Finish the release notes for the 1.18 release.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-07 16:01:59 -08:00
Zach Brown
222ba2cede Merge pull request #152 from versity/zab/stuck_srch_compact
Zab/stuck srch compact
2023-11-07 15:56:39 -08:00
Zach Brown
c7e97eeb1f Allow srch compaction from _SAFE_BYTES
Compacting sorted srch files can take multiple transactions because they
can be very large.  Each transaction resumes at a byte offset in a block
where the previous transaction stopped.

The resuming code tests that the byte offsets are sane but had a mistake
in testing the offset to skip to.  It returned an error if the
compaction resumed from the last possible safe offset for decoding
entries.

If a system is unlucky enough to have a compaction transaction stop at
just this offset then compaction stops making forward progress as each
attempt to resume returns an error.

The fix allows continuation from this last safe offset while returning
errors for attempts to continue *past* that offset.  This matches all
the encoding code which allows encoding the last entry in the block at
this offset.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-07 12:34:00 -08:00
Zach Brown
21c070b42d Add test for srch continutation safe pos errors
Add a test for srch compaction getting stuck hitting errors continuing a
partial operation.  It ensures that a block has an encoded entry at
the _SAFE_BYTES offset, that an operaton stops precisely at that
offset, and then watches for errors.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-07 12:34:00 -08:00
Zach Brown
77fbf92968 Add t_trigger_set helper
Add a helper to arm or disarm a trigger with a value argument.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-07 12:12:10 -08:00
Zach Brown
d5c699c3b4 Don't respond with ENOENT for no srch compaction
The srch compaction request building function and the srch compaction
worker both have logic to recognize a valid response with no input files
indicating that there's no work to do.  The server unfortunately
translated nr == 0 into ENOENT and send that error response to the
client.  This caused the client to increment error counters in the
common case when there's no compaction work to perform.  We'd like the
error counter to reflect actual errors, we're about to check it in a
test, so let's fix this up to the server sends a sucessful response with
nr == 0 to indicate that there's no work to do.

Signed-off-by: Zach Brown <zab@versity.com>
2023-11-07 10:30:38 -08:00
174 changed files with 18634 additions and 1182 deletions

View File

@@ -1,6 +1,62 @@
Versity ScoutFS Release Notes
=============================
---
v1.21
\
*Jul 1, 2024*
This release adds features that rely on incompatible changes to
structure the file system. The process of advancing the format version
to enable these features is described in scoutfs(5).
Added the ".indx." extended attribute tag which can be used to determine
the sorting of files in a global index.
Added ScoutFS quotas which let rules define file size and count limits
in terms of ".totl." extended attribute totals.
Added the project ID file attribute which is inherited from parent
directories on creation. ScoutFS quota rules can reference project IDs.
Add a retention attribute for files which prevents modification once
enabled.
---
v1.20
\
*Apr 22, 2024*
Minor changes to packaging to better support "weak" module linking of
the kernel module, and to including git hashes in the built package. No
changes in runtime behaviour.
---
v1.19
\
*Jan 30, 2024*
Added the log\_merge\_wait\_timeout\_ms mount option to set the timeout
for creating log merge operations. The previous timeout, now the
default, was too short for some systems and was resulting in consistent
timeouts which created an excessive number of log trees waiting to be
merged.
Improved performance of many in-mount server operations when there are a
large number of log trees waiting to be merged.
---
v1.18
\
*Nov 7, 2023*
Fixed a bug where background srch file compaction could stop making
forward progress if a partial compaction operation was committed at a
specific byte offset in a block. This would cause srch file searches to
be progressively more expensive over time. Once this fix is running
background compaction will resume, bringing the cost of searches back
down.
---
v1.17
\

View File

@@ -12,17 +12,22 @@ else
SP = @:
endif
SCOUTFS_GIT_DESCRIBE := \
SCOUTFS_GIT_DESCRIBE ?= \
$(shell git describe --all --abbrev=6 --long 2>/dev/null || \
echo no-git)
ESCAPED_GIT_DESCRIBE := \
$(shell echo $(SCOUTFS_GIT_DESCRIBE) |sed -e 's/\//\\\//g')
RPM_GITHASH ?= $(shell git rev-parse --short HEAD)
SCOUTFS_ARGS := SCOUTFS_GIT_DESCRIBE=$(SCOUTFS_GIT_DESCRIBE) \
RPM_GITHASH=$(RPM_GITHASH) \
CONFIG_SCOUTFS_FS=m -C $(SK_KSRC) M=$(CURDIR)/src \
EXTRA_CFLAGS="-Werror"
# - We use the git describe from tags to set up the RPM versioning
RPM_VERSION := $(shell git describe --long --tags | awk -F '-' '{gsub(/^v/,""); print $$1}')
RPM_GITHASH := $(shell git rev-parse --short HEAD)
TARFILE = scoutfs-kmod-$(RPM_VERSION).tar
@@ -41,7 +46,8 @@ modules_install:
%.spec: %.spec.in .FORCE
sed -e 's/@@VERSION@@/$(RPM_VERSION)/g' \
-e 's/@@GITHASH@@/$(RPM_GITHASH)/g' < $< > $@+
-e 's/@@GITHASH@@/$(RPM_GITHASH)/g' \
-e 's/@@GITDESCRIBE@@/$(ESCAPED_GIT_DESCRIBE)/g' < $< > $@+
mv $@+ $@

View File

@@ -1,6 +1,7 @@
%define kmod_name scoutfs
%define kmod_version @@VERSION@@
%define kmod_git_hash @@GITHASH@@
%define kmod_git_describe @@GITDESCRIBE@@
%define pkg_date %(date +%%Y%%m%%d)
# Disable the building of the debug package(s).
@@ -12,8 +13,7 @@
%if 0%{?el7}
%global kernel_source() /usr/src/kernels/%{kernel_version}.$(arch)
%endif
%if 0%{?el8}
%else
%global kernel_source() /usr/src/kernels/%{kernel_version}
%endif
@@ -21,8 +21,7 @@
%if 0%{?el7}
Name: %{kmod_name}
%endif
%if 0%{?el8}
%else
Name: kmod-%{kmod_name}
%endif
Summary: %{kmod_name} kernel module
@@ -34,8 +33,7 @@ URL: http://scoutfs.org/
%if 0%{?el7}
BuildRequires: %{kernel_module_package_buildreqs}
%endif
%if 0%{?el8}
%else
BuildRequires: elfutils-libelf-devel
%endif
BuildRequires: kernel-devel-uname-r = %{kernel_version}
@@ -53,7 +51,8 @@ Source: %{kmod_name}-kmod-%{kmod_version}.tar
%endif
%global install_mod_dir extra/%{kmod_name}
%if 0%{?el8}
%if ! 0%{?el7}
%global flavors_to_build x86_64
%endif
@@ -75,7 +74,7 @@ echo "Building for kernel: %{kernel_version} flavors: '%{flavors_to_build}'"
for flavor in %flavors_to_build; do
rm -rf obj/$flavor
cp -r source obj/$flavor
make SK_KSRC=%{kernel_source $flavor} -C obj/$flavor module
make RPM_GITHASH=%{kmod_git_hash} SCOUTFS_GIT_DESCRIBE=%{kmod_git_describe} SK_KSRC=%{kernel_source $flavor} -C obj/$flavor module
done
%install
@@ -92,15 +91,23 @@ done
# mark modules executable so that strip-to-file can strip them
find %{buildroot} -type f -name \*.ko -exec %{__chmod} u+x \{\} \;
%if 0%{?el8}
%if ! 0%{?el7}
%files
/lib/modules
%post
weak-modules --add-kernel --no-initramfs
echo /lib/modules/%{kversion}/%{install_mod_dir}/scoutfs.ko | weak-modules --add-modules --no-initramfs
depmod -a
%endif
%clean
rm -rf %{buildroot}
%preun
# stash our modules for postun cleanup
SCOUTFS_RPM_NAME=$(rpm -q %{name} | grep "%{version}-%{release}")
rpm -ql $SCOUTFS_RPM_NAME | grep '\.ko$' > /var/run/%{name}-modules-%{version}-%{release} || true
%postun
cat /var/run/%{name}-modules-%{version}-%{release} | weak-modules --remove-modules --no-initramfs
rm /var/run/%{name}-modules-%{version}-%{release} || true

View File

@@ -9,6 +9,7 @@ CFLAGS_scoutfs_trace.o = -I$(src) # define_trace.h double include
scoutfs-y += \
acl.o \
attr_x.o \
avl.o \
alloc.o \
block.o \
@@ -34,6 +35,7 @@ scoutfs-y += \
options.o \
per_task.o \
quorum.o \
quota.o \
recov.o \
scoutfs_trace.o \
server.o \
@@ -42,10 +44,12 @@ scoutfs-y += \
srch.o \
super.o \
sysfs.o \
totl.o \
trans.o \
triggers.o \
tseq.o \
volopt.o \
wkic.o \
xattr.o
#

View File

@@ -78,8 +78,9 @@ endif
# v4.8-rc1-29-g31051c85b5e2
#
# inode_change_ok() removed - replace with setattr_prepare()
# v5.11-rc4-7-g2f221d6f7b88 removes extern attribute
#
ifneq (,$(shell grep 'extern int setattr_prepare' include/linux/fs.h))
ifneq (,$(shell grep 'int setattr_prepare' include/linux/fs.h))
ccflags-y += -DKC_SETATTR_PREPARE
endif
@@ -258,3 +259,157 @@ endif
ifneq (,$(shell grep 'static inline const char .xattr_prefix' include/linux/xattr.h))
ccflags-y += -DKC_XATTR_HANDLER_NAME=1
endif
#
# v5.19-rc4-96-g342a72a33407
#
# Adds `typedef __u32 __bitwise blk_opf_t` to aid flag checking
ifneq (,$(shell grep 'typedef __u32 __bitwise blk_opf_t' include/linux/blk_types.h))
ccflags-y += -DKC_HAVE_BLK_OPF_T=1
endif
#
# v5.12-rc6-9-g4f0f586bf0c8
#
# list_sort cmp function takes const list_head args
ifneq (,$(shell grep 'const struct list_head ., const struct list_head .' include/linux/list_sort.h))
ccflags-y += -DKC_LIST_CMP_CONST_ARG_LIST_HEAD
endif
# v5.7-523-g88dca4ca5a93
#
# The pgprot argument to vmalloc is always PAGE_KERNEL, so it is removed.
ifneq (,$(shell grep 'extern void .__vmalloc.unsigned long size, gfp_t gfp_mask, pgprot_t prot' include/linux/vmalloc.h))
ccflags-y += -DKC_VMALLOC_PGPROT_T
endif
# v6.2-rc1-18-g01beba7957a2
#
# fs: port inode_owner_or_capable() to mnt_idmap
ifneq (,$(shell grep 'bool inode_owner_or_capable.struct user_namespace .mnt_userns' include/linux/fs.h))
ccflags-y += -DKC_INODE_OWNER_OR_CAPABLE_USERNS
endif
#
# v5.11-rc4-5-g47291baa8ddf
#
# namei: make permission helpers idmapped mount aware
ifneq (,$(shell grep 'int inode_permission.struct user_namespace' include/linux/fs.h))
ccflags-y += -DKC_INODE_PERMISSION_USERNS
endif
#
# v5.11-rc4-24-g549c7297717c
#
# fs: make helpers idmap mount aware
# Enlarges the VFS API methods to include user namespace argument.
ifneq (,$(shell grep 'int ..mknod. .struct user_namespace' include/linux/fs.h))
ccflags-y += -DKC_VFS_METHOD_USER_NAMESPACE_ARG
endif
#
# v5.17-rc2-21-g07888c665b40
#
# Detect new style bio_alloc - pass bdev and opf.
ifneq (,$(shell grep 'struct bio .bio_alloc.struct block_device .bdev' include/linux/bio.h))
ccflags-y += -DKC_BIO_ALLOC_DEV_OPF_ARGS
endif
#
# v5.7-rc4-53-gcddf8a2c4a82
#
# fiemap_prep() replaces fiemap_check_flags()
ifneq (,$(shell grep -s 'int fiemap_prep.struct inode' include/linux/fiemap.h))
ccflags-y += -DKC_FIEMAP_PREP
endif
#
# v5.17-13043-g800ba29547e1
#
# generic_perform_write args use kiocb for passing filp and pos
ifneq (,$(shell grep 'ssize_t generic_perform_write.struct kiocb ., struct iov_iter' include/linux/fs.h))
ccflags-y += -DKC_GENERIC_PERFORM_WRITE_KIOCB_IOV_ITER
endif
#
# v5.7-rc6-2496-g76ee0785f42a
#
# net: add sock_set_sndtimeo
ifneq (,$(shell grep 'void sock_set_sndtimeo.struct sock' include/net/sock.h))
ccflags-y += -DKC_SOCK_SET_SNDTIMEO
endif
#
# v5.8-rc4-1931-gba423fdaa589
#
# setsockopt functions are now passed a sockptr_t value instead of char*
ifneq (,$(shell grep -s 'include .linux/sockptr.h.' include/linux/net.h))
ccflags-y += -DKC_SETSOCKOPT_SOCKPTR_T
endif
#
# v5.7-rc6-2507-g71c48eb81c9e
#
# Adds a bunch of low level TCP sock parameter functions that we want to use.
ifneq (,$(shell grep 'int tcp_sock_set_keepintvl' include/linux/tcp.h))
ccflags-y += -DKC_HAVE_TCP_SET_SOCKFN
endif
#
# v4.16-rc3-13-ga84d1169164b
#
# Fixes y2038 issues with struct timeval.
ifneq (,$(shell grep -s '^struct __kernel_old_timeval .' include/uapi/linux/time_types.h))
ccflags-y += -DKC_KERNEL_OLD_TIMEVAL_STRUCT
endif
#
# v5.19-rc4-52-ge33c267ab70d
#
# register_shrinker now requires a name, used for debug stats etc.
ifneq (,$(shell grep 'int __printf.*register_shrinker.struct shrinker .shrinker,' include/linux/shrinker.h))
ccflags-y += -DKC_SHRINKER_NAME
endif
#
# v5.18-rc5-246-gf132ab7d3ab0
#
# mpage_readpage() is now replaced with mpage_read_folio.
ifneq (,$(shell grep 'int mpage_read_folio.struct folio .folio' include/linux/mpage.h))
ccflags-y += -DKC_MPAGE_READ_FOLIO
endif
#
# v5.18-rc5-219-gb3992d1e2ebc
#
# block_write_begin() no longer is being passed aop_flags
ifneq (,$(shell grep -C1 'int block_write_begin' include/linux/buffer_head.h | tail -n 2 | grep 'unsigned flags'))
ccflags-y += -DKC_BLOCK_WRITE_BEGIN_AOP_FLAGS
endif
#
# v6.0-rc6-9-g863f144f12ad
#
# the .tmpfile() vfs method calling convention changed and now a struct
# file* is passed to this metiond instead of a dentry. The function also
# should open the created file and call finish_open_simple() before returning.
ifneq (,$(shell grep 'extern void d_tmpfile.struct dentry' include/linux/dcache.h))
ccflags-y += -DKC_D_TMPFILE_DENTRY
endif
#
# v6.4-rc2-201-g0733ad800291
#
# New blk_mode_t replaces abuse of fmode_t
ifneq (,$(shell grep 'typedef unsigned int __bitwise blk_mode_t' include/linux/blkdev.h))
ccflags-y += -DKC_HAVE_BLK_MODE_T
endif
#
# v6.4-rc2-186-g2736e8eeb0cc
#
# Reworks FMODE_EXCL kludge and instead modifies the blkdev_put() call to pass in
# the (exclusive) holder to implement FMODE_EXCL handling.
ifneq (,$(shell grep 'blkdev_put.struct block_device .bdev, void .holder' include/linux/blkdev.h))
ccflags-y += -DKC_BLKDEV_PUT_HOLDER_ARG
endif

View File

@@ -98,11 +98,9 @@ struct posix_acl *scoutfs_get_acl_locked(struct inode *inode, int type, struct s
acl = ERR_PTR(ret);
}
#ifndef KC___POSIX_ACL_CREATE
/* can set null negative cache */
if (!IS_ERR(acl))
set_cached_acl(inode, type, acl);
#endif
kfree(value);
@@ -155,7 +153,8 @@ int scoutfs_set_acl_locked(struct inode *inode, struct posix_acl *acl, int type,
switch (type) {
case ACL_TYPE_ACCESS:
if (acl) {
ret = posix_acl_update_mode(inode, &new_mode, &acl);
ret = posix_acl_update_mode(KC_VFS_INIT_NS
inode, &new_mode, &acl);
if (ret < 0)
goto out;
set_mode = true;
@@ -194,10 +193,8 @@ int scoutfs_set_acl_locked(struct inode *inode, struct posix_acl *acl, int type,
}
out:
#ifndef KC___POSIX_ACL_CREATE
if (!ret)
set_cached_acl(inode, type, acl);
#endif
kfree(value);
@@ -256,7 +253,9 @@ int scoutfs_acl_get_xattr(struct dentry *dentry, const char *name, void *value,
}
#ifdef KC_XATTR_STRUCT_XATTR_HANDLER
int scoutfs_acl_set_xattr(const struct xattr_handler *handler, struct dentry *dentry,
int scoutfs_acl_set_xattr(const struct xattr_handler *handler,
KC_VFS_NS_DEF
struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
size_t size, int flags)
{
@@ -269,7 +268,7 @@ int scoutfs_acl_set_xattr(struct dentry *dentry, const char *name, const void *v
struct posix_acl *acl = NULL;
int ret;
if (!inode_owner_or_capable(dentry->d_inode))
if (!inode_owner_or_capable(KC_VFS_INIT_NS dentry->d_inode))
return -EPERM;
if (!IS_POSIXACL(dentry->d_inode))

View File

@@ -10,7 +10,9 @@ int scoutfs_set_acl_locked(struct inode *inode, struct posix_acl *acl, int type,
int scoutfs_acl_get_xattr(const struct xattr_handler *, struct dentry *dentry,
struct inode *inode, const char *name, void *value,
size_t size);
int scoutfs_acl_set_xattr(const struct xattr_handler *, struct dentry *dentry,
int scoutfs_acl_set_xattr(const struct xattr_handler *,
KC_VFS_NS_DEF
struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
size_t size, int flags);
#else

View File

@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/blkdev.h>
#include <linux/sort.h>
#include <linux/random.h>

252
kmod/src/attr_x.c Normal file
View File

@@ -0,0 +1,252 @@
/*
* Copyright (C) 2024 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/kernel.h>
#include <linux/fs.h>
#include "format.h"
#include "super.h"
#include "inode.h"
#include "ioctl.h"
#include "lock.h"
#include "trans.h"
#include "attr_x.h"
static int validate_attr_x_input(struct super_block *sb, struct scoutfs_ioctl_inode_attr_x *iax)
{
int ret;
if ((iax->x_mask & SCOUTFS_IOC_IAX__UNKNOWN) ||
(iax->x_flags & SCOUTFS_IOC_IAX_F__UNKNOWN))
return -EINVAL;
if ((iax->x_mask & SCOUTFS_IOC_IAX_RETENTION) &&
(ret = scoutfs_fmt_vers_unsupported(sb, SCOUTFS_FORMAT_VERSION_FEAT_RETENTION)))
return ret;
if ((iax->x_mask & SCOUTFS_IOC_IAX_PROJECT_ID) &&
(ret = scoutfs_fmt_vers_unsupported(sb, SCOUTFS_FORMAT_VERSION_FEAT_PROJECT_ID)))
return ret;
return 0;
}
/*
* If the mask indicates interest in the given attr then set the field
* to the caller's value and return the new size if it didn't already
* include the attr field.
*/
#define fill_attr(size, iax, bit, field, val) \
({ \
__typeof__(iax) _iax = (iax); \
__typeof__(size) _size = (size); \
\
if (_iax->x_mask & (bit)) { \
_iax->field = (val); \
_size = max(_size, offsetof(struct scoutfs_ioctl_inode_attr_x, field) + \
sizeof_field(struct scoutfs_ioctl_inode_attr_x, field)); \
} \
\
_size; \
})
/*
* Returns -errno on error, or >= number of bytes filled by the
* response. 0 can be returned if no attributes are requested in the
* input x_mask.
*/
int scoutfs_get_attr_x(struct inode *inode, struct scoutfs_ioctl_inode_attr_x *iax)
{
struct super_block *sb = inode->i_sb;
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct scoutfs_lock *lock = NULL;
size_t size = 0;
u64 offline;
u64 online;
u64 bits;
int ret;
if (iax->x_mask == 0) {
ret = 0;
goto out;
}
ret = validate_attr_x_input(sb, iax);
if (ret < 0)
goto out;
inode_lock(inode);
ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
if (ret)
goto unlock;
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_META_SEQ,
meta_seq, scoutfs_inode_meta_seq(inode));
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_DATA_SEQ,
data_seq, scoutfs_inode_data_seq(inode));
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_DATA_VERSION,
data_version, scoutfs_inode_data_version(inode));
if (iax->x_mask & (SCOUTFS_IOC_IAX_ONLINE_BLOCKS | SCOUTFS_IOC_IAX_OFFLINE_BLOCKS)) {
scoutfs_inode_get_onoff(inode, &online, &offline);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_ONLINE_BLOCKS,
online_blocks, online);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_OFFLINE_BLOCKS,
offline_blocks, offline);
}
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_CTIME, ctime_sec, inode->i_ctime.tv_sec);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_CTIME, ctime_nsec, inode->i_ctime.tv_nsec);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_CRTIME, crtime_sec, si->crtime.tv_sec);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_CRTIME, crtime_nsec, si->crtime.tv_nsec);
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_SIZE, size, i_size_read(inode));
if (iax->x_mask & SCOUTFS_IOC_IAX__BITS) {
bits = 0;
if ((iax->x_mask & SCOUTFS_IOC_IAX_RETENTION) &&
(scoutfs_inode_get_flags(inode) & SCOUTFS_INO_FLAG_RETENTION))
bits |= SCOUTFS_IOC_IAX_B_RETENTION;
size = fill_attr(size, iax, SCOUTFS_IOC_IAX__BITS, bits, bits);
}
size = fill_attr(size, iax, SCOUTFS_IOC_IAX_PROJECT_ID,
project_id, scoutfs_inode_get_proj(inode));
ret = size;
unlock:
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
inode_unlock(inode);
out:
return ret;
}
static bool valid_attr_changes(struct inode *inode, struct scoutfs_ioctl_inode_attr_x *iax)
{
/* provided data_version must be non-zero */
if ((iax->x_mask & SCOUTFS_IOC_IAX_DATA_VERSION) && (iax->data_version == 0))
return false;
/* can only set size or data version in new regular files */
if (((iax->x_mask & SCOUTFS_IOC_IAX_SIZE) ||
(iax->x_mask & SCOUTFS_IOC_IAX_DATA_VERSION)) &&
(!S_ISREG(inode->i_mode) || scoutfs_inode_data_version(inode) != 0))
return false;
/* must provide non-zero data_version with non-zero size */
if (((iax->x_mask & SCOUTFS_IOC_IAX_SIZE) && (iax->size > 0)) &&
(!(iax->x_mask & SCOUTFS_IOC_IAX_DATA_VERSION) || (iax->data_version == 0)))
return false;
/* must provide non-zero size when setting offline extents to that size */
if ((iax->x_flags & SCOUTFS_IOC_IAX_F_SIZE_OFFLINE) &&
(!(iax->x_mask & SCOUTFS_IOC_IAX_SIZE) || (iax->size == 0)))
return false;
/* the retention bit only applies to regular files */
if ((iax->x_mask & SCOUTFS_IOC_IAX_RETENTION) && !S_ISREG(inode->i_mode))
return false;
return true;
}
int scoutfs_set_attr_x(struct inode *inode, struct scoutfs_ioctl_inode_attr_x *iax)
{
struct super_block *sb = inode->i_sb;
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct scoutfs_lock *lock = NULL;
LIST_HEAD(ind_locks);
bool set_data_seq;
int ret;
/* initially all setting is root only, could loosen with finer grained checks */
if (!capable(CAP_SYS_ADMIN)) {
ret = -EPERM;
goto out;
}
if (iax->x_mask == 0) {
ret = 0;
goto out;
}
ret = validate_attr_x_input(sb, iax);
if (ret < 0)
goto out;
inode_lock(inode);
ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_WRITE, SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
if (ret)
goto unlock;
/* check for errors before making any changes */
if (!valid_attr_changes(inode, iax)) {
ret = -EINVAL;
goto unlock;
}
/* retention prevents modification unless also clearing retention */
ret = scoutfs_inode_check_retention(inode);
if (ret < 0 && !((iax->x_mask & SCOUTFS_IOC_IAX_RETENTION) &&
!(iax->bits & SCOUTFS_IOC_IAX_B_RETENTION)))
goto unlock;
/* setting only so we don't see 0 data seq with nonzero data_version */
if ((iax->x_mask & SCOUTFS_IOC_IAX_DATA_VERSION) && (iax->data_version > 0))
set_data_seq = true;
else
set_data_seq = false;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, set_data_seq, true);
if (ret)
goto unlock;
ret = scoutfs_dirty_inode_item(inode, lock);
if (ret < 0)
goto release;
/* creating offline extent first, it might fail */
if (iax->x_flags & SCOUTFS_IOC_IAX_F_SIZE_OFFLINE) {
ret = scoutfs_data_init_offline_extent(inode, iax->size, lock);
if (ret)
goto release;
}
/* make all changes once they're all checked and will succeed */
if (iax->x_mask & SCOUTFS_IOC_IAX_DATA_VERSION)
scoutfs_inode_set_data_version(inode, iax->data_version);
if (iax->x_mask & SCOUTFS_IOC_IAX_SIZE)
i_size_write(inode, iax->size);
if (iax->x_mask & SCOUTFS_IOC_IAX_CTIME) {
inode->i_ctime.tv_sec = iax->ctime_sec;
inode->i_ctime.tv_nsec = iax->ctime_nsec;
}
if (iax->x_mask & SCOUTFS_IOC_IAX_CRTIME) {
si->crtime.tv_sec = iax->crtime_sec;
si->crtime.tv_nsec = iax->crtime_nsec;
}
if (iax->x_mask & SCOUTFS_IOC_IAX_RETENTION) {
scoutfs_inode_set_flags(inode, ~SCOUTFS_INO_FLAG_RETENTION,
(iax->bits & SCOUTFS_IOC_IAX_B_RETENTION) ?
SCOUTFS_INO_FLAG_RETENTION : 0);
}
if (iax->x_mask & SCOUTFS_IOC_IAX_PROJECT_ID)
scoutfs_inode_set_proj(inode, iax->project_id);
scoutfs_update_inode_item(inode, lock, &ind_locks);
ret = 0;
release:
scoutfs_release_trans(sb);
unlock:
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
inode_unlock(inode);
out:
return ret;
}

11
kmod/src/attr_x.h Normal file
View File

@@ -0,0 +1,11 @@
#ifndef _SCOUTFS_ATTR_X_H_
#define _SCOUTFS_ATTR_X_H_
#include <linux/kernel.h>
#include <linux/fs.h>
#include "ioctl.h"
int scoutfs_get_attr_x(struct inode *inode, struct scoutfs_ioctl_inode_attr_x *iax);
int scoutfs_set_attr_x(struct inode *inode, struct scoutfs_ioctl_inode_attr_x *iax);
#endif

View File

@@ -120,8 +120,7 @@ do { \
static __le32 block_calc_crc(struct scoutfs_block_header *hdr, u32 size)
{
int off = offsetof(struct scoutfs_block_header, crc) +
FIELD_SIZEOF(struct scoutfs_block_header, crc);
int off = offsetofend(struct scoutfs_block_header, crc);
u32 calc = crc32c(~0, (char *)hdr + off, size - off);
return cpu_to_le32(calc);
@@ -159,7 +158,7 @@ static struct block_private *block_alloc(struct super_block *sb, u64 blkno)
*/
lockdep_off();
nofs_flags = memalloc_nofs_save();
bp->virt = __vmalloc(SCOUTFS_BLOCK_LG_SIZE, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
bp->virt = kc__vmalloc(SCOUTFS_BLOCK_LG_SIZE, GFP_NOFS | __GFP_HIGHMEM);
memalloc_nofs_restore(nofs_flags);
lockdep_on();
@@ -438,7 +437,7 @@ static void block_remove_all(struct super_block *sb)
* possible. Final freeing, verifying checksums, and unlinking errored
* blocks are all done by future users of the blocks.
*/
static void block_end_io(struct super_block *sb, unsigned int opf,
static void block_end_io(struct super_block *sb, blk_opf_t opf,
struct block_private *bp, int err)
{
DECLARE_BLOCK_INFO(sb, binf);
@@ -478,7 +477,7 @@ static void KC_DECLARE_BIO_END_IO(block_bio_end_io, struct bio *bio)
* Kick off IO for a single block.
*/
static int block_submit_bio(struct super_block *sb, struct block_private *bp,
unsigned int opf)
blk_opf_t opf)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct bio *bio = NULL;
@@ -505,15 +504,13 @@ static int block_submit_bio(struct super_block *sb, struct block_private *bp,
for (off = 0; off < SCOUTFS_BLOCK_LG_SIZE; off += PAGE_SIZE) {
if (!bio) {
bio = bio_alloc(GFP_NOFS, SCOUTFS_BLOCK_LG_PAGES_PER);
bio = kc_bio_alloc(sbi->meta_bdev, SCOUTFS_BLOCK_LG_PAGES_PER, opf, GFP_NOFS);
if (!bio) {
ret = -ENOMEM;
break;
}
kc_bio_set_opf(bio, opf);
kc_bio_set_sector(bio, sector + (off >> 9));
bio_set_dev(bio, sbi->meta_bdev);
bio->bi_end_io = block_bio_end_io;
bio->bi_private = bp;
@@ -683,6 +680,7 @@ int scoutfs_block_read_ref(struct super_block *sb, struct scoutfs_block_ref *ref
struct scoutfs_block_header *hdr;
struct block_private *bp = NULL;
bool retried = false;
__le32 crc = 0;
int ret;
retry:
@@ -695,7 +693,9 @@ retry:
/* corrupted writes might be a sign of a stale reference */
if (!test_bit(BLOCK_BIT_CRC_VALID, &bp->bits)) {
if (hdr->crc != block_calc_crc(hdr, SCOUTFS_BLOCK_LG_SIZE)) {
crc = block_calc_crc(hdr, SCOUTFS_BLOCK_LG_SIZE);
if (hdr->crc != crc) {
trace_scoutfs_block_stale(sb, ref, hdr, magic, le32_to_cpu(crc));
ret = -ESTALE;
goto out;
}
@@ -705,6 +705,7 @@ retry:
if (hdr->magic != cpu_to_le32(magic) || hdr->fsid != cpu_to_le64(sbi->fsid) ||
hdr->seq != ref->seq || hdr->blkno != ref->blkno) {
trace_scoutfs_block_stale(sb, ref, hdr, magic, 0);
ret = -ESTALE;
goto out;
}
@@ -1197,7 +1198,7 @@ static void KC_DECLARE_BIO_END_IO(sm_block_bio_end_io, struct bio *bio)
* only layer that sees the full block buffer so we pass the calculated
* crc to the caller for them to check in their context.
*/
static int sm_block_io(struct super_block *sb, struct block_device *bdev, unsigned int opf,
static int sm_block_io(struct super_block *sb, struct block_device *bdev, blk_opf_t opf,
u64 blkno, struct scoutfs_block_header *hdr, size_t len, __le32 *blk_crc)
{
struct scoutfs_block_header *pg_hdr;
@@ -1229,15 +1230,13 @@ static int sm_block_io(struct super_block *sb, struct block_device *bdev, unsign
pg_hdr->crc = block_calc_crc(pg_hdr, SCOUTFS_BLOCK_SM_SIZE);
}
bio = bio_alloc(GFP_NOFS, 1);
bio = kc_bio_alloc(bdev, 1, opf, GFP_NOFS);
if (!bio) {
ret = -ENOMEM;
goto out;
}
kc_bio_set_opf(bio, opf | REQ_SYNC);
kc_bio_set_sector(bio, blkno << (SCOUTFS_BLOCK_SM_SHIFT - 9));
bio_set_dev(bio, bdev);
bio->bi_end_io = sm_block_bio_end_io;
bio->bi_private = &sbc;
bio_add_page(bio, page, SCOUTFS_BLOCK_SM_SIZE, 0);
@@ -1298,7 +1297,7 @@ int scoutfs_block_setup(struct super_block *sb)
init_waitqueue_head(&binf->waitq);
KC_INIT_SHRINKER_FUNCS(&binf->shrinker, block_count_objects,
block_scan_objects);
KC_REGISTER_SHRINKER(&binf->shrinker);
KC_REGISTER_SHRINKER(&binf->shrinker, "scoutfs-block:" SCSBF, SCSB_ARGS(sb));
INIT_WORK(&binf->free_work, block_free_work);
init_llist_head(&binf->free_llist);

View File

@@ -2029,187 +2029,253 @@ int scoutfs_btree_rebalance(struct super_block *sb,
key, SCOUTFS_BTREE_MAX_VAL_LEN, NULL, NULL, NULL);
}
struct merge_pos {
struct merged_range {
struct scoutfs_key start;
struct scoutfs_key end;
struct rb_root root;
int size;
};
struct merged_item {
struct rb_node node;
struct scoutfs_btree_root *root;
struct scoutfs_block *bl;
struct scoutfs_btree_block *bt;
struct scoutfs_avl_node *avl;
struct scoutfs_key *key;
struct scoutfs_key key;
u64 seq;
u8 flags;
unsigned int val_len;
u8 *val;
u8 val[0];
};
static struct merge_pos *first_mpos(struct rb_root *root)
static inline struct merged_item *mitem_container(struct rb_node *node)
{
struct rb_node *node = rb_first(root);
if (node)
return container_of(node, struct merge_pos, node);
return node ? container_of(node, struct merged_item, node) : NULL;
}
static inline struct merged_item *first_mitem(struct rb_root *root)
{
return mitem_container(rb_first(root));
}
static inline struct merged_item *last_mitem(struct rb_root *root)
{
return mitem_container(rb_last(root));
}
static inline struct merged_item *next_mitem(struct merged_item *mitem)
{
return mitem_container(mitem ? rb_next(&mitem->node) : NULL);
}
static inline struct merged_item *prev_mitem(struct merged_item *mitem)
{
return mitem_container(mitem ? rb_prev(&mitem->node) : NULL);
}
static struct merged_item *find_mitem(struct rb_root *root, struct scoutfs_key *key,
struct rb_node **parent_ret, struct rb_node ***link_ret)
{
struct rb_node **node = &root->rb_node;
struct rb_node *parent = NULL;
struct merged_item *mitem;
int cmp;
while (*node) {
parent = *node;
mitem = container_of(*node, struct merged_item, node);
cmp = scoutfs_key_compare(key, &mitem->key);
if (cmp < 0) {
node = &(*node)->rb_left;
} else if (cmp > 0) {
node = &(*node)->rb_right;
} else {
*parent_ret = NULL;
*link_ret = NULL;
return mitem;
}
}
*parent_ret = parent;
*link_ret = node;
return NULL;
}
static struct merge_pos *next_mpos(struct merge_pos *mpos)
static void insert_mitem(struct merged_range *rng, struct merged_item *mitem,
struct rb_node *parent, struct rb_node **link)
{
struct rb_node *node;
if (mpos && (node = rb_next(&mpos->node)))
return container_of(node, struct merge_pos, node);
else
return NULL;
rb_link_node(&mitem->node, parent, link);
rb_insert_color(&mitem->node, &rng->root);
rng->size += item_len_bytes(mitem->val_len);
}
static void free_mpos(struct super_block *sb, struct merge_pos *mpos)
static void replace_mitem(struct merged_range *rng, struct merged_item *victim,
struct merged_item *new)
{
scoutfs_block_put(sb, mpos->bl);
kfree(mpos);
rb_replace_node(&victim->node, &new->node, &rng->root);
RB_CLEAR_NODE(&victim->node);
rng->size -= item_len_bytes(victim->val_len);
rng->size += item_len_bytes(new->val_len);
}
static void insert_mpos(struct rb_root *pos_root, struct merge_pos *ins)
static void free_mitem(struct merged_range *rng, struct merged_item *mitem)
{
struct rb_node **node = &pos_root->rb_node;
struct rb_node *parent = NULL;
struct merge_pos *mpos;
int cmp;
if (IS_ERR_OR_NULL(mitem))
return;
parent = NULL;
while (*node) {
parent = *node;
mpos = container_of(*node, struct merge_pos, node);
/* sort merge items by key then newest to oldest */
cmp = scoutfs_key_compare(ins->key, mpos->key) ?:
-scoutfs_cmp(ins->seq, mpos->seq);
if (cmp < 0)
node = &(*node)->rb_left;
else
node = &(*node)->rb_right;
if (!RB_EMPTY_NODE(&mitem->node)) {
rng->size -= item_len_bytes(mitem->val_len);
rb_erase(&mitem->node, &rng->root);
}
rb_link_node(&ins->node, parent, node);
rb_insert_color(&ins->node, pos_root);
kfree(mitem);
}
static void trim_range_size(struct merged_range *rng, int merge_window)
{
struct merged_item *mitem;
struct merged_item *tmp;
mitem = last_mitem(&rng->root);
while (mitem && rng->size > merge_window) {
rng->end = mitem->key;
scoutfs_key_dec(&rng->end);
tmp = mitem;
mitem = prev_mitem(mitem);
free_mitem(rng, tmp);
}
}
static void trim_range_end(struct merged_range *rng)
{
struct merged_item *mitem;
struct merged_item *tmp;
mitem = last_mitem(&rng->root);
while (mitem && scoutfs_key_compare(&mitem->key, &rng->end) > 0) {
tmp = mitem;
mitem = prev_mitem(mitem);
free_mitem(rng, tmp);
}
}
/*
* Find the next item in the merge_pos root in the caller's range and
* insert it into the rbtree sorted by key and version so that merging
* can find the next newest item at the front of the rbtree. We free
* the mpos on error or if there are no more items in the range.
* Record and combine logged items from log roots for merging with the
* writable destination root. The caller is responsible for trimming
* the range if it gets too large or if the key range shrinks.
*/
static int reset_mpos(struct super_block *sb, struct rb_root *pos_root, struct merge_pos *mpos,
struct scoutfs_key *start, struct scoutfs_key *end)
static int merge_read_item(struct super_block *sb, struct scoutfs_key *key, u64 seq, u8 flags,
void *val, int val_len, void *arg)
{
struct scoutfs_btree_item *item;
struct scoutfs_avl_node *next;
struct btree_walk_key_range kr;
struct scoutfs_key walk_key;
int ret = 0;
struct merged_range *rng = arg;
struct merged_item *mitem;
struct merged_item *found;
struct rb_node *parent;
struct rb_node **link;
int ret;
/* always erase before freeing or inserting */
if (!RB_EMPTY_NODE(&mpos->node)) {
rb_erase(&mpos->node, pos_root);
RB_CLEAR_NODE(&mpos->node);
}
/*
* advance to next item via the avl tree. The caller's pos is
* only ever incremented past the last key so we can use next to
* iterate rather than using search to skip past multiple items.
*/
if (mpos->avl)
mpos->avl = scoutfs_avl_next(&mpos->bt->item_root, mpos->avl);
/* find the next leaf with the key if we run out of items */
walk_key = *start;
while (!mpos->avl && !scoutfs_key_is_zeros(&walk_key)) {
scoutfs_block_put(sb, mpos->bl);
mpos->bl = NULL;
ret = btree_walk(sb, NULL, NULL, mpos->root, BTW_NEXT, &walk_key,
0, &mpos->bl, &kr, NULL);
if (ret < 0) {
if (ret == -ENOENT)
ret = 0;
free_mpos(sb, mpos);
found = find_mitem(&rng->root, key, &parent, &link);
if (found) {
ret = scoutfs_forest_combine_deltas(key, found->val, found->val_len, val, val_len);
if (ret < 0)
goto out;
if (ret > 0) {
if (ret == SCOUTFS_DELTA_COMBINED) {
scoutfs_inc_counter(sb, btree_merge_delta_combined);
} else if (ret == SCOUTFS_DELTA_COMBINED_NULL) {
scoutfs_inc_counter(sb, btree_merge_delta_null);
free_mitem(rng, found);
}
ret = 0;
goto out;
}
mpos->bt = mpos->bl->data;
mpos->avl = scoutfs_avl_search(&mpos->bt->item_root, cmp_key_item,
start, NULL, NULL, &next, NULL) ?: next;
if (mpos->avl == NULL)
walk_key = kr.iter_next;
if (found->seq >= seq) {
ret = 0;
goto out;
}
}
/* see if we're out of items within the range */
item = node_item(mpos->avl);
if (!item || scoutfs_key_compare(item_key(item), end) > 0) {
free_mpos(sb, mpos);
ret = 0;
mitem = kmalloc(offsetof(struct merged_item, val[val_len]), GFP_NOFS);
if (!mitem) {
ret = -ENOMEM;
goto out;
}
/* insert the next item within range at its version */
mpos->key = item_key(item);
mpos->seq = le64_to_cpu(item->seq);
mpos->flags = item->flags;
mpos->val_len = item_val_len(item);
mpos->val = item_val(mpos->bt, item);
mitem->key = *key;
mitem->seq = seq;
mitem->flags = flags;
mitem->val_len = val_len;
if (val_len)
memcpy(mitem->val, val, val_len);
if (found) {
replace_mitem(rng, found, mitem);
free_mitem(rng, found);
} else {
insert_mitem(rng, mitem, parent, link);
}
insert_mpos(pos_root, mpos);
ret = 0;
out:
return ret;
}
/*
* The caller has reset all the merge positions for all the input log
* btree roots and wants the next logged item it should try and merge
* with the items in the fs_root.
* Read a range of merged items. The caller has set the key bounds of
* the range. We read a merge window's worth of items from blocks in
* each input btree.
*
* We look ahead in the logged item stream to see if we should merge any
* older logged delta items into one result for the caller. We also
* take this opportunity to skip and reset the mpos for any older
* versions of the first item.
* The caller can only use the smallest range that overlaps with all the
* blocks that we read. We start reading from the range's start key so
* it will always be present and we don't need to adjust it. The final
* block we read from each input might not cover the range's end so it
* needs to be adjusted.
*
* The end range can also shrink if we have to drop items because the
* items exceeded the merge window size.
*/
static int next_resolved_mpos(struct super_block *sb, struct rb_root *pos_root,
struct scoutfs_key *end, struct merge_pos **mpos_ret)
static int read_merged_range(struct super_block *sb, struct merged_range *rng,
struct list_head *inputs, int merge_window)
{
struct merge_pos *mpos;
struct merge_pos *next;
struct scoutfs_btree_root_head *rhead;
struct scoutfs_key start;
struct scoutfs_key end;
struct scoutfs_key key;
int ret = 0;
int i;
while ((mpos = first_mpos(pos_root)) && (next = next_mpos(mpos)) &&
!scoutfs_key_compare(mpos->key, next->key)) {
list_for_each_entry(rhead, inputs, head) {
key = rng->start;
ret = scoutfs_forest_combine_deltas(mpos->key, mpos->val, mpos->val_len,
next->val, next->val_len);
if (ret < 0)
break;
/* reset advances to the next item */
key = *mpos->key;
scoutfs_key_inc(&key);
/* always skip next combined or older version */
ret = reset_mpos(sb, pos_root, next, &key, end);
if (ret < 0)
break;
if (ret == SCOUTFS_DELTA_COMBINED) {
scoutfs_inc_counter(sb, btree_merge_delta_combined);
} else if (ret == SCOUTFS_DELTA_COMBINED_NULL) {
scoutfs_inc_counter(sb, btree_merge_delta_null);
/* if merging resulted in no info, skip current */
ret = reset_mpos(sb, pos_root, mpos, &key, end);
for (i = 0; i < merge_window; i += SCOUTFS_BLOCK_LG_SIZE) {
start = key;
end = rng->end;
ret = scoutfs_btree_read_items(sb, &rhead->root, &key, &start, &end,
merge_read_item, rng);
if (ret < 0)
goto out;
if (scoutfs_key_compare(&end, &rng->end) >= 0)
break;
key = end;
scoutfs_key_inc(&key);
}
if (scoutfs_key_compare(&end, &rng->end) < 0) {
rng->end = end;
trim_range_end(rng);
}
if (rng->size > merge_window)
trim_range_size(rng, merge_window);
}
*mpos_ret = mpos;
trace_scoutfs_btree_merge_read_range(sb, &rng->start, &rng->end, rng->size);
ret = 0;
out:
return ret;
}
@@ -2226,6 +2292,13 @@ static int next_resolved_mpos(struct super_block *sb, struct rb_root *pos_root,
* to allocators running low or needing to join/split the parent.
* *next_ret is set to the next key which hasn't been merged so that the
* caller can retry with a new allocator and subtree.
*
* The number of input roots can be immense. The merge_window specifies
* the size of the set of merged items that we'll maintain as we iterate
* over all the input roots. Once we've merged items into the window
* from all the input roots the merged input items are then merged to
* the writable destination root. It may take multiple passes of
* windows of merged items to cover the input key range.
*/
int scoutfs_btree_merge(struct super_block *sb,
struct scoutfs_alloc *alloc,
@@ -2235,18 +2308,16 @@ int scoutfs_btree_merge(struct super_block *sb,
struct scoutfs_key *next_ret,
struct scoutfs_btree_root *root,
struct list_head *inputs,
bool subtree, int dirty_limit, int alloc_low)
bool subtree, int dirty_limit, int alloc_low, int merge_window)
{
struct scoutfs_btree_root_head *rhead;
struct rb_root pos_root = RB_ROOT;
struct scoutfs_btree_item *item;
struct scoutfs_btree_block *bt;
struct scoutfs_block *bl = NULL;
struct btree_walk_key_range kr;
struct scoutfs_avl_node *par;
struct scoutfs_key next;
struct merge_pos *mpos;
struct merge_pos *tmp;
struct merged_item *mitem;
struct merged_item *tmp;
struct merged_range rng;
int walk_val_len;
int walk_flags;
bool is_del;
@@ -2257,49 +2328,59 @@ int scoutfs_btree_merge(struct super_block *sb,
trace_scoutfs_btree_merge(sb, root, start, end);
scoutfs_inc_counter(sb, btree_merge);
list_for_each_entry(rhead, inputs, head) {
mpos = kzalloc(sizeof(*mpos), GFP_NOFS);
if (!mpos) {
ret = -ENOMEM;
goto out;
}
RB_CLEAR_NODE(&mpos->node);
mpos->root = &rhead->root;
ret = reset_mpos(sb, &pos_root, mpos, start, end);
if (ret < 0)
goto out;
}
walk_flags = BTW_DIRTY;
if (subtree)
walk_flags |= BTW_SUBTREE;
walk_val_len = 0;
while ((ret = next_resolved_mpos(sb, &pos_root, end, &mpos)) == 0 && mpos) {
rng.start = *start;
rng.end = *end;
rng.root = RB_ROOT;
rng.size = 0;
ret = read_merged_range(sb, &rng, inputs, merge_window);
if (ret < 0)
goto out;
for (;;) {
/* read next window as it empties (and it is possible to read an empty range) */
mitem = first_mitem(&rng.root);
if (!mitem) {
/* done if the read range hit the end */
if (scoutfs_key_compare(&rng.end, end) >= 0)
break;
/* read next batch of merged items */
rng.start = rng.end;
scoutfs_key_inc(&rng.start);
rng.end = *end;
ret = read_merged_range(sb, &rng, inputs, merge_window);
if (ret < 0)
break;
continue;
}
if (scoutfs_block_writer_dirty_bytes(sb, wri) >= dirty_limit) {
scoutfs_inc_counter(sb, btree_merge_dirty_limit);
ret = -ERANGE;
*next_ret = *mpos->key;
*next_ret = mitem->key;
goto out;
}
if (scoutfs_alloc_meta_low(sb, alloc, alloc_low)) {
scoutfs_inc_counter(sb, btree_merge_alloc_low);
ret = -ERANGE;
*next_ret = *mpos->key;
*next_ret = mitem->key;
goto out;
}
scoutfs_block_put(sb, bl);
bl = NULL;
ret = btree_walk(sb, alloc, wri, root, walk_flags,
mpos->key, walk_val_len, &bl, &kr, NULL);
&mitem->key, walk_val_len, &bl, &kr, NULL);
if (ret < 0) {
if (ret == -ERANGE)
*next_ret = *mpos->key;
*next_ret = mitem->key;
goto out;
}
bt = bl->data;
@@ -2311,22 +2392,21 @@ int scoutfs_btree_merge(struct super_block *sb,
continue;
}
while ((ret = next_resolved_mpos(sb, &pos_root, end, &mpos)) == 0 && mpos) {
while (mitem) {
/* walk to new leaf if we exceed parent ref key */
if (scoutfs_key_compare(mpos->key, &kr.end) > 0)
if (scoutfs_key_compare(&mitem->key, &kr.end) > 0)
break;
/* see if there's an existing item */
item = leaf_item_hash_search(sb, bt, mpos->key);
is_del = !!(mpos->flags & SCOUTFS_ITEM_FLAG_DELETION);
item = leaf_item_hash_search(sb, bt, &mitem->key);
is_del = !!(mitem->flags & SCOUTFS_ITEM_FLAG_DELETION);
/* see if we're merging delta items */
if (item && !is_del)
delta = scoutfs_forest_combine_deltas(mpos->key,
delta = scoutfs_forest_combine_deltas(&mitem->key,
item_val(bt, item),
item_val_len(item),
mpos->val, mpos->val_len);
mitem->val, mitem->val_len);
else
delta = 0;
if (delta < 0) {
@@ -2338,40 +2418,38 @@ int scoutfs_btree_merge(struct super_block *sb,
scoutfs_inc_counter(sb, btree_merge_delta_null);
}
trace_scoutfs_btree_merge_items(sb, mpos->root,
mpos->key, mpos->val_len,
trace_scoutfs_btree_merge_items(sb, &mitem->key, mitem->val_len,
item ? root : NULL,
item ? item_key(item) : NULL,
item ? item_val_len(item) : 0, is_del);
/* rewalk and split if ins/update needs room */
if (!is_del && !delta && !mid_free_item_room(bt, mpos->val_len)) {
if (!is_del && !delta && !mid_free_item_room(bt, mitem->val_len)) {
walk_flags |= BTW_INSERT;
walk_val_len = mpos->val_len;
walk_val_len = mitem->val_len;
break;
}
/* insert missing non-deletion merge items */
if (!item && !is_del) {
scoutfs_avl_search(&bt->item_root,
cmp_key_item, mpos->key,
scoutfs_avl_search(&bt->item_root, cmp_key_item, &mitem->key,
&cmp, &par, NULL, NULL);
create_item(bt, mpos->key, mpos->seq, mpos->flags,
mpos->val, mpos->val_len, par, cmp);
create_item(bt, &mitem->key, mitem->seq, mitem->flags,
mitem->val, mitem->val_len, par, cmp);
scoutfs_inc_counter(sb, btree_merge_insert);
}
/* update existing items */
if (item && !is_del && !delta) {
item->seq = cpu_to_le64(mpos->seq);
item->flags = mpos->flags;
update_item_value(bt, item, mpos->val, mpos->val_len);
item->seq = cpu_to_le64(mitem->seq);
item->flags = mitem->flags;
update_item_value(bt, item, mitem->val, mitem->val_len);
scoutfs_inc_counter(sb, btree_merge_update);
}
/* update combined delta item seq */
if (delta == SCOUTFS_DELTA_COMBINED) {
item->seq = cpu_to_le64(mpos->seq);
item->seq = cpu_to_le64(mitem->seq);
}
/*
@@ -2403,21 +2481,18 @@ int scoutfs_btree_merge(struct super_block *sb,
walk_flags &= ~(BTW_INSERT | BTW_DELETE);
walk_val_len = 0;
/* finished with this key, skip any older items */
next = *mpos->key;
scoutfs_key_inc(&next);
ret = reset_mpos(sb, &pos_root, mpos, &next, end);
if (ret < 0)
goto out;
/* finished with this merged item */
tmp = mitem;
mitem = next_mitem(mitem);
free_mitem(&rng, tmp);
}
}
ret = 0;
out:
scoutfs_block_put(sb, bl);
rbtree_postorder_for_each_entry_safe(mpos, tmp, &pos_root, node) {
free_mpos(sb, mpos);
}
rbtree_postorder_for_each_entry_safe(mitem, tmp, &rng.root, node)
free_mitem(&rng, mitem);
return ret;
}

View File

@@ -119,7 +119,7 @@ int scoutfs_btree_merge(struct super_block *sb,
struct scoutfs_key *next_ret,
struct scoutfs_btree_root *root,
struct list_head *input_list,
bool subtree, int dirty_limit, int alloc_low);
bool subtree, int dirty_limit, int alloc_low, int merge_window);
int scoutfs_btree_free_blocks(struct super_block *sb,
struct scoutfs_alloc *alloc,

View File

@@ -20,6 +20,7 @@
#include <net/sock.h>
#include <net/tcp.h>
#include <asm/barrier.h>
#include <linux/overflow.h>
#include "format.h"
#include "counters.h"
@@ -68,6 +69,7 @@ int scoutfs_client_alloc_inodes(struct super_block *sb, u64 count,
struct client_info *client = SCOUTFS_SB(sb)->client_info;
struct scoutfs_net_inode_alloc ial;
__le64 lecount = cpu_to_le64(count);
u64 tmp;
int ret;
ret = scoutfs_net_sync_request(sb, client->conn,
@@ -80,7 +82,7 @@ int scoutfs_client_alloc_inodes(struct super_block *sb, u64 count,
if (*nr == 0)
ret = -ENOSPC;
else if (*ino + *nr < *ino)
else if (check_add_overflow(*ino, *nr - 1, &tmp))
ret = -EINVAL;
}

View File

@@ -145,6 +145,7 @@
EXPAND_COUNTER(lock_shrink_work) \
EXPAND_COUNTER(lock_unlock) \
EXPAND_COUNTER(lock_wait) \
EXPAND_COUNTER(log_merge_wait_timeout) \
EXPAND_COUNTER(net_dropped_response) \
EXPAND_COUNTER(net_send_bytes) \
EXPAND_COUNTER(net_send_error) \
@@ -161,6 +162,8 @@
EXPAND_COUNTER(orphan_scan_error) \
EXPAND_COUNTER(orphan_scan_item) \
EXPAND_COUNTER(orphan_scan_omap_set) \
EXPAND_COUNTER(quota_info_count_objects) \
EXPAND_COUNTER(quota_info_scan_objects) \
EXPAND_COUNTER(quorum_candidate_server_stopping) \
EXPAND_COUNTER(quorum_elected) \
EXPAND_COUNTER(quorum_fence_error) \
@@ -198,20 +201,19 @@
EXPAND_COUNTER(srch_read_stale) \
EXPAND_COUNTER(statfs) \
EXPAND_COUNTER(totl_read_copied) \
EXPAND_COUNTER(totl_read_finalized) \
EXPAND_COUNTER(totl_read_fs) \
EXPAND_COUNTER(totl_read_item) \
EXPAND_COUNTER(totl_read_logged) \
EXPAND_COUNTER(trans_commit_data_alloc_low) \
EXPAND_COUNTER(trans_commit_dirty_meta_full) \
EXPAND_COUNTER(trans_commit_fsync) \
EXPAND_COUNTER(trans_commit_meta_alloc_low) \
EXPAND_COUNTER(trans_commit_sync_fs) \
EXPAND_COUNTER(trans_commit_timer) \
EXPAND_COUNTER(trans_commit_written)
EXPAND_COUNTER(trans_commit_written) \
EXPAND_COUNTER(wkic_count_objects) \
EXPAND_COUNTER(wkic_scan_objects)
#define FIRST_COUNTER alloc_alloc_data
#define LAST_COUNTER trans_commit_written
#define LAST_COUNTER wkic_scan_objects
#undef EXPAND_COUNTER
#define EXPAND_COUNTER(which) struct percpu_counter which;

View File

@@ -20,7 +20,9 @@
#include <linux/hash.h>
#include <linux/log2.h>
#include <linux/falloc.h>
#include <linux/fiemap.h>
#include <linux/writeback.h>
#include <linux/overflow.h>
#include "format.h"
#include "super.h"
@@ -586,6 +588,12 @@ static int scoutfs_get_block(struct inode *inode, sector_t iblock,
goto out;
}
if (create && !si->staging) {
ret = scoutfs_inode_check_retention(inode);
if (ret < 0)
goto out;
}
/* convert unwritten to written, could be staging */
if (create && ext.map && (ext.flags & SEF_UNWRITTEN)) {
un.start = iblock;
@@ -673,8 +681,14 @@ int scoutfs_get_block_write(struct inode *inode, sector_t iblock, struct buffer_
* We can return errors from locking and checking offline extents. The
* page is unlocked if we return an error.
*/
#ifdef KC_MPAGE_READ_FOLIO
static int scoutfs_read_folio(struct file *file, struct folio *folio)
{
struct page *page = &folio->page;
#else
static int scoutfs_readpage(struct file *file, struct page *page)
{
#endif
struct inode *inode = file->f_inode;
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct super_block *sb = inode->i_sb;
@@ -721,7 +735,11 @@ static int scoutfs_readpage(struct file *file, struct page *page)
return ret;
}
#ifdef KC_MPAGE_READ_FOLIO
ret = mpage_read_folio(folio, scoutfs_get_block_read);
#else
ret = mpage_readpage(page, scoutfs_get_block_read);
#endif
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_READ);
scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
@@ -819,7 +837,10 @@ struct write_begin_data {
static int scoutfs_write_begin(struct file *file,
struct address_space *mapping, loff_t pos,
unsigned len, unsigned flags,
unsigned len,
#ifdef KC_BLOCK_WRITE_BEGIN_AOP_FLAGS
unsigned flags,
#endif
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
@@ -854,13 +875,18 @@ retry:
if (ret < 0)
goto out;
#ifdef KC_BLOCK_WRITE_BEGIN_AOP_FLAGS
/* can't re-enter fs, have trans */
flags |= AOP_FLAG_NOFS;
#endif
/* generic write_end updates i_size and calls dirty_inode */
ret = scoutfs_dirty_inode_item(inode, wbd->lock) ?:
block_write_begin(mapping, pos, len, flags, pagep,
scoutfs_get_block_write);
block_write_begin(mapping, pos, len,
#ifdef KC_BLOCK_WRITE_BEGIN_AOP_FLAGS
flags,
#endif
pagep, scoutfs_get_block_write);
if (ret < 0) {
scoutfs_release_trans(sb);
scoutfs_inode_index_unlock(sb, &wbd->ind_locks);
@@ -1062,6 +1088,7 @@ long scoutfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
loff_t end;
u64 iblock;
u64 last;
loff_t tmp;
s64 ret;
/* XXX support more flags */
@@ -1070,14 +1097,14 @@ long scoutfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
goto out;
}
/* catch wrapping */
if (offset + len < offset) {
ret = -EINVAL;
if (len == 0) {
ret = 0;
goto out;
}
if (len == 0) {
ret = 0;
/* catch wrapping */
if (check_add_overflow(offset, len - 1, &tmp)) {
ret = -EINVAL;
goto out;
}
@@ -1104,6 +1131,10 @@ long scoutfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
while(iblock <= last) {
ret = scoutfs_quota_check_data(sb, inode);
if (ret)
goto out_extent;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, true);
if (ret)
goto out_extent;
@@ -1155,9 +1186,9 @@ out:
* on regular files with no data extents. It's used to restore a file
* with an offline extent which can then trigger staging.
*
* The caller has taken care of locking the inode. We're updating the
* inode offline count as we create the offline extent so we take care
* of the index locking, updating, and transaction.
* The caller must take care of cluster locking, transactions, inode
* updates, and index updates (so that they can atomically make this
* change along with other metadata changes).
*/
int scoutfs_data_init_offline_extent(struct inode *inode, u64 size,
struct scoutfs_lock *lock)
@@ -1171,7 +1202,6 @@ int scoutfs_data_init_offline_extent(struct inode *inode, u64 size,
.lock = lock,
};
const u64 count = DIV_ROUND_UP(size, SCOUTFS_BLOCK_SM_SIZE);
LIST_HEAD(ind_locks);
u64 on;
u64 off;
int ret;
@@ -1184,28 +1214,10 @@ int scoutfs_data_init_offline_extent(struct inode *inode, u64 size,
goto out;
}
/* we're updating meta_seq with offline block count */
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, true);
if (ret < 0)
goto out;
ret = scoutfs_dirty_inode_item(inode, lock);
if (ret < 0)
goto unlock;
down_write(&si->extent_sem);
ret = scoutfs_ext_insert(sb, &data_ext_ops, &args,
0, count, 0, SEF_OFFLINE);
up_write(&si->extent_sem);
if (ret < 0)
goto unlock;
scoutfs_update_inode_item(inode, lock, &ind_locks);
unlock:
scoutfs_release_trans(sb);
scoutfs_inode_index_unlock(sb, &ind_locks);
ret = 0;
out:
return ret;
}
@@ -1273,6 +1285,9 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
if (ret)
goto out;
if (!is_stage && (ret = scoutfs_inode_check_retention(to)))
goto out;
if ((from_off & SCOUTFS_BLOCK_SM_MASK) ||
(to_off & SCOUTFS_BLOCK_SM_MASK) ||
((byte_len & SCOUTFS_BLOCK_SM_MASK) &&
@@ -1310,8 +1325,8 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
goto out;
}
ret = inode_permission(from, MAY_WRITE) ?:
inode_permission(to, MAY_WRITE);
ret = inode_permission(KC_VFS_INIT_NS from, MAY_WRITE) ?:
inode_permission(KC_VFS_INIT_NS to, MAY_WRITE);
if (ret < 0)
goto out;
@@ -1549,7 +1564,7 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
goto out;
}
ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC);
ret = fiemap_prep(inode, fieinfo, start, &len, FIEMAP_FLAG_SYNC);
if (ret)
goto out;
@@ -1715,12 +1730,16 @@ int scoutfs_data_wait_check(struct inode *inode, loff_t pos, loff_t len,
u64 last_block;
u64 on;
u64 off;
loff_t tmp;
int ret = 0;
if (len == 0)
goto out;
if (WARN_ON_ONCE(sef & SEF_UNKNOWN) ||
WARN_ON_ONCE(op & SCOUTFS_IOC_DWO_UNKNOWN) ||
WARN_ON_ONCE(dw && !RB_EMPTY_NODE(&dw->node)) ||
WARN_ON_ONCE(pos + len < pos)) {
WARN_ON_ONCE(check_add_overflow(pos, len - 1, &tmp))) {
ret = -EINVAL;
goto out;
}
@@ -1807,37 +1826,6 @@ int scoutfs_data_wait_check_iov(struct inode *inode, const struct iovec *iov,
return ret;
}
int scoutfs_data_wait_check_iter(struct inode *inode, loff_t pos, struct iov_iter *iter,
u8 sef, u8 op, struct scoutfs_data_wait *dw,
struct scoutfs_lock *lock)
{
size_t count = iov_iter_count(iter);
size_t off = iter->iov_offset;
const struct iovec *iov;
size_t len;
int ret = 0;
for (iov = iter->iov; count > 0; iov++) {
len = iov->iov_len - off;
if (len == 0)
continue;
/* aren't we waiting on too much data here ? */
ret = scoutfs_data_wait_check(inode, pos, len,
sef, op, dw, lock);
if (ret != 0)
break;
pos += len;
count -= len;
off = 0;
}
return ret;
}
int scoutfs_data_wait(struct inode *inode, struct scoutfs_data_wait *dw)
{
DECLARE_DATA_WAIT_ROOT(inode->i_sb, rt);
@@ -1927,7 +1915,13 @@ int scoutfs_data_waiting(struct super_block *sb, u64 ino, u64 iblock,
}
const struct address_space_operations scoutfs_file_aops = {
#ifdef KC_MPAGE_READ_FOLIO
.dirty_folio = block_dirty_folio,
.invalidate_folio = block_invalidate_folio,
.read_folio = scoutfs_read_folio,
#else
.readpage = scoutfs_readpage,
#endif
#ifndef KC_FILE_AOPS_READAHEAD
.readpages = scoutfs_readpages,
#else
@@ -1948,6 +1942,8 @@ const struct file_operations scoutfs_file_fops = {
#else
.read_iter = scoutfs_file_read_iter,
.write_iter = scoutfs_file_write_iter,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
#endif
.unlocked_ioctl = scoutfs_ioctl,
.fsync = scoutfs_file_fsync,

View File

@@ -65,9 +65,6 @@ int scoutfs_data_wait_check_iov(struct inode *inode, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, u8 sef,
u8 op, struct scoutfs_data_wait *ow,
struct scoutfs_lock *lock);
int scoutfs_data_wait_check_iter(struct inode *inode, loff_t pos, struct iov_iter *iter,
u8 sef, u8 op, struct scoutfs_data_wait *ow,
struct scoutfs_lock *lock);
bool scoutfs_data_wait_found(struct scoutfs_data_wait *ow);
int scoutfs_data_wait(struct inode *inode,
struct scoutfs_data_wait *ow);

View File

@@ -34,6 +34,7 @@
#include "forest.h"
#include "acl.h"
#include "counters.h"
#include "quota.h"
#include "scoutfs_trace.h"
/*
@@ -651,6 +652,10 @@ static struct inode *lock_hold_create(struct inode *dir, struct dentry *dentry,
if (ret)
goto out_unlock;
ret = scoutfs_quota_check_inode(sb, dir);
if (ret)
goto out_unlock;
if (orph_lock) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, ino, orph_lock);
if (ret < 0)
@@ -672,6 +677,8 @@ retry:
if (ret < 0)
goto out;
scoutfs_inode_set_proj(inode, scoutfs_inode_get_proj(dir));
ret = scoutfs_dirty_inode_item(dir, *dir_lock);
out:
if (ret)
@@ -696,8 +703,9 @@ out_unlock:
return inode;
}
static int scoutfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
dev_t rdev)
static int scoutfs_mknod(KC_VFS_NS_DEF
struct inode *dir,
struct dentry *dentry, umode_t mode, dev_t rdev)
{
struct super_block *sb = dir->i_sb;
struct inode *inode = NULL;
@@ -766,15 +774,20 @@ out:
}
/* XXX hmm, do something with excl? */
static int scoutfs_create(struct inode *dir, struct dentry *dentry,
umode_t mode, bool excl)
static int scoutfs_create(KC_VFS_NS_DEF
struct inode *dir,
struct dentry *dentry, umode_t mode, bool excl)
{
return scoutfs_mknod(dir, dentry, mode | S_IFREG, 0);
return scoutfs_mknod(KC_VFS_NS
dir, dentry, mode | S_IFREG, 0);
}
static int scoutfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
static int scoutfs_mkdir(KC_VFS_NS_DEF
struct inode *dir,
struct dentry *dentry, umode_t mode)
{
return scoutfs_mknod(dir, dentry, mode | S_IFDIR, 0);
return scoutfs_mknod(KC_VFS_NS
dir, dentry, mode | S_IFDIR, 0);
}
static int scoutfs_link(struct dentry *old_dentry,
@@ -926,12 +939,16 @@ static int scoutfs_unlink(struct inode *dir, struct dentry *dentry)
goto unlock;
}
ret = scoutfs_inode_check_retention(inode);
if (ret < 0)
goto unlock;
hash = dirent_name_hash(dentry->d_name.name, dentry->d_name.len);
ret = lookup_dirent(sb, scoutfs_ino(dir), dentry->d_name.name, dentry->d_name.len, hash,
&dent, dir_lock);
if (ret < 0)
goto out;
goto unlock;
if (should_orphan(inode)) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, scoutfs_ino(inode),
@@ -1165,7 +1182,8 @@ static const char *scoutfs_get_link(struct dentry *dentry, struct inode *inode,
* Symlink target paths can be annoyingly large. We store relatively
* rare large paths in multiple items.
*/
static int scoutfs_symlink(struct inode *dir, struct dentry *dentry,
static int scoutfs_symlink(KC_VFS_NS_DEF
struct inode *dir, struct dentry *dentry,
const char *symname)
{
struct super_block *sb = dir->i_sb;
@@ -1552,7 +1570,8 @@ static int verify_ancestors(struct super_block *sb, u64 p1, u64 p2,
* from using parent/child locking orders as two groups can have both
* parent and child relationships to each other.
*/
static int scoutfs_rename_common(struct inode *old_dir,
static int scoutfs_rename_common(KC_VFS_NS_DEF
struct inode *old_dir,
struct dentry *old_dentry, struct inode *new_dir,
struct dentry *new_dentry, unsigned int flags)
{
@@ -1632,6 +1651,10 @@ static int scoutfs_rename_common(struct inode *old_dir,
goto out_unlock;
}
if ((old_inode && (ret = scoutfs_inode_check_retention(old_inode))) ||
(new_inode && (ret = scoutfs_inode_check_retention(new_inode))))
goto out_unlock;
if (should_orphan(new_inode)) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, scoutfs_ino(new_inode),
&orph_lock);
@@ -1825,18 +1848,21 @@ static int scoutfs_rename(struct inode *old_dir,
struct dentry *old_dentry, struct inode *new_dir,
struct dentry *new_dentry)
{
return scoutfs_rename_common(old_dir, old_dentry, new_dir, new_dentry, 0);
return scoutfs_rename_common(KC_VFS_INIT_NS
old_dir, old_dentry, new_dir, new_dentry, 0);
}
#endif
static int scoutfs_rename2(struct inode *old_dir,
static int scoutfs_rename2(KC_VFS_NS_DEF
struct inode *old_dir,
struct dentry *old_dentry, struct inode *new_dir,
struct dentry *new_dentry, unsigned int flags)
{
if (flags & ~RENAME_NOREPLACE)
return -EINVAL;
return scoutfs_rename_common(old_dir, old_dentry, new_dir, new_dentry, flags);
return scoutfs_rename_common(KC_VFS_NS
old_dir, old_dentry, new_dir, new_dentry, flags);
}
#ifdef KC_FMODE_KABI_ITERATE
@@ -1848,8 +1874,18 @@ static int scoutfs_dir_open(struct inode *inode, struct file *file)
}
#endif
static int scoutfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
static int scoutfs_tmpfile(KC_VFS_NS_DEF
struct inode *dir,
#ifdef KC_D_TMPFILE_DENTRY
struct dentry *dentry,
#else
struct file *file,
#endif
umode_t mode)
{
#ifndef KC_D_TMPFILE_DENTRY
struct dentry *dentry = file->f_path.dentry;
#endif
struct super_block *sb = dir->i_sb;
struct inode *inode = NULL;
struct scoutfs_lock *dir_lock = NULL;
@@ -1876,7 +1912,11 @@ static int scoutfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mod
si->crtime = inode->i_mtime;
insert_inode_hash(inode);
ihold(inode); /* need to update inode modifications in d_tmpfile */
#ifdef KC_D_TMPFILE_DENTRY
d_tmpfile(dentry, inode);
#else
d_tmpfile(file, inode);
#endif
inode_inc_iversion(inode);
scoutfs_forest_inc_inode_count(sb);
@@ -1884,6 +1924,10 @@ static int scoutfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mod
scoutfs_update_inode_item(dir, dir_lock, &ind_locks);
scoutfs_inode_index_unlock(sb, &ind_locks);
#ifndef KC_D_TMPFILE_DENTRY
ret = finish_open_simple(file, 0);
#endif
out:
scoutfs_release_trans(sb);
scoutfs_inode_index_unlock(sb, &ind_locks);

View File

@@ -105,12 +105,12 @@ static ssize_t elapsed_secs_show(struct kobject *kobj,
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
ktime_t now = ktime_get();
struct timeval tv = { 0, };
ktime_t t = ns_to_ktime(0);
if (ktime_after(now, fence->start_kt))
tv = ktime_to_timeval(ktime_sub(now, fence->start_kt));
t = ktime_sub(now, fence->start_kt);
return snprintf(buf, PAGE_SIZE, "%llu", (long long)tv.tv_sec);
return snprintf(buf, PAGE_SIZE, "%llu", (long long)ktime_divns(t, NSEC_PER_SEC));
}
SCOUTFS_ATTR_RO(elapsed_secs);

View File

@@ -28,6 +28,7 @@
#include "inode.h"
#include "per_task.h"
#include "omap.h"
#include "quota.h"
#ifdef KC_LINUX_HAVE_FOP_AIO_READ
/*
@@ -108,6 +109,10 @@ retry:
if (ret)
goto out;
ret = scoutfs_inode_check_retention(inode);
if (ret < 0)
goto out;
ret = scoutfs_complete_truncate(inode, scoutfs_inode_lock);
if (ret)
goto out;
@@ -122,6 +127,10 @@ retry:
goto out;
}
ret = scoutfs_quota_check_data(sb, inode);
if (ret)
goto out;
/* XXX: remove SUID bit */
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
@@ -171,10 +180,8 @@ retry:
goto out;
if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, scoutfs_inode_lock)) {
ret = scoutfs_data_wait_check_iter(inode, iocb->ki_pos, to,
SEF_OFFLINE,
SCOUTFS_IOC_DWO_READ,
&dw, scoutfs_inode_lock);
ret = scoutfs_data_wait_check(inode, iocb->ki_pos, iov_iter_count(to), SEF_OFFLINE,
SCOUTFS_IOC_DWO_READ, &dw, scoutfs_inode_lock);
if (ret != 0)
goto out;
} else {
@@ -205,8 +212,7 @@ ssize_t scoutfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct scoutfs_lock *scoutfs_inode_lock = NULL;
SCOUTFS_DECLARE_PER_TASK_ENTRY(pt_ent);
DECLARE_DATA_WAIT(dw);
int ret;
int written;
ssize_t ret;
retry:
inode_lock(inode);
@@ -219,23 +225,29 @@ retry:
if (ret <= 0)
goto out;
ret = scoutfs_inode_check_retention(inode);
if (ret < 0)
goto out;
ret = scoutfs_complete_truncate(inode, scoutfs_inode_lock);
if (ret)
goto out;
ret = scoutfs_quota_check_data(sb, inode);
if (ret)
goto out;
if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, scoutfs_inode_lock)) {
/* data_version is per inode, whole file must be online */
ret = scoutfs_data_wait_check_iter(inode, iocb->ki_pos, from,
SEF_OFFLINE,
SCOUTFS_IOC_DWO_WRITE,
&dw, scoutfs_inode_lock);
ret = scoutfs_data_wait_check(inode, 0, i_size_read(inode), SEF_OFFLINE,
SCOUTFS_IOC_DWO_WRITE, &dw, scoutfs_inode_lock);
if (ret != 0)
goto out;
}
/* XXX: remove SUID bit */
written = __generic_file_write_iter(iocb, from);
ret = __generic_file_write_iter(iocb, from);
out:
scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
@@ -248,14 +260,15 @@ out:
goto retry;
}
if (ret > 0 || ret == -EIOCBQUEUED)
ret = generic_write_sync(iocb, written);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
return written ? written : ret;
return ret;
}
#endif
int scoutfs_permission(struct inode *inode, int mask)
int scoutfs_permission(KC_VFS_NS_DEF
struct inode *inode, int mask)
{
struct super_block *sb = inode->i_sb;
struct scoutfs_lock *inode_lock = NULL;
@@ -269,7 +282,8 @@ int scoutfs_permission(struct inode *inode, int mask)
if (ret)
return ret;
ret = generic_permission(inode, mask);
ret = generic_permission(KC_VFS_INIT_NS
inode, mask);
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_READ);

View File

@@ -10,7 +10,8 @@ ssize_t scoutfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
ssize_t scoutfs_file_read_iter(struct kiocb *, struct iov_iter *);
ssize_t scoutfs_file_write_iter(struct kiocb *, struct iov_iter *);
#endif
int scoutfs_permission(struct inode *inode, int mask);
int scoutfs_permission(KC_VFS_NS_DEF
struct inode *inode, int mask);
loff_t scoutfs_file_llseek(struct file *file, loff_t offset, int whence);
#endif /* _SCOUTFS_FILE_H_ */

View File

@@ -238,19 +238,16 @@ static int forest_read_items(struct super_block *sb, struct scoutfs_key *key, u6
* We return -ESTALE if we hit stale blocks to give the caller a chance
* to reset their state and retry with a newer version of the btrees.
*/
int scoutfs_forest_read_items(struct super_block *sb,
struct scoutfs_key *key,
struct scoutfs_key *bloom_key,
struct scoutfs_key *start,
struct scoutfs_key *end,
scoutfs_forest_item_cb cb, void *arg)
int scoutfs_forest_read_items_roots(struct super_block *sb, struct scoutfs_net_roots *roots,
struct scoutfs_key *key, struct scoutfs_key *bloom_key,
struct scoutfs_key *start, struct scoutfs_key *end,
scoutfs_forest_item_cb cb, void *arg)
{
struct forest_read_items_data rid = {
.cb = cb,
.cb_arg = arg,
};
struct scoutfs_log_trees lt;
struct scoutfs_net_roots roots;
struct scoutfs_bloom_block *bb;
struct forest_bloom_nrs bloom;
SCOUTFS_BTREE_ITEM_REF(iref);
@@ -264,18 +261,14 @@ int scoutfs_forest_read_items(struct super_block *sb,
scoutfs_inc_counter(sb, forest_read_items);
calc_bloom_nrs(&bloom, bloom_key);
ret = scoutfs_client_get_roots(sb, &roots);
if (ret)
goto out;
trace_scoutfs_forest_using_roots(sb, &roots.fs_root, &roots.logs_root);
trace_scoutfs_forest_using_roots(sb, &roots->fs_root, &roots->logs_root);
*start = orig_start;
*end = orig_end;
/* start with fs root items */
rid.fic |= FIC_FS_ROOT;
ret = scoutfs_btree_read_items(sb, &roots.fs_root, key, start, end,
ret = scoutfs_btree_read_items(sb, &roots->fs_root, key, start, end,
forest_read_items, &rid);
if (ret < 0)
goto out;
@@ -283,7 +276,7 @@ int scoutfs_forest_read_items(struct super_block *sb,
scoutfs_key_init_log_trees(&ltk, 0, 0);
for (;; scoutfs_key_inc(&ltk)) {
ret = scoutfs_btree_next(sb, &roots.logs_root, &ltk, &iref);
ret = scoutfs_btree_next(sb, &roots->logs_root, &ltk, &iref);
if (ret == 0) {
if (iref.val_len == sizeof(lt)) {
ltk = *iref.key;
@@ -340,6 +333,23 @@ out:
return ret;
}
int scoutfs_forest_read_items(struct super_block *sb,
struct scoutfs_key *key,
struct scoutfs_key *bloom_key,
struct scoutfs_key *start,
struct scoutfs_key *end,
scoutfs_forest_item_cb cb, void *arg)
{
struct scoutfs_net_roots roots;
int ret;
ret = scoutfs_client_get_roots(sb, &roots);
if (ret == 0)
ret = scoutfs_forest_read_items_roots(sb, &roots, key, bloom_key, start, end,
cb, arg);
return ret;
}
/*
* If the items are deltas then combine the src with the destination
* value and store the result in the destination.
@@ -721,7 +731,8 @@ static void scoutfs_forest_log_merge_worker(struct work_struct *work)
ret = scoutfs_btree_merge(sb, &alloc, &wri, &req.start, &req.end,
&next, &comp.root, &inputs,
!!(req.flags & cpu_to_le64(SCOUTFS_LOG_MERGE_REQUEST_SUBTREE)),
SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT, 10);
SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT, 10,
(2 * 1024 * 1024));
if (ret == -ERANGE) {
comp.remain = next;
le64_add_cpu(&comp.flags, SCOUTFS_LOG_MERGE_COMP_REMAIN);

View File

@@ -4,6 +4,7 @@
struct scoutfs_alloc;
struct scoutfs_block_writer;
struct scoutfs_block;
struct scoutfs_lock;
#include "btree.h"
@@ -23,6 +24,10 @@ int scoutfs_forest_read_items(struct super_block *sb,
struct scoutfs_key *start,
struct scoutfs_key *end,
scoutfs_forest_item_cb cb, void *arg);
int scoutfs_forest_read_items_roots(struct super_block *sb, struct scoutfs_net_roots *roots,
struct scoutfs_key *key, struct scoutfs_key *bloom_key,
struct scoutfs_key *start, struct scoutfs_key *end,
scoutfs_forest_item_cb cb, void *arg);
int scoutfs_forest_set_bloom_bits(struct super_block *sb,
struct scoutfs_lock *lock);
void scoutfs_forest_set_max_seq(struct super_block *sb, u64 max_seq);

View File

@@ -8,9 +8,14 @@
*/
#define SCOUTFS_FORMAT_VERSION_MIN 1
#define SCOUTFS_FORMAT_VERSION_MIN_STR __stringify(SCOUTFS_FORMAT_VERSION_MIN)
#define SCOUTFS_FORMAT_VERSION_MAX 1
#define SCOUTFS_FORMAT_VERSION_MAX 2
#define SCOUTFS_FORMAT_VERSION_MAX_STR __stringify(SCOUTFS_FORMAT_VERSION_MAX)
#define SCOUTFS_FORMAT_VERSION_FEAT_RETENTION 2
#define SCOUTFS_FORMAT_VERSION_FEAT_PROJECT_ID 2
#define SCOUTFS_FORMAT_VERSION_FEAT_QUOTA 2
#define SCOUTFS_FORMAT_VERSION_FEAT_INDX_TAG 2
/* statfs(2) f_type */
#define SCOUTFS_SUPER_MAGIC 0x554f4353 /* "SCOU" */
@@ -175,6 +180,10 @@ struct scoutfs_key {
#define sko_rid _sk_first
#define sko_ino _sk_second
/* quota rules */
#define skqr_hash _sk_second
#define skqr_coll_nr _sk_third
/* xattr totl */
#define skxt_a _sk_first
#define skxt_b _sk_second
@@ -585,7 +594,9 @@ struct scoutfs_log_merge_freeing {
*/
#define SCOUTFS_INODE_INDEX_ZONE 4
#define SCOUTFS_ORPHAN_ZONE 8
#define SCOUTFS_QUOTA_ZONE 10
#define SCOUTFS_XATTR_TOTL_ZONE 12
#define SCOUTFS_XATTR_INDX_ZONE 14
#define SCOUTFS_FS_ZONE 16
#define SCOUTFS_LOCK_ZONE 20
/* Items only stored in server btrees */
@@ -608,6 +619,9 @@ struct scoutfs_log_merge_freeing {
/* orphan zone, redundant type used for clarity */
#define SCOUTFS_ORPHAN_TYPE 4
/* quota zone */
#define SCOUTFS_QUOTA_RULE_TYPE 4
/* fs zone */
#define SCOUTFS_INODE_TYPE 4
#define SCOUTFS_XATTR_TYPE 8
@@ -661,6 +675,34 @@ struct scoutfs_xattr_totl_val {
__le64 count;
};
#define SQ_RF_TOTL_COUNT (1 << 0)
#define SQ_RF__UNKNOWN (~((1 << 1) - 1))
#define SQ_NS_LITERAL 0
#define SQ_NS_PROJ 1
#define SQ_NS_UID 2
#define SQ_NS_GID 3
#define SQ_NS__NR 4
#define SQ_NS__NR_SELECT (SQ_NS__NR - 1) /* !literal */
#define SQ_NF_SELECT (1 << 0)
#define SQ_NF__UNKNOWN (~((1 << 1) - 1))
#define SQ_OP_INODE 0
#define SQ_OP_DATA 1
#define SQ_OP__NR 2
struct scoutfs_quota_rule_val {
__le64 name_val[3];
__le64 limit;
__u8 prio;
__u8 op;
__u8 rule_flags;
__u8 name_source[3];
__u8 name_flags[3];
__u8 _pad[7];
};
/* XXX does this exist upstream somewhere? */
#define member_sizeof(TYPE, MEMBER) (sizeof(((TYPE *)0)->MEMBER))
@@ -859,9 +901,38 @@ struct scoutfs_inode {
struct scoutfs_timespec ctime;
struct scoutfs_timespec mtime;
struct scoutfs_timespec crtime;
__le64 proj;
};
#define SCOUTFS_INO_FLAG_TRUNCATE 0x1
#define SCOUTFS_INODE_FMT_V1_BYTES offsetof(struct scoutfs_inode, proj)
/*
* There are so few versions that we don't mind doing this work inline
* so that both utils and kernel can share these. Mounting has already
* checked that the format version is within the supported min and max,
* so these functions only deal with size variance within that band.
*/
/* Returns the native written inode size for the given format version, 0 for bad version */
static inline int scoutfs_inode_vers_bytes(__u64 fmt_vers)
{
if (fmt_vers == 1)
return SCOUTFS_INODE_FMT_V1_BYTES;
else
return sizeof(struct scoutfs_inode);
}
/*
* Returns true if bytes is a valid inode size to read from the given
* version. The given version must be greater than the version that
* introduced the size.
*/
static inline int scoutfs_inode_valid_vers_bytes(__u64 fmt_vers, int bytes)
{
return (bytes == sizeof(struct scoutfs_inode) && fmt_vers == SCOUTFS_FORMAT_VERSION_MAX) ||
(bytes == SCOUTFS_INODE_FMT_V1_BYTES);
}
#define SCOUTFS_INO_FLAG_TRUNCATE 0x1
#define SCOUTFS_INO_FLAG_RETENTION 0x2
#define SCOUTFS_ROOT_INO 1

View File

@@ -91,7 +91,7 @@ static void scoutfs_inode_ctor(void *obj)
init_rwsem(&si->extent_sem);
mutex_init(&si->item_mutex);
seqcount_init(&si->seqcount);
seqlock_init(&si->seqlock);
si->staging = false;
scoutfs_per_task_init(&si->pt_data_lock);
atomic64_set(&si->data_waitq.changed, 0);
@@ -250,7 +250,7 @@ static void set_item_info(struct scoutfs_inode_info *si,
set_item_major(si, SCOUTFS_INODE_INDEX_DATA_SEQ_TYPE, sinode->data_seq);
}
static void load_inode(struct inode *inode, struct scoutfs_inode *cinode)
static void load_inode(struct inode *inode, struct scoutfs_inode *cinode, int inode_bytes)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
@@ -278,6 +278,7 @@ static void load_inode(struct inode *inode, struct scoutfs_inode *cinode)
si->flags = le32_to_cpu(cinode->flags);
si->crtime.tv_sec = le64_to_cpu(cinode->crtime.sec);
si->crtime.tv_nsec = le32_to_cpu(cinode->crtime.nsec);
si->proj = le64_to_cpu(cinode->proj);
/*
* i_blocks is initialized from online and offline and is then
@@ -298,6 +299,24 @@ void scoutfs_inode_init_key(struct scoutfs_key *key, u64 ino)
};
}
/*
* Read an inode item into the caller's buffer and return the size that
* we read. Returns errors if the inode size is unsupported or doesn't
* make sense for the format version.
*/
static int lookup_inode_item(struct super_block *sb, struct scoutfs_key *key,
struct scoutfs_inode *sinode, struct scoutfs_lock *lock)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
int ret;
ret = scoutfs_item_lookup_smaller_zero(sb, key, sinode, sizeof(struct scoutfs_inode), lock);
if (ret >= 0 && !scoutfs_inode_valid_vers_bytes(sbi->fmt_vers, ret))
return -EIO;
return ret;
}
/*
* Refresh the vfs inode fields if the lock indicates that the current
* contents could be stale.
@@ -333,12 +352,12 @@ int scoutfs_inode_refresh(struct inode *inode, struct scoutfs_lock *lock)
mutex_lock(&si->item_mutex);
if (atomic64_read(&si->last_refreshed) < refresh_gen) {
ret = scoutfs_item_lookup_exact(sb, &key, &sinode,
sizeof(sinode), lock);
if (ret == 0) {
load_inode(inode, &sinode);
ret = lookup_inode_item(sb, &key, &sinode, lock);
if (ret > 0) {
load_inode(inode, &sinode, ret);
atomic64_set(&si->last_refreshed, refresh_gen);
scoutfs_lock_add_coverage(sb, lock, &si->ino_lock_cov);
ret = 0;
}
} else {
ret = 0;
@@ -354,7 +373,8 @@ int scoutfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
{
struct inode *inode = dentry->d_inode;
#else
int scoutfs_getattr(const struct path *path, struct kstat *stat,
int scoutfs_getattr(KC_VFS_NS_DEF
const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
struct inode *inode = d_inode(path->dentry);
@@ -366,7 +386,8 @@ int scoutfs_getattr(const struct path *path, struct kstat *stat,
ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ,
SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
if (ret == 0) {
generic_fillattr(inode, stat);
generic_fillattr(KC_VFS_INIT_NS
inode, stat);
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
}
return ret;
@@ -464,7 +485,8 @@ int scoutfs_complete_truncate(struct inode *inode, struct scoutfs_lock *lock)
* re-acquire it. Ideally we'd fix this so that we can acquire the lock
* instead of the caller.
*/
int scoutfs_setattr(struct dentry *dentry, struct iattr *attr)
int scoutfs_setattr(KC_VFS_NS_DEF
struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = dentry->d_inode;
struct super_block *sb = inode->i_sb;
@@ -482,10 +504,15 @@ retry:
SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
if (ret)
return ret;
ret = setattr_prepare(dentry, attr);
ret = setattr_prepare(KC_VFS_INIT_NS
dentry, attr);
if (ret)
goto out;
ret = scoutfs_inode_check_retention(inode);
if (ret < 0)
goto out;
attr_size = (attr->ia_valid & ATTR_SIZE) ? attr->ia_size :
i_size_read(inode);
@@ -542,7 +569,8 @@ retry:
if (ret < 0)
goto release;
setattr_copy(inode, attr);
setattr_copy(KC_VFS_INIT_NS
inode, attr);
inode_inc_iversion(inode);
scoutfs_update_inode_item(inode, lock, &ind_locks);
@@ -566,11 +594,9 @@ static void set_trans_seq(struct inode *inode, u64 *seq)
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
if (*seq != sbi->trans_seq) {
preempt_disable();
write_seqcount_begin(&si->seqcount);
write_seqlock(&si->seqlock);
*seq = sbi->trans_seq;
write_seqcount_end(&si->seqcount);
preempt_enable();
write_sequnlock(&si->seqlock);
}
}
@@ -592,22 +618,18 @@ void scoutfs_inode_inc_data_version(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
preempt_disable();
write_seqcount_begin(&si->seqcount);
write_seqlock(&si->seqlock);
si->data_version++;
write_seqcount_end(&si->seqcount);
preempt_enable();
write_sequnlock(&si->seqlock);
}
void scoutfs_inode_set_data_version(struct inode *inode, u64 data_version)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
preempt_disable();
write_seqcount_begin(&si->seqcount);
write_seqlock(&si->seqlock);
si->data_version = data_version;
write_seqcount_end(&si->seqcount);
preempt_enable();
write_sequnlock(&si->seqlock);
}
void scoutfs_inode_add_onoff(struct inode *inode, s64 on, s64 off)
@@ -616,8 +638,7 @@ void scoutfs_inode_add_onoff(struct inode *inode, s64 on, s64 off)
if (inode && (on || off)) {
si = SCOUTFS_I(inode);
preempt_disable();
write_seqcount_begin(&si->seqcount);
write_seqlock(&si->seqlock);
/* inode and extents out of sync, bad callers */
if (((s64)si->online_blocks + on < 0) ||
@@ -638,8 +659,7 @@ void scoutfs_inode_add_onoff(struct inode *inode, s64 on, s64 off)
si->online_blocks,
si->offline_blocks);
write_seqcount_end(&si->seqcount);
preempt_enable();
write_sequnlock(&si->seqlock);
}
/* any time offline extents decreased we try and wake waiters */
@@ -647,16 +667,16 @@ void scoutfs_inode_add_onoff(struct inode *inode, s64 on, s64 off)
scoutfs_data_wait_changed(inode);
}
static u64 read_seqcount_u64(struct inode *inode, u64 *val)
static u64 read_seqlock_u64(struct inode *inode, u64 *val)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
unsigned int seq;
unsigned seq;
u64 v;
do {
seq = read_seqcount_begin(&si->seqcount);
seq = read_seqbegin(&si->seqlock);
v = *val;
} while (read_seqcount_retry(&si->seqcount, seq));
} while (read_seqretry(&si->seqlock, seq));
return v;
}
@@ -665,33 +685,82 @@ u64 scoutfs_inode_meta_seq(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
return read_seqcount_u64(inode, &si->meta_seq);
return read_seqlock_u64(inode, &si->meta_seq);
}
u64 scoutfs_inode_data_seq(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
return read_seqcount_u64(inode, &si->data_seq);
return read_seqlock_u64(inode, &si->data_seq);
}
u64 scoutfs_inode_data_version(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
return read_seqcount_u64(inode, &si->data_version);
return read_seqlock_u64(inode, &si->data_version);
}
void scoutfs_inode_get_onoff(struct inode *inode, s64 *on, s64 *off)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
unsigned int seq;
unsigned seq;
do {
seq = read_seqcount_begin(&si->seqcount);
seq = read_seqbegin(&si->seqlock);
*on = SCOUTFS_I(inode)->online_blocks;
*off = SCOUTFS_I(inode)->offline_blocks;
} while (read_seqcount_retry(&si->seqcount, seq));
} while (read_seqretry(&si->seqlock, seq));
}
/*
* Get our private scoutfs inode flags, not the vfs i_flags.
*/
u32 scoutfs_inode_get_flags(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
unsigned seq;
u32 flags;
do {
seq = read_seqbegin(&si->seqlock);
flags = si->flags;
} while (read_seqretry(&si->seqlock, seq));
return flags;
}
void scoutfs_inode_set_flags(struct inode *inode, u32 and, u32 or)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
write_seqlock(&si->seqlock);
si->flags = (si->flags & and) | or;
write_sequnlock(&si->seqlock);
}
u64 scoutfs_inode_get_proj(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
unsigned seq;
u64 proj;
do {
seq = read_seqbegin(&si->seqlock);
proj = si->proj;
} while (read_seqretry(&si->seqlock, seq));
return proj;
}
void scoutfs_inode_set_proj(struct inode *inode, u64 proj)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
write_seqlock(&si->seqlock);
si->proj = proj;
write_sequnlock(&si->seqlock);
}
static int scoutfs_iget_test(struct inode *inode, void *arg)
@@ -803,7 +872,7 @@ out:
return inode;
}
static void store_inode(struct scoutfs_inode *cinode, struct inode *inode)
static void store_inode(struct scoutfs_inode *cinode, struct inode *inode, int inode_bytes)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
u64 online_blocks;
@@ -839,6 +908,7 @@ static void store_inode(struct scoutfs_inode *cinode, struct inode *inode)
cinode->crtime.sec = cpu_to_le64(si->crtime.tv_sec);
cinode->crtime.nsec = cpu_to_le32(si->crtime.tv_nsec);
memset(cinode->crtime.__pad, 0, sizeof(cinode->crtime.__pad));
cinode->proj = cpu_to_le64(si->proj);
}
/*
@@ -862,15 +932,18 @@ static void store_inode(struct scoutfs_inode *cinode, struct inode *inode)
int scoutfs_dirty_inode_item(struct inode *inode, struct scoutfs_lock *lock)
{
struct super_block *sb = inode->i_sb;
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_inode sinode;
struct scoutfs_key key;
int inode_bytes;
int ret;
store_inode(&sinode, inode);
inode_bytes = scoutfs_inode_vers_bytes(sbi->fmt_vers);
store_inode(&sinode, inode, inode_bytes);
scoutfs_inode_init_key(&key, scoutfs_ino(inode));
ret = scoutfs_item_update(sb, &key, &sinode, sizeof(sinode), lock);
ret = scoutfs_item_update(sb, &key, &sinode, inode_bytes, lock);
if (!ret)
trace_scoutfs_dirty_inode(inode);
return ret;
@@ -911,10 +984,10 @@ static bool inode_has_index(umode_t mode, u8 type)
}
}
static int cmp_index_lock(void *priv, struct list_head *A, struct list_head *B)
static int cmp_index_lock(void *priv, KC_LIST_CMP_CONST struct list_head *A, KC_LIST_CMP_CONST struct list_head *B)
{
struct index_lock *a = list_entry(A, struct index_lock, head);
struct index_lock *b = list_entry(B, struct index_lock, head);
KC_LIST_CMP_CONST struct index_lock *a = list_entry(A, KC_LIST_CMP_CONST struct index_lock, head);
KC_LIST_CMP_CONST struct index_lock *b = list_entry(B, KC_LIST_CMP_CONST struct index_lock, head);
return ((int)a->type - (int)b->type) ?:
scoutfs_cmp_u64s(a->major, b->major) ?:
@@ -1072,9 +1145,11 @@ void scoutfs_update_inode_item(struct inode *inode, struct scoutfs_lock *lock,
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct super_block *sb = inode->i_sb;
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
const u64 ino = scoutfs_ino(inode);
struct scoutfs_key key;
struct scoutfs_inode sinode;
struct scoutfs_key key;
int inode_bytes;
int ret;
int err;
@@ -1083,15 +1158,17 @@ void scoutfs_update_inode_item(struct inode *inode, struct scoutfs_lock *lock,
/* set the meta version once per trans for any inode updates */
scoutfs_inode_set_meta_seq(inode);
inode_bytes = scoutfs_inode_vers_bytes(sbi->fmt_vers);
/* only race with other inode field stores once */
store_inode(&sinode, inode);
store_inode(&sinode, inode, inode_bytes);
ret = update_indices(sb, si, ino, inode->i_mode, &sinode, lock_list, lock);
BUG_ON(ret);
scoutfs_inode_init_key(&key, ino);
err = scoutfs_item_update(sb, &key, &sinode, sizeof(sinode), lock);
err = scoutfs_item_update(sb, &key, &sinode, inode_bytes, lock);
if (err) {
scoutfs_err(sb, "inode %llu update err %d", ino, err);
BUG_ON(err);
@@ -1459,10 +1536,12 @@ out:
int scoutfs_new_inode(struct super_block *sb, struct inode *dir, umode_t mode, dev_t rdev,
u64 ino, struct scoutfs_lock *lock, struct inode **inode_ret)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_inode_info *si;
struct scoutfs_key key;
struct scoutfs_inode sinode;
struct scoutfs_key key;
struct inode *inode;
int inode_bytes;
int ret;
inode = new_inode(sb);
@@ -1478,6 +1557,7 @@ int scoutfs_new_inode(struct super_block *sb, struct inode *dir, umode_t mode, d
si->offline_blocks = 0;
si->next_readdir_pos = SCOUTFS_DIRENT_FIRST_POS;
si->next_xattr_id = 0;
si->proj = 0;
si->have_item = false;
atomic64_set(&si->last_refreshed, lock->refresh_gen);
scoutfs_lock_add_coverage(sb, lock, &si->ino_lock_cov);
@@ -1487,20 +1567,23 @@ int scoutfs_new_inode(struct super_block *sb, struct inode *dir, umode_t mode, d
scoutfs_inode_set_data_seq(inode);
inode->i_ino = ino; /* XXX overflow */
inode_init_owner(inode, dir, mode);
inode_init_owner(KC_VFS_INIT_NS
inode, dir, mode);
inode_set_bytes(inode, 0);
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
inode->i_rdev = rdev;
set_inode_ops(inode);
store_inode(&sinode, inode);
inode_bytes = scoutfs_inode_vers_bytes(sbi->fmt_vers);
store_inode(&sinode, inode, inode_bytes);
scoutfs_inode_init_key(&key, scoutfs_ino(inode));
ret = scoutfs_omap_set(sb, ino);
if (ret < 0)
goto out;
ret = scoutfs_item_create(sb, &key, &sinode, sizeof(sinode), lock);
ret = scoutfs_item_create(sb, &key, &sinode, inode_bytes, lock);
if (ret < 0)
scoutfs_omap_clear(sb, ino);
out:
@@ -1754,7 +1837,7 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)
}
scoutfs_inode_init_key(&key, ino);
ret = scoutfs_item_lookup_exact(sb, &key, &sinode, sizeof(sinode), lock);
ret = lookup_inode_item(sb, &key, &sinode, lock);
if (ret < 0) {
if (ret == -ENOENT)
ret = 0;
@@ -2143,6 +2226,17 @@ out:
return ret;
}
/*
* Return an error if the inode has the retention flag set and can not
* be modified. This mimics the errno returned by the vfs whan an
* inode's immutable flag is set. The flag won't be set on older format
* versions so we don't check the mounted format version here.
*/
int scoutfs_inode_check_retention(struct inode *inode)
{
return (scoutfs_inode_get_flags(inode) & SCOUTFS_INO_FLAG_RETENTION) ? -EPERM : 0;
}
int scoutfs_inode_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);

View File

@@ -21,6 +21,7 @@ struct scoutfs_inode_info {
u64 data_version;
u64 online_blocks;
u64 offline_blocks;
u64 proj;
u32 flags;
struct kc_timespec crtime;
@@ -47,7 +48,7 @@ struct scoutfs_inode_info {
atomic64_t last_refreshed;
/* initialized once for slab object */
seqcount_t seqcount;
seqlock_t seqlock;
bool staging; /* holder of i_mutex is staging */
struct scoutfs_per_task pt_data_lock;
struct scoutfs_data_waitq data_waitq;
@@ -120,17 +121,26 @@ u64 scoutfs_inode_meta_seq(struct inode *inode);
u64 scoutfs_inode_data_seq(struct inode *inode);
u64 scoutfs_inode_data_version(struct inode *inode);
void scoutfs_inode_get_onoff(struct inode *inode, s64 *on, s64 *off);
u32 scoutfs_inode_get_flags(struct inode *inode);
void scoutfs_inode_set_flags(struct inode *inode, u32 and, u32 or);
u64 scoutfs_inode_get_proj(struct inode *inode);
void scoutfs_inode_set_proj(struct inode *inode, u64 proj);
int scoutfs_complete_truncate(struct inode *inode, struct scoutfs_lock *lock);
int scoutfs_inode_check_retention(struct inode *inode);
int scoutfs_inode_refresh(struct inode *inode, struct scoutfs_lock *lock);
#ifdef KC_LINUX_HAVE_RHEL_IOPS_WRAPPER
int scoutfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
#else
int scoutfs_getattr(const struct path *path, struct kstat *stat,
int scoutfs_getattr(KC_VFS_NS_DEF
const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags);
#endif
int scoutfs_setattr(struct dentry *dentry, struct iattr *attr);
int scoutfs_setattr(KC_VFS_NS_DEF
struct dentry *dentry, struct iattr *attr);
int scoutfs_inode_orphan_create(struct super_block *sb, u64 ino, struct scoutfs_lock *lock,
struct scoutfs_lock *primary);

View File

@@ -23,6 +23,7 @@
#include <linux/aio.h>
#include <linux/list_sort.h>
#include <linux/backing-dev.h>
#include <linux/overflow.h>
#include "format.h"
#include "key.h"
@@ -42,7 +43,12 @@
#include "alloc.h"
#include "server.h"
#include "counters.h"
#include "attr_x.h"
#include "totl.h"
#include "wkic.h"
#include "quota.h"
#include "scoutfs_trace.h"
#include "util.h"
/*
* We make inode index items coherent by locking fixed size regions of
@@ -284,6 +290,7 @@ static long scoutfs_ioc_release(struct file *file, unsigned long arg)
u64 online;
u64 offline;
u64 isize;
__u64 tmp;
int ret;
if (copy_from_user(&args, (void __user *)arg, sizeof(args)))
@@ -293,12 +300,11 @@ static long scoutfs_ioc_release(struct file *file, unsigned long arg)
if (args.length == 0)
return 0;
if (((args.offset + args.length) < args.offset) ||
if ((check_add_overflow(args.offset, args.length - 1, &tmp)) ||
(args.offset & SCOUTFS_BLOCK_SM_MASK) ||
(args.length & SCOUTFS_BLOCK_SM_MASK))
return -EINVAL;
ret = mnt_want_write_file(file);
if (ret)
return ret;
@@ -545,20 +551,41 @@ out:
static long scoutfs_ioc_stat_more(struct file *file, unsigned long arg)
{
struct inode *inode = file_inode(file);
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct scoutfs_ioctl_stat_more stm;
struct scoutfs_ioctl_inode_attr_x *iax = NULL;
struct scoutfs_ioctl_stat_more *stm = NULL;
int ret;
stm.meta_seq = scoutfs_inode_meta_seq(inode);
stm.data_seq = scoutfs_inode_data_seq(inode);
stm.data_version = scoutfs_inode_data_version(inode);
scoutfs_inode_get_onoff(inode, &stm.online_blocks, &stm.offline_blocks);
stm.crtime_sec = si->crtime.tv_sec;
stm.crtime_nsec = si->crtime.tv_nsec;
iax = kmalloc(sizeof(struct scoutfs_ioctl_inode_attr_x), GFP_KERNEL);
stm = kmalloc(sizeof(struct scoutfs_ioctl_stat_more), GFP_KERNEL);
if (!iax || !stm) {
ret = -ENOMEM;
goto out;
}
if (copy_to_user((void __user *)arg, &stm, sizeof(stm)))
return -EFAULT;
iax->x_mask = SCOUTFS_IOC_IAX_META_SEQ | SCOUTFS_IOC_IAX_DATA_SEQ |
SCOUTFS_IOC_IAX_DATA_VERSION | SCOUTFS_IOC_IAX_ONLINE_BLOCKS |
SCOUTFS_IOC_IAX_OFFLINE_BLOCKS | SCOUTFS_IOC_IAX_CRTIME;
iax->x_flags = 0;
ret = scoutfs_get_attr_x(inode, iax);
if (ret < 0)
goto out;
return 0;
stm->meta_seq = iax->meta_seq;
stm->data_seq = iax->data_seq;
stm->data_version = iax->data_version;
stm->online_blocks = iax->online_blocks;
stm->offline_blocks = iax->offline_blocks;
stm->crtime_sec = iax->crtime_sec;
stm->crtime_nsec = iax->crtime_nsec;
if (copy_to_user((void __user *)arg, stm, sizeof(struct scoutfs_ioctl_stat_more)))
ret = -EFAULT;
else
ret = 0;
out:
kfree(iax);
kfree(stm);
return ret;
}
static bool inc_wrapped(u64 *ino, u64 *iblock)
@@ -615,24 +642,19 @@ static long scoutfs_ioc_data_waiting(struct file *file, unsigned long arg)
* This is used when restoring files, it lets the caller set all the
* inode attributes which are otherwise unreachable. Changing the file
* size can only be done for regular files with a data_version of 0.
*
* We unconditionally fill the iax attributes from the sm set and let
* set_attr_x check them.
*/
static long scoutfs_ioc_setattr_more(struct file *file, unsigned long arg)
{
struct inode *inode = file->f_inode;
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct super_block *sb = inode->i_sb;
struct inode *inode = file_inode(file);
struct scoutfs_ioctl_setattr_more __user *usm = (void __user *)arg;
struct scoutfs_ioctl_inode_attr_x *iax = NULL;
struct scoutfs_ioctl_setattr_more sm;
struct scoutfs_lock *lock = NULL;
LIST_HEAD(ind_locks);
bool set_data_seq;
int ret;
if (!capable(CAP_SYS_ADMIN)) {
ret = -EPERM;
goto out;
}
if (!(file->f_mode & FMODE_WRITE)) {
ret = -EBADF;
goto out;
@@ -643,65 +665,41 @@ static long scoutfs_ioc_setattr_more(struct file *file, unsigned long arg)
goto out;
}
if ((sm.i_size > 0 && sm.data_version == 0) ||
((sm.flags & SCOUTFS_IOC_SETATTR_MORE_OFFLINE) && !sm.i_size) ||
(sm.flags & SCOUTFS_IOC_SETATTR_MORE_UNKNOWN)) {
if (sm.flags & SCOUTFS_IOC_SETATTR_MORE_UNKNOWN) {
ret = -EINVAL;
goto out;
}
iax = kzalloc(sizeof(struct scoutfs_ioctl_inode_attr_x), GFP_KERNEL);
if (!iax) {
ret = -ENOMEM;
goto out;
}
iax->x_mask = SCOUTFS_IOC_IAX_CTIME | SCOUTFS_IOC_IAX_CRTIME |
SCOUTFS_IOC_IAX_SIZE;
iax->data_version = sm.data_version;
iax->ctime_sec = sm.ctime_sec;
iax->ctime_nsec = sm.ctime_nsec;
iax->crtime_sec = sm.crtime_sec;
iax->crtime_nsec = sm.crtime_nsec;
iax->size = sm.i_size;
if (sm.flags & SCOUTFS_IOC_SETATTR_MORE_OFFLINE)
iax->x_flags |= SCOUTFS_IOC_IAX_F_SIZE_OFFLINE;
if (sm.data_version != 0)
iax->x_mask |= SCOUTFS_IOC_IAX_DATA_VERSION;
ret = mnt_want_write_file(file);
if (ret)
if (ret < 0)
goto out;
inode_lock(inode);
ret = scoutfs_set_attr_x(inode, iax);
ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_WRITE,
SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
if (ret)
goto unlock;
/* can only change size/dv on untouched regular files */
if ((sm.i_size != 0 || sm.data_version != 0) &&
((!S_ISREG(inode->i_mode) ||
scoutfs_inode_data_version(inode) != 0))) {
ret = -EINVAL;
goto unlock;
}
/* create offline extents in potentially many transactions */
if (sm.flags & SCOUTFS_IOC_SETATTR_MORE_OFFLINE) {
ret = scoutfs_data_init_offline_extent(inode, sm.i_size, lock);
if (ret)
goto unlock;
}
/* setting only so we don't see 0 data seq with nonzero data_version */
set_data_seq = sm.data_version != 0 ? true : false;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, set_data_seq, false);
if (ret)
goto unlock;
if (sm.data_version)
scoutfs_inode_set_data_version(inode, sm.data_version);
if (sm.i_size)
i_size_write(inode, sm.i_size);
inode->i_ctime.tv_sec = sm.ctime_sec;
inode->i_ctime.tv_nsec = sm.ctime_nsec;
si->crtime.tv_sec = sm.crtime_sec;
si->crtime.tv_nsec = sm.crtime_nsec;
scoutfs_update_inode_item(inode, lock, &ind_locks);
ret = 0;
scoutfs_release_trans(sb);
unlock:
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
inode_unlock(inode);
mnt_drop_write_file(file);
out:
kfree(iax);
return ret;
}
@@ -720,7 +718,8 @@ static long scoutfs_ioc_listxattr_hidden(struct file *file, unsigned long arg)
int total = 0;
int ret;
ret = inode_permission(inode, MAY_READ);
ret = inode_permission(KC_VFS_INIT_NS
inode, MAY_READ);
if (ret < 0)
goto out;
@@ -958,6 +957,7 @@ static long scoutfs_ioc_move_blocks(struct file *file, unsigned long arg)
struct scoutfs_ioctl_move_blocks mb;
struct file *from_file;
struct inode *from;
u64 tmp;
int ret;
if (copy_from_user(&mb, umb, sizeof(mb)))
@@ -966,8 +966,8 @@ static long scoutfs_ioc_move_blocks(struct file *file, unsigned long arg)
if (mb.len == 0)
return 0;
if (mb.from_off + mb.len < mb.from_off ||
mb.to_off + mb.len < mb.to_off)
if ((check_add_overflow(mb.from_off, mb.len - 1, &tmp)) ||
(check_add_overflow(mb.to_off, mb.len - 1, &tmp)))
return -EOVERFLOW;
from_file = fget(mb.from_fd);
@@ -1035,124 +1035,32 @@ out:
return ret;
}
struct xattr_total_entry {
struct rb_node node;
struct scoutfs_ioctl_xattr_total xt;
u64 fs_seq;
u64 fs_total;
u64 fs_count;
u64 fin_seq;
u64 fin_total;
s64 fin_count;
u64 log_seq;
u64 log_total;
s64 log_count;
struct read_xattr_total_iter_cb_args {
struct scoutfs_ioctl_xattr_total *xt;
unsigned int copied;
unsigned int total;
};
static int cmp_xt_entry_name(const struct xattr_total_entry *a,
const struct xattr_total_entry *b)
{
return scoutfs_cmp_u64s(a->xt.name[0], b->xt.name[0]) ?:
scoutfs_cmp_u64s(a->xt.name[1], b->xt.name[1]) ?:
scoutfs_cmp_u64s(a->xt.name[2], b->xt.name[2]);
}
/*
* Record the contribution of the three classes of logged items we can
* see: the item in the fs_root, items from finalized log btrees, and
* items from active log btrees. Once we have the full set the caller
* can decide which of the items contribute to the total it sends to the
* user.
* This is called under an RCU read lock so it can't copy to userspace.
*/
static int read_xattr_total_item(struct super_block *sb, struct scoutfs_key *key,
u64 seq, u8 flags, void *val, int val_len, int fic, void *arg)
static int read_xattr_total_iter_cb(struct scoutfs_key *key, void *val, unsigned int val_len,
void *cb_arg)
{
struct read_xattr_total_iter_cb_args *cba = cb_arg;
struct scoutfs_xattr_totl_val *tval = val;
struct xattr_total_entry *ent;
struct xattr_total_entry rd;
struct rb_root *root = arg;
struct rb_node *parent;
struct rb_node **node;
int cmp;
struct scoutfs_ioctl_xattr_total *xt = &cba->xt[cba->copied];
rd.xt.name[0] = le64_to_cpu(key->skxt_a);
rd.xt.name[1] = le64_to_cpu(key->skxt_b);
rd.xt.name[2] = le64_to_cpu(key->skxt_c);
xt->name[0] = le64_to_cpu(key->skxt_a);
xt->name[1] = le64_to_cpu(key->skxt_b);
xt->name[2] = le64_to_cpu(key->skxt_c);
xt->total = le64_to_cpu(tval->total);
xt->count = le64_to_cpu(tval->count);
/* find entry matching name */
node = &root->rb_node;
parent = NULL;
cmp = -1;
while (*node) {
parent = *node;
ent = container_of(*node, struct xattr_total_entry, node);
/* sort merge items by key then newest to oldest */
cmp = cmp_xt_entry_name(&rd, ent);
if (cmp < 0)
node = &(*node)->rb_left;
else if (cmp > 0)
node = &(*node)->rb_right;
else
break;
}
/* allocate and insert new node if we need to */
if (cmp != 0) {
ent = kzalloc(sizeof(*ent), GFP_KERNEL);
if (!ent)
return -ENOMEM;
memcpy(&ent->xt.name, &rd.xt.name, sizeof(ent->xt.name));
rb_link_node(&ent->node, parent, node);
rb_insert_color(&ent->node, root);
}
if (fic & FIC_FS_ROOT) {
ent->fs_seq = seq;
ent->fs_total = le64_to_cpu(tval->total);
ent->fs_count = le64_to_cpu(tval->count);
} else if (fic & FIC_FINALIZED) {
ent->fin_seq = seq;
ent->fin_total += le64_to_cpu(tval->total);
ent->fin_count += le64_to_cpu(tval->count);
} else {
ent->log_seq = seq;
ent->log_total += le64_to_cpu(tval->total);
ent->log_count += le64_to_cpu(tval->count);
}
scoutfs_inc_counter(sb, totl_read_item);
return 0;
}
/* these are always _safe, node stores next */
#define for_each_xt_ent(ent, node, root) \
for (node = rb_first(root); \
node && (ent = rb_entry(node, struct xattr_total_entry, node), \
node = rb_next(node), 1); )
#define for_each_xt_ent_reverse(ent, node, root) \
for (node = rb_last(root); \
node && (ent = rb_entry(node, struct xattr_total_entry, node), \
node = rb_prev(node), 1); )
static void free_xt_ent(struct rb_root *root, struct xattr_total_entry *ent)
{
rb_erase(&ent->node, root);
kfree(ent);
}
static void free_all_xt_ents(struct rb_root *root)
{
struct xattr_total_entry *ent;
struct rb_node *node;
for_each_xt_ent(ent, node, root)
free_xt_ent(root, ent);
if (++cba->copied < cba->total)
return -EAGAIN;
else
return 0;
}
/*
@@ -1162,30 +1070,6 @@ static void free_all_xt_ents(struct rb_root *root)
* have been committed. It doesn't use locking to force commits and
* block writers so it can be a little bit out of date with respect to
* dirty xattrs in memory across the system.
*
* Our reader has to be careful because the log btree merging code can
* write partial results to the fs_root. This means that a reader can
* see both cases where new finalized logs should be applied to the old
* fs items and where old finalized logs have already been applied to
* the partially merged fs items. Currently active logged items are
* always applied on top of all cases.
*
* These cases are differentiated with a combination of sequence numbers
* in items, the count of contributing xattrs, and a flag
* differentiating finalized and active logged items. This lets us
* recognize all cases, including when finalized logs were merged and
* deleted the fs item.
*
* We're allocating a tracking struct for each totl name we see while
* traversing the item btrees. The forest reader is providing the items
* it finds in leaf blocks that contain the search key. In the worst
* case all of these blocks are full and none of the items overlap. At
* most, figure order a thousand names per mount. But in practice many
* of these factors fall away: leaf blocks aren't fill, leaf items
* overlap, there aren't finalized log btrees, and not all mounts are
* actively changing totals. We're much more likely to only read a
* leaf block's worth of totals that have been long since merged into
* the fs_root.
*/
static long scoutfs_ioc_read_xattr_totals(struct file *file, unsigned long arg)
{
@@ -1193,14 +1077,13 @@ static long scoutfs_ioc_read_xattr_totals(struct file *file, unsigned long arg)
struct scoutfs_ioctl_read_xattr_totals __user *urxt = (void __user *)arg;
struct scoutfs_ioctl_read_xattr_totals rxt;
struct scoutfs_ioctl_xattr_total __user *uxt;
struct xattr_total_entry *ent;
struct read_xattr_total_iter_cb_args cba = {NULL, };
struct scoutfs_key range_start;
struct scoutfs_key range_end;
struct scoutfs_key key;
struct scoutfs_key bloom_key;
struct scoutfs_key start;
struct scoutfs_key end;
struct rb_root root = RB_ROOT;
struct rb_node *node;
int count = 0;
unsigned int copied = 0;
unsigned int total;
unsigned int ready;
int ret;
if (!(file->f_mode & FMODE_READ)) {
@@ -1213,6 +1096,13 @@ static long scoutfs_ioc_read_xattr_totals(struct file *file, unsigned long arg)
goto out;
}
cba.xt = (void *)__get_free_page(GFP_KERNEL);
if (!cba.xt) {
ret = -ENOMEM;
goto out;
}
cba.total = PAGE_SIZE / sizeof(struct scoutfs_ioctl_xattr_total);
if (copy_from_user(&rxt, urxt, sizeof(rxt))) {
ret = -EFAULT;
goto out;
@@ -1225,101 +1115,40 @@ static long scoutfs_ioc_read_xattr_totals(struct file *file, unsigned long arg)
goto out;
}
scoutfs_key_set_zeros(&bloom_key);
bloom_key.sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
scoutfs_xattr_init_totl_key(&start, rxt.pos_name);
total = div_u64(min_t(u64, rxt.totals_bytes, INT_MAX),
sizeof(struct scoutfs_ioctl_xattr_total));
while (rxt.totals_bytes >= sizeof(struct scoutfs_ioctl_xattr_total)) {
scoutfs_totl_set_range(&range_start, &range_end);
scoutfs_xattr_init_totl_key(&key, rxt.pos_name);
scoutfs_key_set_ones(&end);
end.sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
if (scoutfs_key_compare(&start, &end) > 0)
while (copied < total) {
cba.copied = 0;
ret = scoutfs_wkic_iterate(sb, &key, &range_end, &range_start, &range_end,
read_xattr_total_iter_cb, &cba);
if (ret < 0)
goto out;
if (cba.copied == 0)
break;
key = start;
ret = scoutfs_forest_read_items(sb, &key, &bloom_key, &start, &end,
read_xattr_total_item, &root);
if (ret < 0) {
if (ret == -ESTALE) {
free_all_xt_ents(&root);
continue;
}
ready = min(total - copied, cba.copied);
if (copy_to_user(&uxt[copied], cba.xt, ready * sizeof(cba.xt[0]))) {
ret = -EFAULT;
goto out;
}
if (RB_EMPTY_ROOT(&root))
break;
/* trim totals that fall outside of the consistent range */
for_each_xt_ent(ent, node, &root) {
scoutfs_xattr_init_totl_key(&key, ent->xt.name);
if (scoutfs_key_compare(&key, &start) < 0) {
free_xt_ent(&root, ent);
} else {
break;
}
}
for_each_xt_ent_reverse(ent, node, &root) {
scoutfs_xattr_init_totl_key(&key, ent->xt.name);
if (scoutfs_key_compare(&key, &end) > 0) {
free_xt_ent(&root, ent);
} else {
break;
}
}
/* copy resulting unique non-zero totals to userspace */
for_each_xt_ent(ent, node, &root) {
if (rxt.totals_bytes < sizeof(ent->xt))
break;
/* start with the fs item if we have it */
if (ent->fs_seq != 0) {
ent->xt.total = ent->fs_total;
ent->xt.count = ent->fs_count;
scoutfs_inc_counter(sb, totl_read_fs);
}
/* apply finalized logs if they're newer or creating */
if (((ent->fs_seq != 0) && (ent->fin_seq > ent->fs_seq)) ||
((ent->fs_seq == 0) && (ent->fin_count > 0))) {
ent->xt.total += ent->fin_total;
ent->xt.count += ent->fin_count;
scoutfs_inc_counter(sb, totl_read_finalized);
}
/* always apply active logs which must be newer than fs and finalized */
if (ent->log_seq > 0) {
ent->xt.total += ent->log_total;
ent->xt.count += ent->log_count;
scoutfs_inc_counter(sb, totl_read_logged);
}
if (ent->xt.total != 0 || ent->xt.count != 0) {
if (copy_to_user(uxt, &ent->xt, sizeof(ent->xt))) {
ret = -EFAULT;
goto out;
}
uxt++;
rxt.totals_bytes -= sizeof(ent->xt);
count++;
scoutfs_inc_counter(sb, totl_read_copied);
}
free_xt_ent(&root, ent);
}
/* continue after the last possible key read */
start = end;
scoutfs_key_inc(&start);
scoutfs_xattr_init_totl_key(&key, cba.xt[ready - 1].name);
scoutfs_key_inc(&key);
copied += ready;
}
ret = 0;
out:
free_all_xt_ents(&root);
if (cba.xt)
free_page((long)cba.xt);
return ret ?: count;
return ret ?: copied;
}
static long scoutfs_ioc_get_allocated_inos(struct file *file, unsigned long arg)
@@ -1504,6 +1333,265 @@ out:
return nr ?: ret;
}
static long scoutfs_ioc_get_attr_x(struct file *file, unsigned long arg)
{
struct inode *inode = file_inode(file);
struct scoutfs_ioctl_inode_attr_x __user *uiax = (void __user *)arg;
struct scoutfs_ioctl_inode_attr_x *iax = NULL;
int ret;
iax = kmalloc(sizeof(struct scoutfs_ioctl_inode_attr_x), GFP_KERNEL);
if (!iax) {
ret = -ENOMEM;
goto out;
}
ret = get_user(iax->x_mask, &uiax->x_mask) ?:
get_user(iax->x_flags, &uiax->x_flags);
if (ret < 0)
goto out;
ret = scoutfs_get_attr_x(inode, iax);
if (ret < 0)
goto out;
/* only copy results after dropping cluster locks (could fault) */
if (ret > 0 && copy_to_user(uiax, iax, ret) != 0)
ret = -EFAULT;
else
ret = 0;
out:
kfree(iax);
return ret;
}
static long scoutfs_ioc_set_attr_x(struct file *file, unsigned long arg)
{
struct inode *inode = file_inode(file);
struct scoutfs_ioctl_inode_attr_x __user *uiax = (void __user *)arg;
struct scoutfs_ioctl_inode_attr_x *iax = NULL;
int ret;
iax = kmalloc(sizeof(struct scoutfs_ioctl_inode_attr_x), GFP_KERNEL);
if (!iax) {
ret = -ENOMEM;
goto out;
}
if (copy_from_user(iax, uiax, sizeof(struct scoutfs_ioctl_inode_attr_x))) {
ret = -EFAULT;
goto out;
}
ret = mnt_want_write_file(file);
if (ret < 0)
goto out;
ret = scoutfs_set_attr_x(inode, iax);
mnt_drop_write_file(file);
out:
kfree(iax);
return ret;
}
static long scoutfs_ioc_get_quota_rules(struct file *file, unsigned long arg)
{
struct super_block *sb = file_inode(file)->i_sb;
struct scoutfs_ioctl_get_quota_rules __user *ugqr = (void __user *)arg;
struct scoutfs_ioctl_get_quota_rules gqr;
struct scoutfs_ioctl_quota_rule __user *uirules;
struct scoutfs_ioctl_quota_rule *irules;
struct page *page = NULL;
int copied = 0;
int nr;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (copy_from_user(&gqr, ugqr, sizeof(gqr)))
return -EFAULT;
if (gqr.rules_nr == 0)
return 0;
uirules = (void __user *)gqr.rules_ptr;
/* limit rules copied per call */
gqr.rules_nr = min_t(u64, gqr.rules_nr, INT_MAX);
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
ret = -ENOMEM;
goto out;
}
irules = page_address(page);
while (copied < gqr.rules_nr) {
nr = min_t(u64, gqr.rules_nr - copied,
PAGE_SIZE / sizeof(struct scoutfs_ioctl_quota_rule));
ret = scoutfs_quota_get_rules(sb, gqr.iterator, page_address(page), nr);
if (ret <= 0)
goto out;
if (copy_to_user(&uirules[copied], irules, ret * sizeof(irules[0]))) {
ret = -EFAULT;
goto out;
}
copied += ret;
}
ret = 0;
out:
if (page)
__free_page(page);
if (ret == 0 && copy_to_user(ugqr->iterator, gqr.iterator, sizeof(gqr.iterator)))
ret = -EFAULT;
return ret ?: copied;
}
static long scoutfs_ioc_mod_quota_rule(struct file *file, unsigned long arg, bool is_add)
{
struct super_block *sb = file_inode(file)->i_sb;
struct scoutfs_ioctl_quota_rule __user *uirule = (void __user *)arg;
struct scoutfs_ioctl_quota_rule irule;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (copy_from_user(&irule, uirule, sizeof(irule)))
return -EFAULT;
return scoutfs_quota_mod_rule(sb, is_add, &irule);
}
struct read_index_buf {
int nr;
int size;
struct scoutfs_ioctl_xattr_index_entry ents[0];
};
#define READ_INDEX_BUF_MAX_ENTS \
((PAGE_SIZE - sizeof(struct read_index_buf)) / \
sizeof(struct scoutfs_ioctl_xattr_index_entry))
/*
* This doesn't filter out duplicates, the caller filters them out to
* catch duplicates between iteration calls.
*/
static int read_index_cb(struct scoutfs_key *key, void *val, unsigned int val_len, void *cb_arg)
{
struct read_index_buf *rib = cb_arg;
struct scoutfs_ioctl_xattr_index_entry *ent = &rib->ents[rib->nr];
u64 xid;
if (val_len != 0)
return -EIO;
/* discard the xid, they're not exposed to ioctl callers */
scoutfs_xattr_get_indx_key(key, &ent->major, &ent->minor, &ent->ino, &xid);
if (++rib->nr == rib->size)
return rib->nr;
return -EAGAIN;
}
static long scoutfs_ioc_read_xattr_index(struct file *file, unsigned long arg)
{
struct super_block *sb = file_inode(file)->i_sb;
struct scoutfs_ioctl_read_xattr_index __user *urxi = (void __user *)arg;
struct scoutfs_ioctl_xattr_index_entry __user *uents;
struct scoutfs_ioctl_xattr_index_entry *ent;
struct scoutfs_ioctl_xattr_index_entry prev;
struct scoutfs_ioctl_read_xattr_index rxi;
struct read_index_buf *rib;
struct page *page = NULL;
struct scoutfs_key first;
struct scoutfs_key last;
struct scoutfs_key start;
struct scoutfs_key end;
int copied = 0;
int ret;
int i;
if (!capable(CAP_SYS_ADMIN)) {
ret = -EPERM;
goto out;
}
if (copy_from_user(&rxi, urxi, sizeof(rxi))) {
ret = -EFAULT;
goto out;
}
uents = (void __user *)rxi.entries_ptr;
rxi.entries_nr = min_t(u64, rxi.entries_nr, INT_MAX);
page = alloc_page(GFP_KERNEL);
if (!page) {
ret = -ENOMEM;
goto out;
}
rib = page_address(page);
scoutfs_xattr_init_indx_key(&first, rxi.first.major, rxi.first.minor, rxi.first.ino, 0);
scoutfs_xattr_init_indx_key(&last, rxi.last.major, rxi.last.minor, rxi.last.ino, U64_MAX);
scoutfs_xattr_indx_get_range(&start, &end);
if (scoutfs_key_compare(&first, &last) > 0) {
ret = -EINVAL;
goto out;
}
/* 0 ino doesn't exist, can't ever match entry to return */
memset(&prev, 0, sizeof(prev));
while (copied < rxi.entries_nr) {
rib->nr = 0;
rib->size = min_t(u64, rxi.entries_nr - copied, READ_INDEX_BUF_MAX_ENTS);
ret = scoutfs_wkic_iterate(sb, &first, &last, &start, &end,
read_index_cb, rib);
if (ret < 0)
goto out;
if (rib->nr == 0)
break;
/*
* Copy entries to userspace, skipping duplicate entries
* that can result from multiple xattrs indexing an
* inode at the same position and which can span
* multiple cache iterations. (Comparing in order of
* most likely to change to fail fast.)
*/
for (i = 0, ent = rib->ents; i < rib->nr; i++, ent++) {
if (ent->ino == prev.ino && ent->minor == prev.minor &&
ent->major == prev.major)
continue;
if (copy_to_user(&uents[copied], ent, sizeof(*ent))) {
ret = -EFAULT;
goto out;
}
prev = *ent;
copied++;
}
scoutfs_xattr_init_indx_key(&first, prev.major, prev.minor, prev.ino, U64_MAX);
scoutfs_key_inc(&first);
}
ret = copied;
out:
if (page)
__free_page(page);
return ret;
}
long scoutfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
switch (cmd) {
@@ -1541,6 +1629,18 @@ long scoutfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return scoutfs_ioc_get_allocated_inos(file, arg);
case SCOUTFS_IOC_GET_REFERRING_ENTRIES:
return scoutfs_ioc_get_referring_entries(file, arg);
case SCOUTFS_IOC_GET_ATTR_X:
return scoutfs_ioc_get_attr_x(file, arg);
case SCOUTFS_IOC_SET_ATTR_X:
return scoutfs_ioc_set_attr_x(file, arg);
case SCOUTFS_IOC_GET_QUOTA_RULES:
return scoutfs_ioc_get_quota_rules(file, arg);
case SCOUTFS_IOC_ADD_QUOTA_RULE:
return scoutfs_ioc_mod_quota_rule(file, arg, true);
case SCOUTFS_IOC_DEL_QUOTA_RULE:
return scoutfs_ioc_mod_quota_rule(file, arg, false);
case SCOUTFS_IOC_READ_XATTR_INDEX:
return scoutfs_ioc_read_xattr_index(file, arg);
}
return -ENOTTY;

View File

@@ -673,4 +673,174 @@ struct scoutfs_ioctl_dirent {
#define SCOUTFS_IOC_GET_REFERRING_ENTRIES \
_IOW(SCOUTFS_IOCTL_MAGIC, 17, struct scoutfs_ioctl_get_referring_entries)
struct scoutfs_ioctl_inode_attr_x {
__u64 x_mask;
__u64 x_flags;
__u64 meta_seq;
__u64 data_seq;
__u64 data_version;
__u64 online_blocks;
__u64 offline_blocks;
__u64 ctime_sec;
__u32 ctime_nsec;
__u32 crtime_nsec;
__u64 crtime_sec;
__u64 size;
__u64 bits;
__u64 project_id;
};
/*
* Behavioral flags set in the x_flags field. These flags don't
* necessarily correspond to specific attributes, but instead change the
* behaviour of a _get_ or _set_ operation.
*
* @SCOUTFS_IOC_IAX_F_SIZE_OFFLINE: When setting i_size, also create
* extents which are marked offline for the region of the file from
* offset 0 to the new set size. This can only be set when setting the
* size and has no effect if setting the size fails.
*/
#define SCOUTFS_IOC_IAX_F_SIZE_OFFLINE (1ULL << 0)
#define SCOUTFS_IOC_IAX_F__UNKNOWN (U64_MAX << 1)
/*
* Single-bit values stored in the @bits field. These indicate whether
* the bit is set, or not. The main _IAX_ bits set in the mask indicate
* whether this value bit is populated by _get or stored by _set.
*/
#define SCOUTFS_IOC_IAX_B_RETENTION (1ULL << 0)
/*
* x_mask bits which indicate which attributes of the inode to populate
* on return for _get or to set on the inode for _set. Each mask bit
* corresponds to the matching named field in the attr_x struct passed
* to the _get_ and _set_ calls.
*
* Each field can have different permissions or other attribute
* requirements which can cause calls to fail. If _set_ fails then no
* other attribute changes will have been made by the same call.
*
* @SCOUTFS_IOC_IAX_RETENTION: Mark a file for retention. When marked,
* no modification can be made to the file other than changing extended
* attributes outside the "user." prefix and clearing the retention
* mark. This can only be set on regular files and requires root (the
* CAP_SYS_ADMIN capability). Other attributes can be set with a
* set_attr_x call on a retention inode as long as that call also
* successfully clears the retention mark.
*/
#define SCOUTFS_IOC_IAX_META_SEQ (1ULL << 0)
#define SCOUTFS_IOC_IAX_DATA_SEQ (1ULL << 1)
#define SCOUTFS_IOC_IAX_DATA_VERSION (1ULL << 2)
#define SCOUTFS_IOC_IAX_ONLINE_BLOCKS (1ULL << 3)
#define SCOUTFS_IOC_IAX_OFFLINE_BLOCKS (1ULL << 4)
#define SCOUTFS_IOC_IAX_CTIME (1ULL << 5)
#define SCOUTFS_IOC_IAX_CRTIME (1ULL << 6)
#define SCOUTFS_IOC_IAX_SIZE (1ULL << 7)
#define SCOUTFS_IOC_IAX_RETENTION (1ULL << 8)
#define SCOUTFS_IOC_IAX_PROJECT_ID (1ULL << 9)
/* single bit attributes that are packed in the bits field as _B_ */
#define SCOUTFS_IOC_IAX__BITS (SCOUTFS_IOC_IAX_RETENTION)
/* inverse of all the bits we understand */
#define SCOUTFS_IOC_IAX__UNKNOWN (U64_MAX << 10)
#define SCOUTFS_IOC_GET_ATTR_X \
_IOW(SCOUTFS_IOCTL_MAGIC, 18, struct scoutfs_ioctl_inode_attr_x)
#define SCOUTFS_IOC_SET_ATTR_X \
_IOW(SCOUTFS_IOCTL_MAGIC, 19, struct scoutfs_ioctl_inode_attr_x)
/*
* (These fields are documented in the order that they're displayed by
* the scoutfs cli utility which matches the sort order of the rules.)
*
* @prio: The priority of the rule. Rules are sorted by their fields
* with prio at the highest magnitude. When multiple rules match the
* rule with the highest sort order is enforced. The priority field
* lets rules override the default field sort order.
*
* @name_val[3]: The three 64bit values that make up the name of the
* totl xattr whose total will be checked against the rule's limit to
* see if the quota rule has been exceeded. The behavior of the values
* can be changed by their corresponding name_source and name_flags.
*
* @name_source[3]: The SQ_NS_ enums that control where the value comes
* from. _LITERAL uses the value from name_val. Inode attribute
* sources (_PROJ, _UID, _GID) are taken from the inode of the operation
* that is being checked against the rule.
*
* @name_flags[3]: The SQ_NF_ enums that alter the name values. _SELECT
* makes the rule only match if the inode attribute of the operation
* matches the attribute value stored in name_val. This lets rules
* match a specific value of an attribute rather than mapping all
* attribute values of to totl names.
*
* @op: The SQ_OP_ enums which specify the operation that can't exceed
* the rule's limit. _INODE checks inode creation and the inode
* attributes are taken from the inode that would be created. _DATA
* checks file data block allocation and the inode fields come from the
* inode that is allocating the blocks.
*
* @limit: The 64bit value that is checked against the totl value
* described by the rule. If the totl value is greater than or equal to
* this value of the matching rule then the operation will return
* -EDQUOT.
*
* @rule_flags: SQ_RF_TOTL_COUNT indicates that the rule's limit should
* be checked against the number of xattrs contributing to a totl value
* instead of the sum of the xattrs.
*/
struct scoutfs_ioctl_quota_rule {
__u64 name_val[3];
__u64 limit;
__u8 prio;
__u8 op;
__u8 rule_flags;
__u8 name_source[3];
__u8 name_flags[3];
__u8 _pad[7];
};
struct scoutfs_ioctl_get_quota_rules {
__u64 iterator[2];
__u64 rules_ptr;
__u64 rules_nr;
};
/*
* Rules are uniquely identified by their non-padded fields. Addition will fail
* with -EEXIST if the specified rule already exists and deletion must find a rule
* with all matching fields to delete.
*/
#define SCOUTFS_IOC_GET_QUOTA_RULES \
_IOR(SCOUTFS_IOCTL_MAGIC, 20, struct scoutfs_ioctl_get_quota_rules)
#define SCOUTFS_IOC_ADD_QUOTA_RULE \
_IOW(SCOUTFS_IOCTL_MAGIC, 21, struct scoutfs_ioctl_quota_rule)
#define SCOUTFS_IOC_DEL_QUOTA_RULE \
_IOW(SCOUTFS_IOCTL_MAGIC, 22, struct scoutfs_ioctl_quota_rule)
/*
* Inodes can be indexed in a global key space at a position determined
* by a .indx. tagged xattr. The xattr name specifies the two index
* position values, with major having the more significant comparison
* order.
*/
struct scoutfs_ioctl_xattr_index_entry {
__u64 minor;
__u64 ino;
__u8 major;
__u8 _pad[7];
};
struct scoutfs_ioctl_read_xattr_index {
__u64 flags;
struct scoutfs_ioctl_xattr_index_entry first;
struct scoutfs_ioctl_xattr_index_entry last;
__u64 entries_ptr;
__u64 entries_nr;
};
#define SCOUTFS_IOC_READ_XATTR_INDEX \
_IOR(SCOUTFS_IOCTL_MAGIC, 23, struct scoutfs_ioctl_read_xattr_index)
#endif

View File

@@ -24,6 +24,7 @@
#include "item.h"
#include "forest.h"
#include "block.h"
#include "msg.h"
#include "trans.h"
#include "counters.h"
#include "scoutfs_trace.h"
@@ -1670,13 +1671,24 @@ out:
return ret;
}
static int lock_safe(struct scoutfs_lock *lock, struct scoutfs_key *key,
static int lock_safe(struct super_block *sb, struct scoutfs_lock *lock, struct scoutfs_key *key,
int mode)
{
if (WARN_ON_ONCE(!scoutfs_lock_protected(lock, key, mode)))
bool prot = scoutfs_lock_protected(lock, key, mode);
if (!prot) {
static bool once = false;
if (!once) {
scoutfs_err(sb, "lock (start "SK_FMT" end "SK_FMT" mode 0x%x) does not protect operation (key "SK_FMT" mode 0x%x)",
SK_ARG(&lock->start), SK_ARG(&lock->end), lock->mode,
SK_ARG(key), mode);
dump_stack();
once = true;
}
return -EINVAL;
else
return 0;
}
return 0;
}
static int optional_lock_mode_match(struct scoutfs_lock *lock, int mode)
@@ -1708,8 +1720,8 @@ static int copy_val(void *dst, int dst_len, void *src, int src_len)
* The amount of bytes copied is returned which can be 0 or truncated if
* the caller's buffer isn't big enough.
*/
int scoutfs_item_lookup(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock)
static int item_lookup(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, int len_limit, struct scoutfs_lock *lock)
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
struct cached_item *item;
@@ -1718,7 +1730,7 @@ int scoutfs_item_lookup(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_lookup);
if ((ret = lock_safe(lock, key, SCOUTFS_LOCK_READ)))
if ((ret = lock_safe(sb, lock, key, SCOUTFS_LOCK_READ)))
goto out;
ret = get_cached_page(sb, cinf, lock, key, false, false, 0, &pg);
@@ -1729,6 +1741,8 @@ int scoutfs_item_lookup(struct super_block *sb, struct scoutfs_key *key,
item = item_rbtree_walk(&pg->item_root, key, NULL, NULL, NULL);
if (!item || item->deletion)
ret = -ENOENT;
else if (len_limit > 0 && item->val_len > len_limit)
ret = -EIO;
else
ret = copy_val(val, val_len, item->val, item->val_len);
@@ -1737,13 +1751,38 @@ out:
return ret;
}
int scoutfs_item_lookup(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock)
{
return item_lookup(sb, key, val, val_len, 0, lock);
}
/*
* Copy an item's value into the caller's buffer. If the item's value
* is larger than the caller's buffer then -EIO is returned. If the
* item is smaller then the bytes from the end of the copied value to
* the end of the buffer are zeroed. The number of value bytes copied
* is returned, and 0 can be returned for an item with no value.
*/
int scoutfs_item_lookup_smaller_zero(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock)
{
int ret;
ret = item_lookup(sb, key, val, val_len, val_len, lock);
if (ret >= 0 && ret < val_len)
memset(val + ret, 0, val_len - ret);
return ret;
}
int scoutfs_item_lookup_exact(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len,
struct scoutfs_lock *lock)
{
int ret;
ret = scoutfs_item_lookup(sb, key, val, val_len, lock);
ret = item_lookup(sb, key, val, val_len, 0, lock);
if (ret == val_len)
ret = 0;
else if (ret >= 0)
@@ -1793,7 +1832,7 @@ int scoutfs_item_next(struct super_block *sb, struct scoutfs_key *key,
goto out;
}
if ((ret = lock_safe(lock, key, SCOUTFS_LOCK_READ)))
if ((ret = lock_safe(sb, lock, key, SCOUTFS_LOCK_READ)))
goto out;
pos = *key;
@@ -1874,7 +1913,7 @@ int scoutfs_item_dirty(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_dirty);
if ((ret = lock_safe(lock, key, SCOUTFS_LOCK_WRITE)))
if ((ret = lock_safe(sb, lock, key, SCOUTFS_LOCK_WRITE)))
goto out;
ret = scoutfs_forest_set_bloom_bits(sb, lock);
@@ -1920,7 +1959,7 @@ static int item_create(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_create);
if ((ret = lock_safe(lock, key, mode)) ||
if ((ret = lock_safe(sb, lock, key, mode)) ||
(ret = optional_lock_mode_match(primary, SCOUTFS_LOCK_WRITE)))
goto out;
@@ -1963,7 +2002,7 @@ int scoutfs_item_create(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock)
{
return item_create(sb, key, val, val_len, lock, NULL,
SCOUTFS_LOCK_READ, false);
SCOUTFS_LOCK_WRITE, false);
}
int scoutfs_item_create_force(struct super_block *sb, struct scoutfs_key *key,
@@ -1994,7 +2033,7 @@ int scoutfs_item_update(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_update);
if ((ret = lock_safe(lock, key, SCOUTFS_LOCK_WRITE)))
if ((ret = lock_safe(sb, lock, key, SCOUTFS_LOCK_WRITE)))
goto out;
ret = scoutfs_forest_set_bloom_bits(sb, lock);
@@ -2062,7 +2101,7 @@ int scoutfs_item_delta(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_delta);
if ((ret = lock_safe(lock, key, SCOUTFS_LOCK_WRITE_ONLY)))
if ((ret = lock_safe(sb, lock, key, SCOUTFS_LOCK_WRITE_ONLY)))
goto out;
ret = scoutfs_forest_set_bloom_bits(sb, lock);
@@ -2135,7 +2174,7 @@ static int item_delete(struct super_block *sb, struct scoutfs_key *key,
scoutfs_inc_counter(sb, item_delete);
if ((ret = lock_safe(lock, key, mode)) ||
if ((ret = lock_safe(sb, lock, key, mode)) ||
(ret = optional_lock_mode_match(primary, SCOUTFS_LOCK_WRITE)))
goto out;
@@ -2202,18 +2241,18 @@ u64 scoutfs_item_dirty_pages(struct super_block *sb)
return (u64)atomic_read(&cinf->dirty_pages);
}
static int cmp_pg_start(void *priv, struct list_head *A, struct list_head *B)
static int cmp_pg_start(void *priv, KC_LIST_CMP_CONST struct list_head *A, KC_LIST_CMP_CONST struct list_head *B)
{
struct cached_page *a = list_entry(A, struct cached_page, dirty_head);
struct cached_page *b = list_entry(B, struct cached_page, dirty_head);
KC_LIST_CMP_CONST struct cached_page *a = list_entry(A, KC_LIST_CMP_CONST struct cached_page, dirty_head);
KC_LIST_CMP_CONST struct cached_page *b = list_entry(B, KC_LIST_CMP_CONST struct cached_page, dirty_head);
return scoutfs_key_compare(&a->start, &b->start);
}
static int cmp_item_key(void *priv, struct list_head *A, struct list_head *B)
static int cmp_item_key(void *priv, KC_LIST_CMP_CONST struct list_head *A, KC_LIST_CMP_CONST struct list_head *B)
{
struct cached_item *a = list_entry(A, struct cached_item, dirty_head);
struct cached_item *b = list_entry(B, struct cached_item, dirty_head);
KC_LIST_CMP_CONST struct cached_item *a = list_entry(A, KC_LIST_CMP_CONST struct cached_item, dirty_head);
KC_LIST_CMP_CONST struct cached_item *b = list_entry(B, KC_LIST_CMP_CONST struct cached_item, dirty_head);
return scoutfs_key_compare(&a->key, &b->key);
}
@@ -2654,7 +2693,7 @@ int scoutfs_item_setup(struct super_block *sb)
KC_INIT_SHRINKER_FUNCS(&cinf->shrinker, item_cache_count_objects,
item_cache_scan_objects);
KC_REGISTER_SHRINKER(&cinf->shrinker);
KC_REGISTER_SHRINKER(&cinf->shrinker, "scoutfs-item:" SCSBF, SCSB_ARGS(sb));
#ifdef KC_CPU_NOTIFIER
cinf->notifier.notifier_call = item_cpu_callback;
register_hotcpu_notifier(&cinf->notifier);

View File

@@ -3,6 +3,8 @@
int scoutfs_item_lookup(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock);
int scoutfs_item_lookup_smaller_zero(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, struct scoutfs_lock *lock);
int scoutfs_item_lookup_exact(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len,
struct scoutfs_lock *lock);

View File

@@ -67,12 +67,11 @@ kc_generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
ssize_t status;
struct iov_iter i;
iov_iter_init(&i, WRITE, iov, nr_segs, count);
status = generic_perform_write(file, &i, pos);
status = kc_generic_perform_write(iocb, &i, pos);
if (likely(status >= 0)) {
written += status;

View File

@@ -197,7 +197,11 @@ struct timespec64 kc_current_time(struct inode *inode);
} while (0)
#define KC_SHRINKER_CONTAINER_OF(ptr, type) container_of(ptr, type, shrinker)
#define KC_REGISTER_SHRINKER(ptr) (register_shrinker(ptr))
#ifdef KC_SHRINKER_NAME
#define KC_REGISTER_SHRINKER register_shrinker
#else
#define KC_REGISTER_SHRINKER(ptr, fmt, ...) (register_shrinker(ptr))
#endif /* KC_SHRINKER_NAME */
#define KC_UNREGISTER_SHRINKER(ptr) (unregister_shrinker(ptr))
#define KC_SHRINKER_FN(ptr) (ptr)
#else
@@ -224,7 +228,7 @@ struct kc_shrinker_wrapper {
_wrap->shrink.seeks = DEFAULT_SEEKS; \
} while (0)
#define KC_SHRINKER_CONTAINER_OF(ptr, type) container_of(container_of(ptr, struct kc_shrinker_wrapper, shrink), type, shrinker)
#define KC_REGISTER_SHRINKER(ptr) (register_shrinker(ptr.shrink))
#define KC_REGISTER_SHRINKER(ptr, fmt, ...) (register_shrinker(ptr.shrink))
#define KC_UNREGISTER_SHRINKER(ptr) (unregister_shrinker(ptr.shrink))
#define KC_SHRINKER_FN(ptr) (ptr.shrink)
@@ -271,6 +275,167 @@ ssize_t kc_generic_file_buffered_write(struct kiocb *iocb, const struct iovec *i
unsigned long nr_segs, loff_t pos, loff_t *ppos,
size_t count, ssize_t written);
#define generic_file_buffered_write kc_generic_file_buffered_write
#ifdef KC_GENERIC_PERFORM_WRITE_KIOCB_IOV_ITER
static inline int kc_generic_perform_write(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
iocb->ki_pos = pos;
return generic_perform_write(iocb, iter);
}
#else
static inline int kc_generic_perform_write(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct file *file = iocb->ki_filp;
return generic_perform_write(file, iter, pos);
}
#endif
#endif // KC_GENERIC_FILE_BUFFERED_WRITE
#ifndef KC_HAVE_BLK_OPF_T
/* typedef __u32 __bitwise blk_opf_t; */
typedef unsigned int blk_opf_t;
#endif
#ifdef KC_LIST_CMP_CONST_ARG_LIST_HEAD
#define KC_LIST_CMP_CONST const
#else
#define KC_LIST_CMP_CONST
#endif
#ifdef KC_VMALLOC_PGPROT_T
#define kc__vmalloc(size, gfp_mask) __vmalloc(size, gfp_mask, PAGE_KERNEL)
#else
#define kc__vmalloc __vmalloc
#endif
#ifdef KC_VFS_METHOD_USER_NAMESPACE_ARG
#define KC_VFS_NS_DEF struct user_namespace *mnt_user_ns,
#define KC_VFS_NS mnt_user_ns,
#define KC_VFS_INIT_NS &init_user_ns,
#else
#define KC_VFS_NS_DEF
#define KC_VFS_NS
#define KC_VFS_INIT_NS
#endif
#ifdef KC_BIO_ALLOC_DEV_OPF_ARGS
#define kc_bio_alloc bio_alloc
#else
#include <linux/bio.h>
static inline struct bio *kc_bio_alloc(struct block_device *bdev, unsigned short nr_vecs,
blk_opf_t opf, gfp_t gfp_mask)
{
struct bio *b = bio_alloc(gfp_mask, nr_vecs);
if (b) {
kc_bio_set_opf(b, opf);
bio_set_dev(b, bdev);
}
return b;
}
#endif
#ifndef KC_FIEMAP_PREP
#define fiemap_prep(inode, fieinfo, start, len, flags) fiemap_check_flags(fieinfo, flags)
#endif
#ifndef KC_KERNEL_OLD_TIMEVAL_STRUCT
#define __kernel_old_timeval timeval
#define ns_to_kernel_old_timeval(ktime) ns_to_timeval(ktime.tv64)
#endif
#ifdef KC_SOCK_SET_SNDTIMEO
#include <net/sock.h>
static inline int kc_sock_set_sndtimeo(struct socket *sock, s64 secs)
{
sock_set_sndtimeo(sock->sk, secs);
return 0;
}
static inline int kc_tcp_sock_set_rcvtimeo(struct socket *sock, ktime_t to)
{
struct __kernel_old_timeval tv;
sockptr_t kopt;
tv = ns_to_kernel_old_timeval(to);
kopt = KERNEL_SOCKPTR(&tv);
return sock_setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO_NEW,
kopt, sizeof(tv));
}
#else
#include <net/sock.h>
static inline int kc_sock_set_sndtimeo(struct socket *sock, s64 secs)
{
struct timeval tv = { .tv_sec = secs, .tv_usec = 0 };
return kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO,
(char *)&tv, sizeof(tv));
}
static inline int kc_tcp_sock_set_rcvtimeo(struct socket *sock, ktime_t to)
{
struct __kernel_old_timeval tv;
tv = ns_to_kernel_old_timeval(to);
return kernel_setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO,
(char *)&tv, sizeof(tv));
}
#endif
#ifdef KC_SETSOCKOPT_SOCKPTR_T
static inline int kc_sock_setsockopt(struct socket *sock, int level, int op, int *optval, unsigned int optlen)
{
sockptr_t kopt = KERNEL_SOCKPTR(optval);
return sock_setsockopt(sock, level, op, kopt, sizeof(optval));
}
#else
static inline int kc_sock_setsockopt(struct socket *sock, int level, int op, int *optval, unsigned int optlen)
{
return kernel_setsockopt(sock, level, op, (char *)optval, sizeof(optval));
}
#endif
#ifdef KC_HAVE_TCP_SET_SOCKFN
#include <linux/net.h>
#include <net/tcp.h>
static inline int kc_tcp_sock_set_keepintvl(struct socket *sock, int val)
{
return tcp_sock_set_keepintvl(sock->sk, val);
}
static inline int kc_tcp_sock_set_keepidle(struct socket *sock, int val)
{
return tcp_sock_set_keepidle(sock->sk, val);
}
static inline int kc_tcp_sock_set_user_timeout(struct socket *sock, int val)
{
tcp_sock_set_user_timeout(sock->sk, val);
return 0;
}
static inline int kc_tcp_sock_set_nodelay(struct socket *sock)
{
tcp_sock_set_nodelay(sock->sk);
return 0;
}
#else
#include <linux/net.h>
#include <net/tcp.h>
static inline int kc_tcp_sock_set_keepintvl(struct socket *sock, int val)
{
int optval = val;
return kernel_setsockopt(sock, SOL_TCP, TCP_KEEPINTVL, (char *)&optval, sizeof(optval));
}
static inline int kc_tcp_sock_set_keepidle(struct socket *sock, int val)
{
int optval = val;
return kernel_setsockopt(sock, SOL_TCP, TCP_KEEPIDLE, (char *)&optval, sizeof(optval));
}
static inline int kc_tcp_sock_set_user_timeout(struct socket *sock, int val)
{
int optval = val;
return kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT, (char *)&optval, sizeof(optval));
}
static inline int kc_tcp_sock_set_nodelay(struct socket *sock)
{
int optval = 1;
return kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *)&optval, sizeof(optval));
}
#endif
#endif

View File

@@ -125,8 +125,8 @@ static inline bool scoutfs_key_is_ones(struct scoutfs_key *key)
* other alternatives across keys that first differ in any of the
* values. Say maybe 20% faster than memcmp.
*/
static inline int scoutfs_key_compare(struct scoutfs_key *a,
struct scoutfs_key *b)
static inline int scoutfs_key_compare(const struct scoutfs_key *a,
const struct scoutfs_key *b)
{
return scoutfs_cmp(a->sk_zone, b->sk_zone) ?:
scoutfs_cmp(le64_to_cpu(a->_sk_first), le64_to_cpu(b->_sk_first)) ?:
@@ -142,10 +142,10 @@ static inline int scoutfs_key_compare(struct scoutfs_key *a,
* 1: a_start > b_end
* else 0: ranges overlap
*/
static inline int scoutfs_key_compare_ranges(struct scoutfs_key *a_start,
struct scoutfs_key *a_end,
struct scoutfs_key *b_start,
struct scoutfs_key *b_end)
static inline int scoutfs_key_compare_ranges(const struct scoutfs_key *a_start,
const struct scoutfs_key *a_end,
const struct scoutfs_key *b_start,
const struct scoutfs_key *b_end)
{
return scoutfs_key_compare(a_end, b_start) < 0 ? -1 :
scoutfs_key_compare(a_start, b_end) > 0 ? 1 :

View File

@@ -36,6 +36,8 @@
#include "item.h"
#include "omap.h"
#include "util.h"
#include "totl.h"
#include "quota.h"
/*
* scoutfs uses a lock service to manage item cache consistency between
@@ -185,6 +187,9 @@ static int lock_invalidate(struct super_block *sb, struct scoutfs_lock *lock,
return ret;
}
if (lock->start.sk_zone == SCOUTFS_QUOTA_ZONE && !lock_mode_can_read(mode))
scoutfs_quota_invalidate(sb);
/* have to invalidate if we're not in the only usable case */
if (!(prev == SCOUTFS_LOCK_WRITE && mode == SCOUTFS_LOCK_READ)) {
retry:
@@ -1244,10 +1249,29 @@ int scoutfs_lock_xattr_totl(struct super_block *sb, enum scoutfs_lock_mode mode,
struct scoutfs_key start;
struct scoutfs_key end;
scoutfs_key_set_zeros(&start);
start.sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
scoutfs_key_set_ones(&end);
end.sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
scoutfs_totl_set_range(&start, &end);
return lock_key_range(sb, mode, flags, &start, &end, lock);
}
int scoutfs_lock_xattr_indx(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock)
{
struct scoutfs_key start;
struct scoutfs_key end;
scoutfs_xattr_indx_get_range(&start, &end);
return lock_key_range(sb, mode, flags, &start, &end, lock);
}
int scoutfs_lock_quota(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock)
{
struct scoutfs_key start;
struct scoutfs_key end;
scoutfs_quota_get_lock_range(&start, &end);
return lock_key_range(sb, mode, flags, &start, &end, lock);
}
@@ -1708,7 +1732,7 @@ int scoutfs_lock_setup(struct super_block *sb)
linfo->lock_range_tree = RB_ROOT;
KC_INIT_SHRINKER_FUNCS(&linfo->shrinker, lock_count_objects,
lock_scan_objects);
KC_REGISTER_SHRINKER(&linfo->shrinker);
KC_REGISTER_SHRINKER(&linfo->shrinker, "scoutfs-lock:" SCSBF, SCSB_ARGS(sb));
INIT_LIST_HEAD(&linfo->lru_list);
INIT_WORK(&linfo->inv_work, lock_invalidate_worker);
INIT_LIST_HEAD(&linfo->inv_list);

View File

@@ -86,6 +86,10 @@ int scoutfs_lock_orphan(struct super_block *sb, enum scoutfs_lock_mode mode, int
u64 ino, struct scoutfs_lock **lock);
int scoutfs_lock_xattr_totl(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock);
int scoutfs_lock_xattr_indx(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock);
int scoutfs_lock_quota(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock);
void scoutfs_unlock(struct super_block *sb, struct scoutfs_lock *lock,
enum scoutfs_lock_mode mode);

View File

@@ -904,53 +904,44 @@ static void destroy_conn(struct scoutfs_net_connection *conn)
static int sock_opts_and_names(struct scoutfs_net_connection *conn,
struct socket *sock)
{
struct timeval tv;
int optval;
int ret;
/* we use a keepalive timeout instead of send timeout */
tv.tv_sec = 0;
tv.tv_usec = 0;
ret = kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO,
(char *)&tv, sizeof(tv));
ret = kc_sock_set_sndtimeo(sock, 0);
if (ret)
goto out;
/* not checked when user_timeout != 0, but for clarity */
optval = UNRESPONSIVE_PROBES;
ret = kernel_setsockopt(sock, SOL_TCP, TCP_KEEPCNT,
(char *)&optval, sizeof(optval));
ret = kc_sock_setsockopt(sock, SOL_TCP, TCP_KEEPCNT,
&optval, sizeof(optval));
if (ret)
goto out;
BUILD_BUG_ON(UNRESPONSIVE_PROBES >= UNRESPONSIVE_TIMEOUT_SECS);
optval = UNRESPONSIVE_TIMEOUT_SECS - (UNRESPONSIVE_PROBES);
ret = kernel_setsockopt(sock, SOL_TCP, TCP_KEEPIDLE,
(char *)&optval, sizeof(optval));
ret = kc_tcp_sock_set_keepidle(sock, optval);
if (ret)
goto out;
optval = 1;
ret = kernel_setsockopt(sock, SOL_TCP, TCP_KEEPINTVL,
(char *)&optval, sizeof(optval));
ret = kc_tcp_sock_set_keepintvl(sock, optval);
if (ret)
goto out;
optval = UNRESPONSIVE_TIMEOUT_SECS * MSEC_PER_SEC;
ret = kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT,
(char *)&optval, sizeof(optval));
ret = kc_tcp_sock_set_user_timeout(sock, optval);
if (ret)
goto out;
optval = 1;
ret = kernel_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE,
(char *)&optval, sizeof(optval));
ret = kc_sock_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE,
&optval, sizeof(optval));
if (ret)
goto out;
optval = 1;
ret = kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY,
(char *)&optval, sizeof(optval));
ret = kc_tcp_sock_set_nodelay(sock);
if (ret)
goto out;
@@ -1049,7 +1040,6 @@ static void scoutfs_net_connect_worker(struct work_struct *work)
DEFINE_CONN_FROM_WORK(conn, work, connect_work);
struct super_block *sb = conn->sb;
struct socket *sock;
struct timeval tv;
int ret;
trace_scoutfs_net_connect_work_enter(sb, 0, 0);
@@ -1060,11 +1050,8 @@ static void scoutfs_net_connect_worker(struct work_struct *work)
sock->sk->sk_allocation = GFP_NOFS;
/* caller specified connect timeout */
tv.tv_sec = conn->connect_timeout_ms / MSEC_PER_SEC;
tv.tv_usec = (conn->connect_timeout_ms % MSEC_PER_SEC) * USEC_PER_MSEC;
ret = kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO,
(char *)&tv, sizeof(tv));
/* caller specified connect timeout, defaults to 1 sec */
ret = kc_sock_set_sndtimeo(sock, conn->connect_timeout_ms / MSEC_PER_SEC);
if (ret) {
sock_release(sock);
goto out;
@@ -1462,8 +1449,8 @@ int scoutfs_net_bind(struct super_block *sb,
sock->sk->sk_allocation = GFP_NOFS;
optval = 1;
ret = kernel_setsockopt(sock, SOL_SOCKET, SO_REUSEADDR,
(char *)&optval, sizeof(optval));
ret = kc_sock_setsockopt(sock, SOL_SOCKET, SO_REUSEADDR,
&optval, sizeof(optval));
if (ret)
goto out;

View File

@@ -33,6 +33,7 @@ enum {
Opt_acl,
Opt_data_prealloc_blocks,
Opt_data_prealloc_contig_only,
Opt_log_merge_wait_timeout_ms,
Opt_metadev_path,
Opt_noacl,
Opt_orphan_scan_delay_ms,
@@ -45,6 +46,7 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_data_prealloc_blocks, "data_prealloc_blocks=%s"},
{Opt_data_prealloc_contig_only, "data_prealloc_contig_only=%s"},
{Opt_log_merge_wait_timeout_ms, "log_merge_wait_timeout_ms=%s"},
{Opt_metadev_path, "metadev_path=%s"},
{Opt_noacl, "noacl"},
{Opt_orphan_scan_delay_ms, "orphan_scan_delay_ms=%s"},
@@ -113,6 +115,10 @@ static void free_options(struct scoutfs_mount_options *opts)
kfree(opts->metadev_path);
}
#define MIN_LOG_MERGE_WAIT_TIMEOUT_MS 100UL
#define DEFAULT_LOG_MERGE_WAIT_TIMEOUT_MS 500
#define MAX_LOG_MERGE_WAIT_TIMEOUT_MS (60 * MSEC_PER_SEC)
#define MIN_ORPHAN_SCAN_DELAY_MS 100UL
#define DEFAULT_ORPHAN_SCAN_DELAY_MS (10 * MSEC_PER_SEC)
#define MAX_ORPHAN_SCAN_DELAY_MS (60 * MSEC_PER_SEC)
@@ -126,11 +132,27 @@ static void init_default_options(struct scoutfs_mount_options *opts)
opts->data_prealloc_blocks = SCOUTFS_DATA_PREALLOC_DEFAULT_BLOCKS;
opts->data_prealloc_contig_only = 1;
opts->log_merge_wait_timeout_ms = DEFAULT_LOG_MERGE_WAIT_TIMEOUT_MS;
opts->orphan_scan_delay_ms = -1;
opts->quorum_heartbeat_timeout_ms = SCOUTFS_QUORUM_DEF_HB_TIMEO_MS;
opts->quorum_slot_nr = -1;
}
static int verify_log_merge_wait_timeout_ms(struct super_block *sb, int ret, int val)
{
if (ret < 0) {
scoutfs_err(sb, "failed to parse log_merge_wait_timeout_ms value");
return -EINVAL;
}
if (val < MIN_LOG_MERGE_WAIT_TIMEOUT_MS || val > MAX_LOG_MERGE_WAIT_TIMEOUT_MS) {
scoutfs_err(sb, "invalid log_merge_wait_timeout_ms value %d, must be between %lu and %lu",
val, MIN_LOG_MERGE_WAIT_TIMEOUT_MS, MAX_LOG_MERGE_WAIT_TIMEOUT_MS);
return -EINVAL;
}
return 0;
}
static int verify_quorum_heartbeat_timeout_ms(struct super_block *sb, int ret, u64 val)
{
if (ret < 0) {
@@ -196,6 +218,14 @@ static int parse_options(struct super_block *sb, char *options, struct scoutfs_m
opts->data_prealloc_contig_only = nr;
break;
case Opt_log_merge_wait_timeout_ms:
ret = match_int(args, &nr);
ret = verify_log_merge_wait_timeout_ms(sb, ret, nr);
if (ret < 0)
return ret;
opts->log_merge_wait_timeout_ms = nr;
break;
case Opt_metadev_path:
ret = parse_bdev_path(sb, &args[0], &opts->metadev_path);
if (ret < 0)
@@ -422,6 +452,43 @@ static ssize_t data_prealloc_contig_only_store(struct kobject *kobj, struct kobj
}
SCOUTFS_ATTR_RW(data_prealloc_contig_only);
static ssize_t log_merge_wait_timeout_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
struct scoutfs_mount_options opts;
scoutfs_options_read(sb, &opts);
return snprintf(buf, PAGE_SIZE, "%u", opts.log_merge_wait_timeout_ms);
}
static ssize_t log_merge_wait_timeout_ms_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
DECLARE_OPTIONS_INFO(sb, optinf);
char nullterm[30]; /* more than enough for octal -U64_MAX */
int val;
int len;
int ret;
len = min(count, sizeof(nullterm) - 1);
memcpy(nullterm, buf, len);
nullterm[len] = '\0';
ret = kstrtoint(nullterm, 0, &val);
ret = verify_log_merge_wait_timeout_ms(sb, ret, val);
if (ret == 0) {
write_seqlock(&optinf->seqlock);
optinf->opts.log_merge_wait_timeout_ms = val;
write_sequnlock(&optinf->seqlock);
ret = count;
}
return ret;
}
SCOUTFS_ATTR_RW(log_merge_wait_timeout_ms);
static ssize_t metadev_path_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
@@ -525,6 +592,7 @@ SCOUTFS_ATTR_RO(quorum_slot_nr);
static struct attribute *options_attrs[] = {
SCOUTFS_ATTR_PTR(data_prealloc_blocks),
SCOUTFS_ATTR_PTR(data_prealloc_contig_only),
SCOUTFS_ATTR_PTR(log_merge_wait_timeout_ms),
SCOUTFS_ATTR_PTR(metadev_path),
SCOUTFS_ATTR_PTR(orphan_scan_delay_ms),
SCOUTFS_ATTR_PTR(quorum_heartbeat_timeout_ms),

View File

@@ -8,6 +8,7 @@
struct scoutfs_mount_options {
u64 data_prealloc_blocks;
bool data_prealloc_contig_only;
unsigned int log_merge_wait_timeout_ms;
char *metadev_path;
unsigned int orphan_scan_delay_ms;
int quorum_slot_nr;

View File

@@ -303,7 +303,6 @@ static int recv_msg(struct super_block *sb, struct quorum_host_msg *msg,
DECLARE_QUORUM_INFO(sb, qinf);
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_quorum_message qmes;
struct timeval tv;
ktime_t rel_to;
ktime_t now;
int ret;
@@ -328,14 +327,10 @@ static int recv_msg(struct super_block *sb, struct quorum_host_msg *msg,
else
rel_to = ns_to_ktime(0);
tv = ktime_to_timeval(rel_to);
if (tv.tv_sec == 0 && tv.tv_usec == 0) {
if (ktime_compare(rel_to, ns_to_ktime(NSEC_PER_USEC)) <= 0) {
mh.msg_flags |= MSG_DONTWAIT;
} else {
ret = kernel_setsockopt(qinf->sock, SOL_SOCKET, SO_RCVTIMEO,
(char *)&tv, sizeof(tv));
if (ret < 0)
return ret;
ret = kc_tcp_sock_set_rcvtimeo(qinf->sock, rel_to);
}
#ifdef KC_MSGHDR_STRUCT_IOV_ITER
@@ -486,7 +481,7 @@ static void set_quorum_block_event(struct super_block *sb, struct scoutfs_quorum
if (WARN_ON_ONCE(event < 0 || event >= SCOUTFS_QUORUM_EVENT_NR))
return;
getnstimeofday64(&ts);
ktime_get_ts64(&ts);
le64_add_cpu(&blk->write_nr, 1);
ev = &blk->events[event];
@@ -1325,8 +1320,8 @@ int scoutfs_quorum_setup(struct super_block *sb)
qinf = kzalloc(sizeof(struct quorum_info), GFP_KERNEL);
super = kmalloc(sizeof(struct scoutfs_super_block), GFP_KERNEL);
if (qinf)
qinf->hb_delay = __vmalloc(HB_DELAY_NR * sizeof(struct count_recent),
GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
qinf->hb_delay = kc__vmalloc(HB_DELAY_NR * sizeof(struct count_recent),
GFP_KERNEL | __GFP_ZERO);
if (!qinf || !super || !qinf->hb_delay) {
if (qinf)
vfree(qinf->hb_delay);

1266
kmod/src/quota.c Normal file

File diff suppressed because it is too large Load Diff

48
kmod/src/quota.h Normal file
View File

@@ -0,0 +1,48 @@
#ifndef _SCOUTFS_QUOTA_H_
#define _SCOUTFS_QUOTA_H_
#include "ioctl.h"
/*
* Each rule's name can be in the ruleset's rbtree associated with the
* source attr that it selects. This lets checks only test rules that
* the inputs could match. The 'i' field indicates which name is in the
* tree so we can find the containing rule.
*
* This is mostly private to quota.c but we expose it for tracing.
*/
struct squota_rule {
u64 limit;
u8 prio;
u8 op;
u8 rule_flags;
struct squota_rule_name {
struct rb_node node;
u64 val;
u8 source;
u8 flags;
u8 i;
} names[3];
};
/* private to quota.c, only here for tracing */
struct squota_input {
u64 attrs[SQ_NS__NR_SELECT];
u8 op;
};
int scoutfs_quota_check_inode(struct super_block *sb, struct inode *dir);
int scoutfs_quota_check_data(struct super_block *sb, struct inode *inode);
int scoutfs_quota_get_rules(struct super_block *sb, u64 *iterator,
struct scoutfs_ioctl_quota_rule *irules, int nr);
int scoutfs_quota_mod_rule(struct super_block *sb, bool is_add,
struct scoutfs_ioctl_quota_rule *irule);
void scoutfs_quota_get_lock_range(struct scoutfs_key *start, struct scoutfs_key *end);
void scoutfs_quota_invalidate(struct super_block *sb);
int scoutfs_quota_setup(struct super_block *sb);
void scoutfs_quota_destroy(struct super_block *sb);
#endif

View File

@@ -76,10 +76,10 @@ static struct recov_pending *lookup_pending(struct recov_info *recinf, u64 rid,
* We keep the pending list sorted by rid so that we can iterate over
* them. The list should be small and shouldn't be used often.
*/
static int cmp_pending_rid(void *priv, struct list_head *A, struct list_head *B)
static int cmp_pending_rid(void *priv, KC_LIST_CMP_CONST struct list_head *A, KC_LIST_CMP_CONST struct list_head *B)
{
struct recov_pending *a = list_entry(A, struct recov_pending, head);
struct recov_pending *b = list_entry(B, struct recov_pending, head);
KC_LIST_CMP_CONST struct recov_pending *a = list_entry(A, KC_LIST_CMP_CONST struct recov_pending, head);
KC_LIST_CMP_CONST struct recov_pending *b = list_entry(B, KC_LIST_CMP_CONST struct recov_pending, head);
return scoutfs_cmp_u64s(a->rid, b->rid);
}

View File

@@ -24,7 +24,6 @@
#include <linux/tracepoint.h>
#include <linux/in.h>
#include <linux/unaligned/access_ok.h>
#include "key.h"
#include "format.h"
@@ -37,6 +36,10 @@
#include "net.h"
#include "data.h"
#include "ext.h"
#include "quota.h"
#include "trace/quota.h"
#include "trace/wkic.h"
struct lock_info;
@@ -439,6 +442,7 @@ DECLARE_EVENT_CLASS(scoutfs_trans_hold_release_class,
SCSB_TRACE_ASSIGN(sb);
__entry->journal_info = (unsigned long)journal_info;
__entry->holders = holders;
__entry->ret = ret;
),
TP_printk(SCSBF" journal_info 0x%0lx holders %d ret %d",
@@ -1746,21 +1750,41 @@ TRACE_EVENT(scoutfs_btree_merge,
sk_trace_args(end))
);
TRACE_EVENT(scoutfs_btree_merge_read_range,
TP_PROTO(struct super_block *sb, struct scoutfs_key *start, struct scoutfs_key *end,
int size),
TP_ARGS(sb, start, end, size),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
sk_trace_define(start)
sk_trace_define(end)
__field(int, size)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
sk_trace_assign(start, start);
sk_trace_assign(end, end);
__entry->size = size;
),
TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" size %d",
SCSB_TRACE_ARGS, sk_trace_args(start), sk_trace_args(end), __entry->size)
);
TRACE_EVENT(scoutfs_btree_merge_items,
TP_PROTO(struct super_block *sb,
struct scoutfs_btree_root *m_root,
struct scoutfs_key *m_key, int m_val_len,
struct scoutfs_btree_root *f_root,
struct scoutfs_key *f_key, int f_val_len,
int is_del),
TP_ARGS(sb, m_root, m_key, m_val_len, f_root, f_key, f_val_len, is_del),
TP_ARGS(sb, m_key, m_val_len, f_root, f_key, f_val_len, is_del),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, m_root_blkno)
__field(__u64, m_root_seq)
__field(__u8, m_root_height)
sk_trace_define(m_key)
__field(int, m_val_len)
__field(__u64, f_root_blkno)
@@ -1773,10 +1797,6 @@ TRACE_EVENT(scoutfs_btree_merge_items,
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->m_root_blkno = m_root ?
le64_to_cpu(m_root->ref.blkno) : 0;
__entry->m_root_seq = m_root ? le64_to_cpu(m_root->ref.seq) : 0;
__entry->m_root_height = m_root ? m_root->height : 0;
sk_trace_assign(m_key, m_key);
__entry->m_val_len = m_val_len;
__entry->f_root_blkno = f_root ?
@@ -1788,11 +1808,9 @@ TRACE_EVENT(scoutfs_btree_merge_items,
__entry->is_del = !!is_del;
),
TP_printk(SCSBF" merge item root blkno %llu seq %llu height %u key "SK_FMT" val_len %d, fs item root blkno %llu seq %llu height %u key "SK_FMT" val_len %d, is_del %d",
SCSB_TRACE_ARGS, __entry->m_root_blkno, __entry->m_root_seq,
__entry->m_root_height, sk_trace_args(m_key),
__entry->m_val_len, __entry->f_root_blkno,
__entry->f_root_seq, __entry->f_root_height,
TP_printk(SCSBF" merge item key "SK_FMT" val_len %d, fs item root blkno %llu seq %llu height %u key "SK_FMT" val_len %d, is_del %d",
SCSB_TRACE_ARGS, sk_trace_args(m_key), __entry->m_val_len,
__entry->f_root_blkno, __entry->f_root_seq, __entry->f_root_height,
sk_trace_args(f_key), __entry->f_val_len, __entry->is_del)
);
@@ -2075,6 +2093,71 @@ TRACE_EVENT(scoutfs_trans_seq_last,
SCSB_TRACE_ARGS, __entry->s_rid, __entry->trans_seq)
);
TRACE_EVENT(scoutfs_server_finalize_items,
TP_PROTO(struct super_block *sb, u64 rid, u64 item_rid, u64 item_nr, u64 item_flags,
u64 item_get_trans_seq),
TP_ARGS(sb, rid, item_rid, item_nr, item_flags, item_get_trans_seq),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, c_rid)
__field(__u64, item_rid)
__field(__u64, item_nr)
__field(__u64, item_flags)
__field(__u64, item_get_trans_seq)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->c_rid = rid;
__entry->item_rid = item_rid;
__entry->item_nr = item_nr;
__entry->item_flags = item_flags;
__entry->item_get_trans_seq = item_get_trans_seq;
),
TP_printk(SCSBF" rid %016llx item_rid %016llx item_nr %llu item_flags 0x%llx item_get_trans_seq %llu",
SCSB_TRACE_ARGS, __entry->c_rid, __entry->item_rid, __entry->item_nr,
__entry->item_flags, __entry->item_get_trans_seq)
);
TRACE_EVENT(scoutfs_server_finalize_decision,
TP_PROTO(struct super_block *sb, u64 rid, bool saw_finalized, bool others_active,
bool ours_visible, bool finalize_ours, unsigned int delay_ms,
u64 finalize_sent_seq),
TP_ARGS(sb, rid, saw_finalized, others_active, ours_visible, finalize_ours, delay_ms,
finalize_sent_seq),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, c_rid)
__field(bool, saw_finalized)
__field(bool, others_active)
__field(bool, ours_visible)
__field(bool, finalize_ours)
__field(unsigned int, delay_ms)
__field(__u64, finalize_sent_seq)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->c_rid = rid;
__entry->saw_finalized = saw_finalized;
__entry->others_active = others_active;
__entry->ours_visible = ours_visible;
__entry->finalize_ours = finalize_ours;
__entry->delay_ms = delay_ms;
__entry->finalize_sent_seq = finalize_sent_seq;
),
TP_printk(SCSBF" rid %016llx saw_finalized %u others_active %u ours_visible %u finalize_ours %u delay_ms %u finalize_sent_seq %llu",
SCSB_TRACE_ARGS, __entry->c_rid, __entry->saw_finalized, __entry->others_active,
__entry->ours_visible, __entry->finalize_ours, __entry->delay_ms,
__entry->finalize_sent_seq)
);
TRACE_EVENT(scoutfs_get_log_merge_status,
TP_PROTO(struct super_block *sb, u64 rid, struct scoutfs_key *next_range_key,
u64 nr_requests, u64 nr_complete, u64 seq),
@@ -2315,6 +2398,44 @@ TRACE_EVENT(scoutfs_block_dirty_ref,
__entry->block_blkno, __entry->block_seq)
);
TRACE_EVENT(scoutfs_block_stale,
TP_PROTO(struct super_block *sb, struct scoutfs_block_ref *ref,
struct scoutfs_block_header *hdr, u32 magic, u32 crc),
TP_ARGS(sb, ref, hdr, magic, crc),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, ref_blkno)
__field(__u64, ref_seq)
__field(__u32, hdr_crc)
__field(__u32, hdr_magic)
__field(__u64, hdr_fsid)
__field(__u64, hdr_seq)
__field(__u64, hdr_blkno)
__field(__u32, magic)
__field(__u32, crc)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ref_blkno = le64_to_cpu(ref->blkno);
__entry->ref_seq = le64_to_cpu(ref->seq);
__entry->hdr_crc = le32_to_cpu(hdr->crc);
__entry->hdr_magic = le32_to_cpu(hdr->magic);
__entry->hdr_fsid = le64_to_cpu(hdr->fsid);
__entry->hdr_seq = le64_to_cpu(hdr->seq);
__entry->hdr_blkno = le64_to_cpu(hdr->blkno);
__entry->magic = magic;
__entry->crc = crc;
),
TP_printk(SCSBF" ref_blkno %llu ref_seq %016llx hdr_crc %08x hdr_magic %08x hdr_fsid %016llx hdr_seq %016llx hdr_blkno %llu magic %08x crc %08x",
SCSB_TRACE_ARGS, __entry->ref_blkno, __entry->ref_seq, __entry->hdr_crc,
__entry->hdr_magic, __entry->hdr_fsid, __entry->hdr_seq, __entry->hdr_blkno,
__entry->magic, __entry->crc)
);
DECLARE_EVENT_CLASS(scoutfs_block_class,
TP_PROTO(struct super_block *sb, void *bp, u64 blkno, int refcount, int io_count,
unsigned long bits, __u64 accessed),
@@ -2799,6 +2920,81 @@ TRACE_EVENT(scoutfs_omap_should_delete,
SCSB_TRACE_ARGS, __entry->ino, __entry->nlink, __entry->ret)
);
#define SSCF_FMT "[bo %llu bs %llu es %llu]"
#define SSCF_FIELDS(pref) \
__field(__u64, pref##_blkno) \
__field(__u64, pref##_blocks) \
__field(__u64, pref##_entries)
#define SSCF_ASSIGN(pref, sfl) \
__entry->pref##_blkno = le64_to_cpu((sfl)->ref.blkno); \
__entry->pref##_blocks = le64_to_cpu((sfl)->blocks); \
__entry->pref##_entries = le64_to_cpu((sfl)->entries);
#define SSCF_ENTRY_ARGS(pref) \
__entry->pref##_blkno, \
__entry->pref##_blocks, \
__entry->pref##_entries
DECLARE_EVENT_CLASS(scoutfs_srch_compact_class,
TP_PROTO(struct super_block *sb, struct scoutfs_srch_compact *sc),
TP_ARGS(sb, sc),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, id)
__field(__u8, nr)
__field(__u8, flags)
SSCF_FIELDS(out)
__field(__u64, in0_blk)
__field(__u64, in0_pos)
SSCF_FIELDS(in0)
__field(__u64, in1_blk)
__field(__u64, in1_pos)
SSCF_FIELDS(in1)
__field(__u64, in2_blk)
__field(__u64, in2_pos)
SSCF_FIELDS(in2)
__field(__u64, in3_blk)
__field(__u64, in3_pos)
SSCF_FIELDS(in3)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->id = le64_to_cpu(sc->id);
__entry->nr = sc->nr;
__entry->flags = sc->flags;
SSCF_ASSIGN(out, &sc->out)
__entry->in0_blk = le64_to_cpu(sc->in[0].blk);
__entry->in0_pos = le64_to_cpu(sc->in[0].pos);
SSCF_ASSIGN(in0, &sc->in[0].sfl)
__entry->in1_blk = le64_to_cpu(sc->in[0].blk);
__entry->in1_pos = le64_to_cpu(sc->in[0].pos);
SSCF_ASSIGN(in1, &sc->in[1].sfl)
__entry->in2_blk = le64_to_cpu(sc->in[0].blk);
__entry->in2_pos = le64_to_cpu(sc->in[0].pos);
SSCF_ASSIGN(in2, &sc->in[2].sfl)
__entry->in3_blk = le64_to_cpu(sc->in[0].blk);
__entry->in3_pos = le64_to_cpu(sc->in[0].pos);
SSCF_ASSIGN(in3, &sc->in[3].sfl)
),
TP_printk(SCSBF" id %llu nr %u flags 0x%x out "SSCF_FMT" in0 b %llu p %llu "SSCF_FMT" in1 b %llu p %llu "SSCF_FMT" in2 b %llu p %llu "SSCF_FMT" in3 b %llu p %llu "SSCF_FMT,
SCSB_TRACE_ARGS, __entry->id, __entry->nr, __entry->flags, SSCF_ENTRY_ARGS(out),
__entry->in0_blk, __entry->in0_pos, SSCF_ENTRY_ARGS(in0),
__entry->in1_blk, __entry->in1_pos, SSCF_ENTRY_ARGS(in1),
__entry->in2_blk, __entry->in2_pos, SSCF_ENTRY_ARGS(in2),
__entry->in3_blk, __entry->in3_pos, SSCF_ENTRY_ARGS(in3))
);
DEFINE_EVENT(scoutfs_srch_compact_class, scoutfs_srch_compact_client_send,
TP_PROTO(struct super_block *sb, struct scoutfs_srch_compact *sc),
TP_ARGS(sb, sc)
);
DEFINE_EVENT(scoutfs_srch_compact_class, scoutfs_srch_compact_client_recv,
TP_PROTO(struct super_block *sb, struct scoutfs_srch_compact *sc),
TP_ARGS(sb, sc)
);
#endif /* _TRACE_SCOUTFS_H */
/* This part must be outside protection */

View File

@@ -148,6 +148,8 @@ struct server_info {
struct scoutfs_quorum_config qconf;
/* a running server maintains a private dirty super */
struct scoutfs_super_block dirty_super;
u64 finalize_sent_seq;
};
#define DECLARE_SERVER_INFO(sb, name) \
@@ -296,7 +298,7 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
{
static bool exceeded_once = false;
struct commit_hold *hold;
struct timespec ts;
struct timespec64 ts;
u32 avail_used;
u32 freed_used;
u32 avail_now;
@@ -328,7 +330,7 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
cusers->freed_before, freed_now);
list_for_each_entry(hold, &cusers->holding, entry) {
ts = ktime_to_timespec(hold->start);
ts = ktime_to_timespec64(hold->start);
scoutfs_err(sb, "exceeding hold start %llu.%09llu av %u fr %u",
(u64)ts.tv_sec, (u64)ts.tv_nsec, hold->avail, hold->freed);
hold->exceeded = true;
@@ -413,6 +415,27 @@ static void server_hold_commit(struct super_block *sb, struct commit_hold *hold)
wait_event(cusers->waitq, hold_commit(sb, server, cusers, hold));
}
/*
* Return the higher of the avail or freed used by the active commit
* since this holder joined the commit. This is *not* the amount used
* by the holder, we don't track per-holder alloc use.
*/
static u32 server_hold_alloc_used_since(struct super_block *sb, struct commit_hold *hold)
{
DECLARE_SERVER_INFO(sb, server);
u32 avail_used;
u32 freed_used;
u32 avail_now;
u32 freed_now;
scoutfs_alloc_meta_remaining(&server->alloc, &avail_now, &freed_now);
avail_used = hold->avail - avail_now;
freed_used = hold->freed - freed_now;
return max(avail_used, freed_used);
}
/*
* This is called while holding the commit and returns once the commit
* is successfully written. Many holders can all wait for all holders
@@ -422,7 +445,7 @@ static int server_apply_commit(struct super_block *sb, struct commit_hold *hold,
{
DECLARE_SERVER_INFO(sb, server);
struct commit_users *cusers = &server->cusers;
struct timespec ts;
struct timespec64 ts;
spin_lock(&cusers->lock);
@@ -431,7 +454,7 @@ static int server_apply_commit(struct super_block *sb, struct commit_hold *hold,
check_holder_budget(sb, server, cusers);
if (hold->exceeded) {
ts = ktime_to_timespec(hold->start);
ts = ktime_to_timespec64(hold->start);
scoutfs_err(sb, "exceeding hold start %llu.%09llu stack:",
(u64)ts.tv_sec, (u64)ts.tv_nsec);
dump_stack();
@@ -938,22 +961,24 @@ static int find_log_trees_item(struct super_block *sb,
}
/*
* Find the next log_trees item from the key. Fills the caller's log_trees and sets
* the key past the returned log_trees for iteration. Returns 0 when done, > 0 for each
* item, and -errno on fatal errors.
* Find the log_trees item with the greatest nr for each rid. Fills the
* caller's log_trees and sets the key before the returned log_trees for
* the next iteration. Returns 0 when done, > 0 for each item, and
* -errno on fatal errors.
*/
static int for_each_lt(struct super_block *sb, struct scoutfs_btree_root *root,
struct scoutfs_key *key, struct scoutfs_log_trees *lt)
static int for_each_rid_last_lt(struct super_block *sb, struct scoutfs_btree_root *root,
struct scoutfs_key *key, struct scoutfs_log_trees *lt)
{
SCOUTFS_BTREE_ITEM_REF(iref);
int ret;
ret = scoutfs_btree_next(sb, root, key, &iref);
ret = scoutfs_btree_prev(sb, root, key, &iref);
if (ret == 0) {
if (iref.val_len == sizeof(struct scoutfs_log_trees)) {
memcpy(lt, iref.val, iref.val_len);
*key = *iref.key;
scoutfs_key_inc(key);
key->sklt_nr = 0;
scoutfs_key_dec(key);
ret = 1;
} else {
ret = -EIO;
@@ -1048,21 +1073,13 @@ static int next_log_merge_item(struct super_block *sb,
* abandoned log btree finalized. If it takes too long each client has
* a change to make forward progress before being asked to commit again.
*
* We're waiting on heavy state that is protected by mutexes and
* transaction machinery. It's tricky to recreate that state for
* lightweight condition tests that don't change task state. Instead of
* trying to get that right, particularly as we unwind after success or
* after timeouts, waiters use an unsatisfying poll. Short enough to
* not add terrible latency, given how heavy and infrequent this already
* is, and long enough to not melt the cpu. This could be tuned if it
* becomes a problem.
*
* This can end up finalizing a new empty log btree if a new mount
* happens to arrive at just the right time. That's fine, merging will
* ignore and tear down the empty input.
*/
#define FINALIZE_POLL_MS (11)
#define FINALIZE_TIMEOUT_MS (MSEC_PER_SEC / 2)
#define FINALIZE_POLL_MIN_DELAY_MS 5U
#define FINALIZE_POLL_MAX_DELAY_MS 100U
#define FINALIZE_POLL_DELAY_GROWTH_PCT 150U
static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_log_trees *lt,
u64 rid, struct commit_hold *hold)
{
@@ -1070,8 +1087,10 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
struct scoutfs_log_merge_status stat;
struct scoutfs_log_merge_range rng;
struct scoutfs_mount_options opts;
struct scoutfs_log_trees each_lt;
struct scoutfs_log_trees fin;
unsigned int delay_ms;
unsigned long timeo;
bool saw_finalized;
bool others_active;
@@ -1079,10 +1098,14 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
bool ours_visible;
struct scoutfs_key key;
char *err_str = NULL;
ktime_t start;
int ret;
int err;
timeo = jiffies + msecs_to_jiffies(FINALIZE_TIMEOUT_MS);
scoutfs_options_read(sb, &opts);
timeo = jiffies + msecs_to_jiffies(opts.log_merge_wait_timeout_ms);
delay_ms = FINALIZE_POLL_MIN_DELAY_MS;
start = ktime_get_raw();
for (;;) {
/* nothing to do if there's already a merge in flight */
@@ -1099,8 +1122,13 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
saw_finalized = false;
others_active = false;
ours_visible = false;
scoutfs_key_init_log_trees(&key, 0, 0);
while ((ret = for_each_lt(sb, &super->logs_root, &key, &each_lt)) > 0) {
scoutfs_key_init_log_trees(&key, U64_MAX, U64_MAX);
while ((ret = for_each_rid_last_lt(sb, &super->logs_root, &key, &each_lt)) > 0) {
trace_scoutfs_server_finalize_items(sb, rid, le64_to_cpu(each_lt.rid),
le64_to_cpu(each_lt.nr),
le64_to_cpu(each_lt.flags),
le64_to_cpu(each_lt.get_trans_seq));
if ((le64_to_cpu(each_lt.flags) & SCOUTFS_LOG_TREES_FINALIZED))
saw_finalized = true;
@@ -1125,6 +1153,10 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
finalize_ours = (lt->item_root.height > 2) ||
(le32_to_cpu(lt->meta_avail.flags) & SCOUTFS_ALLOC_FLAG_LOW);
trace_scoutfs_server_finalize_decision(sb, rid, saw_finalized, others_active,
ours_visible, finalize_ours, delay_ms,
server->finalize_sent_seq);
/* done if we're not finalizing and there's no finalized */
if (!finalize_ours && !saw_finalized) {
ret = 0;
@@ -1132,12 +1164,13 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
}
/* send sync requests soon to give time to commit */
scoutfs_key_init_log_trees(&key, 0, 0);
scoutfs_key_init_log_trees(&key, U64_MAX, U64_MAX);
while (others_active &&
(ret = for_each_lt(sb, &super->logs_root, &key, &each_lt)) > 0) {
(ret = for_each_rid_last_lt(sb, &super->logs_root, &key, &each_lt)) > 0) {
if ((le64_to_cpu(each_lt.flags) & SCOUTFS_LOG_TREES_FINALIZED) ||
(le64_to_cpu(each_lt.rid) == rid))
(le64_to_cpu(each_lt.rid) == rid) ||
(le64_to_cpu(each_lt.get_trans_seq) <= server->finalize_sent_seq))
continue;
ret = scoutfs_net_submit_request_node(sb, server->conn,
@@ -1157,6 +1190,8 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
break;
}
server->finalize_sent_seq = scoutfs_server_seq(sb);
/* Finalize ours if it's visible to others */
if (ours_visible) {
fin = *lt;
@@ -1194,13 +1229,16 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
if (ret < 0)
err_str = "applying commit before waiting for finalized";
msleep(FINALIZE_POLL_MS);
msleep(delay_ms);
delay_ms = min(delay_ms * FINALIZE_POLL_DELAY_GROWTH_PCT / 100,
FINALIZE_POLL_MAX_DELAY_MS);
server_hold_commit(sb, hold);
mutex_lock(&server->logs_mutex);
/* done if we timed out */
if (time_after(jiffies, timeo)) {
scoutfs_inc_counter(sb, log_merge_wait_timeout);
ret = 0;
break;
}
@@ -1783,43 +1821,29 @@ out:
* Give the caller the last seq before outstanding client commits. All
* seqs up to and including this are stable, new client transactions can
* only have greater seqs.
*
* For each rid, only its greatest log trees nr can be an open commit.
* We look at the last log_trees item for each client rid and record its
* trans seq if it hasn't been committed.
*/
static int get_stable_trans_seq(struct super_block *sb, u64 *last_seq_ret)
{
struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
DECLARE_SERVER_INFO(sb, server);
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_log_trees *lt;
struct scoutfs_log_trees lt;
struct scoutfs_key key;
u64 last_seq = 0;
int ret;
last_seq = scoutfs_server_seq(sb) - 1;
scoutfs_key_init_log_trees(&key, 0, 0);
mutex_lock(&server->logs_mutex);
for (;; scoutfs_key_inc(&key)) {
ret = scoutfs_btree_next(sb, &super->logs_root, &key, &iref);
if (ret == 0) {
if (iref.val_len == sizeof(*lt)) {
lt = iref.val;
if ((le64_to_cpu(lt->get_trans_seq) >
le64_to_cpu(lt->commit_trans_seq)) &&
le64_to_cpu(lt->get_trans_seq) <= last_seq) {
last_seq = le64_to_cpu(lt->get_trans_seq) - 1;
}
key = *iref.key;
} else {
ret = -EIO;
}
scoutfs_btree_put_iref(&iref);
}
if (ret < 0) {
if (ret == -ENOENT) {
ret = 0;
break;
}
scoutfs_key_init_log_trees(&key, U64_MAX, U64_MAX);
while ((ret = for_each_rid_last_lt(sb, &super->logs_root, &key, &lt)) > 0) {
if ((le64_to_cpu(lt.get_trans_seq) > le64_to_cpu(lt.commit_trans_seq)) &&
le64_to_cpu(lt.get_trans_seq) <= last_seq) {
last_seq = le64_to_cpu(lt.get_trans_seq) - 1;
}
}
@@ -1966,9 +1990,7 @@ static int server_srch_get_compact(struct super_block *sb,
ret = scoutfs_srch_get_compact(sb, &server->alloc, &server->wri,
&super->srch_root, rid, sc);
mutex_unlock(&server->srch_mutex);
if (ret == 0 && sc->nr == 0)
ret = -ENOENT;
if (ret < 0)
if (ret < 0 || (ret == 0 && sc->nr == 0))
goto apply;
mutex_lock(&server->alloc_mutex);
@@ -2473,9 +2495,11 @@ static void server_log_merge_free_work(struct work_struct *work)
while (!server_is_stopping(server)) {
server_hold_commit(sb, &hold);
mutex_lock(&server->logs_mutex);
commit = true;
if (!commit) {
server_hold_commit(sb, &hold);
mutex_lock(&server->logs_mutex);
commit = true;
}
ret = next_log_merge_item(sb, &super->log_merge,
SCOUTFS_LOG_MERGE_FREEING_ZONE,
@@ -2522,12 +2546,14 @@ static void server_log_merge_free_work(struct work_struct *work)
/* freed blocks are in allocator, we *have* to update fr */
BUG_ON(ret < 0);
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
commit = false;
if (ret < 0) {
err_str = "looping commit del/upd freeing item";
break;
if (server_hold_alloc_used_since(sb, &hold) >= COMMIT_HOLD_ALLOC_BUDGET / 2) {
mutex_unlock(&server->logs_mutex);
ret = server_apply_commit(sb, &hold, ret);
commit = false;
if (ret < 0) {
err_str = "looping commit del/upd freeing item";
break;
}
}
}
@@ -4300,6 +4326,7 @@ static void scoutfs_server_worker(struct work_struct *work)
scoutfs_info(sb, "server starting at "SIN_FMT, SIN_ARG(&sin));
scoutfs_block_writer_init(sb, &server->wri);
server->finalize_sent_seq = 0;
/* first make sure no other servers are still running */
ret = scoutfs_quorum_fence_leaders(sb, &server->qconf, server->term);

View File

@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/vmalloc.h>
#include <linux/sort.h>
#include <asm/unaligned.h>
#include "super.h"
#include "format.h"
@@ -30,6 +31,9 @@
#include "client.h"
#include "counters.h"
#include "scoutfs_trace.h"
#include "triggers.h"
#include "sysfs.h"
#include "msg.h"
/*
* This srch subsystem gives us a way to find inodes that have a given
@@ -68,10 +72,14 @@ struct srch_info {
atomic_t shutdown;
struct workqueue_struct *workq;
struct delayed_work compact_dwork;
struct scoutfs_sysfs_attrs ssa;
atomic_t compact_delay_ms;
};
#define DECLARE_SRCH_INFO(sb, name) \
struct srch_info *name = SCOUTFS_SB(sb)->srch_info
#define DECLARE_SRCH_INFO_KOBJ(kobj, name) \
DECLARE_SRCH_INFO(SCOUTFS_SYSFS_ATTRS_SB(kobj), name)
#define SRE_FMT "%016llx.%llu.%llu"
#define SRE_ARG(sre) \
@@ -520,6 +528,95 @@ out:
return ret;
}
/*
* Padded entries are encoded in pairs after an existing entry. All of
* the pairs cancel each other out by all readers (the second encoding
* looks like deletion) so they aren't visible to the first/last bounds of
* the block or file.
*/
static int append_padded_entry(struct scoutfs_srch_file *sfl, u64 blk,
struct scoutfs_srch_block *srb, struct scoutfs_srch_entry *sre)
{
int ret;
ret = encode_entry(srb->entries + le32_to_cpu(srb->entry_bytes),
sre, &srb->tail);
if (ret > 0) {
srb->tail = *sre;
le32_add_cpu(&srb->entry_nr, 1);
le32_add_cpu(&srb->entry_bytes, ret);
le64_add_cpu(&sfl->entries, 1);
ret = 0;
}
return ret;
}
/*
* This is called by a testing trigger to create a very specific case of
* encoded entry offsets. We want the last entry in the block to start
* precisely at the _SAFE_BYTES offset.
*
* This is called when there is a single existing entry in the block.
* We have the entire block to work with. We encode pairs of matching
* entries. This hides them from readers (both searches and merging) as
* they're interpreted as creation and deletion and are deleted. We use
* the existing hash value of the first entry in the block but then set
* the inode to an impossibly large number so it doesn't interfere with
* anything.
*
* To hit the specific offset we very carefully manage the amount of
* bytes of change between fields in the entry. We know that if we
* change all the byte of the ino and id we end up with a 20 byte
* (2+8+8,2) encoding of the pair of entries. To have the last entry
* start at the _SAFE_POS offset we know that the final 20 byte pair
* encoding needs to end at 2 bytes (second entry encoding) after the
* _SAFE_POS offset.
*
* So as we encode pairs we watch the delta of our current offset from
* that desired final offset of 2 past _SAFE_POS. If we're a multiple
* of 20 away then we encode the full 20 byte pairs. If we're not, then
* we drop a byte to encode 19 bytes. That'll slowly change the offset
* to be a multiple of 20 again while encoding large entries.
*/
static void pad_entries_at_safe(struct scoutfs_srch_file *sfl, u64 blk,
struct scoutfs_srch_block *srb)
{
struct scoutfs_srch_entry sre;
u32 target;
s32 diff;
u64 hash;
u64 ino;
u64 id;
int ret;
hash = le64_to_cpu(srb->tail.hash);
ino = le64_to_cpu(srb->tail.ino) | (1ULL << 62);
id = le64_to_cpu(srb->tail.id);
target = SCOUTFS_SRCH_BLOCK_SAFE_BYTES + 2;
while ((diff = target - le32_to_cpu(srb->entry_bytes)) > 0) {
ino ^= 1ULL << (7 * 8);
if (diff % 20 == 0) {
id ^= 1ULL << (7 * 8);
} else {
id ^= 1ULL << (6 * 8);
}
sre.hash = cpu_to_le64(hash);
sre.ino = cpu_to_le64(ino);
sre.id = cpu_to_le64(id);
ret = append_padded_entry(sfl, blk, srb, &sre);
if (ret == 0)
ret = append_padded_entry(sfl, blk, srb, &sre);
BUG_ON(ret != 0);
diff = target - le32_to_cpu(srb->entry_bytes);
}
}
/*
* The caller is dropping an ino/id because the tracking rbtree is full.
* This loses information so we can't return any entries at or after the
@@ -987,6 +1084,9 @@ int scoutfs_srch_rotate_log(struct super_block *sb,
struct scoutfs_key key;
int ret;
if (sfl->ref.blkno && !force && scoutfs_trigger(sb, SRCH_FORCE_LOG_ROTATE))
force = true;
if (sfl->ref.blkno == 0 ||
(!force && le64_to_cpu(sfl->blocks) < SCOUTFS_SRCH_LOG_BLOCK_LIMIT))
return 0;
@@ -1462,7 +1562,7 @@ static int kway_merge(struct super_block *sb,
struct scoutfs_block_writer *wri,
struct scoutfs_srch_file *sfl,
kway_get_t kway_get, kway_advance_t kway_adv,
void **args, int nr)
void **args, int nr, bool logs_input)
{
DECLARE_SRCH_INFO(sb, srinf);
struct scoutfs_srch_block *srb = NULL;
@@ -1489,8 +1589,7 @@ static int kway_merge(struct super_block *sb,
nr_parents = max_t(unsigned long, 1, roundup_pow_of_two(nr) - 1);
/* root at [1] for easy sib/parent index calc, final pad for odd sib */
nr_nodes = 1 + nr_parents + nr + 1;
tnodes = __vmalloc(nr_nodes * sizeof(struct tourn_node),
GFP_NOFS, PAGE_KERNEL);
tnodes = kc__vmalloc(nr_nodes * sizeof(struct tourn_node), GFP_NOFS);
if (!tnodes)
return -ENOMEM;
@@ -1567,6 +1666,15 @@ static int kway_merge(struct super_block *sb,
blk++;
}
/* end sorted block on _SAFE offset for testing */
if (bl && le32_to_cpu(srb->entry_nr) == 1 && logs_input &&
scoutfs_trigger(sb, SRCH_COMPACT_LOGS_PAD_SAFE)) {
pad_entries_at_safe(sfl, blk, srb);
scoutfs_block_put(sb, bl);
bl = NULL;
blk++;
}
scoutfs_inc_counter(sb, srch_compact_entry);
} else {
@@ -1609,6 +1717,8 @@ static int kway_merge(struct super_block *sb,
empty++;
ret = 0;
} else if (ret < 0) {
if (ret == -ENOANO) /* just testing trigger */
ret = 0;
goto out;
}
@@ -1816,7 +1926,7 @@ static int compact_logs(struct super_block *sb,
}
ret = kway_merge(sb, alloc, wri, &sc->out, kway_get_page, kway_adv_page,
args, nr_pages);
args, nr_pages, true);
if (ret < 0)
goto out;
@@ -1874,12 +1984,18 @@ static int kway_get_reader(struct super_block *sb,
srb = rdr->bl->data;
if (rdr->pos > SCOUTFS_SRCH_BLOCK_SAFE_BYTES ||
rdr->skip >= SCOUTFS_SRCH_BLOCK_SAFE_BYTES ||
rdr->skip > SCOUTFS_SRCH_BLOCK_SAFE_BYTES ||
rdr->skip >= le32_to_cpu(srb->entry_bytes)) {
/* XXX inconsistency */
return -EIO;
}
if (rdr->decoded_bytes == 0 && rdr->pos == SCOUTFS_SRCH_BLOCK_SAFE_BYTES &&
scoutfs_trigger(sb, SRCH_MERGE_STOP_SAFE)) {
/* only used in testing */
return -ENOANO;
}
/* decode entry, possibly skipping start of the block */
while (rdr->decoded_bytes == 0 || rdr->pos < rdr->skip) {
ret = decode_entry(srb->entries + rdr->pos,
@@ -1969,7 +2085,7 @@ static int compact_sorted(struct super_block *sb,
}
ret = kway_merge(sb, alloc, wri, &sc->out, kway_get_reader,
kway_adv_reader, args, nr);
kway_adv_reader, args, nr, false);
sc->flags |= SCOUTFS_SRCH_COMPACT_FLAG_DONE;
for (i = 0; i < nr; i++) {
@@ -2098,8 +2214,15 @@ static int delete_files(struct super_block *sb, struct scoutfs_alloc *alloc,
return ret;
}
/* wait 10s between compact attempts on error, immediate after success */
#define SRCH_COMPACT_DELAY_MS (10 * MSEC_PER_SEC)
static void queue_compact_work(struct srch_info *srinf, bool immediate)
{
unsigned long delay;
if (!atomic_read(&srinf->shutdown)) {
delay = immediate ? 0 : msecs_to_jiffies(atomic_read(&srinf->compact_delay_ms));
queue_delayed_work(srinf->workq, &srinf->compact_dwork, delay);
}
}
/*
* Get a compaction operation from the server, sort the entries from the
@@ -2127,7 +2250,6 @@ static void scoutfs_srch_compact_worker(struct work_struct *work)
struct super_block *sb = srinf->sb;
struct scoutfs_block_writer wri;
struct scoutfs_alloc alloc;
unsigned long delay;
int ret;
int err;
@@ -2140,6 +2262,8 @@ static void scoutfs_srch_compact_worker(struct work_struct *work)
scoutfs_block_writer_init(sb, &wri);
ret = scoutfs_client_srch_get_compact(sb, sc);
if (ret >= 0)
trace_scoutfs_srch_compact_client_recv(sb, sc);
if (ret < 0 || sc->nr == 0)
goto out;
@@ -2168,6 +2292,7 @@ commit:
sc->meta_freed = alloc.freed;
sc->flags |= ret < 0 ? SCOUTFS_SRCH_COMPACT_FLAG_ERROR : 0;
trace_scoutfs_srch_compact_client_send(sb, sc);
err = scoutfs_client_srch_commit_compact(sb, sc);
if (err < 0 && ret == 0)
ret = err;
@@ -2178,14 +2303,56 @@ out:
scoutfs_inc_counter(sb, srch_compact_error);
scoutfs_block_writer_forget_all(sb, &wri);
if (!atomic_read(&srinf->shutdown)) {
delay = ret == 0 ? 0 : msecs_to_jiffies(SRCH_COMPACT_DELAY_MS);
queue_delayed_work(srinf->workq, &srinf->compact_dwork, delay);
}
queue_compact_work(srinf, sc->nr > 0 && ret == 0);
kfree(sc);
}
static ssize_t compact_delay_ms_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
DECLARE_SRCH_INFO_KOBJ(kobj, srinf);
return snprintf(buf, PAGE_SIZE, "%u", atomic_read(&srinf->compact_delay_ms));
}
#define MIN_COMPACT_DELAY_MS MSEC_PER_SEC
#define DEF_COMPACT_DELAY_MS (10 * MSEC_PER_SEC)
#define MAX_COMPACT_DELAY_MS (60 * MSEC_PER_SEC)
static ssize_t compact_delay_ms_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
DECLARE_SRCH_INFO(sb, srinf);
char nullterm[30]; /* more than enough for octal -U64_MAX */
u64 val;
int len;
int ret;
len = min(count, sizeof(nullterm) - 1);
memcpy(nullterm, buf, len);
nullterm[len] = '\0';
ret = kstrtoll(nullterm, 0, &val);
if (ret < 0 || val < MIN_COMPACT_DELAY_MS || val > MAX_COMPACT_DELAY_MS) {
scoutfs_err(sb, "invalid compact_delay_ms value, must be between %lu and %lu",
MIN_COMPACT_DELAY_MS, MAX_COMPACT_DELAY_MS);
return -EINVAL;
}
atomic_set(&srinf->compact_delay_ms, val);
cancel_delayed_work(&srinf->compact_dwork);
queue_compact_work(srinf, false);
return count;
}
SCOUTFS_ATTR_RW(compact_delay_ms);
static struct attribute *srch_attrs[] = {
SCOUTFS_ATTR_PTR(compact_delay_ms),
NULL,
};
void scoutfs_srch_destroy(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
@@ -2202,6 +2369,8 @@ void scoutfs_srch_destroy(struct super_block *sb)
destroy_workqueue(srinf->workq);
}
scoutfs_sysfs_destroy_attrs(sb, &srinf->ssa);
kfree(srinf);
sbi->srch_info = NULL;
}
@@ -2219,8 +2388,15 @@ int scoutfs_srch_setup(struct super_block *sb)
srinf->sb = sb;
atomic_set(&srinf->shutdown, 0);
INIT_DELAYED_WORK(&srinf->compact_dwork, scoutfs_srch_compact_worker);
scoutfs_sysfs_init_attrs(sb, &srinf->ssa);
atomic_set(&srinf->compact_delay_ms, DEF_COMPACT_DELAY_MS);
sbi->srch_info = srinf;
ret = scoutfs_sysfs_create_attrs(sb, &srinf->ssa, srch_attrs, "srch");
if (ret < 0)
goto out;
srinf->workq = alloc_workqueue("scoutfs_srch_compact",
WQ_NON_REENTRANT | WQ_UNBOUND |
WQ_HIGHPRI, 0);
@@ -2229,8 +2405,7 @@ int scoutfs_srch_setup(struct super_block *sb)
goto out;
}
queue_delayed_work(srinf->workq, &srinf->compact_dwork,
msecs_to_jiffies(SRCH_COMPACT_DELAY_MS));
queue_compact_work(srinf, false);
ret = 0;
out:

View File

@@ -49,6 +49,8 @@
#include "volopt.h"
#include "fence.h"
#include "xattr.h"
#include "wkic.h"
#include "quota.h"
#include "scoutfs_trace.h"
static struct dentry *scoutfs_debugfs_root;
@@ -158,7 +160,11 @@ static void scoutfs_metadev_close(struct super_block *sb)
* from kill_sb->put_super.
*/
lockdep_off();
#ifdef KC_BLKDEV_PUT_HOLDER_ARG
blkdev_put(sbi->meta_bdev, sb);
#else
blkdev_put(sbi->meta_bdev, SCOUTFS_META_BDEV_MODE);
#endif
lockdep_on();
sbi->meta_bdev = NULL;
}
@@ -194,7 +200,9 @@ static void scoutfs_put_super(struct super_block *sb)
scoutfs_shutdown_trans(sb);
scoutfs_volopt_destroy(sb);
scoutfs_client_destroy(sb);
scoutfs_quota_destroy(sb);
scoutfs_inode_destroy(sb);
scoutfs_wkic_destroy(sb);
scoutfs_item_destroy(sb);
scoutfs_forest_destroy(sb);
scoutfs_data_destroy(sb);
@@ -519,7 +527,11 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
goto out;
}
#ifdef KC_BLKDEV_PUT_HOLDER_ARG
meta_bdev = blkdev_get_by_path(opts.metadev_path, SCOUTFS_META_BDEV_MODE, sb, NULL);
#else
meta_bdev = blkdev_get_by_path(opts.metadev_path, SCOUTFS_META_BDEV_MODE, sb);
#endif
if (IS_ERR(meta_bdev)) {
scoutfs_err(sb, "could not open metadev: error %ld",
PTR_ERR(meta_bdev));
@@ -544,7 +556,9 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
scoutfs_block_setup(sb) ?:
scoutfs_forest_setup(sb) ?:
scoutfs_item_setup(sb) ?:
scoutfs_wkic_setup(sb) ?:
scoutfs_inode_setup(sb) ?:
scoutfs_quota_setup(sb) ?:
scoutfs_data_setup(sb) ?:
scoutfs_setup_trans(sb) ?:
scoutfs_omap_setup(sb) ?:

View File

@@ -30,6 +30,8 @@ struct recov_info;
struct omap_info;
struct volopt_info;
struct fence_info;
struct wkic_info;
struct squota_info;
struct scoutfs_sb_info {
struct super_block *sb;
@@ -55,6 +57,8 @@ struct scoutfs_sb_info {
struct omap_info *omap_info;
struct volopt_info *volopt_info;
struct item_cache_info *item_cache_info;
struct wkic_info *wkic_info;
struct squota_info *squota_info;
struct fence_info *fence_info;
/* tracks tasks waiting for data extents */
@@ -97,7 +101,11 @@ static inline bool SCOUTFS_IS_META_BDEV(struct scoutfs_super_block *super_block)
return !!(le64_to_cpu(super_block->flags) & SCOUTFS_FLAG_IS_META_BDEV);
}
#ifdef KC_HAVE_BLK_MODE_T
#define SCOUTFS_META_BDEV_MODE (BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_EXCL)
#else
#define SCOUTFS_META_BDEV_MODE (FMODE_READ | FMODE_WRITE | FMODE_EXCL)
#endif
static inline bool scoutfs_forcing_unmount(struct super_block *sb)
{
@@ -156,4 +164,17 @@ int scoutfs_write_super(struct super_block *sb,
/* to keep this out of the ioctl.h public interface definition */
long scoutfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
/*
* Returns 0 when supported, non-zero -errno when unsupported.
*/
static inline int scoutfs_fmt_vers_unsupported(struct super_block *sb, u64 vers)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
if (sbi && (sbi->fmt_vers < vers))
return -EOPNOTSUPP;
else
return 0;
}
#endif

View File

@@ -13,6 +13,7 @@
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/blkdev.h>
#include "super.h"
#include "sysfs.h"

90
kmod/src/totl.c Normal file
View File

@@ -0,0 +1,90 @@
/*
* Copyright (C) 2023 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/kernel.h>
#include <linux/string.h>
#include "format.h"
#include "forest.h"
#include "totl.h"
void scoutfs_totl_set_range(struct scoutfs_key *start, struct scoutfs_key *end)
{
scoutfs_key_set_zeros(start);
start->sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
scoutfs_key_set_ones(end);
end->sk_zone = SCOUTFS_XATTR_TOTL_ZONE;
}
void scoutfs_totl_merge_init(struct scoutfs_totl_merging *merg)
{
memset(merg, 0, sizeof(struct scoutfs_totl_merging));
}
void scoutfs_totl_merge_contribute(struct scoutfs_totl_merging *merg,
u64 seq, u8 flags, void *val, int val_len, int fic)
{
struct scoutfs_xattr_totl_val *tval = val;
if (fic & FIC_FS_ROOT) {
merg->fs_seq = seq;
merg->fs_total = le64_to_cpu(tval->total);
merg->fs_count = le64_to_cpu(tval->count);
} else if (fic & FIC_FINALIZED) {
merg->fin_seq = seq;
merg->fin_total += le64_to_cpu(tval->total);
merg->fin_count += le64_to_cpu(tval->count);
} else {
merg->log_seq = seq;
merg->log_total += le64_to_cpu(tval->total);
merg->log_count += le64_to_cpu(tval->count);
}
}
/*
* .totl. item merging has to be careful because the log btree merging
* code can write partial results to the fs_root. This means that a
* reader can see both cases where new finalized logs should be applied
* to the old fs items and where old finalized logs have already been
* applied to the partially merged fs items. Currently active logged
* items are always applied on top of all cases.
*
* These cases are differentiated with a combination of sequence numbers
* in items, the count of contributing xattrs, and a flag
* differentiating finalized and active logged items. This lets us
* recognize all cases, including when finalized logs were merged and
* deleted the fs item.
*/
void scoutfs_totl_merge_resolve(struct scoutfs_totl_merging *merg, __u64 *total, __u64 *count)
{
*total = 0;
*count = 0;
/* start with the fs item if we have it */
if (merg->fs_seq != 0) {
*total = merg->fs_total;
*count = merg->fs_count;
}
/* apply finalized logs if they're newer or creating */
if (((merg->fs_seq != 0) && (merg->fin_seq > merg->fs_seq)) ||
((merg->fs_seq == 0) && (merg->fin_count > 0))) {
*total += merg->fin_total;
*count += merg->fin_count;
}
/* always apply active logs which must be newer than fs and finalized */
if (merg->log_seq > 0) {
*total += merg->log_total;
*count += merg->log_count;
}
}

24
kmod/src/totl.h Normal file
View File

@@ -0,0 +1,24 @@
#ifndef _SCOUTFS_TOTL_H_
#define _SCOUTFS_TOTL_H_
#include "key.h"
struct scoutfs_totl_merging {
u64 fs_seq;
u64 fs_total;
u64 fs_count;
u64 fin_seq;
u64 fin_total;
s64 fin_count;
u64 log_seq;
u64 log_total;
s64 log_count;
};
void scoutfs_totl_set_range(struct scoutfs_key *start, struct scoutfs_key *end);
void scoutfs_totl_merge_init(struct scoutfs_totl_merging *merg);
void scoutfs_totl_merge_contribute(struct scoutfs_totl_merging *merg,
u64 seq, u8 flags, void *val, int val_len, int fic);
void scoutfs_totl_merge_resolve(struct scoutfs_totl_merging *merg, __u64 *total, __u64 *count);
#endif

143
kmod/src/trace/quota.h Normal file
View File

@@ -0,0 +1,143 @@
/*
* Tracing squota_input
*/
#define SQI_FMT "[%u %llu %llu %llu]"
#define SQI_ARGS(i) \
(i)->op, (i)->attrs[0], (i)->attrs[1], (i)->attrs[2]
#define SQI_FIELDS(pref) \
__array(__u64, pref##_attrs, SQ_NS__NR_SELECT) \
__field(__u8, pref##_op)
#define SQI_ASSIGN(pref, i) \
__entry->pref##_attrs[0] = (i)->attrs[0]; \
__entry->pref##_attrs[1] = (i)->attrs[1]; \
__entry->pref##_attrs[2] = (i)->attrs[2]; \
__entry->pref##_op = (i)->op;
#define SQI_ENTRY_ARGS(pref) \
__entry->pref##_op, __entry->pref##_attrs[0], \
__entry->pref##_attrs[1], __entry->pref##_attrs[2]
/*
* Tracing squota_rule
*/
#define SQR_FMT "[%u %llu,%u,%x %llu,%u,%x %llu,%u,%x %u %llu]"
#define SQR_ARGS(r) \
(r)->prio, \
(r)->name_val[0], (r)->name_source[0], (r)->name_flags[0], \
(r)->name_val[1], (r)->name_source[1], (r)->name_flags[1], \
(r)->name_val[2], (r)->name_source[2], (r)->name_flags[2], \
(r)->op, (r)->limit \
#define SQR_FIELDS(pref) \
__array(__u64, pref##_name_val, 3) \
__field(__u64, pref##_limit) \
__array(__u8, pref##_name_source, 3) \
__array(__u8, pref##_name_flags, 3) \
__field(__u8, pref##_prio) \
__field(__u8, pref##_op)
#define SQR_ASSIGN(pref, r) \
__entry->pref##_name_val[0] = (r)->names[0].val; \
__entry->pref##_name_val[1] = (r)->names[1].val; \
__entry->pref##_name_val[2] = (r)->names[2].val; \
__entry->pref##_limit = (r)->limit; \
__entry->pref##_name_source[0] = (r)->names[0].source; \
__entry->pref##_name_source[1] = (r)->names[1].source; \
__entry->pref##_name_source[2] = (r)->names[2].source; \
__entry->pref##_name_flags[0] = (r)->names[0].flags; \
__entry->pref##_name_flags[1] = (r)->names[1].flags; \
__entry->pref##_name_flags[2] = (r)->names[2].flags; \
__entry->pref##_prio = (r)->prio; \
__entry->pref##_op = (r)->op;
#define SQR_ENTRY_ARGS(pref) \
__entry->pref##_prio, __entry->pref##_name_val[0], \
__entry->pref##_name_source[0], __entry->pref##_name_flags[0], \
__entry->pref##_name_val[1], __entry->pref##_name_source[1], \
__entry->pref##_name_flags[1], __entry->pref##_name_val[2], \
__entry->pref##_name_source[2], __entry->pref##_name_flags[2], \
__entry->pref##_op, __entry->pref##_limit
TRACE_EVENT(scoutfs_quota_check,
TP_PROTO(struct super_block *sb, long rs_ptr, struct squota_input *inp, int ret),
TP_ARGS(sb, rs_ptr, inp, ret),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(long, rs_ptr)
SQI_FIELDS(i)
__field(int, ret)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->rs_ptr = rs_ptr;
SQI_ASSIGN(i, inp);
__entry->ret = ret;
),
TP_printk(SCSBF" rs_ptr %ld ret %d inp "SQI_FMT,
SCSB_TRACE_ARGS, __entry->rs_ptr, __entry->ret, SQI_ENTRY_ARGS(i))
);
DECLARE_EVENT_CLASS(scoutfs_quota_rule_op_class,
TP_PROTO(struct super_block *sb, struct squota_rule *rule, int ret),
TP_ARGS(sb, rule, ret),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
SQR_FIELDS(r)
__field(int, ret)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
SQR_ASSIGN(r, rule);
__entry->ret = ret;
),
TP_printk(SCSBF" "SQR_FMT" ret %d",
SCSB_TRACE_ARGS, SQR_ENTRY_ARGS(r), __entry->ret)
);
DEFINE_EVENT(scoutfs_quota_rule_op_class, scoutfs_quota_add_rule,
TP_PROTO(struct super_block *sb, struct squota_rule *rule, int ret),
TP_ARGS(sb, rule, ret)
);
DEFINE_EVENT(scoutfs_quota_rule_op_class, scoutfs_quota_del_rule,
TP_PROTO(struct super_block *sb, struct squota_rule *rule, int ret),
TP_ARGS(sb, rule, ret)
);
TRACE_EVENT(scoutfs_quota_totl_check,
TP_PROTO(struct super_block *sb, struct squota_input *inp, struct scoutfs_key *key,
u64 limit, int ret),
TP_ARGS(sb, inp, key, limit, ret),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
SQI_FIELDS(i)
sk_trace_define(k)
__field(__u64, limit)
__field(int, ret)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
SQI_ASSIGN(i, inp);
sk_trace_assign(k, key);
__entry->limit = limit;
__entry->ret = ret;
),
TP_printk(SCSBF" inp "SQI_FMT" key "SK_FMT" limit %llu ret %d",
SCSB_TRACE_ARGS, SQI_ENTRY_ARGS(i), sk_trace_args(k), __entry->limit,
__entry->ret)
);

112
kmod/src/trace/wkic.h Normal file
View File

@@ -0,0 +1,112 @@
DECLARE_EVENT_CLASS(scoutfs_wkic_wpage_class,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(void *, ptr)
__field(int, which)
__field(bool, n0l)
__field(bool, n1l)
sk_trace_define(start)
sk_trace_define(end)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ptr = ptr;
__entry->which = which;
__entry->n0l = n0l;
__entry->n1l = n1l;
sk_trace_assign(start, start);
sk_trace_assign(end, end);
__entry->which = which;
),
TP_printk(SCSBF" ptr %p wh %d nl %u,%u start "SK_FMT " end "SK_FMT, SCSB_TRACE_ARGS,
__entry->ptr, __entry->which, __entry->n0l, __entry->n1l,
sk_trace_args(start), sk_trace_args(end))
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_alloced,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_freeing,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_found,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_trimmed,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_erased,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_inserting,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_inserted,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_shrinking,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_dropping,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_replaying,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
DEFINE_EVENT(scoutfs_wkic_wpage_class, scoutfs_wkic_wpage_filled,
TP_PROTO(struct super_block *sb, void *ptr, int which, bool n0l, bool n1l,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, ptr, which, n0l, n1l, start, end)
);
TRACE_EVENT(scoutfs_wkic_read_items,
TP_PROTO(struct super_block *sb, struct scoutfs_key *key, struct scoutfs_key *start,
struct scoutfs_key *end),
TP_ARGS(sb, key, start, end),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
sk_trace_define(key)
sk_trace_define(start)
sk_trace_define(end)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
sk_trace_assign(key, start);
sk_trace_assign(start, start);
sk_trace_assign(end, end);
),
TP_printk(SCSBF" key "SK_FMT" start "SK_FMT " end "SK_FMT, SCSB_TRACE_ARGS,
sk_trace_args(key), sk_trace_args(start), sk_trace_args(end))
);

View File

@@ -39,6 +39,9 @@ struct scoutfs_triggers {
static char *names[] = {
[SCOUTFS_TRIGGER_BLOCK_REMOVE_STALE] = "block_remove_stale",
[SCOUTFS_TRIGGER_SRCH_COMPACT_LOGS_PAD_SAFE] = "srch_compact_logs_pad_safe",
[SCOUTFS_TRIGGER_SRCH_FORCE_LOG_ROTATE] = "srch_force_log_rotate",
[SCOUTFS_TRIGGER_SRCH_MERGE_STOP_SAFE] = "srch_merge_stop_safe",
[SCOUTFS_TRIGGER_STATFS_LOCK_PURGE] = "statfs_lock_purge",
};
@@ -90,13 +93,9 @@ int scoutfs_setup_triggers(struct super_block *sb)
goto out;
}
for (i = 0; i < ARRAY_SIZE(triggers->atomics); i++) {
if (!debugfs_create_atomic_t(names[i], 0644, triggers->dir,
&triggers->atomics[i])) {
ret = -ENOMEM;
goto out;
}
}
for (i = 0; i < ARRAY_SIZE(triggers->atomics); i++)
debugfs_create_atomic_t(names[i], 0644, triggers->dir,
&triggers->atomics[i]);
ret = 0;
out:

View File

@@ -3,6 +3,9 @@
enum scoutfs_trigger {
SCOUTFS_TRIGGER_BLOCK_REMOVE_STALE,
SCOUTFS_TRIGGER_SRCH_COMPACT_LOGS_PAD_SAFE,
SCOUTFS_TRIGGER_SRCH_FORCE_LOG_ROTATE,
SCOUTFS_TRIGGER_SRCH_MERGE_STOP_SAFE,
SCOUTFS_TRIGGER_STATFS_LOCK_PURGE,
SCOUTFS_TRIGGER_NR,
};

View File

@@ -183,6 +183,13 @@ static void *scoutfs_tseq_seq_next(struct seq_file *m, void *v, loff_t *pos)
ent = tseq_rb_next(ent);
if (ent)
*pos = ent->pos;
else
/*
* once we hit the end, *pos is never used, but it has to
* be updated to avoid an error in bpf_seq_read()
*/
(*pos)++;
return ent;
}

1160
kmod/src/wkic.c Normal file

File diff suppressed because it is too large Load Diff

19
kmod/src/wkic.h Normal file
View File

@@ -0,0 +1,19 @@
#ifndef _SCOUTFS_WKIC_H_
#define _SCOUTFS_WKIC_H_
#include "format.h"
typedef int (*wkic_iter_cb_t)(struct scoutfs_key *key, void *val, unsigned int val_len,
void *cb_arg);
int scoutfs_wkic_iterate(struct super_block *sb, struct scoutfs_key *key, struct scoutfs_key *last,
struct scoutfs_key *range_start, struct scoutfs_key *range_end,
wkic_iter_cb_t cb, void *cb_arg);
int scoutfs_wkic_iterate_stable(struct super_block *sb, struct scoutfs_key *key,
struct scoutfs_key *last, struct scoutfs_key *range_start,
struct scoutfs_key *range_end, wkic_iter_cb_t cb, void *cb_arg);
int scoutfs_wkic_setup(struct super_block *sb);
void scoutfs_wkic_destroy(struct super_block *sb);
#endif

View File

@@ -81,7 +81,20 @@ static void init_xattr_key(struct scoutfs_key *key, u64 ino, u32 name_hash,
#define SCOUTFS_XATTR_PREFIX "scoutfs."
#define SCOUTFS_XATTR_PREFIX_LEN (sizeof(SCOUTFS_XATTR_PREFIX) - 1)
/*
* We could have hidden the logic that needs this in a user-prefix
* specific .set handler, but I wanted to make sure that we always
* applied that logic from any call chains to _xattr_set. The
* additional strcmp isn't so expensive given all the rest of the work
* we're doing in here.
*/
static inline bool is_user(const char *name)
{
return !strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN);
}
#define HIDE_TAG "hide."
#define INDX_TAG "indx."
#define SRCH_TAG "srch."
#define TOTL_TAG "totl."
#define TAG_LEN (sizeof(HIDE_TAG) - 1)
@@ -103,6 +116,9 @@ int scoutfs_xattr_parse_tags(const char *name, unsigned int name_len,
if (!strncmp(name, HIDE_TAG, TAG_LEN)) {
if (++tgs->hide == 0)
return -EINVAL;
} else if (!strncmp(name, INDX_TAG, TAG_LEN)) {
if (++tgs->indx == 0)
return -EINVAL;
} else if (!strncmp(name, SRCH_TAG, TAG_LEN)) {
if (++tgs->srch == 0)
return -EINVAL;
@@ -540,47 +556,57 @@ static int parse_totl_u64(const char *s, int len, u64 *res)
}
/*
* non-destructive relatively quick parse of the last 3 dotted u64s that
* make up the name of the xattr total. -EINVAL is returned if there
* are anything but 3 valid u64 encodings between single dots at the end
* of the name.
* non-destructive relatively quick parse of final dotted u64s in an
* xattr name. If the required number of values are found then we
* return the number of bytes in the name that are not the final dotted
* u64s with their dots. -EINVAL is returned if we didn't find the
* required number of values.
*/
static int parse_totl_key(struct scoutfs_key *key, const char *name, int name_len)
static int parse_dotted_u64s(u64 *u64s, int nr, const char *name, int name_len)
{
u64 tot_name[3];
int end = name_len;
int nr = 0;
int len;
int ret;
int i;
int u;
/* parse name elements in reserve order from end of xattr name string */
for (i = name_len - 1; i >= 0 && nr < ARRAY_SIZE(tot_name); i--) {
for (u = nr - 1, i = name_len - 1; u >= 0 && i >= 0; i--) {
if (name[i] != '.')
continue;
len = end - (i + 1);
ret = parse_totl_u64(&name[i + 1], len, &tot_name[nr]);
ret = parse_totl_u64(&name[i + 1], len, &u64s[u]);
if (ret < 0)
goto out;
end = i;
nr++;
u--;
}
if (nr == ARRAY_SIZE(tot_name)) {
/* swap to account for parsing in reverse */
swap(tot_name[0], tot_name[2]);
scoutfs_xattr_init_totl_key(key, tot_name);
ret = 0;
} else {
if (u == -1)
ret = end;
else
ret = -EINVAL;
}
out:
return ret;
}
static int parse_totl_key(struct scoutfs_key *key, const char *name, int name_len)
{
u64 u64s[3];
int ret;
ret = parse_dotted_u64s(u64s, ARRAY_SIZE(u64s), name, name_len);
if (ret >= 0) {
scoutfs_xattr_init_totl_key(key, u64s);
ret = 0;
}
return ret;
}
static int apply_totl_delta(struct super_block *sb, struct scoutfs_key *key,
struct scoutfs_xattr_totl_val *tval, struct scoutfs_lock *lock)
{
@@ -607,6 +633,72 @@ int scoutfs_xattr_combine_totl(void *dst, int dst_len, void *src, int src_len)
return SCOUTFS_DELTA_COMBINED;
}
void scoutfs_xattr_indx_get_range(struct scoutfs_key *start, struct scoutfs_key *end)
{
scoutfs_key_set_zeros(start);
start->sk_zone = SCOUTFS_XATTR_INDX_ZONE;
scoutfs_key_set_ones(end);
end->sk_zone = SCOUTFS_XATTR_INDX_ZONE;
}
/*
* .indx. keys are a bit funny because we're iterating over index keys
* by major:minor:inode:xattr_id. That doesn't map nicely to the
* comparison precedence of the key fields. We have to mess around a
* little bit to get the major into the most significant key bits and
* the low bits of xattr id into the least significant key bits.
*/
void scoutfs_xattr_init_indx_key(struct scoutfs_key *key, u8 major, u64 minor, u64 ino, u64 xid)
{
scoutfs_key_set_zeros(key);
key->sk_zone = SCOUTFS_XATTR_INDX_ZONE;
key->_sk_first = cpu_to_le64(((u64)major << 56) | (minor >> 8));
key->_sk_second = cpu_to_le64((minor << 56) | (ino >> 8));
key->_sk_third = cpu_to_le64((ino << 56) | (xid >> 8));
key->_sk_fourth = xid & 0xff;
}
void scoutfs_xattr_get_indx_key(struct scoutfs_key *key, u8 *major, u64 *minor, u64 *ino, u64 *xid)
{
*major = le64_to_cpu(key->_sk_first) >> 56;
*minor = (le64_to_cpu(key->_sk_first) << 8) | (le64_to_cpu(key->_sk_second) >> 56);
*ino = (le64_to_cpu(key->_sk_second) << 8) | (le64_to_cpu(key->_sk_third) >> 56);
*xid = (le64_to_cpu(key->_sk_third) << 8) | key->_sk_fourth;
}
void scoutfs_xattr_set_indx_key_xid(struct scoutfs_key *key, u64 xid)
{
u8 major;
u64 minor;
u64 ino;
u64 dummy;
scoutfs_xattr_get_indx_key(key, &major, &minor, &ino, &dummy);
scoutfs_xattr_init_indx_key(key, major, minor, ino, xid);
}
/*
* This initial parsing of the name doesn't yet have access to an xattr
* id to put in the key. That's added later as the existing xattr is
* found or a new xattr's id is allocated.
*/
static int parse_indx_key(struct scoutfs_key *key, const char *name, int name_len, u64 ino)
{
u64 u64s[2];
int ret;
ret = parse_dotted_u64s(u64s, ARRAY_SIZE(u64s), name, name_len);
if (ret < 0)
return ret;
if (u64s[0] > U8_MAX)
return -EINVAL;
scoutfs_xattr_init_indx_key(key, u64s[0], u64s[1], ino, 0);
return 0;
}
/*
* The confusing swiss army knife of creating, modifying, and deleting
* xattrs.
@@ -627,7 +719,7 @@ int scoutfs_xattr_combine_totl(void *dst, int dst_len, void *src, int src_len)
int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_len,
const void *value, size_t size, int flags,
const struct scoutfs_xattr_prefix_tags *tgs,
struct scoutfs_lock *lck, struct scoutfs_lock *totl_lock,
struct scoutfs_lock *lck, struct scoutfs_lock *tag_lock,
struct list_head *ind_locks)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
@@ -635,10 +727,11 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
const u64 ino = scoutfs_ino(inode);
struct scoutfs_xattr_totl_val tval = {0,};
struct scoutfs_xattr *xat = NULL;
struct scoutfs_key totl_key;
struct scoutfs_key tag_key;
struct scoutfs_key key;
bool undo_srch = false;
bool undo_totl = false;
bool undo_indx = false;
u8 found_parts;
unsigned int xat_bytes_totl;
unsigned int xat_bytes;
@@ -651,7 +744,8 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
trace_scoutfs_xattr_set(sb, name_len, value, size, flags);
if (WARN_ON_ONCE(tgs->totl && !totl_lock))
if (WARN_ON_ONCE(tgs->totl && tgs->indx) ||
WARN_ON_ONCE((tgs->totl | tgs->indx) && !tag_lock))
return -EINVAL;
/* mirror the syscall's errors for large names and values */
@@ -664,10 +758,22 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
(flags & ~(XATTR_CREATE | XATTR_REPLACE)))
return -EINVAL;
if ((tgs->hide | tgs->srch | tgs->totl) && !capable(CAP_SYS_ADMIN))
if ((tgs->hide | tgs->indx | tgs->srch | tgs->totl) && !capable(CAP_SYS_ADMIN))
return -EPERM;
if (tgs->totl && ((ret = parse_totl_key(&totl_key, name, name_len)) != 0))
if (tgs->totl && ((ret = parse_totl_key(&tag_key, name, name_len)) != 0))
return ret;
if (tgs->indx &&
(ret = scoutfs_fmt_vers_unsupported(sb, SCOUTFS_FORMAT_VERSION_FEAT_INDX_TAG)))
return ret;
if (tgs->indx && ((ret = parse_indx_key(&tag_key, name, name_len, ino)) != 0))
return ret;
/* retention blocks user. xattr modification, all else allowed */
ret = scoutfs_inode_check_retention(inode);
if (ret < 0 && is_user(name))
return ret;
/* allocate enough to always read an existing xattr's totl */
@@ -708,6 +814,12 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
/* found fields in key will also be used */
found_parts = ret >= 0 ? xattr_nr_parts(xat) : 0;
/* use existing xattr's id or allocate new when creating */
if (found_parts)
id = le64_to_cpu(key.skx_id);
else if (value)
id = si->next_xattr_id++;
if (found_parts && tgs->totl) {
/* parse old totl value before we clobber xat buf */
val_len = ret - offsetof(struct scoutfs_xattr, name[xat->name_len]);
@@ -718,12 +830,25 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
le64_add_cpu(&tval.total, -total);
}
/*
* indx xattrs don't have a value. After returning an error for
* non-zero val length or short circuiting modifying with the
* same 0 length, all we're left with is creating or deleting
* the xattr.
*/
if (tgs->indx) {
if (size != 0) {
ret = -EINVAL;
goto out;
}
if (found_parts && value) {
ret = 0;
goto out;
}
}
/* prepare the xattr header, name, and start of value in first item */
if (value) {
if (found_parts)
id = le64_to_cpu(key.skx_id);
else
id = si->next_xattr_id++;
xat->name_len = name_len;
xat->val_len = cpu_to_le16(size);
memset(xat->__pad, 0, sizeof(xat->__pad));
@@ -741,9 +866,18 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
le64_add_cpu(&tval.total, total);
}
if (tgs->indx) {
scoutfs_xattr_set_indx_key_xid(&tag_key, id);
if (value)
ret = scoutfs_item_create_force(sb, &tag_key, NULL, 0, tag_lock, NULL);
else
ret = scoutfs_item_delete_force(sb, &tag_key, tag_lock, NULL);
if (ret < 0)
goto out;
undo_indx = true;
}
if (tgs->srch && !(found_parts && value)) {
if (found_parts)
id = le64_to_cpu(key.skx_id);
hash = scoutfs_hash64(name, name_len);
ret = scoutfs_forest_srch_add(sb, hash, ino, id);
if (ret < 0)
@@ -752,7 +886,7 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
}
if (tgs->totl) {
ret = apply_totl_delta(sb, &totl_key, &tval, totl_lock);
ret = apply_totl_delta(sb, &tag_key, &tval, tag_lock);
if (ret < 0)
goto out;
undo_totl = true;
@@ -777,6 +911,13 @@ int scoutfs_xattr_set_locked(struct inode *inode, const char *name, size_t name_
ret = 0;
out:
if (ret < 0 && undo_indx) {
if (value)
err = scoutfs_item_delete_force(sb, &tag_key, tag_lock, NULL);
else
err = scoutfs_item_create_force(sb, &tag_key, NULL, 0, tag_lock, NULL);
BUG_ON(err); /* inconsistent */
}
if (ret < 0 && undo_srch) {
err = scoutfs_forest_srch_add(sb, hash, ino, id);
BUG_ON(err);
@@ -785,7 +926,7 @@ out:
/* _delta() on dirty items shouldn't fail */
tval.total = cpu_to_le64(-le64_to_cpu(tval.total));
tval.count = cpu_to_le64(-le64_to_cpu(tval.count));
err = apply_totl_delta(sb, &totl_key, &tval, totl_lock);
err = apply_totl_delta(sb, &tag_key, &tval, tag_lock);
BUG_ON(err);
}
@@ -801,7 +942,7 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name, const void
struct inode *inode = dentry->d_inode;
struct super_block *sb = inode->i_sb;
struct scoutfs_xattr_prefix_tags tgs;
struct scoutfs_lock *totl_lock = NULL;
struct scoutfs_lock *tag_lock = NULL;
struct scoutfs_lock *lck = NULL;
size_t name_len = strlen(name);
LIST_HEAD(ind_locks);
@@ -816,8 +957,11 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name, const void
if (ret)
goto unlock;
if (tgs.totl) {
ret = scoutfs_lock_xattr_totl(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, &totl_lock);
if (tgs.totl || tgs.indx) {
if (tgs.totl)
ret = scoutfs_lock_xattr_totl(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, &tag_lock);
else
ret = scoutfs_lock_xattr_indx(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, &tag_lock);
if (ret)
goto unlock;
}
@@ -836,7 +980,7 @@ retry:
goto release;
ret = scoutfs_xattr_set_locked(dentry->d_inode, name, name_len, value, size, flags, &tgs,
lck, totl_lock, &ind_locks);
lck, tag_lock, &ind_locks);
if (ret == 0)
scoutfs_update_inode_item(inode, lck, &ind_locks);
@@ -845,7 +989,7 @@ release:
scoutfs_inode_index_unlock(sb, &ind_locks);
unlock:
scoutfs_unlock(sb, lck, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, totl_lock, SCOUTFS_LOCK_WRITE_ONLY);
scoutfs_unlock(sb, tag_lock, SCOUTFS_LOCK_WRITE_ONLY);
return ret;
}
@@ -882,7 +1026,9 @@ static int scoutfs_xattr_get_handler
static int scoutfs_xattr_set_handler
#ifdef KC_XATTR_STRUCT_XATTR_HANDLER
(const struct xattr_handler *handler, struct dentry *dentry,
(const struct xattr_handler *handler,
KC_VFS_NS_DEF
struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
size_t size, int flags)
{
@@ -1055,14 +1201,15 @@ int scoutfs_xattr_drop(struct super_block *sb, u64 ino,
{
struct scoutfs_xattr_prefix_tags tgs;
struct scoutfs_xattr *xat = NULL;
struct scoutfs_lock *totl_lock = NULL;
struct scoutfs_lock *tag_lock = NULL;
struct scoutfs_xattr_totl_val tval;
struct scoutfs_key totl_key;
struct scoutfs_key tag_key;
struct scoutfs_key last;
struct scoutfs_key key;
bool release = false;
unsigned int bytes;
unsigned int val_len;
u8 locked_zone = 0;
void *value;
u64 total;
u64 hash;
@@ -1108,16 +1255,32 @@ int scoutfs_xattr_drop(struct super_block *sb, u64 ino,
goto out;
}
ret = parse_totl_key(&totl_key, xat->name, xat->name_len) ?:
ret = parse_totl_key(&tag_key, xat->name, xat->name_len) ?:
parse_totl_u64(value, val_len, &total);
if (ret < 0)
break;
}
if (tgs.totl && totl_lock == NULL) {
ret = scoutfs_lock_xattr_totl(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, &totl_lock);
if (tgs.indx) {
ret = parse_indx_key(&tag_key, xat->name, xat->name_len, ino);
if (ret < 0)
goto out;
}
if ((tgs.totl || tgs.indx) && locked_zone != tag_key.sk_zone) {
if (tag_lock) {
scoutfs_unlock(sb, tag_lock, SCOUTFS_LOCK_WRITE_ONLY);
tag_lock = NULL;
}
if (tgs.totl)
ret = scoutfs_lock_xattr_totl(sb, SCOUTFS_LOCK_WRITE_ONLY, 0,
&tag_lock);
else
ret = scoutfs_lock_xattr_indx(sb, SCOUTFS_LOCK_WRITE_ONLY, 0,
&tag_lock);
if (ret < 0)
break;
locked_zone = tag_key.sk_zone;
}
ret = scoutfs_hold_trans(sb, false);
@@ -1140,7 +1303,13 @@ int scoutfs_xattr_drop(struct super_block *sb, u64 ino,
if (tgs.totl) {
tval.total = cpu_to_le64(-total);
tval.count = cpu_to_le64(-1LL);
ret = apply_totl_delta(sb, &totl_key, &tval, totl_lock);
ret = apply_totl_delta(sb, &tag_key, &tval, tag_lock);
if (ret < 0)
break;
}
if (tgs.indx) {
ret = scoutfs_item_delete_force(sb, &tag_key, tag_lock, NULL);
if (ret < 0)
break;
}
@@ -1153,7 +1322,7 @@ int scoutfs_xattr_drop(struct super_block *sb, u64 ino,
if (release)
scoutfs_release_trans(sb);
scoutfs_unlock(sb, totl_lock, SCOUTFS_LOCK_WRITE_ONLY);
scoutfs_unlock(sb, tag_lock, SCOUTFS_LOCK_WRITE_ONLY);
kfree(xat);
out:
return ret;

View File

@@ -3,6 +3,7 @@
struct scoutfs_xattr_prefix_tags {
unsigned long hide:1,
indx:1,
srch:1,
totl:1;
};
@@ -30,4 +31,9 @@ int scoutfs_xattr_parse_tags(const char *name, unsigned int name_len,
void scoutfs_xattr_init_totl_key(struct scoutfs_key *key, u64 *name);
int scoutfs_xattr_combine_totl(void *dst, int dst_len, void *src, int src_len);
void scoutfs_xattr_indx_get_range(struct scoutfs_key *start, struct scoutfs_key *end);
void scoutfs_xattr_init_indx_key(struct scoutfs_key *key, u8 major, u64 minor, u64 ino, u64 xid);
void scoutfs_xattr_get_indx_key(struct scoutfs_key *key, u8 *major, u64 *minor, u64 *ino, u64 *xid);
void scoutfs_xattr_set_indx_key_xid(struct scoutfs_key *key, u64 xid);
#endif

View File

@@ -0,0 +1,244 @@
package main
import (
"flag"
"fmt"
"log"
"os"
"path/filepath"
"sync"
"syscall"
"restore/pkg/restore"
)
type options struct {
metaPath string
sourceDir string
numWorkers int
}
// hardlinkTracker keeps track of inodes we've already processed
type hardlinkTracker struct {
sync.Mutex
seen map[uint64]bool
}
func newHardlinkTracker() *hardlinkTracker {
return &hardlinkTracker{
seen: make(map[uint64]bool),
}
}
func (h *hardlinkTracker) isNewInode(ino uint64, nlink bool) bool {
if !nlink {
return true
}
h.Lock()
defer h.Unlock()
if _, exists := h.seen[ino]; exists {
return false
}
h.seen[ino] = true
return true
}
// getFileInfo extracts file information from os.FileInfo
func getFileInfo(info os.FileInfo) restore.FileInfo {
stat := info.Sys().(*syscall.Stat_t)
// Use target inode number if specified, otherwise use actual inode number
ino := uint64(stat.Ino)
return restore.FileInfo{
Ino: ino,
Mode: uint32(stat.Mode),
Uid: uint32(stat.Uid),
Gid: uint32(stat.Gid),
Size: uint64(stat.Size),
Rdev: uint64(stat.Rdev),
AtimeSec: stat.Atim.Sec,
AtimeNsec: stat.Atim.Nsec,
MtimeSec: stat.Mtim.Sec,
MtimeNsec: stat.Mtim.Nsec,
CtimeSec: stat.Ctim.Sec,
CtimeNsec: stat.Ctim.Nsec,
IsDir: info.IsDir(),
IsRegular: stat.Mode&syscall.S_IFMT == syscall.S_IFREG,
}
}
// getXAttrs gets extended attributes for a file/directory
func getXAttrs(path string) ([]restore.XAttr, error) {
size, err := syscall.Listxattr(path, nil)
if err != nil || size == 0 {
return nil, err
}
buf := make([]byte, size)
size, err = syscall.Listxattr(path, buf)
if err != nil {
return nil, err
}
var xattrs []restore.XAttr
start := 0
for i := 0; i < size; i++ {
if buf[i] == 0 {
name := string(buf[start:i])
value, err := syscall.Getxattr(path, name, nil)
if err != nil {
continue
}
valueBuf := make([]byte, value)
_, err = syscall.Getxattr(path, name, valueBuf)
if err != nil {
continue
}
xattrs = append(xattrs, restore.XAttr{
Name: name,
Value: valueBuf,
})
start = i + 1
}
}
return xattrs, nil
}
func restorePath(writer *restore.WorkerWriter, hlTracker *hardlinkTracker, path string, parentIno uint64) error {
entries, err := os.ReadDir(path)
if err != nil {
return fmt.Errorf("failed to read directory: %v", err)
}
log.Printf("Restoring path: %s", path)
var subdirs int
var nameBytes int
for pos, entry := range entries {
if entry.Name() == "." || entry.Name() == ".." {
continue
}
info, err := entry.Info()
if err != nil {
return fmt.Errorf("failed to get entry info: %v", err)
}
stat, ok := info.Sys().(*syscall.Stat_t)
if !ok {
return fmt.Errorf("failed to get stat_t")
}
nameBytes += len(entry.Name())
fullPath := filepath.Join(path, entry.Name())
// Recurse into directories
if info.IsDir() {
subdirs++
if err := restorePath(writer, hlTracker, fullPath, uint64(stat.Ino)); err != nil {
return err
}
}
err = writer.CreateEntry(parentIno, uint64(pos), uint64(stat.Ino), uint32(info.Mode()), entry.Name())
if err != nil {
return fmt.Errorf("failed to create entry: %v", err)
}
// Handle inode
isHardlink := stat.Nlink > 1
if !info.IsDir() && hlTracker.isNewInode(uint64(stat.Ino), isHardlink) {
fileInfo := getFileInfo(info)
err = writer.CreateInode(fileInfo)
if err != nil {
return fmt.Errorf("failed to create inode: %v", err)
}
// Handle xattrs
xattrs, err := getXAttrs(fullPath)
if err == nil {
for pos, xattr := range xattrs {
err = writer.CreateXAttr(uint64(stat.Ino), uint64(pos), xattr)
if err != nil {
return fmt.Errorf("failed to create xattr: %v", err)
}
}
}
}
}
// Get directory info
dirInfo, err := os.Stat(path)
if err != nil {
return fmt.Errorf("failed to stat directory: %v", err)
}
// Create directory inode
dirFileInfo := getFileInfo(dirInfo)
dirFileInfo.NrSubdirs = uint64(subdirs)
dirFileInfo.NameBytes = uint64(nameBytes)
return writer.CreateInode(dirFileInfo)
}
func main() {
opts := options{}
flag.StringVar(&opts.metaPath, "m", "", "path to metadata device")
flag.StringVar(&opts.sourceDir, "s", "", "path to source directory")
flag.IntVar(&opts.numWorkers, "w", 4, "number of worker threads")
flag.Parse()
if opts.metaPath == "" || opts.sourceDir == "" {
flag.Usage()
os.Exit(1)
}
// Create master and worker writers
master, workers, err := restore.NewWriters(opts.metaPath, opts.numWorkers)
if err != nil {
fmt.Fprintf(os.Stderr, "Failed to create writers: %v\n", err)
os.Exit(1)
}
defer master.Destroy()
// Create hardlink tracker
hlTracker := newHardlinkTracker()
// Start workers
var wg sync.WaitGroup
for i, worker := range workers {
wg.Add(1)
go func(w *restore.WorkerWriter, workerNum int) {
defer wg.Done()
// Each worker processes a subset of the directory tree
if err := restorePath(w, hlTracker, opts.sourceDir, 1); err != nil {
fmt.Fprintf(os.Stderr, "Worker %d failed: %v\n", workerNum, err)
os.Exit(1)
}
// Create root inode for source directory
rootInfo, err := os.Stat(opts.sourceDir)
if err != nil {
fmt.Fprintf(os.Stderr, "Failed to stat source directory: %v\n", err)
os.Exit(1)
}
w.CreateInode(getFileInfo(rootInfo))
err = w.Destroy()
if err != nil {
fmt.Fprintf(os.Stderr, "Failed to destroy worker: %v\n", err)
os.Exit(1)
}
}(worker, i)
}
// Wait for all workers to complete
wg.Wait()
fmt.Println("Restore completed successfully")
}

3
restore/go.mod Normal file
View File

@@ -0,0 +1,3 @@
module restore
go 1.21.11

View File

@@ -0,0 +1,472 @@
package restore
/*
#cgo CFLAGS: -I${SRCDIR}/../../../utils/src -I${SRCDIR}/../../../kmod/src
#cgo LDFLAGS: -L${SRCDIR}/../../../utils/src -l:scoutfs_parallel_restore.a -lm
#include <stdlib.h>
#include <linux/types.h>
#include <stdbool.h>
#include <math.h>
#include "sparse.h"
#include "util.h"
#include "format.h"
#include "parallel_restore.h"
// If there are any type conflicts, you might need to add:
// #include "kernel_types.h"
*/
import "C"
import (
"errors"
"fmt"
"sync"
"syscall"
"unsafe"
)
const batchSize = 1000
const bufSize = 2 * 1024 * 1024
type WorkerWriter struct {
writer *C.struct_scoutfs_parallel_restore_writer
progressCh chan *ScoutfsParallelWriterProgress
fileCreated int64
devFd int
buf unsafe.Pointer
wg *sync.WaitGroup
}
type MasterWriter struct {
writer *C.struct_scoutfs_parallel_restore_writer
progressCh chan *ScoutfsParallelWriterProgress
workers []*WorkerWriter
wg sync.WaitGroup
slice *C.struct_scoutfs_parallel_restore_slice // Add slice field
progressWg sync.WaitGroup
devFd int
super *C.struct_scoutfs_super_block
}
type ScoutfsParallelWriterProgress struct {
Progress *C.struct_scoutfs_parallel_restore_progress
Slice *C.struct_scoutfs_parallel_restore_slice
}
func (m *MasterWriter) aggregateProgress() {
defer m.progressWg.Done()
for progress := range m.progressCh {
ret := C.scoutfs_parallel_restore_add_progress(m.writer, progress.Progress)
if ret != 0 {
// Handle error appropriately, e.g., log it
fmt.Printf("Failed to add progress, error code: %d\n", ret)
}
if progress.Slice != nil {
ret = C.scoutfs_parallel_restore_add_slice(m.writer, progress.Slice)
C.free(unsafe.Pointer(progress.Slice))
if ret != 0 {
// Handle error appropriately, e.g., log it
fmt.Printf("Failed to add slice, error code: %d\n", ret)
}
}
// Free the C-allocated progress structures
C.free(unsafe.Pointer(progress.Progress))
}
}
func (m *MasterWriter) Destroy() {
m.wg.Wait()
close(m.progressCh)
m.progressWg.Wait()
if m.slice != nil {
C.free(unsafe.Pointer(m.slice)) // Free slice on error
}
if m.super != nil {
C.free(unsafe.Pointer(m.super)) // Free superblock on error
}
if m.devFd != 0 {
syscall.Close(m.devFd)
}
// Destroy master writer
C.scoutfs_parallel_restore_destroy_writer(&m.writer)
}
func NewWriters(path string, numWriters int) (*MasterWriter, []*WorkerWriter, error) {
if numWriters <= 1 {
return nil, nil, errors.New("number of writers must be positive")
}
devFd, err := syscall.Open(path, syscall.O_DIRECT|syscall.O_RDWR|syscall.O_EXCL, 0)
if err != nil {
return nil, nil, fmt.Errorf("failed to open metadata device '%s': %v", path, err)
}
var masterWriter MasterWriter
masterWriter.progressCh = make(chan *ScoutfsParallelWriterProgress, numWriters*2)
masterWriter.workers = make([]*WorkerWriter, 0, numWriters-1)
masterWriter.devFd = devFd
var ret C.int
// Allocate aligned memory for superblock
var super unsafe.Pointer
ret = C.posix_memalign(&super, 4096, C.SCOUTFS_BLOCK_SM_SIZE)
if ret != 0 {
masterWriter.Destroy()
return nil, nil, fmt.Errorf("failed to allocate aligned memory for superblock: %d", ret)
}
masterWriter.super = (*C.struct_scoutfs_super_block)(super)
// Read the superblock from devFd
superOffset := C.SCOUTFS_SUPER_BLKNO << C.SCOUTFS_BLOCK_SM_SHIFT
count, err := syscall.Pread(devFd, (*[1 << 30]byte)(super)[:C.SCOUTFS_BLOCK_SM_SIZE], int64(superOffset))
if err != nil {
masterWriter.Destroy()
return nil, nil, fmt.Errorf("failed to read superblock: %v", err)
}
if count != int(C.SCOUTFS_BLOCK_SM_SIZE) {
masterWriter.Destroy()
return nil, nil, fmt.Errorf("failed to read superblock, bytes read: %d", count)
}
// Check if the superblock is valid.
if C.le64_to_cpu(masterWriter.super.flags)&C.SCOUTFS_FLAG_IS_META_BDEV == 0 {
masterWriter.Destroy()
return nil, nil, errors.New("superblock is not a metadata device")
}
// Create master writer
ret = C.scoutfs_parallel_restore_create_writer(&masterWriter.writer)
if ret != 0 {
masterWriter.Destroy()
return nil, nil, errors.New("failed to create master writer")
}
ret = C.scoutfs_parallel_restore_import_super(masterWriter.writer, masterWriter.super, C.int(devFd))
if ret != 0 {
masterWriter.Destroy()
return nil, nil, fmt.Errorf("failed to import superblock, error code: %d", ret)
}
// Initialize slices for each worker
masterWriter.slice = (*C.struct_scoutfs_parallel_restore_slice)(C.malloc(C.size_t(numWriters) *
C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{}))))
if masterWriter.slice == nil {
masterWriter.Destroy()
return nil, nil, errors.New("failed to allocate slices")
}
ret = C.scoutfs_parallel_restore_init_slices(masterWriter.writer,
masterWriter.slice,
C.int(numWriters))
if ret != 0 {
masterWriter.Destroy()
return nil, nil, errors.New("failed to initialize slices")
}
ret = C.scoutfs_parallel_restore_add_slice(masterWriter.writer, masterWriter.slice)
if ret != 0 {
masterWriter.Destroy()
return nil, nil, errors.New("failed to add slice to master writer")
}
// Create worker writers
for i := 1; i < numWriters; i++ {
var bufPtr unsafe.Pointer
if ret := C.posix_memalign(&bufPtr, 4096, bufSize); ret != 0 {
masterWriter.Destroy()
return nil, nil, fmt.Errorf("failed to allocate aligned worker buffer: %d", ret)
}
worker := &WorkerWriter{
progressCh: masterWriter.progressCh,
buf: bufPtr,
wg: &masterWriter.wg,
}
ret = C.scoutfs_parallel_restore_create_writer(&worker.writer)
if ret != 0 {
masterWriter.Destroy()
return nil, nil, errors.New("failed to create worker writer")
}
masterWriter.wg.Add(1)
// Use each slice for the corresponding worker
slice := (*C.struct_scoutfs_parallel_restore_slice)(unsafe.Pointer(uintptr(unsafe.Pointer(masterWriter.slice)) +
uintptr(i)*unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{})))
ret = C.scoutfs_parallel_restore_add_slice(worker.writer, slice)
if ret != 0 {
C.scoutfs_parallel_restore_destroy_writer(&worker.writer)
masterWriter.Destroy()
return nil, nil, errors.New("failed to add slice to worker writer")
}
masterWriter.workers = append(masterWriter.workers, worker)
}
masterWriter.progressWg.Add(1)
go masterWriter.aggregateProgress()
return &masterWriter, masterWriter.workers, nil
}
func (w *WorkerWriter) getProgress(withSlice bool) (*ScoutfsParallelWriterProgress, error) {
progress := (*C.struct_scoutfs_parallel_restore_progress)(
C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_progress{}))),
)
if progress == nil {
return nil, errors.New("failed to allocate memory for progress")
}
// Fetch the current progress from the C library
ret := C.scoutfs_parallel_restore_get_progress(w.writer, progress)
if ret != 0 {
C.free(unsafe.Pointer(progress))
return nil, fmt.Errorf("failed to get progress, error code: %d", ret)
}
var slice *C.struct_scoutfs_parallel_restore_slice
if withSlice {
slice = (*C.struct_scoutfs_parallel_restore_slice)(
C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{}))),
)
if slice == nil {
C.free(unsafe.Pointer(progress))
return nil, errors.New("failed to allocate memory for slice")
}
// Optionally fetch the slice information
ret = C.scoutfs_parallel_restore_get_slice(w.writer, slice)
if ret != 0 {
C.free(unsafe.Pointer(progress))
C.free(unsafe.Pointer(slice))
return nil, fmt.Errorf("failed to get slice, error code: %d", ret)
}
}
return &ScoutfsParallelWriterProgress{
Progress: progress,
Slice: slice,
}, nil
}
// writeBuffer writes data from the buffer to the device file descriptor.
// It uses scoutfs_parallel_restore_write_buf to get data and pwrite to write it.
func (w *WorkerWriter) writeBuffer() (int64, error) {
var totalWritten int64
var count int64
var off int64
var ret C.int
// Allocate memory for off and count
offPtr := (*C.off_t)(unsafe.Pointer(&off))
countPtr := (*C.size_t)(unsafe.Pointer(&count))
for {
ret = C.scoutfs_parallel_restore_write_buf(w.writer, w.buf,
C.size_t(bufSize), offPtr, countPtr)
if ret != 0 {
return totalWritten, fmt.Errorf("failed to write buffer: error code %d", ret)
}
if count > 0 {
n, err := syscall.Pwrite(w.devFd, unsafe.Slice((*byte)(w.buf), count), off)
if err != nil {
return totalWritten, fmt.Errorf("pwrite failed: %v", err)
}
if n != int(count) {
return totalWritten, fmt.Errorf("pwrite wrote %d bytes; expected %d", n, count)
}
totalWritten += int64(n)
}
if count == 0 {
break
}
}
return totalWritten, nil
}
func (w *WorkerWriter) InsertEntry(entry *C.struct_scoutfs_parallel_restore_entry) error {
// Add the entry using the C library
ret := C.scoutfs_parallel_restore_add_entry(w.writer, entry)
if ret != 0 {
return fmt.Errorf("failed to add entry, error code: %d", ret)
}
// Increment the fileCreated counter
w.fileCreated++
if w.fileCreated >= batchSize {
_, err := w.writeBuffer()
if err != nil {
return fmt.Errorf("error writing buffers: %v", err)
}
// Allocate memory for progress and slice structures
progress, err := w.getProgress(false)
if err != nil {
return err
}
// Send the progress update to the shared progress channel
w.progressCh <- progress
// Reset the fileCreated counter
w.fileCreated = 0
}
return nil
}
func (w *WorkerWriter) InsertXattr(xattr *C.struct_scoutfs_parallel_restore_xattr) error {
ret := C.scoutfs_parallel_restore_add_xattr(w.writer, xattr)
if ret != 0 {
return fmt.Errorf("failed to add xattr, error code: %d", ret)
}
return nil
}
func (w *WorkerWriter) InsertInode(inode *C.struct_scoutfs_parallel_restore_inode) error {
ret := C.scoutfs_parallel_restore_add_inode(w.writer, inode)
if ret != 0 {
return fmt.Errorf("failed to add inode, error code: %d", ret)
}
return nil
}
// should only be called once
func (w *WorkerWriter) Destroy() error {
defer w.wg.Done()
// Send final progress if there are remaining entries
if w.fileCreated > 0 {
_, err := w.writeBuffer()
if err != nil {
return err
}
progress := &ScoutfsParallelWriterProgress{
Progress: (*C.struct_scoutfs_parallel_restore_progress)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_progress{})))),
Slice: (*C.struct_scoutfs_parallel_restore_slice)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_slice{})))),
}
w.progressCh <- progress
w.fileCreated = 0
}
if w.buf != nil {
C.free(w.buf)
w.buf = nil
}
C.scoutfs_parallel_restore_destroy_writer(&w.writer)
return nil
}
// Add these new types and functions to the existing restore.go file
type FileInfo struct {
Ino uint64
Mode uint32
Uid uint32
Gid uint32
Size uint64
Rdev uint64
AtimeSec int64
AtimeNsec int64
MtimeSec int64
MtimeNsec int64
CtimeSec int64
CtimeNsec int64
NrSubdirs uint64
NameBytes uint64
IsDir bool
IsRegular bool
}
type XAttr struct {
Name string
Value []byte
}
// CreateInode creates a C inode structure from FileInfo
func (w *WorkerWriter) CreateInode(info FileInfo) error {
inode := (*C.struct_scoutfs_parallel_restore_inode)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_inode{}))))
if inode == nil {
return fmt.Errorf("failed to allocate inode")
}
defer C.free(unsafe.Pointer(inode))
inode.ino = C.__u64(info.Ino)
inode.mode = C.__u32(info.Mode)
inode.uid = C.__u32(info.Uid)
inode.gid = C.__u32(info.Gid)
inode.size = C.__u64(info.Size)
inode.rdev = C.uint(info.Rdev)
inode.atime.tv_sec = C.__time_t(info.AtimeSec)
inode.atime.tv_nsec = C.long(info.AtimeNsec)
inode.mtime.tv_sec = C.__time_t(info.MtimeSec)
inode.mtime.tv_nsec = C.long(info.MtimeNsec)
inode.ctime.tv_sec = C.__time_t(info.CtimeSec)
inode.ctime.tv_nsec = C.long(info.CtimeNsec)
inode.crtime = inode.ctime
if info.IsRegular && info.Size > 0 {
inode.offline = C.bool(true)
}
if info.IsDir {
inode.nr_subdirs = C.__u64(info.NrSubdirs)
inode.total_entry_name_bytes = C.__u64(info.NameBytes)
}
return w.InsertInode(inode)
}
// CreateEntry creates a directory entry
func (w *WorkerWriter) CreateEntry(dirIno uint64, pos uint64, ino uint64, mode uint32, name string) error {
entryC := (*C.struct_scoutfs_parallel_restore_entry)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_entry{})) + C.size_t(len(name))))
if entryC == nil {
return fmt.Errorf("failed to allocate entry")
}
defer C.free(unsafe.Pointer(entryC))
entryC.dir_ino = C.__u64(dirIno)
entryC.pos = C.__u64(pos)
entryC.ino = C.__u64(ino)
entryC.mode = C.__u32(mode)
entryC.name_len = C.uint(len(name))
entryC.name = (*C.char)(C.malloc(C.size_t(len(name))))
if entryC.name == nil {
return fmt.Errorf("failed to allocate entry name")
}
defer C.free(unsafe.Pointer(entryC.name))
copy((*[1 << 30]byte)(unsafe.Pointer(entryC.name))[:len(name)], []byte(name))
return w.InsertEntry(entryC)
}
// CreateXAttr creates an extended attribute
func (w *WorkerWriter) CreateXAttr(ino uint64, pos uint64, xattr XAttr) error {
xattrC := (*C.struct_scoutfs_parallel_restore_xattr)(C.malloc(C.size_t(unsafe.Sizeof(C.struct_scoutfs_parallel_restore_xattr{})) + C.size_t(len(xattr.Name)) + C.size_t(len(xattr.Value))))
if xattrC == nil {
return fmt.Errorf("failed to allocate xattr")
}
defer C.free(unsafe.Pointer(xattrC))
xattrC.ino = C.__u64(ino)
xattrC.pos = C.__u64(pos)
xattrC.name_len = C.uint(len(xattr.Name))
xattrC.value_len = C.__u32(len(xattr.Value))
xattrC.name = (*C.char)(C.malloc(C.size_t(len(xattr.Name))))
if xattrC.name == nil {
return fmt.Errorf("failed to allocate xattr name")
}
defer C.free(unsafe.Pointer(xattrC.name))
copy((*[1 << 30]byte)(unsafe.Pointer(xattrC.name))[:len(xattr.Name)], []byte(xattr.Name))
xattrC.value = unsafe.Pointer(&xattr.Value[0])
return w.InsertXattr(xattrC)
}

View File

@@ -0,0 +1,10 @@
package restore
import "testing"
func TestNewWriters(t *testing.T) {
_, _, err := NewWriters("/tmp", 2)
if err != nil {
t.Fatalf("failed to create master writer: %v", err)
}
}

1
tests/.gitignore vendored
View File

@@ -9,3 +9,4 @@ src/find_xattrs
src/stage_tmpfile
src/create_xattr_loop
src/o_tmpfile_umask
src/o_tmpfile_linkat

1
tests/.xfstests-branch Normal file
View File

@@ -0,0 +1 @@
v2022.05.01-2-g787cd20

View File

@@ -12,7 +12,10 @@ BIN := src/createmany \
src/find_xattrs \
src/create_xattr_loop \
src/fragmented_data_extents \
src/o_tmpfile_umask
src/o_tmpfile_umask \
src/o_tmpfile_linkat \
src/parallel_restore \
src/restore_copy
DEPS := $(wildcard src/*.d)
@@ -22,8 +25,12 @@ ifneq ($(DEPS),)
-include $(DEPS)
endif
src/parallel_restore_cflags := ../utils/src/scoutfs_parallel_restore.a -lm
src/restore_copy_cflags := ../utils/src/scoutfs_parallel_restore.a -lm
$(BIN): %: %.c Makefile
gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@
gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@ $($(@)_cflags)
.PHONY: clean
clean:

View File

@@ -25,8 +25,9 @@ All options can be seen by running with -h.
This script is built to test multi-node systems on one host by using
different mounts of the same devices. The script creates a fake block
device in front of each fs block device for each mount that will be
tested. Currently it will create free loop devices and will mount on
/mnt/test.[0-9].
tested. It will create predictable device mapper devices and mounts
them on /mnt/test.N. These static device names and mount paths limit
the script to a single execution per host.
All tests will be run by default. Particular tests can be included or
excluded by providing test name regular expressions with the -I and -E
@@ -104,14 +105,15 @@ used during the test.
| Variable | Description | Origin | Example |
| ---------------- | ------------------- | --------------- | ----------------- |
| T\_MB[0-9] | per-mount meta bdev | created per run | /dev/loop0 |
| T\_DB[0-9] | per-mount data bdev | created per run | /dev/loop1 |
| T\_MB[0-9] | per-mount meta bdev | created per run | /dev/mapper/\_scoutfs\_test\_meta\_[0-9] |
| T\_DB[0-9] | per-mount data bdev | created per run | /dev/mapper/\_scoutfs\_test\_data\_[0-9] |
| T\_D[0-9] | per-mount test dir | made for test | /mnt/test.[0-9]/t |
| T\_META\_DEVICE | main FS meta bdev | -M | /dev/vda |
| T\_DATA\_DEVICE | main FS data bdev | -D | /dev/vdb |
| T\_EX\_META\_DEV | scratch meta bdev | -f | /dev/vdd |
| T\_EX\_DATA\_DEV | scratch meta bdev | -e | /dev/vdc |
| T\_M[0-9] | mount paths | mounted per run | /mnt/test.[0-9]/ |
| T\_MODULE | built kernel module | created per run | ../kmod/src/..ko |
| T\_NR\_MOUNTS | number of mounts | -n | 3 |
| T\_O[0-9] | mount options | created per run | -o server\_addr= |
| T\_QUORUM | quorum count | -q | 2 |

View File

@@ -7,8 +7,9 @@ t_status_msg()
export T_PASS_STATUS=100
export T_SKIP_STATUS=101
export T_FAIL_STATUS=102
export T_SKIP_PERMITTED_STATUS=103
export T_FIRST_STATUS="$T_PASS_STATUS"
export T_LAST_STATUS="$T_FAIL_STATUS"
export T_LAST_STATUS="$T_SKIP_PERMITTED_STATUS"
t_pass()
{
@@ -21,6 +22,17 @@ t_skip()
exit $T_SKIP_STATUS
}
#
# This exit code is *reserved* for tests that are up-front never going to work
# in certain cases. This should be expressly documented per-case and made
# abundantly clear before merging. The test itself should document its case.
#
t_skip_permitted()
{
t_status_msg "$@"
exit $T_SKIP_PERMITTED_STATUS
}
t_fail()
{
t_status_msg "$@"

View File

@@ -6,6 +6,61 @@ t_filter_fs()
-e 's@Device: [a-fA-F0-9]*h/[0-9]*d@Device: 0h/0d@g'
}
#
# We can hit a spurious kasan warning that was fixed upstream:
#
# e504e74cc3a2 x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
#
# KASAN can get mad when the unwinder doesn't find ORC metadata and
# wanders up without using frames and hits the KASAN stack red zones.
# We can ignore these messages.
#
# They're bracketed by:
# [ 2687.690127] ==================================================================
# [ 2687.691366] BUG: KASAN: stack-out-of-bounds in get_reg+0x1bc/0x230
# ...
# [ 2687.706220] ==================================================================
# [ 2687.707284] Disabling lock debugging due to kernel taint
#
# That final lock debugging message may not be included.
#
ignore_harmless_unwind_kasan_stack_oob()
{
awk '
BEGIN {
in_soob = 0
soob_nr = 0
}
( !in_soob && $0 ~ /==================================================================/ ) {
in_soob = 1
soob_nr = NR
saved = $0
}
( in_soob == 1 && NR == (soob_nr + 1) ) {
if (match($0, /KASAN: stack-out-of-bounds in get_reg/) != 0) {
in_soob = 2
} else {
in_soob = 0
print saved
}
saved=""
}
( in_soob == 2 && $0 ~ /==================================================================/ ) {
in_soob = 3
soob_nr = NR
}
( in_soob == 3 && NR > soob_nr && $0 !~ /Disabling lock debugging/ ) {
in_soob = 0
}
( !in_soob ) { print $0 }
END {
if (saved) {
print saved
}
}
'
}
#
# Filter out expected messages. Putting messages here implies that
# tests aren't relying on messages to discover failures.. they're
@@ -86,10 +141,25 @@ t_filter_dmesg()
re="$re|scoutfs .* critical transaction commit failure.*"
# change-devices causes loop device resizing
re="$re|loop: module loaded"
re="$re|loop[0-9].* detected capacity change from.*"
re="$re|dm-[0-9].* detected capacity change from.*"
# ignore systemd-journal rotating
re="$re|systemd-journald.*"
egrep -v "($re)"
# process accounting can be noisy
re="$re|Process accounting resumed.*"
# format vers back/compat tries bad mounts
re="$re|scoutfs .* error.*outside of supported version.*"
re="$re|scoutfs .* error.*could not get .*super.*"
# ignore "unsafe core pattern" when xfstests tries to disable cores"
re="$re|Unsafe core_pattern used with fs.suid_dumpable=2.*"
re="$re|Pipe handler or fully qualified core dump path required.*"
re="$re|Set kernel.core_pattern before fs.suid_dumpable.*"
egrep -v "($re)" | \
ignore_harmless_unwind_kasan_stack_oob
}

View File

@@ -29,13 +29,12 @@ t_mount_rid()
}
#
# Output the "f.$fsid.r.$rid" identifier string for the given mount
# number, 0 is used by default if none is specified.
# Output the "f.$fsid.r.$rid" identifier string for the given path
# in a mounted scoutfs volume.
#
t_ident()
t_ident_from_mnt()
{
local nr="${1:-0}"
local mnt="$(eval echo \$T_M$nr)"
local mnt="$1"
local fsid
local rid
@@ -45,6 +44,38 @@ t_ident()
echo "f.${fsid:0:6}.r.${rid:0:6}"
}
#
# Output the "f.$fsid.r.$rid" identifier string for the given mount
# number, 0 is used by default if none is specified.
#
t_ident()
{
local nr="${1:-0}"
local mnt="$(eval echo \$T_M$nr)"
t_ident_from_mnt "$mnt"
}
#
# Output the sysfs path for a path in a mounted fs.
#
t_sysfs_path_from_ident()
{
local ident="$1"
echo "/sys/fs/scoutfs/$ident"
}
#
# Output the sysfs path for a path in a mounted fs.
#
t_sysfs_path_from_mnt()
{
local mnt="$1"
t_sysfs_path_from_ident $(t_ident_from_mnt $mnt)
}
#
# Output the mount's sysfs path, defaulting to mount 0 if none is
# specified.
@@ -53,7 +84,7 @@ t_sysfs_path()
{
local nr="$1"
echo "/sys/fs/scoutfs/$(t_ident $nr)"
t_sysfs_path_from_ident $(t_ident $nr)
}
#
@@ -265,6 +296,15 @@ t_trigger_get() {
cat "$(t_trigger_path "$nr")/$which"
}
t_trigger_set() {
local which="$1"
local nr="$2"
local val="$3"
local path=$(t_trigger_path "$nr")
echo "$val" > "$path/$which"
}
t_trigger_show() {
local which="$1"
local string="$2"
@@ -276,9 +316,8 @@ t_trigger_show() {
t_trigger_arm_silent() {
local which="$1"
local nr="$2"
local path=$(t_trigger_path "$nr")
echo 1 > "$path/$which"
t_trigger_set "$which" "$nr" 1
}
t_trigger_arm() {

View File

@@ -0,0 +1,155 @@
== setup test directory
== getfacl
directory drwxr-xr-x 0 0 0 '.'
# file: .
# owner: root
# group: root
user::rwx
group::r-x
other::r-x
== basic non-acl access through permissions
directory drwxr-xr-x 0 44444 0 'dir-testuid'
touch: cannot touch 'dir-testuid/file-group-write': Permission denied
touch: cannot touch 'symlinkdir-testuid/symlink-file-group-write': Permission denied
regular empty file -rw-r--r-- 22222 44444 0 'dir-testuid/file-group-write'
regular empty file -rw-r--r-- 22222 44444 0 'symlinkdir-testuid/symlink-file-group-write'
== basic acl access
directory drwxr-xr-x 0 0 0 'dir-root'
touch: cannot touch 'dir-root/file-group-write': Permission denied
touch: cannot touch 'symlinkdir-root/file-group-write': Permission denied
# file: dir-root
# owner: root
# group: root
user::rwx
user:22222:rwx
group::r-x
mask::rwx
other::r-x
regular empty file -rw-r--r-- 22222 0 0 'dir-root/file-group-write'
regular empty file -rw-r--r-- 22222 0 0 'symlinkdir-root/file-group-write'
== directory exec
Success
Success
# file: dir-root
# owner: root
# group: root
user::rwx
user:22222:rw-
group::r-x
mask::rwx
other::r-x
Failed
Failed
# file: dir-root
# owner: root
# group: root
user::rwx
user:22222:rw-
group::r-x
group:44444:rwx
mask::rwx
other::r-x
Success
Success
== get/set attr
regular empty file -rw-r--r-- 0 0 0 'file-root'
setfattr: file-root: Permission denied
# file: file-root
# owner: root
# group: root
user::rw-
user:22222:rw-
group::r--
mask::rw-
other::r--
# file: file-root
user.test2="Success"
# file: file-root
# owner: root
# group: root
user::rw-
group::r--
mask::r--
other::r--
setfattr: file-root: Permission denied
# file: file-root
user.test2="Success"
# file: file-root
# owner: root
# group: root
user::rw-
group::r--
group:44444:rw-
mask::rw-
other::r--
# file: file-root
user.test2="Success"
user.test4="Success"
== inheritance / default acl
directory drwxr-xr-x 0 0 0 'dir-root2'
mkdir: cannot create directory 'dir-root2/dir': Permission denied
touch: cannot touch 'dir-root2/dir/file': No such file or directory
# file: dir-root2
# owner: root
# group: root
user::rwx
group::r-x
other::r-x
default:user::rwx
default:user:22222:rwx
default:group::r-x
default:mask::rwx
default:other::r-x
mkdir: cannot create directory 'dir-root2/dir': Permission denied
touch: cannot touch 'dir-root2/dir/file': No such file or directory
# file: dir-root2
# owner: root
# group: root
user::rwx
user:22222:rwx
group::r-x
mask::rwx
other::r-x
default:user::rwx
default:user:22222:rwx
default:group::r-x
default:mask::rwx
default:other::r-x
directory drwxrwxr-x 22222 0 4 'dir-root2/dir'
# file: dir-root2/dir
# owner: 22222
# group: root
user::rwx
user:22222:rwx
group::r-x
mask::rwx
other::r-x
default:user::rwx
default:user:22222:rwx
default:group::r-x
default:mask::rwx
default:other::r-x
regular empty file -rw-rw-r-- 22222 0 0 'dir-root2/dir/file'
# file: dir-root2/dir/file
# owner: 22222
# group: root
user::rw-
user:22222:rwx #effective:rw-
group::r-x #effective:r--
mask::rw-
other::r--
== cleanup

View File

@@ -56,3 +56,4 @@ mv: cannot move '/mnt/test/test/basic-posix-consistency/dir/c/clobber' to '/mnt/
== inode indexes match after removing and syncing
== concurrent creates make one file
one-file
== cleanup

View File

@@ -25,3 +25,4 @@ rc: 0
equal_prepared
large_prepared
resized larger test rc: 0
== cleanup

View File

@@ -1,3 +1,4 @@
== measure initial createmany
== measure initial createmany
== measure two concurrent createmany runs
== cleanup

View File

@@ -1,29 +1,29 @@
== initial writes smaller than prealloc grow to prealloc size
/mnt/test/test/data-prealloc/file-1: 7 extents found
/mnt/test/test/data-prealloc/file-2: 7 extents found
/mnt/test/test/data-prealloc/file-1: extents: 7
/mnt/test/test/data-prealloc/file-2: extents: 7
== larger files get full prealloc extents
/mnt/test/test/data-prealloc/file-1: 9 extents found
/mnt/test/test/data-prealloc/file-2: 9 extents found
/mnt/test/test/data-prealloc/file-1: extents: 9
/mnt/test/test/data-prealloc/file-2: extents: 9
== non-streaming writes with contig have per-block extents
/mnt/test/test/data-prealloc/file-1: 32 extents found
/mnt/test/test/data-prealloc/file-2: 32 extents found
/mnt/test/test/data-prealloc/file-1: extents: 32
/mnt/test/test/data-prealloc/file-2: extents: 32
== any writes to region prealloc get full extents
/mnt/test/test/data-prealloc/file-1: 4 extents found
/mnt/test/test/data-prealloc/file-2: 4 extents found
/mnt/test/test/data-prealloc/file-1: 4 extents found
/mnt/test/test/data-prealloc/file-2: 4 extents found
/mnt/test/test/data-prealloc/file-1: extents: 4
/mnt/test/test/data-prealloc/file-2: extents: 4
/mnt/test/test/data-prealloc/file-1: extents: 4
/mnt/test/test/data-prealloc/file-2: extents: 4
== streaming offline writes get full extents either way
/mnt/test/test/data-prealloc/file-1: 4 extents found
/mnt/test/test/data-prealloc/file-2: 4 extents found
/mnt/test/test/data-prealloc/file-1: 4 extents found
/mnt/test/test/data-prealloc/file-2: 4 extents found
/mnt/test/test/data-prealloc/file-1: extents: 4
/mnt/test/test/data-prealloc/file-2: extents: 4
/mnt/test/test/data-prealloc/file-1: extents: 4
/mnt/test/test/data-prealloc/file-2: extents: 4
== goofy preallocation amounts work
/mnt/test/test/data-prealloc/file-1: 5 extents found
/mnt/test/test/data-prealloc/file-2: 5 extents found
/mnt/test/test/data-prealloc/file-1: 5 extents found
/mnt/test/test/data-prealloc/file-2: 5 extents found
/mnt/test/test/data-prealloc/file-1: 3 extents found
/mnt/test/test/data-prealloc/file-2: 3 extents found
/mnt/test/test/data-prealloc/file-1: extents: 6
/mnt/test/test/data-prealloc/file-2: extents: 6
/mnt/test/test/data-prealloc/file-1: extents: 6
/mnt/test/test/data-prealloc/file-2: extents: 6
/mnt/test/test/data-prealloc/file-1: extents: 3
/mnt/test/test/data-prealloc/file-2: extents: 3
== block writes into region allocs hole
wrote blk 24
wrote blk 32

View File

@@ -0,0 +1,4 @@
== ensuring utils and module for old versions
== unmounting test fs and removing test module
== testing combinations of old and new format versions
== restoring test module and mount

View File

@@ -1,3 +1,4 @@
== setting longer hung task timeout
== creating fragmented extents
== unlink file with moved extents to free extents per block
== cleanup

View File

@@ -0,0 +1,28 @@
== simple mkfs/restore/mount
committed_seq 1120
total_meta_blocks 163840
total_data_blocks 15728640
1440 1440 57120
80 80 400
0: offset: 0 length: 1 flags: O.L
extents: 1
0: offset: 0 length: 1 flags: O.L
extents: 1
0: offset: 0 length: 1 flags: O.L
extents: 1
0: offset: 0 length: 1 flags: O.L
extents: 1
Type Size Total Used Free Use%
MetaData 64KB 163840 34722 129118 21
Data 4KB 15728640 64 15728576 0
7 13,L,- 15,L,- 17,L,- I 33 -
== just under ENOSPC
Type Size Total Used Free Use%
MetaData 64KB 163840 155666 8174 95
Data 4KB 15728640 64 15728576 0
== just over ENOSPC
== ENOSPC
== attempt to restore data device
== attempt format_v1 restore
== test if previously mounted
== cleanup

24
tests/golden/projects Normal file
View File

@@ -0,0 +1,24 @@
== default new files don't have project
0
== set new project on files and dirs
8675309
8675309
== non-root can see id
8675309
== can use IDs around long width limits
2147483647
2147483648
4294967295
9223372036854775807
9223372036854775808
18446744073709551615
== created files and dirs inherit project id
8675309
8675309
== inheritance continues
8675309
== clearing project id stops inheritance
0
0
== o_tmpfile creations inherit dir
8675309

41
tests/golden/quota Normal file
View File

@@ -0,0 +1,41 @@
== prepare dir with write perm for test ids
== test assumes starting with no rules, empty list
== add rule
7 13,L,- 15,L,- 17,L,- I 33 -
== list is empty again after delete
== can change limits without deleting
1 1,L,- 1,L,- 1,L,- I 100 -
1 1,L,- 1,L,- 1,L,- I 101 -
1 1,L,- 1,L,- 1,L,- I 99 -
== wipe and restore rules in bulk
7 15,L,- 0,L,- 0,L,- I 33 -
7 14,L,- 0,L,- 0,L,- I 33 -
7 13,L,- 0,L,- 0,L,- I 33 -
7 12,L,- 0,L,- 0,L,- I 33 -
7 11,L,- 0,L,- 0,L,- I 33 -
7 10,L,- 0,L,- 0,L,- I 33 -
7 15,L,- 0,L,- 0,L,- I 33 -
7 14,L,- 0,L,- 0,L,- I 33 -
7 13,L,- 0,L,- 0,L,- I 33 -
7 12,L,- 0,L,- 0,L,- I 33 -
7 11,L,- 0,L,- 0,L,- I 33 -
7 10,L,- 0,L,- 0,L,- I 33 -
== default rule prevents file creation
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
== decreasing totl allows file creation again
== attr selecting rules prevent creation
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
== multi attr selecting doesn't prevent partial
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
== op differentiates
== higher priority rule applies
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
== data rules with total and count prevent write and fallocate
dd: error writing '/mnt/test/test/quota/dir/file': Disk quota exceeded
fallocate: fallocate failed: Disk quota exceeded
dd: error writing '/mnt/test/test/quota/dir/file': Disk quota exceeded
fallocate: fallocate failed: Disk quota exceeded
== added rules work after bulk restore
touch: cannot touch '/mnt/test/test/quota/dir/file': Disk quota exceeded
== cleanup

View File

@@ -0,0 +1,28 @@
== setting retention on dir fails
attr_x ioctl failed on '/mnt/test/test/retention-basic': Invalid argument (22)
scoutfs: set-attr-x failed: Invalid argument (22)
== set retention
== get-attr-x shows retention
1
== unpriv can't clear retention
attr_x ioctl failed on '/mnt/test/test/retention-basic/file-1': Operation not permitted (1)
scoutfs: set-attr-x failed: Operation not permitted (1)
== can set hidden scoutfs xattr in retention
== setting user. xattr fails in retention
setfattr: /mnt/test/test/retention-basic/file-1: Operation not permitted
== file deletion fails in retention
rm: cannot remove '/mnt/test/test/retention-basic/file-1': Operation not permitted
== file rename fails in retention
mv: cannot move '/mnt/test/test/retention-basic/file-1' to '/mnt/test/test/retention-basic/file-2': Operation not permitted
== file write fails in retention
date: write error: Operation not permitted
== file truncate fails in retention
truncate: failed to truncate '/mnt/test/test/retention-basic/file-1' at 0 bytes: Operation not permitted
== setattr fails in retention
touch: setting times of '/mnt/test/test/retention-basic/file-1': Operation not permitted
== clear retention
== file write
== file rename
== setattr
== xattr deletion
== cleanup

View File

@@ -22,10 +22,8 @@ scoutfs: setattr failed: Invalid argument (22)
== large ctime is set
1972-02-19 00:06:25.999999999 +0000
== large offline extents are created
Filesystem type is: 554f4353
File size of /mnt/test/test/setattr_more/file is 40988672 (10007 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 10006: 0.. 10006: 10007: unknown,eof
/mnt/test/test/setattr_more/file: 1 extent found
0: offset: 0 0 length: 10007 flags: O.L
extents: 1
== correct offline extent length
976563
== omitting data_version should not fail

View File

@@ -1,5 +1,9 @@
== create/release/stage single block file
0: offset: 0 0 length: 1 flags: O.L
extents: 1
== create/release/stage larger file
0: offset: 0 0 length: 4096 flags: O.L
extents: 1
== multiple release,drop_cache,stage cycles
== release+stage shouldn't change stat, data seq or vers
== stage does change meta_seq
@@ -7,16 +11,22 @@
stage: must provide file version with --data-version
Try `stage --help' or `stage --usage' for more information.
== wrapped region fails
stage returned -1, not 4096: error Invalid argument (22)
stage returned -1, not 8192: error Invalid argument (22)
scoutfs: stage failed: Input/output error (5)
== non-block aligned offset fails
stage returned -1, not 4095: error Invalid argument (22)
scoutfs: stage failed: Input/output error (5)
0: offset: 0 0 length: 1 flags: O.L
extents: 1
== non-block aligned len within block fails
stage returned -1, not 1024: error Invalid argument (22)
scoutfs: stage failed: Input/output error (5)
0: offset: 0 0 length: 1 flags: O.L
extents: 1
== partial final block that writes to i_size does work
== zero length stage doesn't bring blocks online
0: offset: 0 0 length: 100 flags: O.L
extents: 1
== stage of non-regular file fails
ioctl failed: Inappropriate ioctl for device (25)
stage: must provide file version with --data-version

View File

@@ -0,0 +1,37 @@
== initialize per-mount values
== arm compaction triggers
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_merge_stop_safe armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_merge_stop_safe armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_merge_stop_safe armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_merge_stop_safe armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_merge_stop_safe armed: 1
== compact more often
== create padded sorted inputs by forcing log rotation
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_compact_logs_pad_safe armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_force_log_rotate armed: 1
trigger srch_compact_logs_pad_safe armed: 1
== compaction of padded should stop at safe
== verify no compaction errors
== cleanup

View File

@@ -1,2 +1,3 @@
== create initial files
== race stage and release
== cleanup

View File

@@ -5,24 +5,44 @@ generic/004
generic/005
generic/006
generic/007
generic/008
generic/009
generic/011
generic/012
generic/013
generic/014
generic/015
generic/016
generic/018
generic/020
generic/021
generic/022
generic/023
generic/024
generic/025
generic/026
generic/028
generic/031
generic/032
generic/033
generic/034
generic/035
generic/037
generic/039
generic/040
generic/041
generic/050
generic/052
generic/053
generic/056
generic/057
generic/058
generic/059
generic/060
generic/061
generic/062
generic/063
generic/064
generic/065
generic/066
generic/067
@@ -31,42 +51,193 @@ generic/070
generic/071
generic/073
generic/076
generic/078
generic/079
generic/081
generic/082
generic/084
generic/086
generic/087
generic/088
generic/090
generic/091
generic/092
generic/094
generic/096
generic/097
generic/098
generic/099
generic/101
generic/104
generic/105
generic/106
generic/107
generic/110
generic/111
generic/113
generic/114
generic/115
generic/116
generic/117
generic/118
generic/119
generic/121
generic/122
generic/123
generic/124
generic/128
generic/129
generic/130
generic/131
generic/134
generic/135
generic/136
generic/138
generic/139
generic/140
generic/142
generic/143
generic/144
generic/145
generic/146
generic/147
generic/148
generic/149
generic/150
generic/151
generic/152
generic/153
generic/154
generic/155
generic/156
generic/157
generic/158
generic/159
generic/160
generic/161
generic/162
generic/163
generic/169
generic/171
generic/172
generic/173
generic/174
generic/177
generic/178
generic/179
generic/180
generic/181
generic/182
generic/183
generic/184
generic/185
generic/188
generic/189
generic/190
generic/191
generic/193
generic/194
generic/195
generic/196
generic/197
generic/198
generic/199
generic/200
generic/201
generic/202
generic/203
generic/205
generic/206
generic/207
generic/210
generic/211
generic/212
generic/214
generic/216
generic/217
generic/218
generic/219
generic/220
generic/221
generic/222
generic/223
generic/225
generic/227
generic/228
generic/229
generic/230
generic/235
generic/236
generic/237
generic/238
generic/240
generic/244
generic/245
generic/249
generic/250
generic/252
generic/253
generic/254
generic/255
generic/256
generic/257
generic/258
generic/259
generic/260
generic/261
generic/262
generic/263
generic/264
generic/265
generic/266
generic/267
generic/268
generic/271
generic/272
generic/276
generic/277
generic/278
generic/279
generic/281
generic/282
generic/283
generic/284
generic/286
generic/287
generic/288
generic/289
generic/290
generic/291
generic/292
generic/293
generic/294
generic/295
generic/296
generic/301
generic/302
generic/303
generic/304
generic/305
generic/306
generic/307
generic/308
generic/309
generic/312
generic/313
generic/314
generic/315
generic/316
generic/317
generic/319
generic/322
generic/324
generic/326
generic/327
generic/328
generic/329
generic/330
generic/331
generic/332
generic/335
generic/336
generic/337
@@ -74,10 +245,255 @@ generic/341
generic/342
generic/343
generic/348
generic/353
generic/355
generic/358
generic/359
generic/360
generic/361
generic/362
generic/363
generic/364
generic/365
generic/366
generic/367
generic/368
generic/369
generic/370
generic/371
generic/372
generic/373
generic/374
generic/375
generic/376
generic/377
generic/378
generic/379
generic/380
generic/381
generic/382
generic/383
generic/384
generic/385
generic/386
generic/389
generic/391
generic/392
generic/393
generic/394
generic/395
generic/396
generic/397
generic/398
generic/400
generic/401
generic/402
generic/403
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/426
generic/427
generic/436
generic/439
generic/440
generic/443
generic/445
generic/446
generic/448
generic/449
generic/450
generic/451
generic/453
generic/454
generic/456
generic/458
generic/460
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/477
generic/478
generic/479
generic/480
generic/481
generic/483
generic/485
generic/486
generic/487
generic/488
generic/489
generic/490
generic/491
generic/492
generic/498
generic/499
generic/501
generic/502
generic/503
generic/504
generic/505
generic/506
generic/507
generic/508
generic/509
generic/510
generic/511
generic/512
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/523
generic/524
generic/525
generic/526
generic/527
generic/528
generic/529
generic/530
generic/531
generic/533
generic/534
generic/535
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/547
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/557
generic/566
generic/567
generic/571
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/604
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/611
generic/612
generic/613
generic/618
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/632
generic/634
generic/635
generic/637
generic/639
generic/640
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/676
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/032
Not
run:
generic/008
@@ -251,8 +667,6 @@ generic/331
generic/332
generic/353
generic/355
generic/356
generic/357
generic/358
generic/359
generic/361
@@ -278,11 +692,174 @@ generic/383
generic/384
generic/385
generic/386
shared/001
generic/391
generic/392
generic/395
generic/396
generic/397
generic/398
generic/400
generic/402
generic/404
generic/406
generic/407
generic/408
generic/412
generic/413
generic/414
generic/417
generic/419
generic/420
generic/421
generic/422
generic/424
generic/425
generic/427
generic/439
generic/440
generic/446
generic/449
generic/450
generic/451
generic/453
generic/454
generic/456
generic/458
generic/462
generic/463
generic/465
generic/466
generic/468
generic/469
generic/470
generic/471
generic/474
generic/485
generic/487
generic/488
generic/491
generic/492
generic/499
generic/501
generic/503
generic/505
generic/506
generic/507
generic/508
generic/511
generic/513
generic/514
generic/515
generic/516
generic/517
generic/518
generic/519
generic/520
generic/528
generic/530
generic/536
generic/537
generic/538
generic/539
generic/540
generic/541
generic/542
generic/543
generic/544
generic/545
generic/546
generic/548
generic/549
generic/550
generic/552
generic/553
generic/555
generic/556
generic/566
generic/567
generic/572
generic/573
generic/574
generic/575
generic/576
generic/577
generic/578
generic/580
generic/581
generic/582
generic/583
generic/584
generic/586
generic/587
generic/588
generic/591
generic/592
generic/593
generic/594
generic/595
generic/596
generic/597
generic/598
generic/599
generic/600
generic/601
generic/602
generic/603
generic/605
generic/606
generic/607
generic/608
generic/609
generic/610
generic/612
generic/613
generic/621
generic/623
generic/624
generic/625
generic/626
generic/628
generic/629
generic/630
generic/635
generic/644
generic/645
generic/646
generic/647
generic/651
generic/652
generic/653
generic/654
generic/655
generic/657
generic/658
generic/659
generic/660
generic/661
generic/662
generic/663
generic/664
generic/665
generic/666
generic/667
generic/668
generic/669
generic/673
generic/674
generic/675
generic/677
generic/678
generic/679
generic/680
generic/681
generic/682
generic/683
generic/684
generic/685
generic/686
generic/687
generic/688
generic/689
shared/002
shared/003
shared/004
shared/032
shared/051
shared/289
Passed all 79 tests
Passed all 495 tests

View File

@@ -73,6 +73,7 @@ $(basename $0) options:
-t | Enabled trace events that match the given glob argument.
| Multiple options enable multiple globbed events.
-T <nr> | Multiply the original trace buffer size by nr during the run.
-V <nr> | Set mkfs device format version.
-X | xfstests git repo. Used by tests/xfstests.sh.
-x | xfstests git branch to checkout and track.
-y | xfstests ./check additional args
@@ -176,6 +177,11 @@ while true; do
T_TRACE_MULT="$2"
shift
;;
-V)
test -n "$2" || die "-V must have a format version argument"
T_MKFS_FORMAT_VERSION="-V $2"
shift
;;
-X)
test -n "$2" || die "-X requires xfstests git repo dir argument"
T_XFSTESTS_REPO="$2"
@@ -326,16 +332,10 @@ unmount_all() {
cmd wait $p
done
# delete all temp meta devices
for dev in $(losetup --associated "$T_META_DEVICE" | cut -d : -f 1); do
if [ -e "$dev" ]; then
cmd losetup -d "$dev"
fi
done
# delete all temp data devices
for dev in $(losetup --associated "$T_DATA_DEVICE" | cut -d : -f 1); do
if [ -e "$dev" ]; then
cmd losetup -d "$dev"
# delete all temp devices
for dev in /dev/mapper/_scoutfs_test_*; do
if [ -b "$dev" ]; then
cmd dmsetup remove $dev
fi
done
}
@@ -350,7 +350,7 @@ if [ -n "$T_MKFS" ]; then
done
msg "making new filesystem with $T_QUORUM quorum members"
cmd scoutfs mkfs -f $quo $T_DATA_ALLOC_ZONE_BLOCKS \
cmd scoutfs mkfs -f $quo $T_DATA_ALLOC_ZONE_BLOCKS $T_MKFS_FORMAT_VERSION \
"$T_META_DEVICE" "$T_DATA_DEVICE"
fi
@@ -358,7 +358,8 @@ if [ -n "$T_INSMOD" ]; then
msg "removing and reinserting scoutfs module"
test -e /sys/module/scoutfs && cmd rmmod scoutfs
cmd modprobe libcrc32c
cmd insmod "$T_KMOD/src/scoutfs.ko"
T_MODULE="$T_KMOD/src/scoutfs.ko"
cmd insmod "$T_MODULE"
fi
if [ -n "$T_TRACE_MULT" ]; then
@@ -434,6 +435,12 @@ $T_UTILS/fenced/scoutfs-fenced > "$T_FENCED_LOG" 2>&1 &
fenced_pid=$!
fenced_log "started fenced pid $fenced_pid in the background"
# setup dm tables
echo "0 $(blockdev --getsz $T_META_DEVICE) linear $T_META_DEVICE 0" > \
$T_RESULTS/dmtable.meta
echo "0 $(blockdev --getsz $T_DATA_DEVICE) linear $T_DATA_DEVICE 0" > \
$T_RESULTS/dmtable.data
#
# mount concurrently so that a quorum is present to elect the leader and
# start a server.
@@ -442,10 +449,13 @@ msg "mounting $T_NR_MOUNTS mounts on meta $T_META_DEVICE data $T_DATA_DEVICE"
pids=""
for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
meta_dev=$(losetup --find --show $T_META_DEVICE)
test -b "$meta_dev" || die "failed to create temp device $meta_dev"
data_dev=$(losetup --find --show $T_DATA_DEVICE)
test -b "$data_dev" || die "failed to create temp device $data_dev"
name="_scoutfs_test_meta_$i"
cmd dmsetup create "$name" --table "$(cat $T_RESULTS/dmtable.meta)"
meta_dev="/dev/mapper/$name"
name="_scoutfs_test_data_$i"
cmd dmsetup create "$name" --table "$(cat $T_RESULTS/dmtable.data)"
data_dev="/dev/mapper/$name"
dir="/mnt/test.$i"
test -d "$dir" || cmd mkdir -p "$dir"
@@ -505,6 +515,7 @@ msg "running tests"
passed=0
skipped=0
failed=0
skipped_permitted=0
for t in $tests; do
# tests has basenames from sequence, get path and name
t="tests/$t"
@@ -547,6 +558,9 @@ for t in $tests; do
printf " %-30s $stats" "$test_name"
# mark in dmesg as to what test we are running
echo "run scoutfs test $test_name" > /dev/kmsg
# record dmesg before
dmesg | t_filter_dmesg > "$T_TMPDIR/dmesg.before"
@@ -608,6 +622,10 @@ for t in $tests; do
grep -s -v "^$test_name " "$last" > "$last.tmp"
echo "$test_name $stats" >> "$last.tmp"
mv -f "$last.tmp" "$last"
elif [ "$sts" == "$T_SKIP_PERMITTED_STATUS" ]; then
echo " [ skipped (permitted): $message ]"
echo "$test_name skipped (permitted) $message " >> "$T_RESULTS/skip.log"
((skipped_permitted++))
elif [ "$sts" == "$T_SKIP_STATUS" ]; then
echo " [ skipped: $message ]"
echo "$test_name $message" >> "$T_RESULTS/skip.log"
@@ -621,7 +639,7 @@ for t in $tests; do
fi
done
msg "all tests run: $passed passed, $skipped skipped, $failed failed"
msg "all tests run: $passed passed, $skipped skipped, $skipped_permitted skipped (permitted), $failed failed"
if [ -n "$T_TRACE_GLOB" -o -n "$T_TRACE_PRINTK" ]; then

View File

@@ -1,6 +1,7 @@
export-get-name-parent.sh
basic-block-counts.sh
basic-bad-mounts.sh
basic-posix-acl.sh
inode-items-updated.sh
simple-inode-index.sh
simple-staging.sh
@@ -12,11 +13,16 @@ data-prealloc.sh
setattr_more.sh
offline-extent-waiting.sh
move-blocks.sh
projects.sh
large-fragmented-free.sh
format-version-forward-back.sh
enospc.sh
srch-safe-merge-pos.sh
srch-basic-functionality.sh
simple-xattr-unit.sh
retention-basic.sh
totl-xattr-tag.sh
quota.sh
lock-refleak.sh
lock-shrink-consistency.sh
lock-pr-cw-conflict.sh
@@ -48,4 +54,5 @@ archive-light-cycle.sh
block-stale-reads.sh
inode-deletion.sh
renameat2-noreplace.sh
parallel_restore.sh
xfstests.sh

View File

@@ -1,6 +1,7 @@
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <errno.h>
#include <string.h>
#include <sys/stat.h>
@@ -35,10 +36,10 @@ struct opts {
unsigned int dry_run:1,
ls_output:1,
quiet:1,
user_xattr:1,
same_srch_xattr:1,
group_srch_xattr:1,
unique_srch_xattr:1;
xattr_set:1,
xattr_file:1,
xattr_group:1;
char *xattr_name;
};
struct stats {
@@ -149,12 +150,31 @@ static void free_dir(struct dir *dir)
free(dir);
}
static size_t snprintf_off(void *buf, size_t sz, size_t off, char *fmt, ...)
{
va_list ap;
int ret;
if (off >= sz)
return sz;
va_start(ap, fmt);
ret = vsnprintf(buf + off, sz - off, fmt, ap);
va_end(ap);
if (ret <= 0)
return sz;
return off + ret;
}
static void create_dir(struct dir *dir, struct opts *opts,
struct stats *stats)
{
struct str_list *s;
char name[100];
char name[256]; /* max len and null term */
char val = 'v';
size_t off;
int rc;
int i;
@@ -175,29 +195,21 @@ static void create_dir(struct dir *dir, struct opts *opts,
rc = mknod(s->str, S_IFREG | 0644, 0);
error_exit(rc, "mknod %s failed"ERRF, s->str, ERRA);
rc = 0;
if (rc == 0 && opts->user_xattr) {
strcpy(name, "user.scoutfs_bcp");
rc = setxattr(s->str, name, &val, 1, 0);
}
if (rc == 0 && opts->same_srch_xattr) {
strcpy(name, "scoutfs.srch.scoutfs_bcp");
rc = setxattr(s->str, name, &val, 1, 0);
}
if (rc == 0 && opts->group_srch_xattr) {
snprintf(name, sizeof(name),
"scoutfs.srch.scoutfs_bcp.group.%lu",
stats->files / 10000);
rc = setxattr(s->str, name, &val, 1, 0);
}
if (rc == 0 && opts->unique_srch_xattr) {
snprintf(name, sizeof(name),
"scoutfs.srch.scoutfs_bcp.unique.%lu",
stats->files);
if (opts->xattr_set) {
off = snprintf_off(name, sizeof(name), 0, "%s", opts->xattr_name);
if (opts->xattr_file)
off = snprintf_off(name, sizeof(name), off,
"-f-%lu", stats->files);
if (opts->xattr_group)
off = snprintf_off(name, sizeof(name), off,
"-g-%lu", stats->files / 10000);
error_exit(off >= sizeof(name), "xattr name longer than 255 bytes");
rc = setxattr(s->str, name, &val, 1, 0);
error_exit(rc, "setxattr %s %s failed"ERRF, s->str, name, ERRA);
}
error_exit(rc, "setxattr %s %s failed"ERRF, s->str, name, ERRA);
stats->files++;
rate_banner(opts, stats);
@@ -365,11 +377,10 @@ static void usage(void)
" -d DIR | create all files in DIR top level directory\n"
" -n | dry run, only parse, don't create any files\n"
" -q | quiet, don't regularly print rates\n"
" -F | append \"-f-NR\" file nr to xattr name, requires -X\n"
" -G | append \"-g-NR\" file nr/10000 to xattr name, requires -X\n"
" -L | parse ls output; only reg, skip meta, paths at ./\n"
" -X | set the same user. xattr name in all files\n"
" -S | set the same .srch. xattr name in all files\n"
" -G | set a .srch. xattr name shared by groups of files\n"
" -U | set a unique .srch. xattr name in all files\n");
" -X NAM | set named xattr in all files\n");
}
int main(int argc, char **argv)
@@ -386,7 +397,7 @@ int main(int argc, char **argv)
memset(&opts, 0, sizeof(opts));
while ((c = getopt(argc, argv, "d:nqLXSGU")) != -1) {
while ((c = getopt(argc, argv, "d:nqFGLX:")) != -1) {
switch(c) {
case 'd':
top_dir = strdup(optarg);
@@ -397,20 +408,19 @@ int main(int argc, char **argv)
case 'q':
opts.quiet = 1;
break;
case 'F':
opts.xattr_file = 1;
break;
case 'G':
opts.xattr_group = 1;
break;
case 'L':
opts.ls_output = 1;
break;
case 'X':
opts.user_xattr = 1;
break;
case 'S':
opts.same_srch_xattr = 1;
break;
case 'G':
opts.group_srch_xattr = 1;
break;
case 'U':
opts.unique_srch_xattr = 1;
opts.xattr_set = 1;
opts.xattr_name = strdup(optarg);
error_exit(!opts.xattr_name, "error allocating xattr name");
break;
case '?':
printf("Unknown option '%c'\n", optopt);
@@ -419,6 +429,11 @@ int main(int argc, char **argv)
}
}
error_exit(opts.xattr_file && !opts.xattr_set,
"must specify xattr -X when appending file nr with -F");
error_exit(opts.xattr_group && !opts.xattr_set,
"must specify xattr -X when appending file nr with -G");
if (!opts.dry_run) {
error_exit(!top_dir,
"must specify top level directory with -d");

View File

@@ -0,0 +1,71 @@
/*
* Copyright (C) 2023 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>
#include <assert.h>
#include <limits.h>
static void linkat_tmpfile(char *dir, char *lpath)
{
char proc_self[PATH_MAX];
int ret;
int fd;
fd = open(dir, O_RDWR | O_TMPFILE, 0777);
if (fd < 0) {
perror("open(O_TMPFILE)");
exit(1);
}
snprintf(proc_self, sizeof(proc_self), "/proc/self/fd/%d", fd);
ret = linkat(AT_FDCWD, proc_self, AT_FDCWD, lpath, AT_SYMLINK_FOLLOW);
if (ret < 0) {
perror("linkat");
exit(1);
}
close(fd);
}
/*
* Use O_TMPFILE and linkat to create a new visible file, used to test
* the O_TMPFILE creation path by inspecting the created file.
*/
int main(int argc, char **argv)
{
char *lpath;
char *dir;
if (argc < 3) {
printf("%s <open_dir> <linkat_path>\n", argv[0]);
return 1;
}
dir = argv[1];
lpath = argv[2];
linkat_tmpfile(dir, lpath);
return 0;
}

View File

@@ -0,0 +1,838 @@
#define _GNU_SOURCE /* O_DIRECT */
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/xattr.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
#include <time.h>
#include <sys/prctl.h>
#include <signal.h>
#include <sys/socket.h>
#include "../../utils/src/sparse.h"
#include "../../utils/src/util.h"
#include "../../utils/src/list.h"
#include "../../utils/src/parse.h"
#include "../../kmod/src/format.h"
#include "../../utils/src/parallel_restore.h"
/*
* XXX:
* - add a nice description of what's going on
* - mention allocator contention
* - test child process dying handling
* - root dir entry name length is wrong
*/
#define ERRF " errno %d (%s)"
#define ERRA errno, strerror(errno)
#define error_exit(cond, fmt, args...) \
do { \
if (cond) { \
printf("error: "fmt"\n", ##args); \
exit(1); \
} \
} while (0)
#define dprintf(fmt, args...) \
do { \
if (0) \
printf(fmt, ##args); \
} while (0)
#define REG_MODE (S_IFREG | 0644)
#define DIR_MODE (S_IFDIR | 0755)
struct opts {
unsigned long long buf_size;
unsigned long long write_batch;
unsigned long long low_dirs;
unsigned long long high_dirs;
unsigned long long low_files;
unsigned long long high_files;
char *meta_path;
unsigned long long total_files;
bool read_only;
unsigned long long seed;
unsigned long long nr_writers;
};
static void usage(void)
{
printf("usage:\n"
" -b NR | threads write blocks in batches files (100000)\n"
" -d LOW:HIGH | range of subdirs per directory (5:10)\n"
" -f LOW:HIGH | range of files per directory (10:20)\n"
" -m PATH | path to metadata device\n"
" -n NR | total number of files to create (100)\n"
" -r | read-only, all work except writing, measure cpu cost\n"
" -s NR | randomization seed (random)\n"
" -w NR | number of writing processes to fork (online cpus)\n"
);
}
static size_t write_bufs(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
void *buf, size_t buf_size, int dev_fd)
{
size_t total = 0;
size_t count;
off_t off;
int ret;
do {
ret = scoutfs_parallel_restore_write_buf(wri, buf, buf_size, &off, &count);
error_exit(ret, "write buf %d", ret);
if (count > 0) {
if (!opts->read_only)
ret = pwrite(dev_fd, buf, count, off);
else
ret = count;
error_exit(ret != count, "pwrite count %zu ret %d", count, ret);
total += ret;
}
} while (count > 0);
return total;
}
struct gen_inode {
struct scoutfs_parallel_restore_inode inode;
struct scoutfs_parallel_restore_xattr **xattrs;
u64 nr_xattrs;
struct scoutfs_parallel_restore_entry **entries;
u64 nr_files;
u64 nr_entries;
};
static void free_gino(struct gen_inode *gino)
{
u64 i;
if (gino) {
if (gino->entries) {
for (i = 0; i < gino->nr_entries; i++)
free(gino->entries[i]);
free(gino->entries);
}
if (gino->xattrs) {
for (i = 0; i < gino->nr_xattrs; i++)
free(gino->xattrs[i]);
free(gino->xattrs);
}
free(gino);
}
}
static struct scoutfs_parallel_restore_xattr *
generate_xattr(struct opts *opts, u64 ino, u64 pos, char *name, int name_len, void *value,
int value_len)
{
struct scoutfs_parallel_restore_xattr *xattr;
xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + name_len + value_len);
error_exit(!xattr, "error allocating generated xattr");
*xattr = (struct scoutfs_parallel_restore_xattr) {
.ino = ino,
.pos = pos,
.name_len = name_len,
.value_len = value_len,
};
xattr->name = (void *)(xattr + 1);
xattr->value = (void *)(xattr->name + name_len);
memcpy(xattr->name, name, name_len);
if (value_len)
memcpy(xattr->value, value, value_len);
return xattr;
}
static struct gen_inode *generate_inode(struct opts *opts, u64 ino, mode_t mode)
{
struct gen_inode *gino;
struct timespec now;
clock_gettime(CLOCK_REALTIME, &now);
gino = calloc(1, sizeof(struct gen_inode));
error_exit(!gino, "failure allocating generated inode");
gino->inode = (struct scoutfs_parallel_restore_inode) {
.ino = ino,
.meta_seq = ino,
.data_seq = 0,
.mode = mode,
.atime = now,
.ctime = now,
.mtime = now,
.crtime = now,
};
/*
* hacky creation of a bunch of xattrs for now.
*/
if ((mode & S_IFMT) == S_IFREG) {
#define NV(n, v) { n, sizeof(n) - 1, v, sizeof(v) - 1, }
struct name_val {
char *name;
int len;
char *value;
int value_len;
} nv[] = {
NV("scoutfs.hide.totl.acct.8314611887310466424.2.0", "1"),
NV("scoutfs.hide.srch.sam_vol_E01001L6_4", ""),
NV("scoutfs.hide.sam_reqcopies", ""),
NV("scoutfs.hide.sam_copy_2", ""),
NV("scoutfs.hide.totl.acct.F01030L6.8314611887310466424.7.30", "1"),
NV("scoutfs.hide.sam_copy_1", ""),
NV("scoutfs.hide.srch.sam_vol_F01030L6_4", ""),
NV("scoutfs.hide.srch.sam_release_cand", ""),
NV("scoutfs.hide.sam_restime", ""),
NV("scoutfs.hide.sam_uuid", ""),
NV("scoutfs.hide.totl.acct.8314611887310466424.3.0", "1"),
NV("scoutfs.hide.srch.sam_vol_F01030L6", ""),
NV("scoutfs.hide.srch.sam_uuid_865939b7-24d6-472f-b85c-7ce7afeb813a", ""),
NV("scoutfs.hide.srch.sam_vol_E01001L6", ""),
NV("scoutfs.hide.totl.acct.E01001L6.8314611887310466424.7.1", "1"),
NV("scoutfs.hide.totl.acct.8314611887310466424.4.0", "1"),
NV("scoutfs.hide.totl.acct.8314611887310466424.11.0", "1"),
NV("scoutfs.hide.totl.acct.8314611887310466424.1.0", "1"),
};
unsigned int nr = array_size(nv);
int i;
gino->xattrs = calloc(nr, sizeof(struct scoutfs_parallel_restore_xattr *));
for (i = 0; i < nr; i++)
gino->xattrs[i] = generate_xattr(opts, ino, i, nv[i].name, nv[i].len,
nv[i].value, nv[i].value_len);
gino->nr_xattrs = nr;
gino->inode.nr_xattrs = nr;
gino->inode.size = 4096;
gino->inode.offline = true;
}
return gino;
}
static struct scoutfs_parallel_restore_entry *
generate_entry(struct opts *opts, char *prefix, u64 nr, u64 dir_ino, u64 pos, u64 ino, mode_t mode)
{
struct scoutfs_parallel_restore_entry *entry;
char buf[PATH_MAX];
int bytes;
bytes = snprintf(buf, sizeof(buf), "%s-%llu", prefix, nr);
entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + bytes);
error_exit(!entry, "error allocating generated entry");
*entry = (struct scoutfs_parallel_restore_entry) {
.dir_ino = dir_ino,
.pos = pos,
.ino = ino,
.mode = mode,
.name = (void *)(entry + 1),
.name_len = bytes,
};
memcpy(entry->name, buf, bytes);
return entry;
}
/*
* since the _parallel_restore_quota_rule mimics the squota_rule found in the
* kernel we can also mimic its rule_to_irule function
*/
#define TEST_RULE_STR "7 13,L,- 15,L,- 17,L,- I 33 -"
static struct scoutfs_parallel_restore_quota_rule *
generate_quota(struct opts *opts)
{
struct scoutfs_parallel_restore_quota_rule *prule;
int err;
prule = calloc(1, sizeof(struct scoutfs_parallel_restore_quota_rule));
error_exit(!prule, "Quota rule alloc failed");
err = sscanf(TEST_RULE_STR, " %hhu %llu,%c,%c %llu,%c,%c %llu,%c,%c %c %llu %c",
&prule->prio,
&prule->names[0].val, &prule->names[0].source, &prule->names[0].flags,
&prule->names[1].val, &prule->names[1].source, &prule->names[1].flags,
&prule->names[2].val, &prule->names[2].source, &prule->names[2].flags,
&prule->op, &prule->limit, &prule->rule_flags);
error_exit(err != 13, "invalid quota rule, missing fields. nr fields: %d rule str: %s\n", err, TEST_RULE_STR);
return prule;
}
static u64 random64(void)
{
return ((u64)lrand48() << 32) | lrand48();
}
static u64 random_range(u64 low, u64 high)
{
return low + (random64() % (high - low + 1));
}
static struct gen_inode *generate_dir(struct opts *opts, u64 dir_ino, u64 ino_start, u64 ino_len,
bool no_dirs)
{
struct scoutfs_parallel_restore_entry *entry;
struct gen_inode *gino;
u64 nr_entries;
u64 nr_files;
u64 nr_dirs;
u64 ino;
char *prefix;
mode_t mode;
u64 i;
nr_dirs = no_dirs ? 0 : random_range(opts->low_dirs, opts->high_dirs);
nr_files = random_range(opts->low_files, opts->high_files);
if (1 + nr_dirs + nr_files > ino_len) {
nr_dirs = no_dirs ? 0 : (ino_len - 1) / 2;
nr_files = (ino_len - 1) - nr_dirs;
}
nr_entries = nr_dirs + nr_files;
gino = generate_inode(opts, dir_ino, DIR_MODE);
error_exit(!gino, "error allocating generated inode");
gino->inode.nr_subdirs = nr_dirs;
gino->nr_files = nr_files;
if (nr_entries) {
gino->entries = calloc(nr_entries, sizeof(struct scoutfs_parallel_restore_entry *));
error_exit(!gino->entries, "error allocating generated inode entries");
gino->nr_entries = nr_entries;
}
mode = DIR_MODE;
prefix = "dir";
for (i = 0; i < nr_entries; i++) {
if (i == nr_dirs) {
mode = REG_MODE;
prefix = "file";
}
ino = ino_start + i;
entry = generate_entry(opts, prefix, ino, gino->inode.ino,
SCOUTFS_DIRENT_FIRST_POS + i, ino, mode);
gino->entries[i] = entry;
gino->inode.total_entry_name_bytes += entry->name_len;
}
return gino;
}
/*
* Restore a generated inode. If it's a directory then we also restore
* all its entries. The caller is going to descend into subdir entries and generate
* those dir inodes. We have to generate and restore all non-dir inodes referenced
* by this inode's entries.
*/
static void restore_inode(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
struct gen_inode *gino)
{
struct gen_inode *nondir;
int ret;
u64 i;
ret = scoutfs_parallel_restore_add_inode(wri, &gino->inode);
error_exit(ret, "thread add root inode %d", ret);
for (i = 0; i < gino->nr_entries; i++) {
ret = scoutfs_parallel_restore_add_entry(wri, gino->entries[i]);
error_exit(ret, "thread add entry %d", ret);
/* caller only needs subdir entries, generate and free others */
if ((gino->entries[i]->mode & S_IFMT) != S_IFDIR) {
nondir = generate_inode(opts, gino->entries[i]->ino,
gino->entries[i]->mode);
restore_inode(opts, wri, nondir);
free_gino(nondir);
free(gino->entries[i]);
if (i != gino->nr_entries - 1)
gino->entries[i] = gino->entries[gino->nr_entries - 1];
gino->nr_entries--;
gino->nr_files--;
i--;
}
}
for (i = 0; i < gino->nr_xattrs; i++) {
ret = scoutfs_parallel_restore_add_xattr(wri, gino->xattrs[i]);
error_exit(ret, "thread add xattr %d", ret);
}
}
struct writer_args {
struct list_head head;
int dev_fd;
int pair_fd;
struct scoutfs_parallel_restore_slice slice;
u64 writer_nr;
u64 dir_height;
u64 ino_start;
u64 ino_len;
};
struct write_result {
struct scoutfs_parallel_restore_progress prog;
struct scoutfs_parallel_restore_slice slice;
__le64 files_created;
__le64 bytes_written;
};
static void write_bufs_and_send(struct opts *opts, struct scoutfs_parallel_restore_writer *wri,
void *buf, size_t buf_size, int dev_fd,
struct write_result *res, bool get_slice, int pair_fd)
{
size_t total;
int ret;
total = write_bufs(opts, wri, buf, buf_size, dev_fd);
le64_add_cpu(&res->bytes_written, total);
ret = scoutfs_parallel_restore_get_progress(wri, &res->prog);
error_exit(ret, "get prog %d", ret);
if (get_slice) {
ret = scoutfs_parallel_restore_get_slice(wri, &res->slice);
error_exit(ret, "thread get slice %d", ret);
}
ret = write(pair_fd, res, sizeof(struct write_result));
error_exit(ret != sizeof(struct write_result), "result send error");
memset(res, 0, sizeof(struct write_result));
}
/*
* Calculate the number of bytes in toplevel "dir-%llu" entry names for the given
* number of writers.
*/
static u64 topdir_entry_bytes(u64 nr_writers)
{
u64 bytes = (3 + 1) * nr_writers;
u64 limit;
u64 done;
u64 wid;
u64 nr;
for (done = 0, wid = 1, limit = 10; done < nr_writers; done += nr, wid++, limit *= 10) {
nr = min(limit - done, nr_writers - done);
bytes += nr * wid;
}
return bytes;
}
struct dir_pos {
struct gen_inode *gino;
u64 pos;
};
static void writer_proc(struct opts *opts, struct writer_args *args)
{
struct scoutfs_parallel_restore_writer *wri = NULL;
struct scoutfs_parallel_restore_entry *entry;
struct dir_pos *dirs = NULL;
struct write_result res;
struct gen_inode *gino;
void *buf = NULL;
u64 level;
u64 ino;
int ret;
memset(&res, 0, sizeof(res));
dirs = calloc(args->dir_height, sizeof(struct dir_pos));
error_exit(errno, "error allocating parent dirs "ERRF, ERRA);
errno = posix_memalign((void **)&buf, 4096, opts->buf_size);
error_exit(errno, "error allocating block buf "ERRF, ERRA);
ret = scoutfs_parallel_restore_create_writer(&wri);
error_exit(ret, "create writer %d", ret);
ret = scoutfs_parallel_restore_add_slice(wri, &args->slice);
error_exit(ret, "add slice %d", ret);
/* writer 0 creates the root dir */
if (args->writer_nr == 0) {
gino = generate_inode(opts, SCOUTFS_ROOT_INO, DIR_MODE);
gino->inode.nr_subdirs = opts->nr_writers;
gino->inode.total_entry_name_bytes = topdir_entry_bytes(opts->nr_writers);
ret = scoutfs_parallel_restore_add_inode(wri, &gino->inode);
error_exit(ret, "thread add root inode %d", ret);
free_gino(gino);
}
/* create root entry for our top level dir */
ino = args->ino_start++;
args->ino_len--;
entry = generate_entry(opts, "top", args->writer_nr,
SCOUTFS_ROOT_INO, SCOUTFS_DIRENT_FIRST_POS + args->writer_nr,
ino, DIR_MODE);
ret = scoutfs_parallel_restore_add_entry(wri, entry);
error_exit(ret, "thread top entry %d", ret);
free(entry);
level = args->dir_height - 1;
while (args->ino_len > 0 && level < args->dir_height) {
gino = dirs[level].gino;
/* generate and restore if we follow entries */
if (!gino) {
gino = generate_dir(opts, ino, args->ino_start, args->ino_len, level == 0);
args->ino_start += gino->nr_entries;
args->ino_len -= gino->nr_entries;
le64_add_cpu(&res.files_created, gino->nr_files);
restore_inode(opts, wri, gino);
dirs[level].gino = gino;
}
if (dirs[level].pos == gino->nr_entries) {
/* ascend if we're done with this dir */
dirs[level].gino = NULL;
dirs[level].pos = 0;
free_gino(gino);
level++;
} else {
/* otherwise descend into subdir entry */
ino = gino->entries[dirs[level].pos]->ino;
dirs[level].pos++;
level--;
}
/* do a partial write at batch intervals when there's still more to do */
if (le64_to_cpu(res.files_created) >= opts->write_batch && args->ino_len > 0)
write_bufs_and_send(opts, wri, buf, opts->buf_size, args->dev_fd,
&res, false, args->pair_fd);
}
write_bufs_and_send(opts, wri, buf, opts->buf_size, args->dev_fd,
&res, true, args->pair_fd);
scoutfs_parallel_restore_destroy_writer(&wri);
free(dirs);
free(buf);
}
/*
* If any of our children exited with an error code, we hard exit.
* The child processes should themselves report out any errors
* encountered. Any remaining children will receive SIGHUP and
* terminate.
*/
static void sigchld_handler(int signo, siginfo_t *info, void *context)
{
if (info->si_status)
exit(EXIT_FAILURE);
}
static void fork_writer(struct opts *opts, struct writer_args *args)
{
pid_t parent = getpid();
pid_t pid;
int ret;
pid = fork();
error_exit(pid == -1, "fork error");
if (pid != 0)
return;
ret = prctl(PR_SET_PDEATHSIG, SIGHUP);
error_exit(ret < 0, "failed to set parent death sig");
printf("pid %u getpid() %u parent %u getppid() %u\n",
pid, getpid(), parent, getppid());
error_exit(getppid() != parent, "child parent already changed");
writer_proc(opts, args);
exit(0);
}
static int do_restore(struct opts *opts)
{
struct scoutfs_parallel_restore_writer *wri = NULL;
struct scoutfs_parallel_restore_slice *slices = NULL;
struct scoutfs_parallel_restore_quota_rule *rule = NULL;
struct scoutfs_super_block *super = NULL;
struct write_result res;
struct writer_args *args;
struct timespec begin;
struct timespec end;
LIST_HEAD(writers);
u64 next_ino;
u64 ino_per;
u64 avg_dirs;
u64 avg_files;
u64 dir_height;
u64 tot_files;
u64 tot_bytes;
int pair[2] = {-1, -1};
float secs;
void *buf = NULL;
int dev_fd = -1;
int ret;
int i;
ret = socketpair(PF_LOCAL, SOCK_STREAM, 0, pair);
error_exit(ret, "socketpair error "ERRF, ERRA);
dev_fd = open(opts->meta_path, O_DIRECT | (opts->read_only ? O_RDONLY : (O_RDWR|O_EXCL)));
error_exit(dev_fd < 0, "error opening '%s': "ERRF, opts->meta_path, ERRA);
errno = posix_memalign((void **)&super, 4096, SCOUTFS_BLOCK_SM_SIZE) ?:
posix_memalign((void **)&buf, 4096, opts->buf_size);
error_exit(errno, "error allocating block bufs "ERRF, ERRA);
ret = pread(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error reading super, ret %d", ret);
ret = scoutfs_parallel_restore_create_writer(&wri);
error_exit(ret, "create writer %d", ret);
ret = scoutfs_parallel_restore_import_super(wri, super, dev_fd);
error_exit(ret, "import super %d", ret);
rule = generate_quota(opts);
ret = scoutfs_parallel_restore_add_quota_rule(wri, rule);
free(rule);
error_exit(ret, "add quotas %d", ret);
slices = calloc(1 + opts->nr_writers, sizeof(struct scoutfs_parallel_restore_slice));
error_exit(!slices, "alloc slices");
scoutfs_parallel_restore_init_slices(wri, slices, 1 + opts->nr_writers);
ret = scoutfs_parallel_restore_add_slice(wri, &slices[0]);
error_exit(ret, "add slices[0] %d", ret);
next_ino = (SCOUTFS_ROOT_INO | SCOUTFS_LOCK_INODE_GROUP_MASK) + 1;
ino_per = opts->total_files / opts->nr_writers;
avg_dirs = (opts->low_dirs + opts->high_dirs) / 2;
avg_files = (opts->low_files + opts->high_files) / 2;
dir_height = 1;
tot_files = avg_files * opts->nr_writers;
while (tot_files < opts->total_files) {
dir_height++;
tot_files *= avg_dirs;
}
dprintf("height %llu tot %llu total %llu\n", dir_height, tot_files, opts->total_files);
clock_gettime(CLOCK_MONOTONIC_RAW, &begin);
/* start each writing process */
for (i = 0; i < opts->nr_writers; i++) {
args = calloc(1, sizeof(struct writer_args));
error_exit(!args, "alloc writer args");
args->dev_fd = dev_fd;
args->pair_fd = pair[1];
args->slice = slices[1 + i];
args->writer_nr = i;
args->dir_height = dir_height;
args->ino_start = next_ino;
args->ino_len = ino_per;
list_add_tail(&args->head, &writers);
next_ino += ino_per;
fork_writer(opts, args);
}
/* read results and watch for writers to finish */
tot_files = 0;
tot_bytes = 0;
i = 0;
while (i < opts->nr_writers) {
ret = read(pair[0], &res, sizeof(struct write_result));
error_exit(ret != sizeof(struct write_result), "result read error %d", ret);
ret = scoutfs_parallel_restore_add_progress(wri, &res.prog);
error_exit(ret, "add thr prog %d", ret);
if (res.slice.meta_len != 0) {
ret = scoutfs_parallel_restore_add_slice(wri, &res.slice);
error_exit(ret, "add thr slice %d", ret);
i++;
}
tot_files += le64_to_cpu(res.files_created);
tot_bytes += le64_to_cpu(res.bytes_written);
}
tot_bytes += write_bufs(opts, wri, buf, opts->buf_size, dev_fd);
ret = scoutfs_parallel_restore_export_super(wri, super);
error_exit(ret, "update super %d", ret);
if (!opts->read_only) {
ret = pwrite(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error writing super, ret %d", ret);
}
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
scoutfs_parallel_restore_destroy_writer(&wri);
secs = ((float)end.tv_sec + ((float)end.tv_nsec/NSEC_PER_SEC)) -
((float)begin.tv_sec + ((float)begin.tv_nsec/NSEC_PER_SEC));
printf("created %llu files in %llu bytes and %f secs => %f bytes/file, %f files/sec\n",
tot_files, tot_bytes, secs,
(float)tot_bytes / tot_files, (float)tot_files / secs);
if (dev_fd >= 0)
close(dev_fd);
if (pair[0] >= 0)
close(pair[0]);
if (pair[1] >= 0)
close(pair[1]);
free(super);
free(slices);
free(buf);
return 0;
}
static int parse_low_high(char *str, u64 *low_ret, u64 *high_ret)
{
char *sep;
int ret = 0;
sep = index(str, ':');
if (sep) {
*sep = '\0';
ret = parse_u64(sep + 1, high_ret);
}
if (ret == 0)
ret = parse_u64(str, low_ret);
if (sep)
*sep = ':';
return ret;
}
int main(int argc, char **argv)
{
struct opts opts = {
.buf_size = (32 * 1024 * 1024),
.write_batch = 1000000,
.low_dirs = 5,
.high_dirs = 10,
.low_files = 10,
.high_files = 20,
.total_files = 100,
};
struct sigaction act = { 0 };
int ret;
int c;
opts.seed = random64();
opts.nr_writers = sysconf(_SC_NPROCESSORS_ONLN);
while ((c = getopt(argc, argv, "b:d:f:m:n:rs:w:")) != -1) {
switch(c) {
case 'b':
ret = parse_u64(optarg, &opts.write_batch);
error_exit(ret, "error parsing -b '%s'\n", optarg);
error_exit(opts.write_batch == 0, "-b can't be 0");
break;
case 'd':
ret = parse_low_high(optarg, &opts.low_dirs, &opts.high_dirs);
error_exit(ret, "error parsing -d '%s'\n", optarg);
break;
case 'f':
ret = parse_low_high(optarg, &opts.low_files, &opts.high_files);
error_exit(ret, "error parsing -f '%s'\n", optarg);
break;
case 'm':
opts.meta_path = strdup(optarg);
break;
case 'n':
ret = parse_u64(optarg, &opts.total_files);
error_exit(ret, "error parsing -n '%s'\n", optarg);
break;
case 'r':
opts.read_only = true;
break;
case 's':
ret = parse_u64(optarg, &opts.seed);
error_exit(ret, "error parsing -s '%s'\n", optarg);
break;
case 'w':
ret = parse_u64(optarg, &opts.nr_writers);
error_exit(ret, "error parsing -w '%s'\n", optarg);
break;
case '?':
printf("Unknown option '%c'\n", optopt);
usage();
exit(1);
}
}
error_exit(opts.low_dirs > opts.high_dirs, "LOW > HIGH in -d %llu:%llu",
opts.low_dirs, opts.high_dirs);
error_exit(opts.low_files > opts.high_files, "LOW > HIGH in -f %llu:%llu",
opts.low_files, opts.high_files);
error_exit(!opts.meta_path, "must specify metadata device path with -m");
printf("recreate with: -d %llu:%llu -f %llu:%llu -n %llu -s %llu -w %llu\n",
opts.low_dirs, opts.high_dirs, opts.low_files, opts.high_files,
opts.total_files, opts.seed, opts.nr_writers);
act.sa_flags = SA_SIGINFO | SA_RESTART;
act.sa_sigaction = &sigchld_handler;
if (sigaction(SIGCHLD, &act, NULL) == -1)
error_exit(ret, "error setting up signal handler\n");
ret = do_restore(&opts);
free(opts.meta_path);
return ret == 0 ? 0 : 1;
}

817
tests/src/restore_copy.c Normal file
View File

@@ -0,0 +1,817 @@
#define _GNU_SOURCE /* O_DIRECT */
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/xattr.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
#include <time.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/signal.h>
#include <sys/statfs.h>
#include <dirent.h>
#include "../../utils/src/sparse.h"
#include "../../utils/src/util.h"
#include "../../utils/src/list.h"
#include "../../utils/src/parse.h"
#include "../../kmod/src/format.h"
#include "../../kmod/src/ioctl.h"
#include "../../utils/src/parallel_restore.h"
/*
* XXX:
*/
#define ERRF " errno %d (%s)"
#define ERRA errno, strerror(errno)
#define error_exit(cond, fmt, args...) \
do { \
if (cond) { \
printf("error: "fmt"\n", ##args); \
exit(1); \
} \
} while (0)
#define REG_MODE (S_IFREG | 0644)
#define DIR_MODE (S_IFDIR | 0755)
#define LNK_MODE (S_IFLNK | 0777)
/*
* At about 1k files we seem to be writing about 1MB of data, so
* set buffer sizes adequately above that.
*/
#define BATCH_FILES 1024
#define BUF_SIZ 2 * 1024 * 1024
/*
* We can't make duplicate inodes for hardlinked files, so we
* will need to track these as we generate them. Not too costly
* to do, since it's just an integer, and sorting shouldn't matter
* until we get into the millions of entries, hopefully.
*/
static struct list_head hardlinks;
struct hardlink_head {
struct list_head head;
u64 ino;
};
struct opts {
char *meta_path;
char *source_dir;
};
static bool warn_scoutfs = false;
static void usage(void)
{
printf("usage:\n"
" -m PATH | path to metadata device\n"
" -s PATH | path to source directory\n"
);
}
static size_t write_bufs(struct scoutfs_parallel_restore_writer *wri,
void *buf, int dev_fd)
{
size_t total = 0;
size_t count;
off_t off;
int ret;
do {
ret = scoutfs_parallel_restore_write_buf(wri, buf, BUF_SIZ, &off, &count);
error_exit(ret, "write buf %d", ret);
if (count > 0) {
ret = pwrite(dev_fd, buf, count, off);
error_exit(ret != count, "pwrite count %zu ret %d", count, ret);
total += ret;
}
} while (count > 0);
return total;
}
struct write_result {
struct scoutfs_parallel_restore_progress prog;
struct scoutfs_parallel_restore_slice slice;
__le64 files_created;
__le64 dirs_created;
__le64 bytes_written;
bool complete;
};
static void write_bufs_and_send(struct scoutfs_parallel_restore_writer *wri,
void *buf, int dev_fd,
struct write_result *res, bool get_slice, int pair_fd)
{
size_t total;
int ret;
total = write_bufs(wri, buf, dev_fd);
le64_add_cpu(&res->bytes_written, total);
ret = scoutfs_parallel_restore_get_progress(wri, &res->prog);
error_exit(ret, "get prog %d", ret);
if (get_slice) {
ret = scoutfs_parallel_restore_get_slice(wri, &res->slice);
error_exit(ret, "thread get slice %d", ret);
}
ret = write(pair_fd, res, sizeof(struct write_result));
error_exit(ret != sizeof(struct write_result), "result send error");
memset(res, 0, sizeof(struct write_result));
}
/*
* Adding xattrs is supported for files and directories only.
*
* If the filesystem on which the path resides isn't scoutfs, we omit the
* scoutfs specific ioctl to fetch hidden xattrs.
*
* Untested if the hidden xattr ioctl works on directories or symlinks.
*/
static void add_xattrs(struct scoutfs_parallel_restore_writer *wri, char *path, u64 ino, bool is_scoutfs)
{
struct scoutfs_ioctl_listxattr_hidden lxh;
struct scoutfs_parallel_restore_xattr *xattr;
char *buf = NULL;
char *name = NULL;
int fd = -1;
int bytes;
int len;
int value_len;
int ret;
int pos = 0;
if (!is_scoutfs)
goto normal_xattrs;
fd = open(path, O_RDONLY);
error_exit(fd < 0, "open"ERRF, ERRA);
memset(&lxh, 0, sizeof(lxh));
lxh.id_pos = 0;
lxh.hash_pos = 0;
lxh.buf_bytes = 256 * 1024;
buf = malloc(lxh.buf_bytes);
error_exit(!buf, "alloc xattr_hidden buf");
lxh.buf_ptr = (unsigned long)buf;
/* hidden */
for (;;) {
ret = ioctl(fd, SCOUTFS_IOC_LISTXATTR_HIDDEN, &lxh);
if (ret == 0) /* done */
break;
error_exit(ret < 0, "listxattr_hidden"ERRF, ERRA);
bytes = ret;
error_exit(bytes > lxh.buf_bytes, "listxattr_hidden overflow");
error_exit(buf[bytes - 1] != '\0', "listxattr_hidden didn't term");
name = buf;
do {
len = strlen(name);
error_exit(len == 0, "listxattr_hidden empty name");
error_exit(len > SCOUTFS_XATTR_MAX_NAME_LEN, "listxattr_hidden long name");
/* get value len */
value_len = fgetxattr(fd, name, NULL, 0);
error_exit(value_len < 0, "malloc value hidden"ERRF, ERRA);
/* allocate everything at once */
xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + len + value_len);
error_exit(!xattr, "error allocating generated xattr");
*xattr = (struct scoutfs_parallel_restore_xattr) {
.ino = ino,
.pos = pos++,
.name_len = len,
.value_len = value_len,
};
xattr->name = (void *)(xattr + 1);
xattr->value = (void *)(xattr->name + len);
/* get value into xattr directly */
ret = fgetxattr(fd, name, (void *)(xattr->name + len), value_len);
error_exit(ret != value_len, "fgetxattr value"ERRF, ERRA);
memcpy(xattr->name, name, len);
ret = scoutfs_parallel_restore_add_xattr(wri, xattr);
error_exit(ret, "add hidden xattr %d", ret);
free(xattr);
name += len + 1;
bytes -= len + 1;
} while (bytes > 0);
}
free(buf);
close(fd);
normal_xattrs:
value_len = listxattr(path, NULL, 0);
error_exit(value_len < 0, "hidden listxattr "ERRF, ERRA);
if (value_len == 0)
return;
buf = calloc(1, value_len);
error_exit(!buf, "malloc value"ERRF, ERRA);
ret = listxattr(path, buf, value_len);
error_exit(ret < 0, "hidden listxattr %d", ret);
name = buf;
bytes = ret;
do {
len = strlen(name);
error_exit(len == 0, "listxattr_hidden empty name");
error_exit(len > SCOUTFS_XATTR_MAX_NAME_LEN, "listxattr_hidden long name");
value_len = getxattr(path, name, NULL, 0);
error_exit(value_len < 0, "value "ERRF, ERRA);
xattr = malloc(sizeof(struct scoutfs_parallel_restore_xattr) + len + value_len);
error_exit(!xattr, "error allocating generated xattr");
*xattr = (struct scoutfs_parallel_restore_xattr) {
.ino = ino,
.pos = pos++,
.name_len = len,
.value_len = value_len,
};
xattr->name = (void *)(xattr + 1);
xattr->value = (void *)(xattr->name + len);
ret = getxattr(path, name, (void *)(xattr->name + len), value_len);
error_exit(ret != value_len, "fgetxattr value"ERRF, ERRA);
memcpy(xattr->name, name, len);
ret = scoutfs_parallel_restore_add_xattr(wri, xattr);
error_exit(ret, "add xattr %d", ret);
free(xattr);
name += len + 1;
bytes -= len + 1;
} while (bytes > 0);
free(buf);
}
/*
* We can't store the same inode multiple times, so we need to make
* sure to account for hardlinks. Maintain a LL that stores the first
* hardlink inode we encounter, and every subsequent hardlink to this
* inode will omit inserting an inode, and just adds another entry
*/
static bool is_new_inode_item(bool nlink, u64 ino)
{
struct hardlink_head *hh_tmp;
struct hardlink_head *hh;
if (!nlink)
return true;
/* lineair search, pretty awful, should be a binary tree */
list_for_each_entry_safe(hh, hh_tmp, &hardlinks, head) {
if (hh->ino == ino)
return false;
}
/* insert item */
hh = malloc(sizeof(struct hardlink_head));
error_exit(!hh, "malloc");
hh->ino = ino;
list_add_tail(&hh->head, &hardlinks);
/*
* XXX
*
* We can be confident that if we don't traverse filesystems
* that once we've created N entries of an N-linked inode, that
* it can be removed from the LL. This would significantly
* improve the manageability of the list.
*
* All we'd need to do is add a counter and compare it to the nr_links
* field of the inode.
*/
return true;
}
/*
* create the inode data for a given path as best as possible
* duplicating the exact data from the source path
*/
static struct scoutfs_parallel_restore_inode *read_inode_data(char *path, u64 ino, bool *nlink, bool is_scoutfs)
{
struct scoutfs_parallel_restore_inode *inode = NULL;
struct scoutfs_ioctl_stat_more stm;
struct stat st;
int ret;
int fd;
inode = calloc(1, sizeof(struct scoutfs_parallel_restore_inode));
error_exit(!inode, "failure allocating inode");
ret = lstat(path, &st);
error_exit(ret, "failure stat inode");
/* use exact inode numbers from path, except for root ino */
if (ino != SCOUTFS_ROOT_INO)
inode->ino = st.st_ino;
else
inode->ino = SCOUTFS_ROOT_INO;
inode->mode = st.st_mode;
inode->uid = st.st_uid;
inode->gid = st.st_gid;
inode->atime = st.st_atim;
inode->ctime = st.st_ctim;
inode->mtime = st.st_mtim;
inode->size = st.st_size;
inode->rdev = st.st_rdev;
/* scoutfs specific */
inode->meta_seq = 0;
inode->data_seq = 0;
inode->crtime = st.st_ctim;
if (S_ISREG(inode->mode)) {
if (inode->size > 0)
inode->offline = true;
if (is_scoutfs) {
fd = open(path, O_RDONLY);
error_exit(!fd, "open failure"ERRF, ERRA);
ret = ioctl(fd, SCOUTFS_IOC_STAT_MORE, &stm);
error_exit(ret, "failure SCOUTFS_IOC_STAT_MORE inode");
inode->meta_seq = stm.meta_seq;
inode->data_seq = stm.data_seq;
inode->crtime = (struct timespec){.tv_sec = stm.crtime_sec, .tv_nsec = stm.crtime_nsec};
close(fd);
}
}
/* pass whether item is hardlinked or not */
*nlink = (st.st_nlink > 1);
return inode;
}
struct writer_args {
struct list_head head;
int dev_fd;
int pair_fd;
struct scoutfs_parallel_restore_slice slice;
};
static void restore_path(struct scoutfs_parallel_restore_writer *wri, struct writer_args *args, struct write_result *res, void *buf, char *path, u64 ino)
{
struct scoutfs_parallel_restore_inode *inode;
struct scoutfs_parallel_restore_entry *entry;
DIR *dirp = NULL;
char *subdir = NULL;
char link[PATH_MAX + 1];
struct dirent *ent;
struct statfs stf;
int ret = 0;
int subdir_count = 0, file_count = 0;
size_t ent_len = 0;
size_t pos = 0;
bool nlink = false;
char ind = '?';
u64 mode;
bool is_scoutfs = false;
/* get fs info once per path */
ret = statfs(path, &stf);
error_exit(ret != 0, "statfs"ERRF, ERRA);
is_scoutfs = (stf.f_type == 0x554f4353);
if (!is_scoutfs && !warn_scoutfs) {
warn_scoutfs = true;
fprintf(stderr, "Non-scoutfs source path detected: scoutfs specific features disabled\n");
}
/* traverse the entire tree */
dirp = opendir(path);
errno = 0;
while ((ent = readdir(dirp))) {
if (ent->d_type == DT_DIR) {
if ((strcmp(ent->d_name, ".") == 0) ||
(strcmp(ent->d_name, "..") == 0)) {
/* position still matters */
pos++;
continue;
}
/* recurse into subdir */
ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
restore_path(wri, args, res, buf, subdir, ent->d_ino);
subdir_count++;
ent_len += strlen(ent->d_name);
entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
error_exit(!entry, "error allocating generated entry");
*entry = (struct scoutfs_parallel_restore_entry) {
.dir_ino = ino,
.pos = pos++,
.ino = ent->d_ino,
.mode = DIR_MODE,
.name = (void *)(entry + 1),
.name_len = strlen(ent->d_name),
};
memcpy(entry->name, ent->d_name, strlen(ent->d_name));
ret = scoutfs_parallel_restore_add_entry(wri, entry);
error_exit(ret, "add entry %d", ret);
free(entry);
add_xattrs(wri, subdir, ent->d_ino, is_scoutfs);
free(subdir);
le64_add_cpu(&res->dirs_created, 1);
} else if (ent->d_type == DT_REG) {
file_count++;
ent_len += strlen(ent->d_name);
entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
error_exit(!entry, "error allocating generated entry");
*entry = (struct scoutfs_parallel_restore_entry) {
.dir_ino = ino,
.pos = pos++,
.ino = ent->d_ino,
.mode = REG_MODE,
.name = (void *)(entry + 1),
.name_len = strlen(ent->d_name),
};
memcpy(entry->name, ent->d_name, strlen(ent->d_name));
ret = scoutfs_parallel_restore_add_entry(wri, entry);
error_exit(ret, "add entry %d", ret);
free(entry);
ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
/* file inode */
inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
fprintf(stdout, "f %s/%s\n", path, ent->d_name);
if (is_new_inode_item(nlink, ent->d_ino)) {
ret = scoutfs_parallel_restore_add_inode(wri, inode);
error_exit(ret, "add reg file inode %d", ret);
/* xattrs */
add_xattrs(wri, subdir, ent->d_ino, is_scoutfs);
}
free(inode);
free(subdir);
le64_add_cpu(&res->files_created, 1);
} else if (ent->d_type == DT_LNK) {
/* readlink */
ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
ent_len += strlen(ent->d_name);
ret = readlink(subdir, link, PATH_MAX);
error_exit(ret < 0, "readlink %d", ret);
/* must 0-terminate if we want to print it */
link[ret] = 0;
entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
error_exit(!entry, "error allocating generated entry");
*entry = (struct scoutfs_parallel_restore_entry) {
.dir_ino = ino,
.pos = pos++,
.ino = ent->d_ino,
.mode = LNK_MODE,
.name = (void *)(entry + 1),
.name_len = strlen(ent->d_name),
};
memcpy(entry->name, ent->d_name, strlen(ent->d_name));
ret = scoutfs_parallel_restore_add_entry(wri, entry);
error_exit(ret, "add symlink entry %d", ret);
/* link inode */
inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
fprintf(stdout, "l %s/%s -> %s\n", path, ent->d_name, link);
inode->mode = LNK_MODE;
inode->target = link;
inode->target_len = strlen(link) + 1; /* scoutfs null terminates symlinks */
ret = scoutfs_parallel_restore_add_inode(wri, inode);
error_exit(ret, "add syml inode %d", ret);
free(inode);
free(subdir);
le64_add_cpu(&res->files_created, 1);
} else {
/* odd stuff */
switch(ent->d_type) {
case DT_CHR:
ind = 'c';
mode = S_IFCHR;
break;
case DT_BLK:
ind = 'b';
mode = S_IFBLK;
break;
case DT_FIFO:
ind = 'p';
mode = S_IFIFO;
break;
case DT_SOCK:
ind = 's';
mode = S_IFSOCK;
break;
default:
error_exit(true, "Unknown readdir entry type");
;;
}
file_count++;
ent_len += strlen(ent->d_name);
entry = malloc(sizeof(struct scoutfs_parallel_restore_entry) + strlen(ent->d_name));
error_exit(!entry, "error allocating generated entry");
*entry = (struct scoutfs_parallel_restore_entry) {
.dir_ino = ino,
.pos = pos++,
.ino = ent->d_ino,
.mode = mode,
.name = (void *)(entry + 1),
.name_len = strlen(ent->d_name),
};
memcpy(entry->name, ent->d_name, strlen(ent->d_name));
ret = scoutfs_parallel_restore_add_entry(wri, entry);
error_exit(ret, "add entry %d", ret);
free(entry);
ret = asprintf(&subdir, "%s/%s", path, ent->d_name);
error_exit(ret == -1, "asprintf subdir"ERRF, ERRA);
/* file inode */
inode = read_inode_data(subdir, ent->d_ino, &nlink, is_scoutfs);
fprintf(stdout, "%c %lld %s/%s\n", ind, inode->ino, path, ent->d_name);
if (is_new_inode_item(nlink, ent->d_ino)) {
ret = scoutfs_parallel_restore_add_inode(wri, inode);
error_exit(ret, "add reg file inode %d", ret);
}
free(inode);
free(subdir);
le64_add_cpu(&res->files_created, 1);
}
/* batch out changes, will be about 1M */
if (le64_to_cpu(res->files_created) > BATCH_FILES) {
write_bufs_and_send(wri, buf, args->dev_fd, res, false, args->pair_fd);
}
}
if (ent != NULL)
error_exit(errno, "readdir"ERRF, ERRA);
closedir(dirp);
/* create the dir itself */
inode = read_inode_data(path, ino, &nlink, is_scoutfs);
inode->nr_subdirs = subdir_count;
inode->total_entry_name_bytes = ent_len;
fprintf(stdout, "d %s\n", path);
ret = scoutfs_parallel_restore_add_inode(wri, inode);
error_exit(ret, "add dir inode %d", ret);
free(inode);
/* No need to send, we'll send final after last directory is complete */
}
static int do_restore(struct opts *opts)
{
struct scoutfs_parallel_restore_writer *pwri, *wri = NULL;
struct scoutfs_parallel_restore_slice *slices = NULL;
struct scoutfs_super_block *super = NULL;
struct writer_args *args;
struct write_result res;
int pair[2] = {-1, -1};
LIST_HEAD(writers);
void *buf = NULL;
void *bufp = NULL;
int dev_fd = -1;
pid_t pid;
int ret;
u64 tot_bytes;
u64 tot_dirs;
u64 tot_files;
ret = socketpair(PF_LOCAL, SOCK_STREAM, 0, pair);
error_exit(ret, "socketpair error "ERRF, ERRA);
dev_fd = open(opts->meta_path, O_DIRECT | (O_RDWR|O_EXCL));
error_exit(dev_fd < 0, "error opening '%s': "ERRF, opts->meta_path, ERRA);
errno = posix_memalign((void **)&super, 4096, SCOUTFS_BLOCK_SM_SIZE) ?:
posix_memalign((void **)&buf, 4096, BUF_SIZ);
error_exit(errno, "error allocating block bufs "ERRF, ERRA);
ret = pread(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error reading super, ret %d", ret);
error_exit((super->flags & SCOUTFS_FLAG_IS_META_BDEV) == 0, "super block is not meta dev");
ret = scoutfs_parallel_restore_create_writer(&wri);
error_exit(ret, "create writer %d", ret);
ret = scoutfs_parallel_restore_import_super(wri, super, dev_fd);
error_exit(ret, "import super %d", ret);
slices = calloc(2, sizeof(struct scoutfs_parallel_restore_slice));
error_exit(!slices, "alloc slices");
scoutfs_parallel_restore_init_slices(wri, slices, 2);
ret = scoutfs_parallel_restore_add_slice(wri, &slices[0]);
error_exit(ret, "add slices[0] %d", ret);
args = calloc(1, sizeof(struct writer_args));
error_exit(!args, "alloc writer args");
args->dev_fd = dev_fd;
args->slice = slices[1];
args->pair_fd = pair[1];
list_add_tail(&args->head, &writers);
/* fork writer process */
pid = fork();
error_exit(pid == -1, "fork error");
if (pid == 0) {
ret = prctl(PR_SET_PDEATHSIG, SIGHUP);
error_exit(ret < 0, "failed to set parent death sig");
errno = posix_memalign((void **)&bufp, 4096, BUF_SIZ);
error_exit(errno, "error allocating block bufp "ERRF, ERRA);
ret = scoutfs_parallel_restore_create_writer(&pwri);
error_exit(ret, "create pwriter %d", ret);
ret = scoutfs_parallel_restore_add_slice(pwri, &args->slice);
error_exit(ret, "add pslice %d", ret);
memset(&res, 0, sizeof(res));
restore_path(pwri, args, &res, bufp, opts->source_dir, SCOUTFS_ROOT_INO);
res.complete = true;
write_bufs_and_send(pwri, buf, args->dev_fd, &res, true, args->pair_fd);
scoutfs_parallel_restore_destroy_writer(&pwri);
free(bufp);
exit(0);
};
/* read results and wait for writer to finish */
tot_bytes = 0;
tot_dirs = 1;
tot_files = 0;
for (;;) {
ret = read(pair[0], &res, sizeof(struct write_result));
error_exit(ret != sizeof(struct write_result), "result read error %d", ret);
ret = scoutfs_parallel_restore_add_progress(wri, &res.prog);
error_exit(ret, "add thr prog %d", ret);
if (res.slice.meta_len != 0) {
ret = scoutfs_parallel_restore_add_slice(wri, &res.slice);
error_exit(ret, "add thr slice %d", ret);
if (res.complete)
break;
}
tot_bytes += le64_to_cpu(res.bytes_written);
tot_files += le64_to_cpu(res.files_created);
tot_dirs += le64_to_cpu(res.dirs_created);
}
tot_bytes += write_bufs(wri, buf, args->dev_fd);
fprintf(stdout, "Wrote %lld directories, %lld files, %lld bytes total\n",
tot_dirs, tot_files, tot_bytes);
/* write super to finalize */
ret = scoutfs_parallel_restore_export_super(wri, super);
error_exit(ret, "update super %d", ret);
ret = pwrite(dev_fd, super, SCOUTFS_BLOCK_SM_SIZE,
SCOUTFS_SUPER_BLKNO << SCOUTFS_BLOCK_SM_SHIFT);
error_exit(ret != SCOUTFS_BLOCK_SM_SIZE, "error writing super, ret %d", ret);
scoutfs_parallel_restore_destroy_writer(&wri);
if (dev_fd >= 0)
close(dev_fd);
if (pair[0] > 0)
close(pair[0]);
if (pair[1] > 0)
close(pair[1]);
free(super);
free(args);
free(slices);
free(buf);
return 0;
}
int main(int argc, char **argv)
{
struct opts opts = (struct opts){ 0 };
struct hardlink_head *hh_tmp;
struct hardlink_head *hh;
int ret;
int c;
INIT_LIST_HEAD(&hardlinks);
while ((c = getopt(argc, argv, "b:m:s:")) != -1) {
switch(c) {
case 'm':
opts.meta_path = strdup(optarg);
break;
case 's':
opts.source_dir = strdup(optarg);
break;
case '?':
printf("Unknown option '%c'\n", optopt);
usage();
exit(1);
}
}
error_exit(!opts.meta_path, "must specify metadata device path with -m");
error_exit(!opts.source_dir, "must specify source directory path with -s");
ret = do_restore(&opts);
free(opts.meta_path);
free(opts.source_dir);
list_for_each_entry_safe(hh, hh_tmp, &hardlinks, head) {
list_del_init(&hh->head);
free(hh);
}
return ret == 0 ? 0 : 1;
}

View File

@@ -0,0 +1,110 @@
#
# test basic POSIX acl functionality.
#
t_require_commands stat rm touch mkdir getfacl setfacl id sudo
t_require_mounts 2
# from quota.sh
TEST_UID=22222
TEST_GID=44444
# sys_setreuid() set fs[uid] to e[ug]id
SET_UID="--ruid=$TEST_UID --euid=$TEST_UID"
SET_GID="--rgid=$TEST_GID --egid=$TEST_GID --clear-groups"
# helper to avoid capturing dates from ls output
L() {
stat -c "%F %A %u %g %s %N" $@
}
echo "== setup test directory"
cd "$T_D0"
echo "== getfacl"
L .
getfacl .
echo "== basic non-acl access through permissions"
rm -rf dir-testuid
mkdir dir-testuid
ln -sf dir-testuid symlinkdir-testuid
chown root:44444 dir-testuid
L dir-testuid
setpriv $SET_UID $SET_GID touch dir-testuid/file-group-write
setpriv $SET_UID $SET_GID touch symlinkdir-testuid/symlink-file-group-write
chmod g+w dir-testuid
setpriv $SET_UID $SET_GID touch dir-testuid/file-group-write
setpriv $SET_UID $SET_GID touch symlinkdir-testuid/symlink-file-group-write
L dir-testuid/file-group-write
L symlinkdir-testuid/symlink-file-group-write
echo "== basic acl access"
rm -rf dir-root
mkdir dir-root
ln -sf dir-root symlinkdir-root
L dir-root
setpriv $SET_UID touch dir-root/file-group-write
setpriv $SET_UID touch symlinkdir-root/file-group-write
setfacl -m u:22222:rwx dir-root
getfacl dir-root
setpriv $SET_UID touch dir-root/file-group-write
setpriv $SET_UID touch symlinkdir-root/file-group-write
L dir-root/file-group-write
L symlinkdir-root/file-group-write
echo "== directory exec"
setpriv $SET_UID bash -c "cd dir-root 2>&- && echo Success"
setpriv $SET_UID bash -c "cd symlinkdir-root 2>&- && echo Success"
setfacl -m u:22222:rw dir-root
getfacl dir-root
setpriv $SET_UID bash -c "cd dir-root 2>&- || echo Failed"
setpriv $SET_UID bash -c "cd symlinkdir-root 2>&- || echo Failed"
setfacl -m g:44444:rwx dir-root
getfacl dir-root
setpriv $SET_GID bash -c "cd dir-root 2>&- && echo Success"
setpriv $SET_GID bash -c "cd symlinkdir-root 2>&- && echo Success"
echo "== get/set attr"
rm -rf file-root
touch file-root
L file-root
setpriv $SET_UID getfattr -d file-root
setpriv $SET_UID setfattr -n "user.test1" -v "Success" file-root
setpriv $SET_UID getfattr -d file-root
setfacl -m u:22222:rw file-root
getfacl file-root
setpriv $SET_UID setfattr -n "user.test2" -v "Success" file-root
setpriv $SET_UID getfattr -d file-root
setfacl -x u:22222 file-root
getfacl file-root
setpriv $SET_UID setfattr -n "user.test3" -v "Success" file-root
setpriv $SET_UID getfattr -d file-root
setfacl -m g:44444:rw file-root
getfacl file-root
setpriv $SET_GID setfattr -n "user.test4" -v "Success" file-root
setpriv $SET_GID getfattr -d file-root
echo "== inheritance / default acl"
rm -rf dir-root2
mkdir dir-root2
L dir-root2
setpriv $SET_UID mkdir dir-root2/dir
setpriv $SET_UID touch dir-root2/dir/file
setfacl -m d:u:22222:rwx dir-root2
getfacl dir-root2
setpriv $SET_UID mkdir dir-root2/dir
setpriv $SET_UID touch dir-root2/dir/file
setfacl -m u:22222:rwx dir-root2
getfacl dir-root2
setpriv $SET_UID mkdir dir-root2/dir
setpriv $SET_UID touch dir-root2/dir/file
L dir-root2/dir
getfacl dir-root2/dir
L dir-root2/dir/file
getfacl dir-root2/dir/file
echo "== cleanup"
t_pass

View File

@@ -3,13 +3,13 @@
# operations in one mount and verify the results in another.
#
t_require_commands getfattr setfattr dd filefrag diff touch stat scoutfs
t_require_commands getfattr setfattr dd diff touch stat scoutfs
t_require_mounts 2
GETFATTR="getfattr --absolute-names"
SETFATTR="setfattr"
DD="dd status=none"
FILEFRAG="filefrag -v -b4096"
FIEMAP="scoutfs get-fiemap"
echo "== root inode updates flow back and forth"
sleep 1
@@ -55,8 +55,8 @@ for i in $(seq 1 10); do
conv=notrunc oflag=append &
wait
done
$FILEFRAG "$T_D0/file" | t_filter_fs > "$T_TMP.0"
$FILEFRAG "$T_D1/file" | t_filter_fs > "$T_TMP.1"
$FIEMAP "$T_D0/file" > "$T_TMP.0"
$FIEMAP "$T_D1/file" > "$T_TMP.1"
diff -u "$T_TMP.0" "$T_TMP.1"
echo "== unlinked file isn't found"
@@ -210,4 +210,7 @@ done
wait
ls "$T_D0/concurrent"
echo "== cleanup"
rm -f "$T_TMP.0" "$T_TMP.1"
t_pass

View File

@@ -73,4 +73,7 @@ test "$large_tot" -gt "$equal_tot" ; echo "resized larger test rc: $?"
umount "$SCR"
losetup -d "$scr_loop"
echo "== cleanup"
rm -f "$T_TMP.small" "$T_TMP.equal" "$T_TMP.large"
t_pass

View File

@@ -28,7 +28,7 @@ while [ "$SECONDS" -lt "$END" ]; do
for i in $(t_fs_nrs); do
if [ "$i" -ge "$quorum_nr" ]; then
t_umount $i &
echo "umount $i pid $pid quo $quorum_nr" \
echo "umount $i rid $rid quo $quorum_nr" \
>> $T_TMP.log
mounted[$i]=0
fi
@@ -53,6 +53,9 @@ while [ "$SECONDS" -lt "$END" ]; do
for i in "${lock_arr[@]}"; do
if [[ ! " ${rid_arr[*]} " =~ " $i " ]]; then
echo -e "RID($i) exists" >> $T_TMP.log
echo -e "rid_arr:\n${rid_arr[@]}" >> $T_TMP.log
echo -e "lock_arr:\n${lock_arr[@]}" >> $T_TMP.log
t_fail "RID($i): exists when not mounted"
fi
done

View File

@@ -2,34 +2,40 @@
# Test clustered parallel createmany
#
t_require_commands mkdir createmany
t_require_commands mkdir createmany bc
t_require_mounts 2
COUNT=50000
# Prep dirs for test. Each mount needs to make their own parent dir for
# the createmany run, otherwise both dirs will end up in the same inode
# group, causing updates to bounce that lock around.
#
# Prep dirs for test. We have per-directory inode number allocators so
# by putting each createmany in a per-mount dir they get their own inode
# number region and cluster locks.
#
echo "== measure initial createmany"
mkdir -p $T_D0/dir/0
mkdir $T_D1/dir/1
echo "== measure initial createmany"
START=$SECONDS
START=$(date +%s.%N)
createmany -o "$T_D0/file_" $COUNT >> $T_TMP.full
SINGLE=$((SECONDS - START))
echo single $SINGLE >> $T_TMP.full
sync
END=$(date +%s.%N)
SINGLE=$(echo "$END - $START" | bc)
echo "== measure two concurrent createmany runs"
START=$SECONDS
createmany -o $T_D0/dir/0/file $COUNT > /dev/null &
START=$(date +%s.%N)
(cd $T_D0/dir/0; createmany -o ./file_ $COUNT > /dev/null) &
pids="$!"
createmany -o $T_D1/dir/1/file $COUNT > /dev/null &
(cd $T_D1/dir/1; createmany -o ./file_ $COUNT > /dev/null) &
pids="$pids $!"
for p in $pids; do
wait $p
done
BOTH=$((SECONDS - START))
sync
END=$(date +%s.%N)
BOTH=$(echo "$END - $START" | bc)
echo both $BOTH >> $T_TMP.full
# Multi node still adds significant overhead, even with our CW locks
@@ -40,8 +46,11 @@ echo both $BOTH >> $T_TMP.full
# exceed this factor should the CW locked items go back to fully
# synchronized operation.
FACTOR=200
if [ "$BOTH" -gt $(($SINGLE*$FACTOR)) ]; then
echo "both createmany took $BOTH sec, more than $FACTOR x single $SINGLE sec"
if [ $(echo "$BOTH > ( $SINGLE * $FACTOR )" | bc) == "1" ]; then
t_fail "both createmany took $BOTH sec, more than $FACTOR x single $SINGLE sec"
fi
echo "== cleanup"
find $T_D0/dir -delete
t_pass

View File

@@ -4,7 +4,16 @@
# merge adjacent consecutive allocations. (we don't have multiple
# allocation cursors)
#
t_require_commands scoutfs stat filefrag dd touch truncate
t_require_commands scoutfs stat dd touch truncate
get_fiemap()
{
scoutfs get-fiemap "$1" | awk '($1 != "extents:") {
unwritten = (substr($8, 2, 1) == "U") ? "unwritten" : "";
eof = (substr($8, 3, 1) == "L") ? "eof" : "";
print $3 ".. " $6 ": " unwritten eof;
};'
}
write_block()
{
@@ -76,26 +85,9 @@ print_extents_found()
{
local prefix="$1"
filefrag "$prefix"* 2>&1 | grep "extent.*found" | t_filter_fs
}
#
# print the logical start, len, and flags if they're there.
#
print_logical_extents()
{
local file="$1"
filefrag -v -b4096 "$file" 2>&1 | t_filter_fs | awk '
($1 ~ /[0-9]+:/) {
if ($NF !~ /[0-9]+:/) {
flags=$NF
} else {
flags=""
}
print $2, $6, flags
}
' | sed 's/last,eof/eof/'
for f in "$prefix"-*; do
echo "$f: $(scoutfs get-fiemap "$f" | tail -n 1)" | t_filter_fs
done
}
t_save_all_sysfs_mount_options data_prealloc_blocks
@@ -197,7 +189,7 @@ for sides in 0 1 2 3; do
done
echo before:
print_logical_extents "$prefix"
get_fiemap "$prefix"
# now write into the first, middle, and last empty block of each
t_set_sysfs_mount_option 0 data_prealloc_contig_only 0
@@ -223,7 +215,7 @@ for sides in 0 1 2 3; do
# mid (both has 6 blocks internally)
2) write_block $prefix $((left + 3)) ;;
esac
print_logical_extents "$prefix"
get_fiemap "$prefix"
((base+=8))
done
done

View File

@@ -0,0 +1,184 @@
#
# Test our basic ability to work with different format versions.
#
# The current code being tested has a range of supported format
# versions. For each of the older supported format versions we have a
# git hash of the commit before the next greater version was introduced.
# We build versions of the scoutfs utility and kernel module for the
# last commit in tree that had a lesser supported version as its max
# supported version. We use those binaries to test forward and back
# compat as new and old code works with a persistent volume with a given
# format version.
#
# not supported on el9!
if [ $(source /etc/os-release ; echo ${VERSION_ID:0:1}) -gt 8 ]; then
t_skip_permitted "Unsupported OS version"
fi
mount_has_format_version()
{
local mnt="$1"
local vers="$2"
local sysfs_fmt_vers="$(t_sysfs_path_from_mnt $SCR)/format_version"
test "$(cat $sysfs_fmt_vers)" == "$vers"
}
SCR="/mnt/scoutfs.scratch"
MIN=$(modinfo $T_MODULE | awk '($1 == "scoutfs_format_version_min:"){print $2}')
MAX=$(modinfo $T_MODULE | awk '($1 == "scoutfs_format_version_max:"){print $2}')
echo "min: $MIN max: $MAX" > "$T_TMP.log"
test "$MIN" -gt 0 -a "$MAX" -gt 0 -a "$MIN" -le "$MAX" || \
t_fail "parsed bad versions, min: $MIN max: $MAX"
test "$MIN" == "$MAX" && \
t_skip "only one supported format version: $MIN"
# prepare dir and wipe any weird old partial state
builds="$T_RESULTS/format_version_builds"
mkdir -p "$builds"
echo "== ensuring utils and module for old versions"
declare -A commits
commits[1]=c3c4b080
for vers in $(seq $MIN $((MAX - 1))); do
dir="$builds/$vers"
platform=$(uname -rp)
buildmark="$dir/buildmark"
commit="${commits[$vers]}"
test -n "$commit" || \
t_fail "no commit for vers $vers"
# have our files for this version
test "$(cat $buildmark 2>&1)" == "$platform" && \
continue
# build as one big sequence of commands that can return failure
(
set -o pipefail
rm -rf $dir &&
mkdir -p $dir/building &&
cd "$T_TESTS/.." &&
git archive --format=tar "$commit" | tar -C "$dir/building" -xf - &&
cd - &&
find $dir &&
make -C "$dir/building" &&
mv $dir/building/utils/src/scoutfs $dir &&
mv $dir/building/kmod/src/scoutfs.ko $dir &&
rm -rf $dir/building &&
echo "$platform" > $buildmark &&
find $dir &&
cat $buildmark
) >> "$T_TMP.log" 2>&1 || t_fail "version $vers build failed"
done
echo "== unmounting test fs and removing test module"
t_quiet t_umount_all
t_quiet rmmod scoutfs
echo "== testing combinations of old and new format versions"
mkdir -p "$SCR"
for vers in $(seq $MIN $((MAX - 1))); do
old_scoutfs="$builds/$vers/scoutfs"
old_module="$builds/$vers/scoutfs.ko"
echo "mkfs $vers" >> "$T_TMP.log"
t_quiet $old_scoutfs mkfs -f -Q 0,127.0.0.1,53000 "$T_EX_META_DEV" "$T_EX_DATA_DEV" \
|| t_fail "mkfs $vers failed"
echo "mount $vers with $vers" >> "$T_TMP.log"
t_quiet insmod $old_module
t_quiet mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
t_quiet mount_has_format_version "$SCR" "$vers"
echo "creating files in $vers" >> "$T_TMP.log"
t_quiet touch "$SCR/file-"{1,2,3}
stat "$SCR"/file-* > "$T_TMP.stat" || \
t_fail "stat in $vers failed"
echo "remounting $vers fs with $MAX" >> "$T_TMP.log"
t_quiet umount "$SCR"
rmmod scoutfs
insmod "$T_MODULE"
t_quiet mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
t_quiet mount_has_format_version "$SCR" "$vers"
echo "verifying stat in $vers with $MAX" >> "$T_TMP.log"
diff -u "$T_TMP.stat" <(stat "$SCR"/file-*)
echo "keep/update/del existing, create new in $vers" >> "$T_TMP.log"
t_quiet touch "$SCR/file-2"
t_quiet rm -f "$SCR/file-3"
t_quiet touch "$SCR/file-4"
stat "$SCR"/file-* > "$T_TMP.stat" || \
t_fail "stat in $vers failed"
echo "remounting $vers fs with $vers" >> "$T_TMP.log"
t_quiet umount "$SCR"
rmmod scoutfs
insmod "$old_module"
t_quiet mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
t_quiet mount_has_format_version "$SCR" "$vers"
echo "verifying stat in $vers with $vers" >> "$T_TMP.log"
diff -u "$T_TMP.stat" <(stat "$SCR"/file-*)
echo "changing format vers to $MAX" >> "$T_TMP.log"
t_quiet umount "$SCR"
rmmod scoutfs
t_quiet scoutfs change-format-version -F -V $MAX $T_EX_META_DEV "$T_EX_DATA_DEV"
echo "mount fs $MAX with old $vers should fail" >> "$T_TMP.log"
insmod "$old_module"
mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR" >> "$T_TMP.log" 2>&1
if [ "$?" == "0" ]; then
umount "$SCR"
t_fail "old code ver $vers able to mount new ver $MAX"
fi
echo "remounting $MAX fs with $MAX" >> "$T_TMP.log"
rmmod scoutfs
insmod "$T_MODULE"
t_quiet mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
t_quiet mount_has_format_version "$SCR" "$MAX"
echo "verifying stat in $MAX with $MAX" >> "$T_TMP.log"
diff -u "$T_TMP.stat" <(stat "$SCR"/file-*)
echo "keep/update/del existing, create new in $MAX" >> "$T_TMP.log"
t_quiet touch "$SCR/file-2"
t_quiet rm -f "$SCR/file-4"
t_quiet touch "$SCR/file-5"
stat "$SCR"/file-* > "$T_TMP.stat" || \
t_fail "stat in $MAX failed"
echo "remounting $MAX fs with $MAX again" >> "$T_TMP.log"
t_quiet umount "$SCR"
t_quiet mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
t_quiet mount_has_format_version "$SCR" "$MAX"
echo "verifying stat in $MAX with $MAX again" >> "$T_TMP.log"
diff -u "$T_TMP.stat" <(stat "$SCR"/file-*)
echo "done with old vers $vers" >> "$T_TMP.log"
t_quiet umount "$SCR"
rmmod scoutfs
done
echo "== restoring test module and mount"
insmod "$T_MODULE"
t_mount_all
t_pass

View File

@@ -10,6 +10,30 @@ EXTENTS_PER_BTREE_BLOCK=600
EXTENTS_PER_LIST_BLOCK=8192
FREED_EXTENTS=$((EXTENTS_PER_BTREE_BLOCK * EXTENTS_PER_LIST_BLOCK))
#
# This test specifically creates a pathologically sparse file that will
# be as expensive as possible to free. This is usually fine on
# dedicated or reasonable hardware, but trying to run this in
# virtualized debug kernels can take a very long time. This test is
# about making sure that the server doesn't fail, not that the platform
# can handle the scale of work that our btree formats happen to require
# while execution is bogged down with use-after-free memory reference
# tracking. So we give the test a lot more breathing room before
# deciding that its hung.
#
echo "== setting longer hung task timeout"
if [ -w /proc/sys/kernel/hung_task_timeout_secs ]; then
secs=$(cat /proc/sys/kernel/hung_task_timeout_secs)
test "$secs" -gt 0 || \
t_fail "confusing value '$secs' from /proc/sys/kernel/hung_task_timeout_secs"
restore_hung_task_timeout()
{
echo "$secs" > /proc/sys/kernel/hung_task_timeout_secs
}
trap restore_hung_task_timeout EXIT
echo "$((secs * 5))" > /proc/sys/kernel/hung_task_timeout_secs
fi
echo "== creating fragmented extents"
fragmented_data_extents $FREED_EXTENTS $EXTENTS_PER_BTREE_BLOCK "$T_D0/alloc" "$T_D0/move"

Some files were not shown because too many files have changed in this diff Show More