Commit Graph

1262 Commits

Author SHA1 Message Date
Zach Brown
f81e4cb98a Add whitespace to xfstests output message
The message indicating that xfstests output was now being shown was
mashed up against the previous passed stats and it was gross and I hated
it.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
1fc706bf3f Filter hrtimer slow messages from dmesg
When running in debug kernels in guests we can really bog down things
enough to trigger hrtimer warnings.  I don't think there's much we can
reasonably do about that.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
e9c3aa6501 More carefully cancel server farewell work
Farewell work is queued by farewell message processing.  Server shutdown
didn't properly wait for pending farewell work to finish before tearing
down.  As the server work destroyed the server's connection the farewell
work could stlil be running and try to send responses down the socket.

We make the server more carefully avoid queueuing farewell work if it's
in the process of shutting down and wait for farewell work to finish
before destroying the server's resources.

This fixed all manner of crashes that were seen in testing when a bunch
of nodes unmounted, creating farewell work on the server as it itself
unmounted and destroyed the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
d39268bbc1 Fix spurious EIO from scoutfs_srch_get_compact
scoutfs_srch_get_compact() is building up a compaction request which has
a list of srch files to read and sort and write into a new srch file.
It finds input files by searching for a sufficient number of similar
files: first any unsorted log files and then sorted log files that are
around the same size.

It finds the files by using btree next on the srch zone which has types
for unsorted srch log files, sorted srch files, but also pending and
busy compaction items.

It was being far too cute about iterating over different key types.  It
was trying to adapt to finding the next key and was making assumptions
about the order of key types.  It didn't notice that the pending and
busy key types followed log and sorted and would generate EIO when it
ran into them and found their value length didn't match what it was
expecting.

Rework the next item ref parsing so that it returns -ENOENT if it gets
an unexpected key type, then look for the next key type when checking
enoent.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
35ed1a2438 Add t_require_meta_size function
Add a function that tests can use to skip when the metadata device isn't
large enough.  I thought we needed to avoid enospc in a particular test,
but it turns out the test's failure was unrelated.  So this isn't used
for now but it seems nice to keep around.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
32e7978a6e Extend lock invalidate grace period
The grace period is intended to let lock holders squeeze in more bulk
work before another node pulls the lock out from under them.  The length
of the delay is a balance between getting more work done per lock hold
and adding latency to ping-ponging workloads.

The current grace period was too short.  To do work in the conflicting
case you often have to read the result that the other mount wrote as you
invalidated their lock.  The test was written in the LSM world where
we'd effectively read a single level 0 1MB segment.  In the btree world
we're checking bloom blocks and reading the other mount's btree.  It has
more dependent read latency.

So we turn up the grace period to let conflicting readers squeeze in
more work before pulling the lock out from under them.  This value was
chosen to make lock-conflicting-batch-commit pass in guests sharing nvme
metadata devices in debugging kernels.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
8123b8fc35 fix lock-conflicting-batch-commit conf output
The test had a silly typo in the label it put on the time it took mounts
to perform conflicting metadata changes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
da5911c311 Use d_materialise_unique to splice dir dentries
When we're splicing in dentries in lookup we can be splicing the result
of changes on other nodes into a stale dcache.  The stale dcache might
contain dir entries and the dcache does not allow aliased directories.

Use d_materialise_unique() to splice in dir inodes so that we remove all
aliased dentries which must be stale.

We can still use d_splice_alias() for all other inode types.  Any
existing stale dentries will fail revalidation before they're used.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
098fc420be Add some item cache page tracing
Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
7a96537210 Leave mounts mounted if run-tests fails
We can lose interesting state if the mounts are unmounted as tests fail,
only unmount if all the tests pass.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
0607dfdac8 Enable and collect trace_printk
Weirdly, run-tests was treating trace_printk not as an option to enable
trace_printk() traces but as an option to print trace events to the
console with printk?  That's not a thing.

Make -P really enable trace_printk tracing and collect it as it would
enabled trace events.  It needs to be treated seperately from the -t
options that enable trace events.

While we're at it treat the -P trace dumping option as a stand-alone
option that works without -t arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
0354bb64c5 More carefully enable tracing in run-tests
run-tests.sh has a -t argument which takes a whitespace seperated string
of globs of events to enable.  This was hard to use and made it very
easy to accidentally expand the globs at the wrong place in the script.

This makes each -t argument specify a single word glob which is stored
in an array so the glob isn't expanded until it's applied to the trace
event path.   We also add an error for -t globs that didn't match any
events and add a message with the count of -t arguments and enabled
events.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
631801c45c Don't queue lock invalidation work during shutdown
The lock invalidation work function needs to be careful not to requeue
itself while we're shutting down or we can be left with invalidation
functions racing with shutdown.  Invalidation calls igrab so we can end
up with unmount warning that there are still inodes in use.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
47a1ac92f7 Update ino-path args in basic-posix-consistency
The ino-path calls in basic-posix-consistency weren't updated for the
recent change to scoutfs cli args.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:45:23 -08:00
Zach Brown
004f693af3 Add golden output for mount-unmount-race test
Signed-off-by: Zach Brown <zab@versity.com>
2021-01-25 14:19:35 -08:00
Andy Grover
f271a5d140 Merge pull request #12 from versity/zab/andys_fallocate_fix_minor_cleanup
Retry if transaction cannot alloc for fallocate or write
2021-01-25 12:52:14 -08:00
Andy Grover
355eac79d2 Retry if transaction cannot alloc for fallocate or write
Add a new distinguishable return value (ENOBUFS) from allocator for if
the transaction cannot alloc space. This doesn't mean the filesystem is
full -- opening a new transaction may result in forward progress.

Alter fallocate and get_blocks code to check for this err val and retry
with a new transaction. Handling actual ENOSPC can still happen, of
course.

Add counter called "alloc_trans_retry" and increment it from both spots.

Signed-off-by: Andy Grover <agrover@versity.com>
[zab@versity.com: fixed up write_begin error paths]
2021-01-25 09:32:01 -08:00
Zach Brown
d8b4e94854 Merge pull request #10 from agrover/rm-item-accounting
Remove item accounting
2021-01-21 09:57:53 -08:00
Andy Grover
bed33c7ffd Remove item accounting
Remove kmod/src/count.h
Remove scoutfs_trans_track_item()
Remove reserved/actual fields from scoutfs_reservation

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-20 17:01:08 -08:00
Andy Grover
b370730029 Merge pull request #11 from versity/zab/item_cache_memory_corruption
Fix item cache page memory corruption
2021-01-20 10:27:20 -08:00
Zach Brown
d64dd89ead Fix item cache page memory corruption
The item cache page life cycle is tricky.  There are no proper page
reference counts, everthing is done by nesting the page rwlock inside
item_cache_info rwlock.  The intent is that you can only reference pages
while you hold the rwlocks appropriately.  The per-cpu page references
are outside that locking regime so they add a reference count.  Now
there are reference counts for the main cache index reference and for
each per-cpu reference.

The end result of all this is that you can only reference pages outside
of locks if you're protected by references.

Lock invalidation messed this up by trying to add its right split page
to the lru after it was unlocked.  Its page reference wasn't protected
at this point.  Shrinking could be freeing that page, and so it could be
putting a freed page's memory back on the lru.

Shrinking had a little bug that it was using list_move to move an
initialized lru_head list_head.  It turns out to be harmless (list_del
will just follow pointers to itself and set itself as next and prev all
over again), but boy does it catch one's eye.  Let's remove all
confusion and drop the reference while holding the cinf->rwlock instead
of trying to optimize freeing outside locks.

Finally, the big one: inserting a read item after compacting the page to
make room was inserting into stale parent pointers into the old
pre-compacted page, rather than the new page that was swapped in by
compaction.  This left references to a freed page in the page rbtree and
hilarity ensued.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-20 09:02:29 -08:00
Zach Brown
8d81196e01 Merge pull request #7 from agrover/versioning
Filesystem version instead of format hash check
2021-01-19 11:55:32 -08:00
Andy Grover
d731c1577e Filesystem version instead of format hash check
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.

Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.

Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.

Update README. Fix comments.

Add interop version to notes and modinfo.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-15 10:53:00 -08:00
Andy Grover
a421bb0884 Merge pull request #5 from versity/zab/move_blocks_ioctl
Zab/move blocks ioctl
2021-01-14 16:18:45 -08:00
Zach Brown
773eb129ed Add move-blocks test
Add a basic test of the move_blocks ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
eb3981c103 Add move-blocks scoutfs cli command
Add a move-blocks command that translates arguments and calls the
MOVE_BLOCKS ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
3139d3ea68 Add move_blocks ioctl
Add a relatively constrained ioctl that moves extents between regular
files.  This is intended to be used by tasks which combine many existing
files into a much larger file without reading and writing all the file
contents.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
4da3d47601 Move ALLOC_DETAIL ioctl definition
By convention we have the _IO* ioctl definition after the argument
structs and ALLOC_DETAIL got it a bit wrong so move it down.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
aa1b1fa34f Add util.h for kernel helpers
Add a little header for inline convenience functions.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
8fcc9095e6 Merge pull request #6 from agrover/super
Fix mkfs check for existing ScoutFS superblock
2021-01-14 08:57:53 -08:00
Andy Grover
299062a456 Fix mkfs check for existing ScoutFS superblock
We were checking for the wrong magic value.

We now need to use -f when running mkfs in run-tests for things to work.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-13 16:32:41 -08:00
Andy Grover
7cac1e7136 Merge pull request #1 from agrover/use-argp
Rework scoutfs command-line parsing
2021-01-13 11:14:08 -08:00
Andy Grover
454dbebf59 Categorize not enough mounts as skip, not fail
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
2c5871c253 Change release ioctl to be denominated in bytes not blocks
This more closely matches stage ioctl and other conventions.

Also change release code to use offset/length nomenclature for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
64a698aa93 Make changes to tests for new scoutfs cmdline syntax
Some different error message require changes to golden/*

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
d48b447e75 Do not set -Wpadded except for checking kmod-shared headers
Remove now-unneeded manual padding in arg structs.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
5241bba7f6 Update scoutfs.8 man page
Update for cli args and options changes. Reorder subcommands to match
scoutfs built-in help.

Consistent ScoutFS capitalization.

Tighten up some descriptions and verbiage for consistency and omit
descriptions of internals in a few spots.

Add SEE ALSO for blockdev(8) and wipefs(8).

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
e0a2175c2e Use argp info instead of duplicating for cmd_register()
Make it static and then use it both for argp_parse as well as
cmd_register_argp.

Split commands into five groups, to help understanding of their
usefulness.

Mention that each command has its own help text, and that we are being
fancy to keep the user from having to give fs path.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
f2cd1003f6 Implement argp support for walk-inodes
This has some fancy parsing going on, and I decided to just leave it
in the main function instead of going to the effort to move it all
to the parsing function.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
97c6cc559e Implement argp support for data-waiting and data-wait-err
These both have a lot of required options.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
7c54c86c38 Implement argp support for setattr
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
e1ba508301 Implement argp support for counters
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
f35154eb19 counters: Ensure name_wid[0] is initialized to zero
I was seeing some segfaults and other weirdness without this.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
7befc61482 Implement argp support for mkfs and add --force
Support max-meta-size and max-data-size using KMGTP units with rounding.

Detect other fs signatures using blkid library.

Detect ScoutFS super using magic value.

Move read_block() from print.c into util.c since blkid also needs it.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
1383ca1a8d Merge pull request #3 from versity/zab/multithread_write_extra_commits
Consistently sample data alloc total_len
2021-01-12 11:51:15 -08:00
Andy Grover
6b5ddf2b3a Implement argp support for print
Print warning if printing a data dev, you probably wanted the meta dev.

Change read_block to return err value. Otherwise there are confusing
ENOMEM messages when pread() fails. e.g. try to print /dev/null.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
d025122fdd Implement argp support for listxaddr-hidden
Rename to list-hidden-xaddrs.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
706fe9a30e Implement argp support for search-xattrs
Get fs path via normal methods, and make xattr an argument not an option.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
0f17ecb9e3 Implement argp support for stage/release
Make offset and length optional. Allow size units (KMGTP) to be used
  for offset/length.

release: Since off/len no longer given in 4k blocks, round offset and
  length to to 4KiB, down and up respectively. Emit a message if rounding
  occurs.

Make version a required option.

stage: change ordering to src (the archive file) then the dest (the
  staged file).

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Zach Brown
fc003a5038 Consistently sample data alloc total_len
With many concurrent writers we were seeing excessive commits forced
because it thought the data allocator was running low.  The transaction
was checking the raw total_len value in the data_avail alloc_root for
the number of free data blocks.  But this read wasn't locked, and
allocators could completely remove a large free extent and then
re-insert a slightly smaller free extent as they perform their
alloction.  The transaction could see a temporary very small total_len
and trigger a commit.

Data allocations are serialized by a heavy mutex so we don't want to
have the reader try and use that to see a consistent total_len.  Instead
we create a data allocator run-time struct that has a consistent
total_len that is updated after all the extent items are manipulated.
This also gives us a place to put the caller's cached extent so that it
can be included in the total_len, previously it wasn't included in the
free total that the transaction saw.

The file data allocator can then initialize and use this struct instead
of its raw use of the root and cached extent.  Then the transaction can
sample its consistent total_len that reflects the root and cached
extent.

A subtle detail is that fallocate can't use _free_data to return an
allocated extent on error to the avail pool.  It instead frees into the
data_free pool like normal frees.  It doesn't really matter that this
could prematurely drain the avail pool because it's in an error path.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-06 09:25:32 -08:00