Commit Graph

334 Commits

Author SHA1 Message Date
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
a1d46e1a92 Fix mkfs btree item offset calculation
mkfs was miscalculating the offset of the start of the free region in
the center of blocks as it populated blocks with items.  It was using
the length of the free region as its offset in the block.  To find
the offset of the end of the free region in the block it has to be
taken relative to the end of the item array.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3488b4e6e0 Add scoutfs print support for log merge items
Add support for printing all the items in the log_merge tree that the
server uses to track log merging.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c482204fcf Clean up btree root printing in superblock
Over time the printing of the btree roots embedded in the super block
has gotten a little out of hand.  Add a helper macro for the printf
format and args and re-order them to match their order in the
superblock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9711fef122 Update for core, trans, and item seq use
We now have a core seq number in the super that is advanced for multiple
users.    The client transaction seq comes from the core seq so we
remove the trans_seq from the super.  The item version is also converted
to use a seq that's derived from the core seq.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
38a4a56741 Stop writing to other quorum slot blocks
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block.  It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.

But the design of the leader bit in the block violates the invariant
that only a slot will write to its block.   As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.

The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play.  An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.

All this comes together to have the server fence an old mount in a slot
while a new mount is coming up.  The new mount sees the mark change and
freaks out and stops participating in quorum.

The fix is to rework the quorum blocks so that each slot only writes to
its own block.  Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced.  We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed.  Each
event gets its own term so we can compare events to discover live
servers.

We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-31 13:10:45 -07:00
Zach Brown
76076011a2 Add scoutfs-fenced man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
bdc0282fa7 Describe fencing in the scoutfs.5 man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1e460e5cb0 Add scoutfs-fenced and its run scripts to spec
Install the scoutfs-fenced daemon and its run scripts in the rpm spec
file.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
877e30d60f Add client address to mounted_client item
Add the peername of the client's connected socket to its mounted_client
item as it mounts.  If the client doesn't recover then fencing can use
the IP to find the host to fence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1f1f40f079 Add fence agent that processes fence requests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
d0b04e790c Add data-alloc-zone-blocks argument to mkfs
Add an argument to mkfs which sets the data_alloc_zone_blocks volume
option.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
54644a5074 Add data_alloc_zone_blocks volume option
Add the data_alloc_zone_blocks volume option.  This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.

We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.

We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root.  The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents.  The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.

The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
9de3ae6dcb Index free extents by order of length
Allocators store free extents in two items, one sorted by their blkno
position and the other by their precise length.

The length index makes it easy to search for precise extent lengths, but
it makes it hard to search for a large extent within a given blkno
region.  Skipping in the blkno dimension has to be done for every
precise length value.

We don't need that level of precision.  If we index the extents by a
coarser order of the length then we have a fixed number of orders in
which we have to skip in the blkno dimension when searching within a
specific region.

This changes the length item to be stored at the log(8) order of the
length of the extents.  This groups extents into orders that are close
to the human-friendly base 10 orders of magnitude.

With this change the order field in the key no longer stores the precise
extent length.  To preserve the length of the extent we need to use
another field.  The only 64bit field remaining is the first which is a
higher comparision priority than the type.  So we use the highest
comparison priority zone field to differentiate the position and order
indexes and can now use all three 64bit fields in the key.

Finally, we have to be careful when constructing a key to use _next when
searching for a large extent.  Previously keys were relying on the magic
property that building a key from an extent length of 0 ended up at the
key value -0 = 0.  That only worked because we never stored zero length
extents.  We now store zero length orders so we can't use the negative
trick anymore.  We explicitly treat 0 length extents carefully when
building keys and we subtract the order from U64_MAX to store the orders
from largest to smallest.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:25:56 -07:00
Zach Brown
0aa6005c99 Add volume options super, server, and sysfs
Introduce global volume options.  They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-19 14:15:06 -07:00
Zach Brown
c6fd807638 Use recov to manage lock recovery
Now that we have the recov layer we can have the lock server use it to
track lock recovery.  The lock server no longer needs its own recovery
tracking structures and can instead call recov.  We add a call for the
server to call to kick lock processing once lock recovery finishes.  We
can get rid of the persistent lock_client items now that the server is
driving recovery from the mounted_client items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Andy Grover
0deb232d3f Support O_TMPFILE and allow MOVE_BLOCKS into released extents
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.

Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.

RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.

Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.

xfstests common/004 now runs because tmpfile is supported.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 14:23:44 -07:00
Andy Grover
efe5d92458 Reserve space in superblock for IPv6 addresses
Define a family field, and add a union for IPv4 and v6 variants, although
v6 is not supported yet.

Family field is now used to determine presence of address in a quorum slot,
instead of checking if addr is zero.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-03-12 14:10:42 -08:00
Zach Brown
f18fa0e97a Update scoutfs print for centralized block_ref
Update scoutfs print to use the new block_ref struct instead of the
handful of per-block type ref structs that we had accumulated.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:49:17 -08:00
Zach Brown
9878312b4d Update man pages for quorum slot changes
Update the man pages with descriptions of the new mkfs -Q quorum slot
configuration and quorum_slot_nr mount option.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
57f34e90e9 Use mounted_client item as sign of farewell
As clients unmount they send a farewell request that cleans up
persistent state associated with the mount.  The client needs to be sure
that it gets processed, and we must maintain a majority of quorum
members mounted to be able to elect a server to process farewell
requests.

We had a mechanism using the unmount_barrier fields in the greeting and
super_block to let the final unmounting quorum majority know that their
farewells have been processed and that they didn't need to keep trying
to reconnect.

But we missed that we also need this out of band farewell handling
signal for non-quorum member clients as well.  The server can send
farewells to a non-member client as well as the final majority and then
tear down all the connections before the non-quorum client can see its
farewell response.  It also needs to be able to know that its farewell
has been processed before the server let the final majority unmount.

We can remove the custom unmount_barrier method and instead have all
unmounting clients check for their mounted_client item in the server's
btree.  This item is removed as the last step of farewell processing so
if the client sees that it has been removed it knows that it doesn't
need to resend the farewell and can finish unmounting.

This fixes a bug where a non-quorum unmount could hang if it raced with
the final majority unmounting.  I was able to trigger this hang in our
tests with 5 mounts and 3 quorum members.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
79f6878355 Clean up block writing in mkfs
scoutfs mkfs had two block writing functions: write_block to fill out
some block header fields including crc calculation, and then
write_block_raw to pwrite the raw buffer to the bytes in the device.

These were used inconsistenly as blocks came and went over time.  Most
callers filled out all the header fields themselves and called the raw
writer.  write_block was only used for super writing, which made sense
because it clobbered the block's header with the super header so the
caller's set header magic and seq fields would be lost.

This cleans up the mess.  We only have one block writer and the caller
provides all the hdr fields.  Everything uses it instead of filling out
the fields themselves and calling the raw writer.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
87fcad5428 Update scoutfs mkfs and print for quorum slots
Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
406d157891 Add stringify macro to utils
Add macros for stringifying either the name of a macro or its value.  In
keeping with making our utils/ sort of look like kernel code, we use the
kernel stringify names.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-18 12:57:30 -08:00
Andy Grover
15fd2ccc02 utils: Do not assert if release is given unaligned offset or length
This is checked for by the kernel ioctl code, so giving unaligned values
will return an error, instead of aborting with an assert.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-29 09:30:57 -08:00
Andy Grover
d731c1577e Filesystem version instead of format hash check
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.

Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.

Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.

Update README. Fix comments.

Add interop version to notes and modinfo.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-15 10:53:00 -08:00
Zach Brown
eb3981c103 Add move-blocks scoutfs cli command
Add a move-blocks command that translates arguments and calls the
MOVE_BLOCKS ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Andy Grover
299062a456 Fix mkfs check for existing ScoutFS superblock
We were checking for the wrong magic value.

We now need to use -f when running mkfs in run-tests for things to work.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-13 16:32:41 -08:00
Andy Grover
2c5871c253 Change release ioctl to be denominated in bytes not blocks
This more closely matches stage ioctl and other conventions.

Also change release code to use offset/length nomenclature for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
d48b447e75 Do not set -Wpadded except for checking kmod-shared headers
Remove now-unneeded manual padding in arg structs.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
5241bba7f6 Update scoutfs.8 man page
Update for cli args and options changes. Reorder subcommands to match
scoutfs built-in help.

Consistent ScoutFS capitalization.

Tighten up some descriptions and verbiage for consistency and omit
descriptions of internals in a few spots.

Add SEE ALSO for blockdev(8) and wipefs(8).

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
e0a2175c2e Use argp info instead of duplicating for cmd_register()
Make it static and then use it both for argp_parse as well as
cmd_register_argp.

Split commands into five groups, to help understanding of their
usefulness.

Mention that each command has its own help text, and that we are being
fancy to keep the user from having to give fs path.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
f2cd1003f6 Implement argp support for walk-inodes
This has some fancy parsing going on, and I decided to just leave it
in the main function instead of going to the effort to move it all
to the parsing function.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
97c6cc559e Implement argp support for data-waiting and data-wait-err
These both have a lot of required options.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
7c54c86c38 Implement argp support for setattr
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
e1ba508301 Implement argp support for counters
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
f35154eb19 counters: Ensure name_wid[0] is initialized to zero
I was seeing some segfaults and other weirdness without this.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
7befc61482 Implement argp support for mkfs and add --force
Support max-meta-size and max-data-size using KMGTP units with rounding.

Detect other fs signatures using blkid library.

Detect ScoutFS super using magic value.

Move read_block() from print.c into util.c since blkid also needs it.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 16:29:42 -08:00
Andy Grover
6b5ddf2b3a Implement argp support for print
Print warning if printing a data dev, you probably wanted the meta dev.

Change read_block to return err value. Otherwise there are confusing
ENOMEM messages when pread() fails. e.g. try to print /dev/null.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
d025122fdd Implement argp support for listxaddr-hidden
Rename to list-hidden-xaddrs.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
706fe9a30e Implement argp support for search-xattrs
Get fs path via normal methods, and make xattr an argument not an option.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
0f17ecb9e3 Implement argp support for stage/release
Make offset and length optional. Allow size units (KMGTP) to be used
  for offset/length.

release: Since off/len no longer given in 4k blocks, round offset and
  length to to 4KiB, down and up respectively. Emit a message if rounding
  occurs.

Make version a required option.

stage: change ordering to src (the archive file) then the dest (the
  staged file).

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-12 10:47:47 -08:00
Andy Grover
10df01eb7a Implement argp support for ino-path
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-04 11:49:31 -08:00
Andy Grover
68b8e4098d Implement argp support for stat and statfs
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-04 11:49:31 -08:00
Andy Grover
5701184324 Implement argp support for df
Convert arg parsing to use argp. Use new get_path() helper fn.

Add -h human-readable option.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-04 11:49:31 -08:00
Andy Grover
a3035582d3 Add strdup_or_error()
Add a helper function to handle the impossible event that strdup fails.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-04 11:49:31 -08:00
Andy Grover
9e47a32257 Add get_path()
Implement a fallback mechanism for opening paths to a filesystem. If
explicitly given, use that. If env var is set, use that. Otherwise, use
current working directory.

Use wordexp to expand ~, $HOME, etc.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-04 11:49:31 -08:00
Zach Brown
e386b900ee Remove README.md from utils
This was just boilerplate for the utils repo.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-07 10:39:20 -08:00
Zach Brown
86cf3ec4ab Remove format.h and ioctl.h from utils
Now that we're in one repo utils can get its format and ioctl headers
from the authoriative kmod files.   When we're building a dist tarball
we copy the files over so that the build from the dist tarball can use
them.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-07 10:39:20 -08:00
Andy Grover
9a647a98f1 scoutfs-utils: Header changes to match kmod PR 41
Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:39 -08:00