Add the peername of the client's connected socket to its mounted_client
item as it mounts. If the client doesn't recover then fencing can use
the IP to find the host to fence.
Signed-off-by: Zach Brown <zab@versity.com>
Add the data_alloc_zone_blocks volume option. This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.
We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.
We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root. The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents. The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.
The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().
Signed-off-by: Zach Brown <zab@versity.com>
Allocators store free extents in two items, one sorted by their blkno
position and the other by their precise length.
The length index makes it easy to search for precise extent lengths, but
it makes it hard to search for a large extent within a given blkno
region. Skipping in the blkno dimension has to be done for every
precise length value.
We don't need that level of precision. If we index the extents by a
coarser order of the length then we have a fixed number of orders in
which we have to skip in the blkno dimension when searching within a
specific region.
This changes the length item to be stored at the log(8) order of the
length of the extents. This groups extents into orders that are close
to the human-friendly base 10 orders of magnitude.
With this change the order field in the key no longer stores the precise
extent length. To preserve the length of the extent we need to use
another field. The only 64bit field remaining is the first which is a
higher comparision priority than the type. So we use the highest
comparison priority zone field to differentiate the position and order
indexes and can now use all three 64bit fields in the key.
Finally, we have to be careful when constructing a key to use _next when
searching for a large extent. Previously keys were relying on the magic
property that building a key from an extent length of 0 ended up at the
key value -0 = 0. That only worked because we never stored zero length
extents. We now store zero length orders so we can't use the negative
trick anymore. We explicitly treat 0 length extents carefully when
building keys and we subtract the order from U64_MAX to store the orders
from largest to smallest.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce global volume options. They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the recov layer we can have the lock server use it to
track lock recovery. The lock server no longer needs its own recovery
tracking structures and can instead call recov. We add a call for the
server to call to kick lock processing once lock recovery finishes. We
can get rid of the persistent lock_client items now that the server is
driving recovery from the mounted_client items.
Signed-off-by: Zach Brown <zab@versity.com>
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Define a family field, and add a union for IPv4 and v6 variants, although
v6 is not supported yet.
Family field is now used to determine presence of address in a quorum slot,
instead of checking if addr is zero.
Signed-off-by: Andy Grover <agrover@versity.com>
Update scoutfs print to use the new block_ref struct instead of the
handful of per-block type ref structs that we had accumulated.
Signed-off-by: Zach Brown <zab@versity.com>
Update the man pages with descriptions of the new mkfs -Q quorum slot
configuration and quorum_slot_nr mount option.
Signed-off-by: Zach Brown <zab@versity.com>
As clients unmount they send a farewell request that cleans up
persistent state associated with the mount. The client needs to be sure
that it gets processed, and we must maintain a majority of quorum
members mounted to be able to elect a server to process farewell
requests.
We had a mechanism using the unmount_barrier fields in the greeting and
super_block to let the final unmounting quorum majority know that their
farewells have been processed and that they didn't need to keep trying
to reconnect.
But we missed that we also need this out of band farewell handling
signal for non-quorum member clients as well. The server can send
farewells to a non-member client as well as the final majority and then
tear down all the connections before the non-quorum client can see its
farewell response. It also needs to be able to know that its farewell
has been processed before the server let the final majority unmount.
We can remove the custom unmount_barrier method and instead have all
unmounting clients check for their mounted_client item in the server's
btree. This item is removed as the last step of farewell processing so
if the client sees that it has been removed it knows that it doesn't
need to resend the farewell and can finish unmounting.
This fixes a bug where a non-quorum unmount could hang if it raced with
the final majority unmounting. I was able to trigger this hang in our
tests with 5 mounts and 3 quorum members.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs mkfs had two block writing functions: write_block to fill out
some block header fields including crc calculation, and then
write_block_raw to pwrite the raw buffer to the bytes in the device.
These were used inconsistenly as blocks came and went over time. Most
callers filled out all the header fields themselves and called the raw
writer. write_block was only used for super writing, which made sense
because it clobbered the block's header with the super header so the
caller's set header magic and seq fields would be lost.
This cleans up the mess. We only have one block writer and the caller
provides all the hdr fields. Everything uses it instead of filling out
the fields themselves and calling the raw writer.
Signed-off-by: Zach Brown <zab@versity.com>
Add macros for stringifying either the name of a macro or its value. In
keeping with making our utils/ sort of look like kernel code, we use the
kernel stringify names.
Signed-off-by: Zach Brown <zab@versity.com>
This is checked for by the kernel ioctl code, so giving unaligned values
will return an error, instead of aborting with an assert.
Signed-off-by: Andy Grover <agrover@versity.com>
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.
Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.
Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.
Update README. Fix comments.
Add interop version to notes and modinfo.
Signed-off-by: Andy Grover <agrover@versity.com>
We were checking for the wrong magic value.
We now need to use -f when running mkfs in run-tests for things to work.
Signed-off-by: Andy Grover <agrover@versity.com>
This more closely matches stage ioctl and other conventions.
Also change release code to use offset/length nomenclature for consistency.
Signed-off-by: Andy Grover <agrover@versity.com>
Update for cli args and options changes. Reorder subcommands to match
scoutfs built-in help.
Consistent ScoutFS capitalization.
Tighten up some descriptions and verbiage for consistency and omit
descriptions of internals in a few spots.
Add SEE ALSO for blockdev(8) and wipefs(8).
Signed-off-by: Andy Grover <agrover@versity.com>
Make it static and then use it both for argp_parse as well as
cmd_register_argp.
Split commands into five groups, to help understanding of their
usefulness.
Mention that each command has its own help text, and that we are being
fancy to keep the user from having to give fs path.
Signed-off-by: Andy Grover <agrover@versity.com>
This has some fancy parsing going on, and I decided to just leave it
in the main function instead of going to the effort to move it all
to the parsing function.
Signed-off-by: Andy Grover <agrover@versity.com>
Support max-meta-size and max-data-size using KMGTP units with rounding.
Detect other fs signatures using blkid library.
Detect ScoutFS super using magic value.
Move read_block() from print.c into util.c since blkid also needs it.
Signed-off-by: Andy Grover <agrover@versity.com>
Print warning if printing a data dev, you probably wanted the meta dev.
Change read_block to return err value. Otherwise there are confusing
ENOMEM messages when pread() fails. e.g. try to print /dev/null.
Signed-off-by: Andy Grover <agrover@versity.com>
Make offset and length optional. Allow size units (KMGTP) to be used
for offset/length.
release: Since off/len no longer given in 4k blocks, round offset and
length to to 4KiB, down and up respectively. Emit a message if rounding
occurs.
Make version a required option.
stage: change ordering to src (the archive file) then the dest (the
staged file).
Signed-off-by: Andy Grover <agrover@versity.com>
Implement a fallback mechanism for opening paths to a filesystem. If
explicitly given, use that. If env var is set, use that. Otherwise, use
current working directory.
Use wordexp to expand ~, $HOME, etc.
Signed-off-by: Andy Grover <agrover@versity.com>
Now that we're in one repo utils can get its format and ioctl headers
from the authoriative kmod files. When we're building a dist tarball
we copy the files over so that the build from the dist tarball can use
them.
Signed-off-by: Zach Brown <zab@versity.com>
Not initializing wid[] can cause incorrect output.
Also, we only need 6 columns if we reference the array from 0.
Signed-off-by: Andy Grover <agrover@versity.com>
mkfs: Take two block devices as arguments. Write everything to metadata
dev, and the superblock to the data dev. UUIDs match. Differentiate by
checking a bit in a new "flags" field in the superblock.
Refactor device_size() a little. Convert spaces to tabs.
Move code to pretty-print sizes to dev.c so we can use it in error
messages there, as well as in mkfs.c.
print: Include flags in output.
Add -D and -M options for setting max dev sizes
Allow sizes to be specified using units like "K", "G" etc.
Note: -D option replaces -S option, and uses above units rather than
the number of 4k data blocks.
Update man pages for cmdline changes.
Signed-off-by: Andy Grover <agrover@versity.com>
It was too tricky to pick out the difference between metadata and data
usage in the previous format. This makes it much more clear which
values are for either metadata or data.
Signed-off-by: Zach Brown <zab@versity.com>