Commit Graph

336 Commits

Author SHA1 Message Date
Zach Brown
4a8240748e Add project ID support
Add support for project IDs.  They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
fb5331a1d9 Add inode retention bit
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode.  The bit is visible through the _attr_x
interface.  It can only be set on regular files and when set it prevents
modification to all but non-user xattrs.  It can be cleared by root.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
de304628ea Add attr_x commands and documentation to utils
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
9ba4271c26 Add new max format version of 2
We're about to add new format structures so increment the max version to
2.  Future commits will add the features before we release version 2 in
the wild.

Signed-off-by: Zach Brown <zab@zabbo.net>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
90cfaf17d1 Initial support for different inode sizes
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version.   This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
d6642da44d Prevent downgrade of format version
Don't let change-format-version decrease the format version.  It doesn't
have the machinery to go back and migrate newer structures to older
structures that would be compatible with code expecting the older
version.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: split from initial patch with other changes]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
6db69b7a4f Set root inode crtime in mkfs
When we added the crtime creation timestamp to the inode we forgot to
update mkfs to set the crtime of the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:18 -07:00
Zach Brown
90a4c82363 Make log merge wait timeout tunable
Add a mount option for the amount of time that log merge creation can
wait before giving up.  We add some counters so we can see how often
the timeout is being hit and what the average successfull wait time is.

Signed-off-by: Zach Brown <zab@versity.com>
2024-01-25 11:25:56 -08:00
Ben McClelland
d2c2fece2a Add rpm spec file support for el8 builds
The rpmbuild support files no longer define the previously used kernel
module macros. This carves out the differences between el7 and el8 with
conditionals based on the distro we are building for.

Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
2023-10-09 15:35:40 -04:00
Zach Brown
2279e9657f Add get_referring_entries scoutfs command
Add a cli command for the get_referring_entries ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2023-06-14 14:12:10 -07:00
Zach Brown
912906f050 Make quorum heartbeat timeout tunable
Add mount and sysfs options for changing the quorum heartbeat timeout.
This allows setting a longer delay in taking over for failed hosts that
has a greater chance of surviving temporary non-fatal delays.

We also double the existing default timeout to 10s which is still
reasonably responsive.

Signed-off-by: Zach Brown <zab@versity.com>
2023-05-17 14:44:27 -07:00
Zach Brown
e7bd1b45dc Add prepare-empty-data-device scoutfs command
Add a command for writing a super block to a new data device after
reading the metadata device to ensure that there's no existing
data on the old data device.

Signed-off-by: Zach Brown <zab@versity.com>
2023-04-17 12:47:50 -07:00
Zach Brown
18903ce500 Alphabetize command listing in scoutfs man page
List the scoutfs utility commands in the man page in alphabetical order.

Signed-off-by: Zach Brown <zab@versity.com>
2023-04-17 12:47:50 -07:00
Zach Brown
b76e22ffcf Refactor user util functions for device size
Split the existing device_size() into get_device_size() and
limit_device_size().  An upcoming command wants to get the device size
without applying limiting policy.

Signed-off-by: Zach Brown <zab@versity.com>
2023-04-17 12:47:50 -07:00
Zach Brown
3363b4fb79 Flush device caches in buffered util cmds
Add calls to our new device cache flushing helper in commands that use
buffered reads.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-18 10:52:02 -08:00
Zach Brown
ddb5cce2a5 Add quick utils flush_device helper
Add a quick helper that just calls cache flushing ioctls on different
kinds of files.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-18 10:27:47 -08:00
Zach Brown
ef2daf8857 Make data preallocation tunable
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes.  Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-14 14:03:35 -07:00
Zach Brown
29538a9f45 Add POSIX ACL support
Add support for the POSIX ACLs as described in acl(5).  Support is
enabled by default and can be explicitly enabled or disabled with the
acl or noacl mount options, respectively.

Signed-off-by: Zach Brown <zab@versity.com>
2022-09-28 10:36:10 -07:00
Zach Brown
49df98f5a8 Add skip-likely-huge print option
Add an option to skip printing structures that are likely to be so huge
that the print output becomes completely unwieldly on large systems.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-06 15:07:57 -07:00
Zach Brown
26ae9c6e04 Verify local unmount testing fence script
The fence script we use for our single node multi-mount tests only knows
how to fence by using forced unmount to destroy a mount.  As of now, the
tests only generate failing nodes that need to be fenced by using forced
unmount as well.  This results in the awkward situation where the
testing fence script doesn't have anything to do because the mount is
already gone.

When the test fence script has nothing to do we might not notice if it
isn't run.  This adds explicit verification to the fencing tests that
the script was really run.  It adds per-invocation logging to the fence
script and the test makes sure that it was run.

While we're at it, we take the opportunity to tidy up some of the
scripting around this.  We use a sysfs file with the data device
major:minor numbers so that the fencing script can find and unmount
mounts without having to ask them for their rid.  They may not be
operational.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-28 14:52:08 -07:00
Zach Brown
a67ea30bb7 Add orphan_scan_delay_ms mount option
Add a mount option to set the delay betwen scanning of the orphan list.
The sysfs file for the option is writable so this option can be set at
run time.

Signed-off-by: Zach Brown <zab@versity.com>
2022-03-10 11:43:11 -08:00
Zach Brown
ae08a797ae Clean quorum and format change command docs
The man pages and inline help blurbs for the recently added format
version and quorum config commands incorrectly described the device
arguments which are needed.

Signed-off-by: Zach Brown <zab@versity.com>
2022-02-08 11:23:27 -08:00
Zach Brown
e067961714 Add get-allocated-inos scoutfs command
Add the get-allocated-inos scoutfs command which wraps the
GET_ALLOCATED_INOS ioctl.   It'll be used by tests to find items
associated with an inode instead of trying to open the inode by a
constructed handle after it was unlinked.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-24 09:40:08 -08:00
Zach Brown
813ce24d79 Move local-force-unmount test script into tests/
The local-force-unmount fenced fencing script only works when all the
mounts are on the local host and it uses force unmount.   It is only
used in our specific local testing scripts.  Packaging it as an example
lead people to believe that it could be used to cobble together a
multi-host testing network, however temporary.

Move it from being in utils and packged to being private to our tests so
that it doesn't present an attractive nuisance.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-19 11:33:34 -08:00
Zach Brown
89ca903c41 Print log trees get/commit seqs
Back when we added the get/commit transaction sequence numbers to the
log_trees we forgot to add them to the scoutfs print output.

Signed-off-by: Zach Brown <zab@versity.com>
2022-01-19 09:21:02 -08:00
Zach Brown
8bc1ee8346 Add change-quorum-config command
Add a command to change the quorum config which starts by only supports
updating the super block whlie the file system is oflfine.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 15:41:04 -08:00
Zach Brown
285b68879a Set quorum config ver to 1 in mkfs and print
We're adding a command to change the quorum config which updates its
version number.  Let's make the version a little more visible and start
it at the more humane 1.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 15:41:04 -08:00
Zach Brown
1ac3efe701 Add meta_super_in_use utils helper
Move the code that checks that the super is in use from
change-format-version into its own function in util.c.   We'll use it in
an upcoming command to change the quorum config.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 15:40:25 -08:00
Zach Brown
ce76682db7 Make mkfs quorum helpers available
Move functions for printing and validating the quorum config from mkfs.c
to quorum.c so that they can be used in an upcoming command to change
the quorum config.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 13:44:51 -08:00
Zach Brown
686f8515bc Fix --quorum-count typo in mkfs error message
The change from --quorum-count to --quorum-slot forgot to update a
mention of the option in an error message in mkfs when it wasn't
provided.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 13:44:51 -08:00
Bryant G. Duffy-Ly
95f2a87864 Fix scoutfs print <data_dev> hang
If a user tries to print a data device exit early if
it is data device.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-08 16:16:13 -06:00
Zach Brown
932a842ae3 Remove valid_bytes from stat _more ioctls
The idea here was that we'd expand the size of the struct and
valid_bytes would tell the kernel which fields were present in
userspace's struct.  That doesn't combine well with the ioctl convention
of having the size of the type baked into the ioctl number.   We'll
remove this to make the world less surprising.  If we expand the
interface we'd add additional ioctls and types.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
80ee2c6d57 Harden client transaction processing
There are a few bad corner cases in the state machine that governs how
client transactions are opened, modified, and committed.

The worst problem is on the server side.   All server request handlers
need to cope with resent requests without causing bad side effects.
Both get_log_trees and commit_log_trees would try to fully processes
resent requests.  _get_log_trees() looks safe because it works with the
log_trees that was stored previously.  _commit_log_trees() is not safe
because it can rotate out the srch log file referenced by the sent
log_trees every time it's processed.  This could create extra srch
entries which would delete the first instance of entries.  Worse still,
by injecting the same block structure into the system multiple times it
ends up causing multiple frees of the blocks that make up the srch file.

The client side problems are slightly different, but related.   There
aren't strong constraints which guarantee that we'll only send a commit
request after a get request succeeds.   In crazy circumstances the
commit request in the write worker could come before the first get in
mount succeeds.   Far worse is that we can send multiple commit requests
for one transaction if it changes as we get errors during multiple
queued write attempts, particularly if we get errors from get_log_trees
after having successfully committed.

This hardens all these paths to ensure a strict sequence of
get_log_trees, transaction modification, and commit_log_trees.

On the server we add *_trans_seq fields to the log_trees struct so that
both get_ and commit_ can see that they've already prepared a commit to
send or have already committed the incoming commit, respectively.   We
can use the get_trans_seq field as the trans_seq of the open transaction
and get rid of the entire seperate mechanism we used to have for
tracking open trans seqs in the clients.  We can get the same info by
walking the log_trees and looking at their *_trans_seq fields.

In the client we have the write worker immediately return success if
mount hasn't opened the first transaction.   Then we don't have the
worker return to allow further modification until it has gotten success
from get_log_trees.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
95ed36f9d3 Maintain inode count in super and log trees
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct.   Client transactions track the change in
inode count as they create and delete inodes.   The log_trees delta is
added to the count in the super as finalized log_trees are deleted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
366f615c9f Add support for our format version
We had previously started on a relatively simple notion of an
interoperability version which wasn't quite right.  This fleshes out
support for a more functional format version.   The super blocks have a
single version that defines behaviour of the running system.   The code
supports a range of versions and we add some initial interfaces for
updating the version while the system is offline.   All of this together
should let us safely change the underlying format over time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
ac2587017e Add write_nr to quorum blocks
Add a write_nr field to the quorum block header which is incremented
with every write.  Each event also gets a write_nr field that is set to
the incremented value from the header.   This gives us a history of the
order of event updates that isn't sensitive to misconfigured time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
1cdcf41ac7 Move more block read/write functions to util
We're adding another command that does block IO so move some block
reading and writing functions out of mkfs.   We also grow a few function
variants and call the write_sync variant from mkfs instead of having it
manually sync.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
024426df28 Add a file for userspace quorum config helpers
Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
ea2b01434e Add support for i_version
This adds i_version to our inode and maintains it as we allocate, load,
modify, and store inodes.  We set the flag in the superblock so
in-kernel users can use i_version to see changes in our inodes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
b9a0f1709f Add xattr .totl. tag
Add the .totl. xattr tag.  When the tag is set the end of the name
specifies a total name with 3 encoded u64s separated by dots.  The value
of the xattr is a u64 that is added to the named total.   An ioctl is
added to read the totals.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
a59fd5865d Add seq and flags to btree items
The fs log btrees have values that start with a header that stores the
item's seq and flags.  There's a lot of sketchy code that manipulates
the value header as items are passed around.

This adds the seq and flags as core item fields in the btree.   They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge.  The rest of the btree items that use the main interface
don't work with the fields.

This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-09 14:44:55 -07:00
Zach Brown
46edf82b6b Add inode crtime creation time
Add an inode creation time field.  It's created for all new inodes.
It's visible to stat_more.  setattr_more can set it during
restore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-03 11:14:41 -07:00
Zach Brown
36fcc4665d Align first free ino to lock group
Currently the first inode number that can be allocated directly follows
the root inode.  This means the first batch of allocated inodes are in
the same lock group as the root inode.

The root inode is a bit special.  It is always hot as absolute path
lookups and inode-to-path resolution always read directory entries from
the root.

Let's try aligning the first free inode number to the next inode lock
group boundary.  This will stop work in those inodes from necessarily
conflicting with work in the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
bb571377dc Don't merge newer items past older
We have a problem where items can appear to go backwards in time because
of the way we chose which log btrees to finalize and merge.

Because we don't have versions in items in the fs_root, and even might
not have items at all if they were deleted, we always assume items in
log btrees are newer than items in the fs root.

This creates the requirement that we can't merge a log btree if it has
items that are also present in older versions in other log btrees which
are not being merged.  The unmerged old item in the log btree would take
precedent over the newer merged item in the fs root.

We weren't enforcing this requirement at all.  We used the max_item_seq
to ensure that all items were older than the current stable seq but that
says nothing about the relationship between older items in the finalized
and active log btrees.  Nothing at all stops an active btree from having
an old version of a newer item that is present in another mount's
finalized log btree.

To reliably fix this we create a strict item seq discontinuity between
all the finalized merge inputs and all the active log btrees.  Once any
log btree is naturally finalized the server forced all the clients to
group up and finalize all their open log btrees.   A merge operation can
then safely operate on all the finalized trees before any new trees are
given to clients who would start using increasing items seqs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-08-25 10:14:38 -07:00
Zach Brown
6d0694f1b0 Add resize_devices ioctl and scoutfs command
Add a scoutfs command that uses an ioctl to send a request to the server
to safely use a device that has grown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
fd686cab86 Fix total_data_blocks calculation in mkfs
mkfs was incorrectly initializing total_data_blocks.  The field is meant
to record the number of blocks from the start of the device that the
filesystem could access.  mkfs was subtracting the initial reserved area
of the device, meaning the number of blocks that the filesystem might
access.

This could allow accesses past devices if mount checks the device size
against the smaller total_data_blocks.

And we're about to use total_data_blocks as the start of a new extent to
add when growing the volume.  It needs to be fixed so that this new
grown free extent doesn't overlap with the end of the existing free
extents.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
4c1181c055 Remove first_ and last_ super blkno fields
There are fields in the super block that specify the range of blocks
that would be used for metadata or data.  They are from the time when a
single block device was carved up into regions for metadata and data.

They don't make sense now that we have separate metadata and data block
devices.  The starting blkno is static and we go to the end of the
device.

This removes the fields now that they serve no purpose.   The only use
of them to check that freed extents fell within the correct bounds can
still be performed by using the static starting number or roughly using
the size of the devices.  It's not perfect, but this is already only
a check to see that the blknos aren't utter nonsense.

We're removing the fields now to avoid having to update them while
worrying about users when resizing devices.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
21c5724dd5 Update fenced service file StartLimitBurst
The first draft was written against an older schema, StartLimitBurst is
in [Service] now.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Zach Brown
3974d98f6b Don't use "/dev/*" redirections near systemd
It sets up stdout and stderr as sockets, not pipes, so these links don't
work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 11:34:52 -07:00
Ben McClelland
3a9db45194 Add fenced systemd and example configs
This should be good enough to get single node mounts up and running with
fenced with minimal effort.  The example config will need to be copied
to /etc/scoutfs/scoutfs-fenced.conf for it to be functional, so this
still requires specific opt-in and wont accidentally run for multi-node
systems.

Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
2021-07-09 08:22:39 -07:00