Commit Graph

1286 Commits

Author SHA1 Message Date
Zach Brown
7421bd1861 Filter all test device digits to 0
We mask device numbers in command output to 0:0 so that we can have
consistent golden test output.  The device number matching regex
responsible for this missed a few digits.

It didn't show up until we both tested enough mounts to get larger
device minor numbers and fixed multi-mount consistency so that the
affected tests didn't fail for other reasons.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
1db6f8194d Update xfstests to use quorum slot options
Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
2de7692336 Unmount mount point, not device
Our test unmount function unmounted the device instead of the mount
point.  It was written this way back in an old version of the harness
which didn't track mount points.

Now that we have mount points, we can just unmount that.  This stops the
umount command from having to search through all the current mounts
looking for the mountpoint for the device it was asked to unmount.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
8c1d96898a Log wait failure in mount-unmount-race test
I got a test failure where waiting returned an error, but it wasn't
clear what the error was or where it might have come from.  Add more
logging so that we learn more about what might have gone wrong.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
090646aaeb Update repo README.md for quorum slots
Update the example configuration in the README to specify the quorum
slots in mkfs arguments and mount options.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
d53350f9f1 Consistently lock server mounted_clients btree
The mounted_clients btree stores items to track mounted clients.  It's
modified by multiple greeting workers and the farewell work.

The greeting work was serialized by the farewell_mutex, but the
modifications in the farewell thread weren't protected.  This could
result in modifications between the threads being lost if the dirty
block reference updates raced in just the right way.  I saw this in
testing with deletions in farewell being lost and then that lingering
item preventing unmount because the server thought it had to wait for a
remaining quorum member to unmount.

We fix this by adding a mutex specifically to protect the
mounted_clients btree in the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
57f34e90e9 Use mounted_client item as sign of farewell
As clients unmount they send a farewell request that cleans up
persistent state associated with the mount.  The client needs to be sure
that it gets processed, and we must maintain a majority of quorum
members mounted to be able to elect a server to process farewell
requests.

We had a mechanism using the unmount_barrier fields in the greeting and
super_block to let the final unmounting quorum majority know that their
farewells have been processed and that they didn't need to keep trying
to reconnect.

But we missed that we also need this out of band farewell handling
signal for non-quorum member clients as well.  The server can send
farewells to a non-member client as well as the final majority and then
tear down all the connections before the non-quorum client can see its
farewell response.  It also needs to be able to know that its farewell
has been processed before the server let the final majority unmount.

We can remove the custom unmount_barrier method and instead have all
unmounting clients check for their mounted_client item in the server's
btree.  This item is removed as the last step of farewell processing so
if the client sees that it has been removed it knows that it doesn't
need to resend the farewell and can finish unmounting.

This fixes a bug where a non-quorum unmount could hang if it raced with
the final majority unmounting.  I was able to trigger this hang in our
tests with 5 mounts and 3 quorum members.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
79f6878355 Clean up block writing in mkfs
scoutfs mkfs had two block writing functions: write_block to fill out
some block header fields including crc calculation, and then
write_block_raw to pwrite the raw buffer to the bytes in the device.

These were used inconsistenly as blocks came and went over time.  Most
callers filled out all the header fields themselves and called the raw
writer.  write_block was only used for super writing, which made sense
because it clobbered the block's header with the super header so the
caller's set header magic and seq fields would be lost.

This cleans up the mess.  We only have one block writer and the caller
provides all the hdr fields.  Everything uses it instead of filling out
the fields themselves and calling the raw writer.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
740e13e53a Return error from _quorum_setup
Well that's a silly mistake.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
dbb716f1bb Update tests for quorum slots
Update the tests to deal with the mkfs and mount changes for the
specifically configured quorum slots.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
87fcad5428 Update scoutfs mkfs and print for quorum slots
Signed-off-by: Zach Brown <zab@versity.com>
2021-02-22 13:28:38 -08:00
Zach Brown
406d157891 Add stringify macro to utils
Add macros for stringifying either the name of a macro or its value.  In
keeping with making our utils/ sort of look like kernel code, we use the
kernel stringify names.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-18 12:57:30 -08:00
Zach Brown
8e34c5d66a Use quorum slots and background election work
Previously quorum configuration specified the number of votes needed to
elected the leader.  This was an excessive amount of freedom in the
configuration of the cluster which created all sorts of problems which
had to be designed around.

Most acutely, though, it required a probabilistic mechanism for mounts
to persistently record that they're starting a server so that future
servers could find and possibly fence them.  They would write to a lot
of quorum blocks and trust that it was unlikely that future servers
would overwrite all of their written blocks.  Overwriting was always
possible, which would be bad enough, but it also required so much IO
that we had to use long election timeouts to avoid spurious fencing.
These longer timeouts had already gone wrong on some storage
configurations, leading to hung mounts.

To fix this and other problems we see coming, like live membership
changes, we now specifically configure the number and identity of mounts
which will be participating in quorum voting.  With specific identities,
mounts now have a corresponding specific block they can write to and
which future servers can read from to see if they're still running.

We change the quorum config in the super block from a single
quorum_count to an array of quorum slots which specify the address of
the mount that is assigned to that slot.  The mount argument to specify
a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr"
which specifies the mount's slot.  The slot's address is used for udp
election messages and tcp server connections.

Now that we specifically have configured unique IP addresses for all the
quorum members, we can use UDP messages to send and receive the vote
mesages in the raft protocol to elect a leader.  The quorum code doesn't
have to read and write disk block votes and is a more reasonable core
loop that either waits for received network messages or timeouts to
advance the raft election state machine.

The quorum blocks are now used for slots to store their persistent raft
term and to set their leader state.  We have event fields in the block
to record the timestamp of the most recent interesting events that
happened to the slot.

Now that raft doesn't use IO, we can leave the quorum election work
running in the background.  The raft work in the quorum members is
always running so we can use a much more typical raft implementation
with heartbeats.  Critically, this decouples the client and election
life cycles.  Quorum is always running and is responsible for starting
and stopping the server.  The client repeatedly tries to connect to a
server, it has nothing to do with deciding to participate in quorum.

Finally, we add a quorum/status sysfs file which shows the state of the
quorum raft protocol in a member mount and has the last messages that
were sent to or received from the other members.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-18 12:57:30 -08:00
Zach Brown
1c7bbd6260 More accurately describe unmounting quorum members
As a client unmounts it sends a farewell request to the server.  We have
to carefully manage unmounting the final quorum members so that there is
always a remaining quorum to elect a leader to start a server to process
all their farewell requests.

The mechanism for doing this described these clients as "voters".
That's not really right, in our terminology voters and candidates are
temporary roles taken on by members during a specific election term in
the raft protocol.  It's more accurate to describe the final set of
clients as quorum members.  They can be voters or candidates depending
on how the raft protocol timeouts workout in any given election.

So we rename the greeting flag, mounted client flag, and the code and
comments on either side of the client and server to be a little clearer.

This only changes symbols and comments, there should be no functional
change.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-11 15:47:39 -08:00
Zach Brown
3ad18b0f3b Update super blkno field tests for meta device
As we read the super we check the first and last meta and data blkno
fields.  The tests weren't updated as we moved from one device to two
metadata and data devices.

Add a helper that tests the range for the device and test both meta and
data ranges fully, instead of only testing the endpoints of each and
assuming they're related because they're living on one device.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-11 15:47:29 -08:00
Andy Grover
79cd7a499b Merge pull request #17 from versity/zab/disable_mount_unmount_test
Disable mount-unmount-race test
2021-02-01 10:09:26 -08:00
Zach Brown
6ad18769cb Disable mount-unmount-race test
The mount-unmount-race test is occasionally hanging, disable it while we
debug it and have test coverage for unrelated work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-02-01 10:07:47 -08:00
Zach Brown
49d82fcaaf Merge pull request #14 from agrover/fix-jira-202
utils: Do not assert if release is given unaligned offset or length
2021-02-01 09:46:01 -08:00
Zach Brown
e4e12c1968 Merge pull request #15 from agrover/radix-block
Remove unused radix_block struct
2021-02-01 09:24:59 -08:00
Andy Grover
15fd2ccc02 utils: Do not assert if release is given unaligned offset or length
This is checked for by the kernel ioctl code, so giving unaligned values
will return an error, instead of aborting with an assert.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-29 09:30:57 -08:00
Andy Grover
eea95357d3 Remove unused radix_block struct
Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-26 16:07:05 -08:00
Andy Grover
9842c5d13e Merge pull request #13 from versity/zab/multi_mount_test_fixes
Zab/multi mount test fixes
2021-01-26 15:56:33 -08:00
Zach Brown
ade539217e Handle advance_seq being replayed in new server
As a core principle, all server message processing needs to be safe to
replay as servers shut down and requests are resent to new servers.

The advance_seq handler got this wrong.  It would only try to remove a
trans_seq item for the seq sent by the client before inserting a new
item for the next seq.  This change could be committed before the reply
was lost as the server shuts down.  The next server would process the
resent request but wouldn't find the old item for the seq that the
client sent, and would ignore the new item that the previous server
inserted.  It would then insert another greater seq for the same client.

This would leave behind a stale old trans_seq that would be returned as
the last_seq which would forever limit the results that could be
returned from the seq index walks.

This fix is to always remove all previous seq items for the client
before inserting a new one.  This creates O(clients) server work, but
it's minimal.

This manifest as occasional simple-inode-index test failures (say 1 in
5?) which would trigger if the unmounts during previous tests would
happen to have advance_seq resent across server shutdowns.  With this
change the test now reliably passes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
5a90234c94 Use terminated test name when saving passed stats
We've grown some test names that are prefixes of others
(createmany-parallel, createmany-parallel-mounts).  When we're searching
for lines with the test name we have to search for the exact test name,
by terminating the name with a space, instead of searching for a line
that starts with the test name.

This fixes strange output and saved passed stats for the names that
share a prefix.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
f81e4cb98a Add whitespace to xfstests output message
The message indicating that xfstests output was now being shown was
mashed up against the previous passed stats and it was gross and I hated
it.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
1fc706bf3f Filter hrtimer slow messages from dmesg
When running in debug kernels in guests we can really bog down things
enough to trigger hrtimer warnings.  I don't think there's much we can
reasonably do about that.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
e9c3aa6501 More carefully cancel server farewell work
Farewell work is queued by farewell message processing.  Server shutdown
didn't properly wait for pending farewell work to finish before tearing
down.  As the server work destroyed the server's connection the farewell
work could stlil be running and try to send responses down the socket.

We make the server more carefully avoid queueuing farewell work if it's
in the process of shutting down and wait for farewell work to finish
before destroying the server's resources.

This fixed all manner of crashes that were seen in testing when a bunch
of nodes unmounted, creating farewell work on the server as it itself
unmounted and destroyed the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
d39268bbc1 Fix spurious EIO from scoutfs_srch_get_compact
scoutfs_srch_get_compact() is building up a compaction request which has
a list of srch files to read and sort and write into a new srch file.
It finds input files by searching for a sufficient number of similar
files: first any unsorted log files and then sorted log files that are
around the same size.

It finds the files by using btree next on the srch zone which has types
for unsorted srch log files, sorted srch files, but also pending and
busy compaction items.

It was being far too cute about iterating over different key types.  It
was trying to adapt to finding the next key and was making assumptions
about the order of key types.  It didn't notice that the pending and
busy key types followed log and sorted and would generate EIO when it
ran into them and found their value length didn't match what it was
expecting.

Rework the next item ref parsing so that it returns -ENOENT if it gets
an unexpected key type, then look for the next key type when checking
enoent.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
35ed1a2438 Add t_require_meta_size function
Add a function that tests can use to skip when the metadata device isn't
large enough.  I thought we needed to avoid enospc in a particular test,
but it turns out the test's failure was unrelated.  So this isn't used
for now but it seems nice to keep around.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
32e7978a6e Extend lock invalidate grace period
The grace period is intended to let lock holders squeeze in more bulk
work before another node pulls the lock out from under them.  The length
of the delay is a balance between getting more work done per lock hold
and adding latency to ping-ponging workloads.

The current grace period was too short.  To do work in the conflicting
case you often have to read the result that the other mount wrote as you
invalidated their lock.  The test was written in the LSM world where
we'd effectively read a single level 0 1MB segment.  In the btree world
we're checking bloom blocks and reading the other mount's btree.  It has
more dependent read latency.

So we turn up the grace period to let conflicting readers squeeze in
more work before pulling the lock out from under them.  This value was
chosen to make lock-conflicting-batch-commit pass in guests sharing nvme
metadata devices in debugging kernels.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
8123b8fc35 fix lock-conflicting-batch-commit conf output
The test had a silly typo in the label it put on the time it took mounts
to perform conflicting metadata changes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
da5911c311 Use d_materialise_unique to splice dir dentries
When we're splicing in dentries in lookup we can be splicing the result
of changes on other nodes into a stale dcache.  The stale dcache might
contain dir entries and the dcache does not allow aliased directories.

Use d_materialise_unique() to splice in dir inodes so that we remove all
aliased dentries which must be stale.

We can still use d_splice_alias() for all other inode types.  Any
existing stale dentries will fail revalidation before they're used.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
098fc420be Add some item cache page tracing
Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
7a96537210 Leave mounts mounted if run-tests fails
We can lose interesting state if the mounts are unmounted as tests fail,
only unmount if all the tests pass.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
0607dfdac8 Enable and collect trace_printk
Weirdly, run-tests was treating trace_printk not as an option to enable
trace_printk() traces but as an option to print trace events to the
console with printk?  That's not a thing.

Make -P really enable trace_printk tracing and collect it as it would
enabled trace events.  It needs to be treated seperately from the -t
options that enable trace events.

While we're at it treat the -P trace dumping option as a stand-alone
option that works without -t arguments.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
0354bb64c5 More carefully enable tracing in run-tests
run-tests.sh has a -t argument which takes a whitespace seperated string
of globs of events to enable.  This was hard to use and made it very
easy to accidentally expand the globs at the wrong place in the script.

This makes each -t argument specify a single word glob which is stored
in an array so the glob isn't expanded until it's applied to the trace
event path.   We also add an error for -t globs that didn't match any
events and add a message with the count of -t arguments and enabled
events.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
631801c45c Don't queue lock invalidation work during shutdown
The lock invalidation work function needs to be careful not to requeue
itself while we're shutting down or we can be left with invalidation
functions racing with shutdown.  Invalidation calls igrab so we can end
up with unmount warning that there are still inodes in use.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
47a1ac92f7 Update ino-path args in basic-posix-consistency
The ino-path calls in basic-posix-consistency weren't updated for the
recent change to scoutfs cli args.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:45:23 -08:00
Zach Brown
004f693af3 Add golden output for mount-unmount-race test
Signed-off-by: Zach Brown <zab@versity.com>
2021-01-25 14:19:35 -08:00
Andy Grover
f271a5d140 Merge pull request #12 from versity/zab/andys_fallocate_fix_minor_cleanup
Retry if transaction cannot alloc for fallocate or write
2021-01-25 12:52:14 -08:00
Andy Grover
355eac79d2 Retry if transaction cannot alloc for fallocate or write
Add a new distinguishable return value (ENOBUFS) from allocator for if
the transaction cannot alloc space. This doesn't mean the filesystem is
full -- opening a new transaction may result in forward progress.

Alter fallocate and get_blocks code to check for this err val and retry
with a new transaction. Handling actual ENOSPC can still happen, of
course.

Add counter called "alloc_trans_retry" and increment it from both spots.

Signed-off-by: Andy Grover <agrover@versity.com>
[zab@versity.com: fixed up write_begin error paths]
2021-01-25 09:32:01 -08:00
Zach Brown
d8b4e94854 Merge pull request #10 from agrover/rm-item-accounting
Remove item accounting
2021-01-21 09:57:53 -08:00
Andy Grover
bed33c7ffd Remove item accounting
Remove kmod/src/count.h
Remove scoutfs_trans_track_item()
Remove reserved/actual fields from scoutfs_reservation

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-20 17:01:08 -08:00
Andy Grover
b370730029 Merge pull request #11 from versity/zab/item_cache_memory_corruption
Fix item cache page memory corruption
2021-01-20 10:27:20 -08:00
Zach Brown
d64dd89ead Fix item cache page memory corruption
The item cache page life cycle is tricky.  There are no proper page
reference counts, everthing is done by nesting the page rwlock inside
item_cache_info rwlock.  The intent is that you can only reference pages
while you hold the rwlocks appropriately.  The per-cpu page references
are outside that locking regime so they add a reference count.  Now
there are reference counts for the main cache index reference and for
each per-cpu reference.

The end result of all this is that you can only reference pages outside
of locks if you're protected by references.

Lock invalidation messed this up by trying to add its right split page
to the lru after it was unlocked.  Its page reference wasn't protected
at this point.  Shrinking could be freeing that page, and so it could be
putting a freed page's memory back on the lru.

Shrinking had a little bug that it was using list_move to move an
initialized lru_head list_head.  It turns out to be harmless (list_del
will just follow pointers to itself and set itself as next and prev all
over again), but boy does it catch one's eye.  Let's remove all
confusion and drop the reference while holding the cinf->rwlock instead
of trying to optimize freeing outside locks.

Finally, the big one: inserting a read item after compacting the page to
make room was inserting into stale parent pointers into the old
pre-compacted page, rather than the new page that was swapped in by
compaction.  This left references to a freed page in the page rbtree and
hilarity ensued.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-20 09:02:29 -08:00
Zach Brown
8d81196e01 Merge pull request #7 from agrover/versioning
Filesystem version instead of format hash check
2021-01-19 11:55:32 -08:00
Andy Grover
d731c1577e Filesystem version instead of format hash check
Instead of hashing headers, define an interop version. Do not mount
superblocks that have a different version, either higher or lower.

Since this is pretty much the same as the format hash except it's a
constant, minimal code changes are needed.

Initial dev version is 0, with the intent that version will be bumped to
1 immediately prior to tagging initial release version.

Update README. Fix comments.

Add interop version to notes and modinfo.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-01-15 10:53:00 -08:00
Andy Grover
a421bb0884 Merge pull request #5 from versity/zab/move_blocks_ioctl
Zab/move blocks ioctl
2021-01-14 16:18:45 -08:00
Zach Brown
773eb129ed Add move-blocks test
Add a basic test of the move_blocks ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00
Zach Brown
eb3981c103 Add move-blocks scoutfs cli command
Add a move-blocks command that translates arguments and calls the
MOVE_BLOCKS ioctl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-14 13:42:22 -08:00