scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-07-20 15:02:21 +00:00

Author	SHA1	Message	Date
Zach Brown	8e34c5d66a	Use quorum slots and background election work Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-18 12:57:30 -08:00
Zach Brown	1c7bbd6260	More accurately describe unmounting quorum members As a client unmounts it sends a farewell request to the server. We have to carefully manage unmounting the final quorum members so that there is always a remaining quorum to elect a leader to start a server to process all their farewell requests. The mechanism for doing this described these clients as "voters". That's not really right, in our terminology voters and candidates are temporary roles taken on by members during a specific election term in the raft protocol. It's more accurate to describe the final set of clients as quorum members. They can be voters or candidates depending on how the raft protocol timeouts workout in any given election. So we rename the greeting flag, mounted client flag, and the code and comments on either side of the client and server to be a little clearer. This only changes symbols and comments, there should be no functional change. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-11 15:47:39 -08:00
Zach Brown	3ad18b0f3b	Update super blkno field tests for meta device As we read the super we check the first and last meta and data blkno fields. The tests weren't updated as we moved from one device to two metadata and data devices. Add a helper that tests the range for the device and test both meta and data ranges fully, instead of only testing the endpoints of each and assuming they're related because they're living on one device. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-11 15:47:29 -08:00
Andy Grover	79cd7a499b	Merge pull request #17 from versity/zab/disable_mount_unmount_test Disable mount-unmount-race test	2021-02-01 10:09:26 -08:00
Zach Brown	6ad18769cb	Disable mount-unmount-race test The mount-unmount-race test is occasionally hanging, disable it while we debug it and have test coverage for unrelated work. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-01 10:07:47 -08:00
Zach Brown	49d82fcaaf	Merge pull request #14 from agrover/fix-jira-202 utils: Do not assert if release is given unaligned offset or length	2021-02-01 09:46:01 -08:00
Zach Brown	e4e12c1968	Merge pull request #15 from agrover/radix-block Remove unused radix_block struct	2021-02-01 09:24:59 -08:00
Andy Grover	15fd2ccc02	utils: Do not assert if release is given unaligned offset or length This is checked for by the kernel ioctl code, so giving unaligned values will return an error, instead of aborting with an assert. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-29 09:30:57 -08:00
Andy Grover	eea95357d3	Remove unused radix_block struct Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-26 16:07:05 -08:00
Andy Grover	9842c5d13e	Merge pull request #13 from versity/zab/multi_mount_test_fixes Zab/multi mount test fixes	2021-01-26 15:56:33 -08:00
Zach Brown	ade539217e	Handle advance_seq being replayed in new server As a core principle, all server message processing needs to be safe to replay as servers shut down and requests are resent to new servers. The advance_seq handler got this wrong. It would only try to remove a trans_seq item for the seq sent by the client before inserting a new item for the next seq. This change could be committed before the reply was lost as the server shuts down. The next server would process the resent request but wouldn't find the old item for the seq that the client sent, and would ignore the new item that the previous server inserted. It would then insert another greater seq for the same client. This would leave behind a stale old trans_seq that would be returned as the last_seq which would forever limit the results that could be returned from the seq index walks. This fix is to always remove all previous seq items for the client before inserting a new one. This creates O(clients) server work, but it's minimal. This manifest as occasional simple-inode-index test failures (say 1 in 5?) which would trigger if the unmounts during previous tests would happen to have advance_seq resent across server shutdowns. With this change the test now reliably passes. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	5a90234c94	Use terminated test name when saving passed stats We've grown some test names that are prefixes of others (createmany-parallel, createmany-parallel-mounts). When we're searching for lines with the test name we have to search for the exact test name, by terminating the name with a space, instead of searching for a line that starts with the test name. This fixes strange output and saved passed stats for the names that share a prefix. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	f81e4cb98a	Add whitespace to xfstests output message The message indicating that xfstests output was now being shown was mashed up against the previous passed stats and it was gross and I hated it. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	1fc706bf3f	Filter hrtimer slow messages from dmesg When running in debug kernels in guests we can really bog down things enough to trigger hrtimer warnings. I don't think there's much we can reasonably do about that. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	e9c3aa6501	More carefully cancel server farewell work Farewell work is queued by farewell message processing. Server shutdown didn't properly wait for pending farewell work to finish before tearing down. As the server work destroyed the server's connection the farewell work could stlil be running and try to send responses down the socket. We make the server more carefully avoid queueuing farewell work if it's in the process of shutting down and wait for farewell work to finish before destroying the server's resources. This fixed all manner of crashes that were seen in testing when a bunch of nodes unmounted, creating farewell work on the server as it itself unmounted and destroyed the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	d39268bbc1	Fix spurious EIO from scoutfs_srch_get_compact scoutfs_srch_get_compact() is building up a compaction request which has a list of srch files to read and sort and write into a new srch file. It finds input files by searching for a sufficient number of similar files: first any unsorted log files and then sorted log files that are around the same size. It finds the files by using btree next on the srch zone which has types for unsorted srch log files, sorted srch files, but also pending and busy compaction items. It was being far too cute about iterating over different key types. It was trying to adapt to finding the next key and was making assumptions about the order of key types. It didn't notice that the pending and busy key types followed log and sorted and would generate EIO when it ran into them and found their value length didn't match what it was expecting. Rework the next item ref parsing so that it returns -ENOENT if it gets an unexpected key type, then look for the next key type when checking enoent. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	35ed1a2438	Add t_require_meta_size function Add a function that tests can use to skip when the metadata device isn't large enough. I thought we needed to avoid enospc in a particular test, but it turns out the test's failure was unrelated. So this isn't used for now but it seems nice to keep around. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	32e7978a6e	Extend lock invalidate grace period The grace period is intended to let lock holders squeeze in more bulk work before another node pulls the lock out from under them. The length of the delay is a balance between getting more work done per lock hold and adding latency to ping-ponging workloads. The current grace period was too short. To do work in the conflicting case you often have to read the result that the other mount wrote as you invalidated their lock. The test was written in the LSM world where we'd effectively read a single level 0 1MB segment. In the btree world we're checking bloom blocks and reading the other mount's btree. It has more dependent read latency. So we turn up the grace period to let conflicting readers squeeze in more work before pulling the lock out from under them. This value was chosen to make lock-conflicting-batch-commit pass in guests sharing nvme metadata devices in debugging kernels. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	8123b8fc35	fix lock-conflicting-batch-commit conf output The test had a silly typo in the label it put on the time it took mounts to perform conflicting metadata changes. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	da5911c311	Use d_materialise_unique to splice dir dentries When we're splicing in dentries in lookup we can be splicing the result of changes on other nodes into a stale dcache. The stale dcache might contain dir entries and the dcache does not allow aliased directories. Use d_materialise_unique() to splice in dir inodes so that we remove all aliased dentries which must be stale. We can still use d_splice_alias() for all other inode types. Any existing stale dentries will fail revalidation before they're used. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	098fc420be	Add some item cache page tracing Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	7a96537210	Leave mounts mounted if run-tests fails We can lose interesting state if the mounts are unmounted as tests fail, only unmount if all the tests pass. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	0607dfdac8	Enable and collect trace_printk Weirdly, run-tests was treating trace_printk not as an option to enable trace_printk() traces but as an option to print trace events to the console with printk? That's not a thing. Make -P really enable trace_printk tracing and collect it as it would enabled trace events. It needs to be treated seperately from the -t options that enable trace events. While we're at it treat the -P trace dumping option as a stand-alone option that works without -t arguments. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	0354bb64c5	More carefully enable tracing in run-tests run-tests.sh has a -t argument which takes a whitespace seperated string of globs of events to enable. This was hard to use and made it very easy to accidentally expand the globs at the wrong place in the script. This makes each -t argument specify a single word glob which is stored in an array so the glob isn't expanded until it's applied to the trace event path. We also add an error for -t globs that didn't match any events and add a message with the count of -t arguments and enabled events. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	631801c45c	Don't queue lock invalidation work during shutdown The lock invalidation work function needs to be careful not to requeue itself while we're shutting down or we can be left with invalidation functions racing with shutdown. Invalidation calls igrab so we can end up with unmount warning that there are still inodes in use. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	47a1ac92f7	Update ino-path args in basic-posix-consistency The ino-path calls in basic-posix-consistency weren't updated for the recent change to scoutfs cli args. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:45:23 -08:00
Zach Brown	004f693af3	Add golden output for mount-unmount-race test Signed-off-by: Zach Brown <zab@versity.com>	2021-01-25 14:19:35 -08:00
Andy Grover	f271a5d140	Merge pull request #12 from versity/zab/andys_fallocate_fix_minor_cleanup Retry if transaction cannot alloc for fallocate or write	2021-01-25 12:52:14 -08:00
Andy Grover	355eac79d2	Retry if transaction cannot alloc for fallocate or write Add a new distinguishable return value (ENOBUFS) from allocator for if the transaction cannot alloc space. This doesn't mean the filesystem is full -- opening a new transaction may result in forward progress. Alter fallocate and get_blocks code to check for this err val and retry with a new transaction. Handling actual ENOSPC can still happen, of course. Add counter called "alloc_trans_retry" and increment it from both spots. Signed-off-by: Andy Grover <agrover@versity.com> [zab@versity.com: fixed up write_begin error paths]	2021-01-25 09:32:01 -08:00
Zach Brown	d8b4e94854	Merge pull request #10 from agrover/rm-item-accounting Remove item accounting	2021-01-21 09:57:53 -08:00
Andy Grover	bed33c7ffd	Remove item accounting Remove kmod/src/count.h Remove scoutfs_trans_track_item() Remove reserved/actual fields from scoutfs_reservation Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-20 17:01:08 -08:00
Andy Grover	b370730029	Merge pull request #11 from versity/zab/item_cache_memory_corruption Fix item cache page memory corruption	2021-01-20 10:27:20 -08:00
Zach Brown	d64dd89ead	Fix item cache page memory corruption The item cache page life cycle is tricky. There are no proper page reference counts, everthing is done by nesting the page rwlock inside item_cache_info rwlock. The intent is that you can only reference pages while you hold the rwlocks appropriately. The per-cpu page references are outside that locking regime so they add a reference count. Now there are reference counts for the main cache index reference and for each per-cpu reference. The end result of all this is that you can only reference pages outside of locks if you're protected by references. Lock invalidation messed this up by trying to add its right split page to the lru after it was unlocked. Its page reference wasn't protected at this point. Shrinking could be freeing that page, and so it could be putting a freed page's memory back on the lru. Shrinking had a little bug that it was using list_move to move an initialized lru_head list_head. It turns out to be harmless (list_del will just follow pointers to itself and set itself as next and prev all over again), but boy does it catch one's eye. Let's remove all confusion and drop the reference while holding the cinf->rwlock instead of trying to optimize freeing outside locks. Finally, the big one: inserting a read item after compacting the page to make room was inserting into stale parent pointers into the old pre-compacted page, rather than the new page that was swapped in by compaction. This left references to a freed page in the page rbtree and hilarity ensued. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-20 09:02:29 -08:00
Zach Brown	8d81196e01	Merge pull request #7 from agrover/versioning Filesystem version instead of format hash check	2021-01-19 11:55:32 -08:00
Andy Grover	d731c1577e	Filesystem version instead of format hash check Instead of hashing headers, define an interop version. Do not mount superblocks that have a different version, either higher or lower. Since this is pretty much the same as the format hash except it's a constant, minimal code changes are needed. Initial dev version is 0, with the intent that version will be bumped to 1 immediately prior to tagging initial release version. Update README. Fix comments. Add interop version to notes and modinfo. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-15 10:53:00 -08:00
Andy Grover	a421bb0884	Merge pull request #5 from versity/zab/move_blocks_ioctl Zab/move blocks ioctl	2021-01-14 16:18:45 -08:00
Zach Brown	773eb129ed	Add move-blocks test Add a basic test of the move_blocks ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	eb3981c103	Add move-blocks scoutfs cli command Add a move-blocks command that translates arguments and calls the MOVE_BLOCKS ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	3139d3ea68	Add move_blocks ioctl Add a relatively constrained ioctl that moves extents between regular files. This is intended to be used by tasks which combine many existing files into a much larger file without reading and writing all the file contents. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	4da3d47601	Move ALLOC_DETAIL ioctl definition By convention we have the _IO* ioctl definition after the argument structs and ALLOC_DETAIL got it a bit wrong so move it down. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	aa1b1fa34f	Add util.h for kernel helpers Add a little header for inline convenience functions. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	8fcc9095e6	Merge pull request #6 from agrover/super Fix mkfs check for existing ScoutFS superblock	2021-01-14 08:57:53 -08:00
Andy Grover	299062a456	Fix mkfs check for existing ScoutFS superblock We were checking for the wrong magic value. We now need to use -f when running mkfs in run-tests for things to work. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-13 16:32:41 -08:00
Andy Grover	7cac1e7136	Merge pull request #1 from agrover/use-argp Rework scoutfs command-line parsing	2021-01-13 11:14:08 -08:00
Andy Grover	454dbebf59	Categorize not enough mounts as skip, not fail Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	2c5871c253	Change release ioctl to be denominated in bytes not blocks This more closely matches stage ioctl and other conventions. Also change release code to use offset/length nomenclature for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	64a698aa93	Make changes to tests for new scoutfs cmdline syntax Some different error message require changes to golden/* Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	d48b447e75	Do not set -Wpadded except for checking kmod-shared headers Remove now-unneeded manual padding in arg structs. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	5241bba7f6	Update scoutfs.8 man page Update for cli args and options changes. Reorder subcommands to match scoutfs built-in help. Consistent ScoutFS capitalization. Tighten up some descriptions and verbiage for consistency and omit descriptions of internals in a few spots. Add SEE ALSO for blockdev(8) and wipefs(8). Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	e0a2175c2e	Use argp info instead of duplicating for cmd_register() Make it static and then use it both for argp_parse as well as cmd_register_argp. Split commands into five groups, to help understanding of their usefulness. Mention that each command has its own help text, and that we are being fancy to keep the user from having to give fs path. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00

1 2 3 4 5 ...

1274 Commits