scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-07-25 17:42:50 +00:00

Author	SHA1	Message	Date
Zach BrownandGitHub	fb2ff753ad	Merge pull request #83 from versity/zab/heartbeat_during_fencing Send quorum heartbeats while fencing	2022-04-01 09:12:41 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach BrownandGitHub	c94b072925	Merge pull request #81 from versity/zab/fenced_test Zab/fenced test	2022-03-29 09:05:09 -07:00
Zach Brown	26ae9c6e04	Verify local unmount testing fence script The fence script we use for our single node multi-mount tests only knows how to fence by using forced unmount to destroy a mount. As of now, the tests only generate failing nodes that need to be fenced by using forced unmount as well. This results in the awkward situation where the testing fence script doesn't have anything to do because the mount is already gone. When the test fence script has nothing to do we might not notice if it isn't run. This adds explicit verification to the fencing tests that the script was really run. It adds per-invocation logging to the fence script and the test makes sure that it was run. While we're at it, we take the opportunity to tidy up some of the scripting around this. We use a sysfs file with the data device major:minor numbers so that the fencing script can find and unmount mounts without having to ask them for their rid. They may not be operational. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-28 14:52:08 -07:00
Zach Brown	c8d7221ec5	Show data device numbers in sysfs file It can be handy to associate mounts with their sysfs directory by their data device number. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-25 14:43:25 -07:00
Zach BrownandGitHub	7fd03dc311	Merge pull request #80 from versity/zab/avoid_xattr_vmalloc Don't use vmalloc in get/set xattr	2022-03-22 12:00:51 -07:00
Zach Brown	4e8a088cc5	Don't use vmalloc in get/set xattr Extended attribute values can be larger than a reasonable maximum size for our btree items so we store xattrs in many items. The first pass at this code used vmalloc to make it relatively easy to work with a contiguous buffer that was cut up into multiple items. The problem, of course, is that vmalloc() is expensive. Well, the problem is that I always forget just how expensive it can be and use it when I shouldn't. We had loads on high cpu count machines that were catastrophically cpu bound on all the contentious work that vmalloc does to maintain a coherent global address space. This removes the use of vmalloc and only allocates a small buffer for the first compound item. The later items directly reference regions of value buffer rather than copying it to and from the large intermediate vmalloced buffer. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-21 21:44:11 -07:00
Zach BrownandGitHub	9c751c1197	Merge pull request #78 from versity/zab/quorum_leader_visibility Zab/quorum leader visibility	2022-03-16 09:16:57 -07:00
Zach Brown	875583b7ef	Add t_fs_is_leader test helper The t_server_nr and t_first_client_nr helpers iterated over all the fs numbers examining their quorum/is_leader files, but clients don't have a quorum/ directory. This was causing spurious outputs in tests that were looking for servers but didn't find it in the first quorum fs number and made it down into the clients. Give them a helper that returns 0 for being a leader if the quorum/ dir doesn't exist. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 16:09:55 -07:00
Zach Brown	38e5aa77c4	Update quorum status files more frequently We were seeing rare test failures where it looked like is_leader wasn't set for any of the mounts. The test that couldn't find a set is_leader file had just perfomed some mounts so we know that a server was up and processing requests. The quorum task wasn't updating the status that's shown in sysfs and debugfs until after the server started up. This opened the race where the server was able to serve mount requests and have the test run to find no is_leader file set before the quorum task was able to update the stats and make its election visible. This updates the quorum task to make its status visible more often, typically before it does something that will take a while. The is_leader will now be visible before the server is started so the test will always see the file after server starts up and lets mounts finish. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 15:07:57 -07:00
Zach BrownandGitHub	57a1d75e52	Merge pull request #77 from versity/zab/v1_2_release Zab/v1 2 release	2022-03-14 18:10:16 -07:00
Zach Brown	51d19d797f	Start v1.3-rc release notes Create the 1.3 section in the release notes for commits to fill. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-14 17:15:24 -07:00
Zach Brown	029a684c25	v1.2 Release Cut the release notes for the 1.2 release. Signed-off-by: Zach Brown <zab@versity.com> v1.2	2022-03-14 17:15:05 -07:00
Zach BrownandGitHub	f2679d9598	Merge pull request #76 from versity/zab/inode_deletion_fixes Zab/inode deletion fixes	2022-03-11 16:23:21 -08:00
Zach Brown	bddca171ee	Call iput outside cluster locked transactions The final iput of an inode can delete items in cluster locked transactions. It was never safe to call iput within locked transactions but we never saw the problem. Recent work on inode deletion raised the issue again. This makes sure that we always perform iput outside of locked transactions. The only interesting change is making scoutfs_new_inode() return the allocated inode on error so that the caller can put the inode after releasing the transaction. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	18171b8543	Put allocator block references on forced unmount During forced unmount commits abort due to errors and the open transaction is left in a dirty state that is cleaned up by scoutfs_shutdown_trans(). It cleans all the dirty blocks in the commit write context with scoutfs_block_writer_forget_all(), but it forgot to call scoutfs_alloc_prepare_commit() to put the block references held by the allocator. This was generating leaked block warnings during testing that used forced unmount. It wouldn't affect regular operations. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	d846eec5e8	Harden final inode deletion We were seeing a number of problems coming from races that allowed tasks in a mount to try and concurrently delete an inode's items. We could see error messages indicating that deletion failed with -ENOENT, we could see users of inodes behave erratically as inodes were deleted from under them, and we could see eventual server errors trying to merge overlapping data extents which were "freed" (add to transaction lists) multiple times. This commit addresses the problems in one relatively large patch. While we could mechanically split up the fixes, they're all interdependent and splitting them up (bisecting through them) could cause failures that would be devilishly hard to diagnose. First we stop allowing multiple cached vfs inodes. This was initially done to avoid deadlocks between lock invalidation and final inode deletion. We add a specific lookup that's used by invalidation which ignores any inodes which are in I_NEW or I_FREEING. Now that iget can wait on inode flags we call iget5_locked before acquiring the cluster lock. This ensures that we can only have one cached vfs inode for a given inode number in evict_inode trying to delete. Now that we can only have one cached inode, we can rework the omap tracking to use _set and _clear instead of _inc and _put. This isn't strictly necessary but is a simplification and lets us issue warnings if we see that we ever try to set an inode numbers bit on behalf of multiple cached inodes. We also add a _test helper. Orphan scanning would try to perform deletion by instantiating a cached inode and then putting it, triggering eviction and final deletion. This was an attempt to simplify concurrency but ended up causing more problems. It no longer tries to interact with inode cache at all and attempts to safely delete inode items directly. It uses the omap test to determine that it should skip an already cached inode. We had attempted to forbid opening inodes by handle if they had an nlink of 0. Since we allowed multiple cached inodes for an inode number this was to prevent adding cached inodes that were being deleted. It was only performing the check on newly allocated inodes, though, so it could get a reference to the cached inode that the scanner had inserted for deleting. We're chosing to keep restricting opening by handle to only linked inodes so we also check existing inodes after they're refreshed. We're left with a task evicting an inode and the orphan scanner racing to delete an inode's items. We move the work of determining if its safe to delete out of scoutfs_omap_should_delete() and into try_delete_inode_items() which is called directly from eviction and scanning. This is mostly code motion but we do make three critical changes. We get rid of the goofy concurrent deletion detection in delete_inode_items() and instead use a bit in the lock data to serialize multiple attempts to delete an inode's items. We no longer assume that the inode must still be around because we were called from evict and specifically check that inode item is still present for deleting. Finally, we use the omap test to discover that we shouldn't delete an inode that is locally cached (and would be not be included to the omap response). We do all this under the inode write lock to serialize between mounts. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:28:58 -08:00
Zach Brown	e2c90339c5	Add orphan-inodes to race final deletion We're seeing some trouble with very specific race conditions. This updates the orphan-inodes test to try and force final inode deletion during eviction, the orphan scan worker, and opening inodes by handle to all race and hit an inode number at the same time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 14:30:17 -08:00
Zach Brown	4a0b14a4f2	Wait for stdin open in orphan-inodes test The orphan inode test often uses a trick where it runs sleep in the abckground with a file as stdin as a means of holding files open. This can very rarely fail if the background sleep happens to be first schedled after the unlink of the file it's reading as stdin. A small delay gives it a chance to run and open the file before its unlinked. It's still possible to lose the race, of course, but so far this has been good enough. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	90518a0fbd	Add handle_fsetxattr test utility Add a quick little binary that spins opening an inode by a handle and calling fsetxattr. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	cd23cc61ca	Add mount option test bash functions Add some test functions which work with mount options. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	a67ea30bb7	Add orphan_scan_delay_ms mount option Add a mount option to set the delay betwen scanning of the orphan list. The sysfs file for the option is writable so this option can be set at run time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	f3b7c683f0	Fix quorum_server_nr syfs file name typo The quorum_slot_nr mount option was being mistakenly shown in a sysfs file named quorum_server_nr. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	8decc54467	Clean up mount option handling The mount options code is some of the oldest in the tree and is weirdly split between options.c and super.c. This cleans up the options code, moves it all to options.c, and reworks it to be more in line with the modern subsystem convenction of storing state in an allocated info struct. Rather than putting the parsed options in the super for everyone to directly reference we put them in the private options info struct and add a locked read function. This will let us add sysfs files to change mount options while safely serializing with readers. All the users of mount options that used to directly reference the parsed struct now call the read function to get a copy. They're all small local changes except for quorum which saves a static copy of the quorum slot number because it references it in so many places and relies on it not changing. Finally, we remove the empty debugfs "options" directory. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	5adcf7677f	Export omap group calc helper The inode caller of omap was manually calculating the group and bits, which isn't fantastic. Export the little helper to calculate it so the inode caller doesn't have to. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	07f03d499f	Remove duplicate orphan work delay calculation You can almost feel the editing mistake that brought the delay calculation into the conditional and forgot to remove the initial calculation at declaration. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:23 -08:00
Zach BrownandGitHub	c5068efef0	Merge pull request #75 from versity/zab/bad_mount_option Zab/bad mount option	2022-02-28 09:07:15 -08:00
Zach Brown	66678dc63b	Fail mounts with unknown options Weirdly, the mout option parser silently returned when it found mount options that weren't recognized. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-21 10:44:56 -08:00
Zach Brown	b2834d3c28	Add basic bad mount testing Add some tests which exercise the kinds of reasonable mistakes that people will make in the field. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-21 10:44:38 -08:00
Zach BrownandGitHub	cff50bec6b	Merge pull request #74 from versity/zab/fallocate_read_inversion Zab/fallocate read inversion	2022-02-21 09:58:49 -08:00
Zach Brown	4d6350b3b0	Fix lock ordering in fallocate We were seeing ABBA deadlocks on the dio_count wait and extent_sem between fallocate and reads. It turns out that fallocate got lock ordering wrong. This brings fallocate in line with the rest of the adherents to the lock heirarchy. Most importantly, the extent_sem is used after the dio_count. While we're at it we bring the i_mutex down to just before the cluster lock for consistency. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-17 14:48:13 -08:00
Zach Brown	48966b42bb	Add simple fallocate test Signed-off-by: Zach Brown <zab@versity.com>	2022-02-17 11:20:08 -08:00
Zach BrownandGitHub	97cb8ad50d	Merge pull request #72 from versity/zab/quick_man_fix Clean quorum and format change command docs	2022-02-09 09:22:50 -08:00
Zach Brown	ae08a797ae	Clean quorum and format change command docs The man pages and inline help blurbs for the recently added format version and quorum config commands incorrectly described the device arguments which are needed. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-08 11:23:27 -08:00
Zach BrownandGitHub	2634fadfcb	Merge pull request #71 from versity/zab/v1_1_release Zab/v1 1 release v1.1	2022-02-04 11:35:39 -08:00
Zach Brown	0c1f19556d	Prepare v1.2-rc release Add the v1.2-rc section to the release notes so that we can add entries with commits as needed. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-04 11:32:53 -08:00
Zach Brown	19caae3da8	v1.1 Release Finish off the release notes for the 1.1 release. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-04 11:32:37 -08:00
Zach BrownandGitHub	2989afbf46	Merge pull request #70 from versity/zab/silence_duplicate_log_merge_complete_error Silence resent log merge commit error	2022-02-02 14:35:01 -08:00
Zach Brown	730a84af92	Silence resent log merge commit error The server's log merge complete request handler was considering the absence of the client's original request as a failure. Unfortunately, this case is possible if a previous server successfully completed the client's request but the response was lost because it stopped for whatever reason. The failure was being logged as a hard error to the console which was causing tests to occasionally fail during server failover that hit just as the log merge completion was being processed. The error was being sent to the client as a response, we just need to silence the message for these expected but rare errors. We also fix the related case where the server printed the even more harsh WARN_ON if there was a next original request but it wasn't the one we expected to find from our requesting client. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-02 11:26:36 -08:00
Zach BrownandGitHub	5b77133c3b	Merge pull request #68 from versity/zab/collection_of_fixes Zab/collection of fixes	2022-01-24 11:22:41 -08:00
Zach Brown	329ac0347d	Remove unused scoutfs_net_cancel_request() The net _cancel_request call hasn't been used or tested in approximately a bazillion years. Best to get rid of it and have to add and test it if we think we need it again. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	15d7eec1f9	Disallow openening unlinked files by handle Our open by handle functions didn't care that the inode wasn't referenced and let tasks open unlinked inodes by number. This interacted badly with the inode deletion mechanisms which required that inodes couldn't be cached on other nodes after the transaction which removed their final reference. If a task did accidentally open a file by inode while it was being deleted it could see the inode items in an inconsistent state and return very confusing errors that look like corruption. The fix is to give the handle iget callers a flag to tell iget to only get the inode if it has a positive nlink. If iget sees that the inode has been unlinked it returns enoent. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	cff17a4cae	Remove unused flags scoutfs_inode_refresh arg The flags argument to scoutfs_inode_refresh wasn't being used. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	9fa2c6af89	Use get-allocated-inos in orphan-inodes test The orphan inodes test needs to test if inode items exist as it manipulates inodes. It used to open the inode by a handle but we're fixing that to not allow opening unlinked files. The get-allocated-inos ioctl tests for the presence of items owned by the inode regardless of any other vfs state so we can use it to verify what scoutfs is doing as we work with the vfs inodes. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e067961714	Add get-allocated-inos scoutfs command Add the get-allocated-inos scoutfs command which wraps the GET_ALLOCATED_INOS ioctl. It'll be used by tests to find items associated with an inode instead of trying to open the inode by a constructed handle after it was unlinked. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	7a96e03148	Add get_allocated_inos ioctl Add an ioctl that can give some indication of inodes that have inode items. We're exposing this for tests that verify the handling of open unlinked inodes. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e9b3cc873a	Export scoutfs_inode_init_key We're adding an ioctl that wants to build inode item keys so let's export the private inode key initializer. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	5f2259c48f	Revert "Fix client/server race btwn lock recov and farewell" This reverts commit `61ad844891`. This fix was trying to ensure that lock recovery response handling can't run after farewell calls reclaim_rid() by jumping through a bunch of hoops to tear down locking state as the first farewell request arrived. It introduced very slippery use after free during shutdown. It appears that it was from drain_workqueue() previously being able to stop chaining work. That's no longer possible when you're trying to drain two workqueues that can queue work in each other. We found a much clearer way to solve the problem so we can toss this. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e14912974d	Wait for lock recovery before sending farewell We recently found that the server can send a farewell response and try to tear down a client's lock state while it was still in lock recovery with the client. The lock recovery response could add a lock for the client after farell's reclaim_rid() had thought the client was gone forever and tore down its locks. This left a lock in the lock server that wasn't associated with any clients and so could never be invalidated. Attempts to acquire conflicting locks with it would hang forever, which we saw as hangs in testing with lots of unmounting. We tried to fix it by serializing incoming request handling and forcefully clobbering the client's lock state as we first got the farewell request. That went very badly. This takes another approach of trying to explicitly wait for lock recovery to finish before sending farewell responses. It's more in line with the overall pattern of having the client be up and functional until farewell tears it down. With this in place we can revert the other attempted fix that was causing so many problems. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:39:51 -08:00
Zach Brown	813ce24d79	Move local-force-unmount test script into tests/ The local-force-unmount fenced fencing script only works when all the mounts are on the local host and it uses force unmount. It is only used in our specific local testing scripts. Packaging it as an example lead people to believe that it could be used to cobble together a multi-host testing network, however temporary. Move it from being in utils and packged to being private to our tests so that it doesn't present an attractive nuisance. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 11:33:34 -08:00

1 2 3 4 5 ...