scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 21:03:12 +00:00

Author	SHA1	Message	Date
Zach Brown	49df98f5a8	Add skip-likely-huge print option Add an option to skip printing structures that are likely to be so huge that the print output becomes completely unwieldly on large systems. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:07:57 -07:00
Zach Brown	15cf3c4134	Merge pull request #93 from versity/zab/v1_5_release v1.5 Release v1.5	2022-06-21 11:22:02 -07:00
Zach Brown	1abe97351d	v1.5 Release Finish the release notes for the 1.5 release. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-21 09:46:16 -07:00
Zach Brown	f757e29915	Merge pull request #92 from versity/zab/server_error_assertions Protect get_log_trees corruption with assertion	2022-06-17 15:29:58 -07:00
Zach Brown	31e474c5fa	Protect get_log_trees corruption with assertion Like a lot of places in the server, get_log_trees() doesn't have the tools in needs to safely unwind partial changes in the face of an error. In the worst case, it can have moved extents from the mount's log_trees item into the server's main data allocator. The dirty data allocator reference is in the super block so it can be written later. The dirty log_trees reference is on stack, though, so it will be thrown away on error. This ends up duplicating extents in the persistent structures because they're written in the new dirty allocator but still remain in the unwritten source log_trees allocator. This change makes it harder for that to happen. It dirties the log_trees item and always tries to update so that the dirty blocks are consistent if they're later written out. If we do get an error updating the item we throw an assertion. It's not great, but it matches other similar circumstances in other parts of the server. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-17 14:22:59 -07:00
Zach Brown	dcf8202d7c	Merge pull request #91 from versity/zab/tcp_sk_alloc_nofs Set sk_allocation on TCP sockets	2022-06-15 09:16:59 -07:00
Zach Brown	ae55fa3153	Set sk_allocation on TCP sockets We were setting sk_allocation on the quorum UDP sockets to prevent entering reclaim while using sockets but we missed setting it on the regular messaging TCP sockets. This could create deadlocks where the sending socket could enter scoutfs reclaim and wait for server messages while holding the socket lock, preventing the receive thread from receiving messages while it blocked on the socket lock. The fix is to prevent entering the FS to reclaim during socket allocations. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-14 08:21:19 -07:00
Zach Brown	7f9f21317c	Merge pull request #90 from versity/zab/multiple_alloc_move_commits Reclaim log_trees alloc roots in multiple commits	2022-06-08 13:23:01 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	0a6b1fb304	Merge pull request #88 from versity/zab/v1_4_release v1.4 Release	2022-05-06 11:23:45 -07:00
Zach Brown	fb7e43dd23	v1.4 Release Finish the release notes for the 1.4 release. Signed-off-by: Zach Brown <zab@versity.com> v1.4	2022-05-06 09:57:27 -07:00
Zach Brown	45d90a5ae4	Merge pull request #86 from versity/zab/increase_server_commit_block_budget Increase server commit dirty block budget	2022-05-06 09:47:47 -07:00
Zach Brown	48f1305a8a	Increase server commit dirty block budget We're seeing allocator motion during get_log_trees dirty quite a lot of blocks, which makes sense. Let's continue to up the budget. If we still need significantly larger budgets we'll want to look into capping the dirty block use of the allocator extent movers which will mean changing callers to support partial progress. Signed-off-by: Zach Brown <zab@versity.com>	2022-05-05 12:11:14 -07:00
Zach Brown	cd4d6502b8	Merge pull request #87 from versity/zab/lock_invalidation_recovery Zab/lock invalidation recovery	2022-04-28 09:01:16 -07:00
Zach Brown	dff366e1a4	Add lock invalidation and recovery test Add a test which tries to have lock recovery processed during lock invalidation on clients. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-27 12:22:18 -07:00
Zach Brown	ca526e2bc0	Lock recovery uses old mode while invalidating When a new server starts up it rebuilds its view of all the granted locks with lock recovery messages. Clients give the server their granted lock modes which the server then uses to process all the resent lock requests from clients. The lock invalidation work in the client is responsible for transitioning an old granted mode to a new invalidated mode from an unsolicited message from the server. It has to process any client state that'd be incompatible with the new mode (write dirty data, drop caches). While it is doing this work, as an implementation short cut, it sets the granted lock mode to the new mode so that users that are compatible with the new invalidated mode can use the lock whlie it's being invalidated. Picture readers reading data while a write lock is invalidating and writing dirty data. A problem arises when a lock recover request is processed during lock invalidation. The client lock recover request handler sends a response with the current granted mode. The server takes this to mean that the invalidation is done but the client invalidation worker might still be writing data, dropping caches, etc. The server will allow the state machine to advance which can send grants to pending client requests which believed that the invalidation was done. All of this can lead to a grant response handler in the client tripping the assertion that there can not be cached items that were incompatible with the old mode in a grant from the server. Invalidation might still be invalidating caches. Hitting this bug is very rare and requires a new server starting up while a client has both a request outstanding and an invalidation being processed when the lock recover request arrives. The fix is to record the old mode during invalidation and send that in lock recover responses. This can lead the lock server to resend invalidation requests to the client. The client already safely handles duplicate invalidation requests from other failover cases. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-27 12:20:56 -07:00
Zach Brown	e423d42106	Merge pull request #85 from versity/zab/v1_3_release v1.3 Release	2022-04-07 12:21:42 -07:00
Zach Brown	82d2be2e4a	v1.3 Release Finish the release notes for the 1.3 release. Signed-off-by: Zach Brown <zab@versity.com> v1.3	2022-04-07 10:42:14 -07:00
Zach Brown	4102b760d0	Merge pull request #84 from versity/zab/getxattr_under_lock Fix getxattr with large values giving EINVAL	2022-04-04 13:58:40 -07:00
Zach Brown	65654ee7c0	Fix getxattr with large values giving EINVAL The change to only allocate a buffer for the first xattr item with kmalloc instead of the entire logical xattr payload with vmalloc included a regression for getting large xattrs. getxattr used to copy the entire payload into the large vmalloc so it could unlock just after get_next_xattr. The change to only getting the first item buffer added a call to copy from the rest of the items but those copies weren't covered by the locks. This would often work because the lock pointer still pointed to a valid lock. But if the lock was invalidated then the mode would no longer be compatible and _item_lookup would return EINVAL. The fix is to extend xattr_rwsem and cluster lock coverage to the rest fo the function body, which includes the value item copies. This also makes getxattr's lock coverage consistent with setxattr and listxattr which might reduce the risk of similar mistakes in the future. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-04 12:49:50 -07:00
Zach Brown	b2d6ceeb9c	Merge pull request #82 from versity/zab/server_alloc_reservation Zab/server alloc reservation	2022-04-01 17:36:22 -07:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	3c2b329675	Limit alloc consumption in server commits Server commits use an allocator that has a limited number of available metadata blocks and entries in a list for freed blocks. The allocator is refilled between commits. Holders can't fully consume the allocator during the commit and that tended to work out because server commit holders commit before sending responses. We'd tend to commit frequently enough that we'd get a chance to refill the allocators before they were consumed. But there was no mechanism to ensure that this would be the case. Enough concurrent server holders were able to fully consume the allocators before committing. This causes scoutfs_meta_alloc and _free to return errors, leading the server to fail in the worst cases. This changes the server commit tracking to use more robust structures which limit the number of concurrent holders so that the allocators aren't exhausted. The commit_users struct stops holders from making progress once the allocators don't have room for more holders. It also lets us stop future holders from making progress once the commit work has been queued. The previous cute use of a rwsem didn't allow for either of these protections. We don't have precise tracking of each holder's allocation consumption so we don't try and reserve blocks for each holder. Instead we have a maxmimum consumption per holder and make sure that all the holders can't consume the allocators if they all use their full limit. All of this requires the holding code paths to be well behaved and not use more than the per-hold limit. We add some debugging code to print the stacks of holders that were active when the total holder limit was exceeded. This is the motivation for having state in the holders. We can record some data at the time their hold started that'll make it a little easier to track down which of the holders exceeded their limit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:17 -07:00
Zach Brown	96ad8dd510	Add scoutfs_alloc_meta_remaining Add helper function to give the caller the number of blocks remaining in the first list block that's used for meta allocation and freeing. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:44 -07:00
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	fb2ff753ad	Merge pull request #83 from versity/zab/heartbeat_during_fencing Send quorum heartbeats while fencing	2022-04-01 09:12:41 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	c94b072925	Merge pull request #81 from versity/zab/fenced_test Zab/fenced test	2022-03-29 09:05:09 -07:00
Zach Brown	26ae9c6e04	Verify local unmount testing fence script The fence script we use for our single node multi-mount tests only knows how to fence by using forced unmount to destroy a mount. As of now, the tests only generate failing nodes that need to be fenced by using forced unmount as well. This results in the awkward situation where the testing fence script doesn't have anything to do because the mount is already gone. When the test fence script has nothing to do we might not notice if it isn't run. This adds explicit verification to the fencing tests that the script was really run. It adds per-invocation logging to the fence script and the test makes sure that it was run. While we're at it, we take the opportunity to tidy up some of the scripting around this. We use a sysfs file with the data device major:minor numbers so that the fencing script can find and unmount mounts without having to ask them for their rid. They may not be operational. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-28 14:52:08 -07:00
Zach Brown	c8d7221ec5	Show data device numbers in sysfs file It can be handy to associate mounts with their sysfs directory by their data device number. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-25 14:43:25 -07:00
Zach Brown	7fd03dc311	Merge pull request #80 from versity/zab/avoid_xattr_vmalloc Don't use vmalloc in get/set xattr	2022-03-22 12:00:51 -07:00
Zach Brown	4e8a088cc5	Don't use vmalloc in get/set xattr Extended attribute values can be larger than a reasonable maximum size for our btree items so we store xattrs in many items. The first pass at this code used vmalloc to make it relatively easy to work with a contiguous buffer that was cut up into multiple items. The problem, of course, is that vmalloc() is expensive. Well, the problem is that I always forget just how expensive it can be and use it when I shouldn't. We had loads on high cpu count machines that were catastrophically cpu bound on all the contentious work that vmalloc does to maintain a coherent global address space. This removes the use of vmalloc and only allocates a small buffer for the first compound item. The later items directly reference regions of value buffer rather than copying it to and from the large intermediate vmalloced buffer. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-21 21:44:11 -07:00
Zach Brown	9c751c1197	Merge pull request #78 from versity/zab/quorum_leader_visibility Zab/quorum leader visibility	2022-03-16 09:16:57 -07:00
Zach Brown	875583b7ef	Add t_fs_is_leader test helper The t_server_nr and t_first_client_nr helpers iterated over all the fs numbers examining their quorum/is_leader files, but clients don't have a quorum/ directory. This was causing spurious outputs in tests that were looking for servers but didn't find it in the first quorum fs number and made it down into the clients. Give them a helper that returns 0 for being a leader if the quorum/ dir doesn't exist. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 16:09:55 -07:00
Zach Brown	38e5aa77c4	Update quorum status files more frequently We were seeing rare test failures where it looked like is_leader wasn't set for any of the mounts. The test that couldn't find a set is_leader file had just perfomed some mounts so we know that a server was up and processing requests. The quorum task wasn't updating the status that's shown in sysfs and debugfs until after the server started up. This opened the race where the server was able to serve mount requests and have the test run to find no is_leader file set before the quorum task was able to update the stats and make its election visible. This updates the quorum task to make its status visible more often, typically before it does something that will take a while. The is_leader will now be visible before the server is started so the test will always see the file after server starts up and lets mounts finish. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 15:07:57 -07:00
Zach Brown	57a1d75e52	Merge pull request #77 from versity/zab/v1_2_release Zab/v1 2 release	2022-03-14 18:10:16 -07:00
Zach Brown	51d19d797f	Start v1.3-rc release notes Create the 1.3 section in the release notes for commits to fill. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-14 17:15:24 -07:00
Zach Brown	029a684c25	v1.2 Release Cut the release notes for the 1.2 release. Signed-off-by: Zach Brown <zab@versity.com> v1.2	2022-03-14 17:15:05 -07:00
Zach Brown	f2679d9598	Merge pull request #76 from versity/zab/inode_deletion_fixes Zab/inode deletion fixes	2022-03-11 16:23:21 -08:00
Zach Brown	bddca171ee	Call iput outside cluster locked transactions The final iput of an inode can delete items in cluster locked transactions. It was never safe to call iput within locked transactions but we never saw the problem. Recent work on inode deletion raised the issue again. This makes sure that we always perform iput outside of locked transactions. The only interesting change is making scoutfs_new_inode() return the allocated inode on error so that the caller can put the inode after releasing the transaction. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	18171b8543	Put allocator block references on forced unmount During forced unmount commits abort due to errors and the open transaction is left in a dirty state that is cleaned up by scoutfs_shutdown_trans(). It cleans all the dirty blocks in the commit write context with scoutfs_block_writer_forget_all(), but it forgot to call scoutfs_alloc_prepare_commit() to put the block references held by the allocator. This was generating leaked block warnings during testing that used forced unmount. It wouldn't affect regular operations. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	d846eec5e8	Harden final inode deletion We were seeing a number of problems coming from races that allowed tasks in a mount to try and concurrently delete an inode's items. We could see error messages indicating that deletion failed with -ENOENT, we could see users of inodes behave erratically as inodes were deleted from under them, and we could see eventual server errors trying to merge overlapping data extents which were "freed" (add to transaction lists) multiple times. This commit addresses the problems in one relatively large patch. While we could mechanically split up the fixes, they're all interdependent and splitting them up (bisecting through them) could cause failures that would be devilishly hard to diagnose. First we stop allowing multiple cached vfs inodes. This was initially done to avoid deadlocks between lock invalidation and final inode deletion. We add a specific lookup that's used by invalidation which ignores any inodes which are in I_NEW or I_FREEING. Now that iget can wait on inode flags we call iget5_locked before acquiring the cluster lock. This ensures that we can only have one cached vfs inode for a given inode number in evict_inode trying to delete. Now that we can only have one cached inode, we can rework the omap tracking to use _set and _clear instead of _inc and _put. This isn't strictly necessary but is a simplification and lets us issue warnings if we see that we ever try to set an inode numbers bit on behalf of multiple cached inodes. We also add a _test helper. Orphan scanning would try to perform deletion by instantiating a cached inode and then putting it, triggering eviction and final deletion. This was an attempt to simplify concurrency but ended up causing more problems. It no longer tries to interact with inode cache at all and attempts to safely delete inode items directly. It uses the omap test to determine that it should skip an already cached inode. We had attempted to forbid opening inodes by handle if they had an nlink of 0. Since we allowed multiple cached inodes for an inode number this was to prevent adding cached inodes that were being deleted. It was only performing the check on newly allocated inodes, though, so it could get a reference to the cached inode that the scanner had inserted for deleting. We're chosing to keep restricting opening by handle to only linked inodes so we also check existing inodes after they're refreshed. We're left with a task evicting an inode and the orphan scanner racing to delete an inode's items. We move the work of determining if its safe to delete out of scoutfs_omap_should_delete() and into try_delete_inode_items() which is called directly from eviction and scanning. This is mostly code motion but we do make three critical changes. We get rid of the goofy concurrent deletion detection in delete_inode_items() and instead use a bit in the lock data to serialize multiple attempts to delete an inode's items. We no longer assume that the inode must still be around because we were called from evict and specifically check that inode item is still present for deleting. Finally, we use the omap test to discover that we shouldn't delete an inode that is locally cached (and would be not be included to the omap response). We do all this under the inode write lock to serialize between mounts. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:28:58 -08:00
Zach Brown	e2c90339c5	Add orphan-inodes to race final deletion We're seeing some trouble with very specific race conditions. This updates the orphan-inodes test to try and force final inode deletion during eviction, the orphan scan worker, and opening inodes by handle to all race and hit an inode number at the same time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 14:30:17 -08:00
Zach Brown	4a0b14a4f2	Wait for stdin open in orphan-inodes test The orphan inode test often uses a trick where it runs sleep in the abckground with a file as stdin as a means of holding files open. This can very rarely fail if the background sleep happens to be first schedled after the unlink of the file it's reading as stdin. A small delay gives it a chance to run and open the file before its unlinked. It's still possible to lose the race, of course, but so far this has been good enough. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	90518a0fbd	Add handle_fsetxattr test utility Add a quick little binary that spins opening an inode by a handle and calling fsetxattr. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	cd23cc61ca	Add mount option test bash functions Add some test functions which work with mount options. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	a67ea30bb7	Add orphan_scan_delay_ms mount option Add a mount option to set the delay betwen scanning of the orphan list. The sysfs file for the option is writable so this option can be set at run time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	f3b7c683f0	Fix quorum_server_nr syfs file name typo The quorum_slot_nr mount option was being mistakenly shown in a sysfs file named quorum_server_nr. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	8decc54467	Clean up mount option handling The mount options code is some of the oldest in the tree and is weirdly split between options.c and super.c. This cleans up the options code, moves it all to options.c, and reworks it to be more in line with the modern subsystem convenction of storing state in an allocated info struct. Rather than putting the parsed options in the super for everyone to directly reference we put them in the private options info struct and add a locked read function. This will let us add sysfs files to change mount options while safely serializing with readers. All the users of mount options that used to directly reference the parsed struct now call the read function to get a copy. They're all small local changes except for quorum which saves a static copy of the quorum slot number because it references it in so many places and relies on it not changing. Finally, we remove the empty debugfs "options" directory. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	5adcf7677f	Export omap group calc helper The inode caller of omap was manually calculating the group and bits, which isn't fantastic. Export the little helper to calculate it so the inode caller doesn't have to. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00

1 2 3 4 5 ...

1630 Commits