scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-06 20:16:25 +00:00

Author	SHA1	Message	Date
Zach Brown	3220c2055c	Merge pull request #98 from versity/zab/move_freed_many_commits Zab/move freed many commits	2022-08-01 09:09:28 -07:00
Zach Brown	1cbc927ccb	Only clear trying inode deletion bit when set try_delete_inode_items() is responsible for making sure that it's safe to delete an inode's persistent items. One of the things it has to check is that there isn't another deletion attempt on the inode in this mount. It sets a bit in lock data while it's working and backs off if the bit is already set. Unfortunately it was always clearing this bit as it exited, regardless of whether it set it or not. This would let the next attempt perform the deletion again before the working task had finished. This was often not a problem because background orphan scanning is the only source of regular concurrent deletion attempts. But it's a big problem if a deletion attempt takes a very long time. It gives enough time for an orphan scan attempt to clear the bit then try again and clobber on whoever is performing the very slow deletion. I hit this in a test that built files with an absurd number of fragmented extents. The second concurrent orphan attempt was able to proceed with deletion and performed a bunch of duplicate data extent frees and caused corruption. The fix is to only clear the bit if we set it. Now all concurrent attempts will back off until the first task is done. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	acb94dd9b7	Add test of large fragmented free lists Add a test which gives the server a transaction with a free list block that contains blknos that each dirty an individiaul btree blocks in the global data free extent btree. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	233fbb39f3	Limit alloc_move per-call allocator consumption Recently scoutfs_alloc_move() was changed to try and limit the amount of metadata blocks it could allocate or free. The intent was to stop concurrent holders of a transaction from fully consuming the available allocator for the transaction. The limiting logic was a bit off. It stopped when the allocator had the caller's limit remaining, not when it had consumed the caller's limit. This is overly permissive and could still allow concurrent callers to consume the allocator. It was also triggering warning messages when a call consumed more than its allowed budget while holding a transaction. Unfortunately, we don't have per-caller tracking of allocator resource consumption. The best we can do is sample the allocators as we start and return if they drop by the caller's limit. This is overly conservative in that it accounts any consumption during concurrent callers to all callers. This isn't perfect but it makes the failure case less likely and the impact shouldn't be significant. We don't often have a lot of concurrency and the limits are larger than callers will typically consume. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	198d3cda32	Add scoutfs_alloc_meta_low_since() Add scoutfs_alloc_meta_low_since() to test if the metadata avail or freed resources have been used by a given amount since a previous snapshot. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:24:10 -07:00
Zach Brown	e8c64b4217	Move freed data extents in multiple server commits As _get_log_trees() in the server prepares the log_trees item for the client's commit, it moves all the freed data extents from the log_trees item into core data extent allocator btree items. If the freed blocks are very fragmented then it can exceed a commit's metadata allocation budget trying to dirty blocks in the free data extent btree. The fix is to move the freed data extents in multiple commits. First we move a limited number in the main commit that does all the rest of the work preparing the commit. Then we try to move the remaining freed extents in multiple additional commits. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-28 11:42:33 -07:00
Zach Brown	89b64ae1f7	Merge pull request #97 from versity/zab/v1_6_release v1.6 Release	2022-07-07 14:54:26 -07:00
Zach Brown	fc8a5a1b5c	v1.6 Release Finish the release notes for the 1.6 release. Signed-off-by: Zach Brown <zab@versity.com> v1.6	2022-07-07 13:07:55 -07:00
Zach Brown	d4c793e010	Merge pull request #94 from versity/zab/mem_free_fixes Zab/mem free fixes	2022-07-07 13:07:04 -07:00
Zach Brown	8a3058818c	Merge pull request #95 from versity/zab/skip_likely_huge Add skip-likely-huge print option	2022-07-07 10:27:50 -07:00
Zach Brown	ba9a106f72	Free send attempts to disconnected clients Callers who send to specific client connections can get -ENOTCONN if their client has gone away. We forgot to free the send tracking struct in that case. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:20 -07:00
Zach Brown	310725eb72	Free omap rid list as server exits The omap code keeps track of rids that are connected to the server. It only freed the tracked rids as the server told it that rids were being removed. But that removal only happened as clients were evicted. If the server shutdown it'd leave the old rid entries around. They'd be leaked as the mount was unmounted and could linger and crate duplicate entries if the server started back up and the same clients reconnected. The fix is to free the tracking rids as the server shuts down. They'll be rebuilt as clients reconnect if the server restarts. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	51a8236316	Fix missed partial fill_super teardown If we return an error from .fill_super without having set sb->s_root then the vfs won't call our put_super. Our fill_super is careful to call put_super so that it can tear down partial state, but we weren't doing this with a few very early errors in fill_super. This tripped leak detection when we weren't freeing the sbi when returning errors from bad option parsing. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	f3dd00895b	Don't allocate zero size net info Clients don't use the net conn info and specified that it has 0 size. The net layer would try and allocate a zero size region which returns the magic ZERO_SIZE_PTR, which it would then later try and free. While that works, it's a little goofy. We can avoid the allocation when the size is 0. The pointer will remain null which kfree also accepts. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	49df98f5a8	Add skip-likely-huge print option Add an option to skip printing structures that are likely to be so huge that the print output becomes completely unwieldly on large systems. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:07:57 -07:00
Zach Brown	15cf3c4134	Merge pull request #93 from versity/zab/v1_5_release v1.5 Release v1.5	2022-06-21 11:22:02 -07:00
Zach Brown	1abe97351d	v1.5 Release Finish the release notes for the 1.5 release. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-21 09:46:16 -07:00
Zach Brown	f757e29915	Merge pull request #92 from versity/zab/server_error_assertions Protect get_log_trees corruption with assertion	2022-06-17 15:29:58 -07:00
Zach Brown	31e474c5fa	Protect get_log_trees corruption with assertion Like a lot of places in the server, get_log_trees() doesn't have the tools in needs to safely unwind partial changes in the face of an error. In the worst case, it can have moved extents from the mount's log_trees item into the server's main data allocator. The dirty data allocator reference is in the super block so it can be written later. The dirty log_trees reference is on stack, though, so it will be thrown away on error. This ends up duplicating extents in the persistent structures because they're written in the new dirty allocator but still remain in the unwritten source log_trees allocator. This change makes it harder for that to happen. It dirties the log_trees item and always tries to update so that the dirty blocks are consistent if they're later written out. If we do get an error updating the item we throw an assertion. It's not great, but it matches other similar circumstances in other parts of the server. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-17 14:22:59 -07:00
Zach Brown	dcf8202d7c	Merge pull request #91 from versity/zab/tcp_sk_alloc_nofs Set sk_allocation on TCP sockets	2022-06-15 09:16:59 -07:00
Zach Brown	ae55fa3153	Set sk_allocation on TCP sockets We were setting sk_allocation on the quorum UDP sockets to prevent entering reclaim while using sockets but we missed setting it on the regular messaging TCP sockets. This could create deadlocks where the sending socket could enter scoutfs reclaim and wait for server messages while holding the socket lock, preventing the receive thread from receiving messages while it blocked on the socket lock. The fix is to prevent entering the FS to reclaim during socket allocations. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-14 08:21:19 -07:00
Zach Brown	7f9f21317c	Merge pull request #90 from versity/zab/multiple_alloc_move_commits Reclaim log_trees alloc roots in multiple commits	2022-06-08 13:23:01 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	0a6b1fb304	Merge pull request #88 from versity/zab/v1_4_release v1.4 Release	2022-05-06 11:23:45 -07:00
Zach Brown	fb7e43dd23	v1.4 Release Finish the release notes for the 1.4 release. Signed-off-by: Zach Brown <zab@versity.com> v1.4	2022-05-06 09:57:27 -07:00
Zach Brown	45d90a5ae4	Merge pull request #86 from versity/zab/increase_server_commit_block_budget Increase server commit dirty block budget	2022-05-06 09:47:47 -07:00
Zach Brown	48f1305a8a	Increase server commit dirty block budget We're seeing allocator motion during get_log_trees dirty quite a lot of blocks, which makes sense. Let's continue to up the budget. If we still need significantly larger budgets we'll want to look into capping the dirty block use of the allocator extent movers which will mean changing callers to support partial progress. Signed-off-by: Zach Brown <zab@versity.com>	2022-05-05 12:11:14 -07:00
Zach Brown	cd4d6502b8	Merge pull request #87 from versity/zab/lock_invalidation_recovery Zab/lock invalidation recovery	2022-04-28 09:01:16 -07:00
Zach Brown	dff366e1a4	Add lock invalidation and recovery test Add a test which tries to have lock recovery processed during lock invalidation on clients. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-27 12:22:18 -07:00
Zach Brown	ca526e2bc0	Lock recovery uses old mode while invalidating When a new server starts up it rebuilds its view of all the granted locks with lock recovery messages. Clients give the server their granted lock modes which the server then uses to process all the resent lock requests from clients. The lock invalidation work in the client is responsible for transitioning an old granted mode to a new invalidated mode from an unsolicited message from the server. It has to process any client state that'd be incompatible with the new mode (write dirty data, drop caches). While it is doing this work, as an implementation short cut, it sets the granted lock mode to the new mode so that users that are compatible with the new invalidated mode can use the lock whlie it's being invalidated. Picture readers reading data while a write lock is invalidating and writing dirty data. A problem arises when a lock recover request is processed during lock invalidation. The client lock recover request handler sends a response with the current granted mode. The server takes this to mean that the invalidation is done but the client invalidation worker might still be writing data, dropping caches, etc. The server will allow the state machine to advance which can send grants to pending client requests which believed that the invalidation was done. All of this can lead to a grant response handler in the client tripping the assertion that there can not be cached items that were incompatible with the old mode in a grant from the server. Invalidation might still be invalidating caches. Hitting this bug is very rare and requires a new server starting up while a client has both a request outstanding and an invalidation being processed when the lock recover request arrives. The fix is to record the old mode during invalidation and send that in lock recover responses. This can lead the lock server to resend invalidation requests to the client. The client already safely handles duplicate invalidation requests from other failover cases. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-27 12:20:56 -07:00
Zach Brown	e423d42106	Merge pull request #85 from versity/zab/v1_3_release v1.3 Release	2022-04-07 12:21:42 -07:00
Zach Brown	82d2be2e4a	v1.3 Release Finish the release notes for the 1.3 release. Signed-off-by: Zach Brown <zab@versity.com> v1.3	2022-04-07 10:42:14 -07:00
Zach Brown	4102b760d0	Merge pull request #84 from versity/zab/getxattr_under_lock Fix getxattr with large values giving EINVAL	2022-04-04 13:58:40 -07:00
Zach Brown	65654ee7c0	Fix getxattr with large values giving EINVAL The change to only allocate a buffer for the first xattr item with kmalloc instead of the entire logical xattr payload with vmalloc included a regression for getting large xattrs. getxattr used to copy the entire payload into the large vmalloc so it could unlock just after get_next_xattr. The change to only getting the first item buffer added a call to copy from the rest of the items but those copies weren't covered by the locks. This would often work because the lock pointer still pointed to a valid lock. But if the lock was invalidated then the mode would no longer be compatible and _item_lookup would return EINVAL. The fix is to extend xattr_rwsem and cluster lock coverage to the rest fo the function body, which includes the value item copies. This also makes getxattr's lock coverage consistent with setxattr and listxattr which might reduce the risk of similar mistakes in the future. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-04 12:49:50 -07:00
Zach Brown	b2d6ceeb9c	Merge pull request #82 from versity/zab/server_alloc_reservation Zab/server alloc reservation	2022-04-01 17:36:22 -07:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	3c2b329675	Limit alloc consumption in server commits Server commits use an allocator that has a limited number of available metadata blocks and entries in a list for freed blocks. The allocator is refilled between commits. Holders can't fully consume the allocator during the commit and that tended to work out because server commit holders commit before sending responses. We'd tend to commit frequently enough that we'd get a chance to refill the allocators before they were consumed. But there was no mechanism to ensure that this would be the case. Enough concurrent server holders were able to fully consume the allocators before committing. This causes scoutfs_meta_alloc and _free to return errors, leading the server to fail in the worst cases. This changes the server commit tracking to use more robust structures which limit the number of concurrent holders so that the allocators aren't exhausted. The commit_users struct stops holders from making progress once the allocators don't have room for more holders. It also lets us stop future holders from making progress once the commit work has been queued. The previous cute use of a rwsem didn't allow for either of these protections. We don't have precise tracking of each holder's allocation consumption so we don't try and reserve blocks for each holder. Instead we have a maxmimum consumption per holder and make sure that all the holders can't consume the allocators if they all use their full limit. All of this requires the holding code paths to be well behaved and not use more than the per-hold limit. We add some debugging code to print the stacks of holders that were active when the total holder limit was exceeded. This is the motivation for having state in the holders. We can record some data at the time their hold started that'll make it a little easier to track down which of the holders exceeded their limit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:17 -07:00
Zach Brown	96ad8dd510	Add scoutfs_alloc_meta_remaining Add helper function to give the caller the number of blocks remaining in the first list block that's used for meta allocation and freeing. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:44 -07:00
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	fb2ff753ad	Merge pull request #83 from versity/zab/heartbeat_during_fencing Send quorum heartbeats while fencing	2022-04-01 09:12:41 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	c94b072925	Merge pull request #81 from versity/zab/fenced_test Zab/fenced test	2022-03-29 09:05:09 -07:00
Zach Brown	26ae9c6e04	Verify local unmount testing fence script The fence script we use for our single node multi-mount tests only knows how to fence by using forced unmount to destroy a mount. As of now, the tests only generate failing nodes that need to be fenced by using forced unmount as well. This results in the awkward situation where the testing fence script doesn't have anything to do because the mount is already gone. When the test fence script has nothing to do we might not notice if it isn't run. This adds explicit verification to the fencing tests that the script was really run. It adds per-invocation logging to the fence script and the test makes sure that it was run. While we're at it, we take the opportunity to tidy up some of the scripting around this. We use a sysfs file with the data device major:minor numbers so that the fencing script can find and unmount mounts without having to ask them for their rid. They may not be operational. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-28 14:52:08 -07:00
Zach Brown	c8d7221ec5	Show data device numbers in sysfs file It can be handy to associate mounts with their sysfs directory by their data device number. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-25 14:43:25 -07:00
Zach Brown	7fd03dc311	Merge pull request #80 from versity/zab/avoid_xattr_vmalloc Don't use vmalloc in get/set xattr	2022-03-22 12:00:51 -07:00
Zach Brown	4e8a088cc5	Don't use vmalloc in get/set xattr Extended attribute values can be larger than a reasonable maximum size for our btree items so we store xattrs in many items. The first pass at this code used vmalloc to make it relatively easy to work with a contiguous buffer that was cut up into multiple items. The problem, of course, is that vmalloc() is expensive. Well, the problem is that I always forget just how expensive it can be and use it when I shouldn't. We had loads on high cpu count machines that were catastrophically cpu bound on all the contentious work that vmalloc does to maintain a coherent global address space. This removes the use of vmalloc and only allocates a small buffer for the first compound item. The later items directly reference regions of value buffer rather than copying it to and from the large intermediate vmalloced buffer. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-21 21:44:11 -07:00
Zach Brown	9c751c1197	Merge pull request #78 from versity/zab/quorum_leader_visibility Zab/quorum leader visibility	2022-03-16 09:16:57 -07:00
Zach Brown	875583b7ef	Add t_fs_is_leader test helper The t_server_nr and t_first_client_nr helpers iterated over all the fs numbers examining their quorum/is_leader files, but clients don't have a quorum/ directory. This was causing spurious outputs in tests that were looking for servers but didn't find it in the first quorum fs number and made it down into the clients. Give them a helper that returns 0 for being a leader if the quorum/ dir doesn't exist. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 16:09:55 -07:00
Zach Brown	38e5aa77c4	Update quorum status files more frequently We were seeing rare test failures where it looked like is_leader wasn't set for any of the mounts. The test that couldn't find a set is_leader file had just perfomed some mounts so we know that a server was up and processing requests. The quorum task wasn't updating the status that's shown in sysfs and debugfs until after the server started up. This opened the race where the server was able to serve mount requests and have the test run to find no is_leader file set before the quorum task was able to update the stats and make its election visible. This updates the quorum task to make its status visible more often, typically before it does something that will take a while. The is_leader will now be visible before the server is started so the test will always see the file after server starts up and lets mounts finish. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-15 15:07:57 -07:00
Zach Brown	57a1d75e52	Merge pull request #77 from versity/zab/v1_2_release Zab/v1 2 release	2022-03-14 18:10:16 -07:00

1 2 3 4 5 ...

1644 Commits