scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-20 13:30:29 +00:00

Author	SHA1	Message	Date
Chris Kirby	9dfd02078d	Suppress another forced shutdown error message The "server error emptying freed" error was causing a fence-and-reclaim test failure. In this case, the error was -ENOLINK, which we should ignore for messaging purposes. Signed-off-by: Chris Kirby <ckirby@versity.com>	2026-04-13 16:46:24 -05:00
Auke Kok	8a730464ab	Add orphan-log-trees test and reclaim_skip_finalize trigger Add a reclaim_skip_finalize trigger that prevents reclaim from setting FINALIZED on log_trees entries. The test arms this trigger, force-unmounts a client to create an orphan, and verifies the log merge succeeds without timeout and the orphan reclaim message appears in dmesg. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-03-25 10:39:40 -07:00
Auke Kok	daea8d5bc1	Reclaim orphaned log_trees entries from unmounted clients An unfinalized log_trees entry whose rid is not in mounted_clients is an orphan left behind by incomplete reclaim. Previously this permanently blocked log merges because the finalize loop treated it as an active client that would never commit. Call reclaim_open_log_tree for orphaned rids before starting a log merge. Once reclaimed, the existing merge and freeing paths include them normally. Also skip orphans in get_stable_trans_seq so their open transaction doesn't artificially lower the stable sequence. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-03-25 06:47:22 -07:00
Chris Kirby	ef0f6f8ac2	Fix race in inode-deletion test Due to an iput race, the "unlink wait for open on other mount" subtest can fail. If the unlink happens inline, then the test passes. But if the orphan scanner has to complete the unlink work, it's possible that there won't be enough log merge work for the scanner to do the cleanup before we look at the seq index. Add SCOUTFS_TRIGGER_LOG_MERGE_FORCE_FINALIZE_OURS, to allow forcing a log merge. Add new counters, log_merges_start and log_merge_complete, so that tests can see that a merge has happened. Then we have to wait for the orphan scanner to do its work. Add a new counter, orphan_scan_empty, that increments each time the scanner walks the entire inode space without finding any orphans. Once the test sees that counter increment, it should be safe to check the seq index and see that the unlinked inode is gone. Signed-off-by: Chris Kirby <ckirby@versity.com>	2026-01-07 08:29:38 -06:00
Zach Brown	de70ca2372	Increase server commit block budget for alloc move A few callers of alloc_move_empty in the server were providing a budget that was too small. Recent changes to extent_mod_blocks increased the max budget that is necessary to move extents between btrees. The existing WAG of 100 was too small for trees of height 2 and 3. This caused looping in production. We can increase the move budget to half the overall commit budget, which leaves room for a height of around 7 each. This is much greater than we see in practice because the size of the per-mount btrees is effectiely limited by both watermarks and thresholds to commit and drain. Signed-off-by: Zach Brown <zab@versity.com>	2025-12-17 14:22:04 -06:00
Chris Kirby	ee630b164f	Handle ENOENT when getting log merge status item Tests that cause client retries can fail with this error from server_commit_log_merge(): error -2 committing log merge: getting merge status item This can happen if the server has already committed and resolved the log merge that is being retried. We can safely ignore ENOENT here just like we do a few lines later. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-12-01 08:58:24 -06:00
Zach Brown	4d66c38c71	Remove redundant WARN in commit_log_trees The server's commit_log_trees has an error message that includes the source of the error, but it's not used for all errors. The WARN_ON is redundant with the message and is removed because it isn't filtered out when we see errors from forced unmount. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-14 10:04:30 -08:00
Zach Brown	92ac132873	Silence merge splice error when forcing Silence another error warning and assertion that's assuming that the result of the errors is going to be persistent. When we're forcing an unmount we've severed storage and networking. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	1ab798e7eb	Silence inconsistent srch on forced unmount Assembling a srch compaction operation creates an item and populates it with allocator state. It doesn't cleanly unwind the allocation and undo the compaction item change if allocation filling fails and issues a warning. This warning isn't needed if the error shows that we're in forced unmount. The inconsistent state won't be applied, it will be dropped on the floor as the mount is torn down. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	e182914e51	Fix double free of metadata blocks in log merging The log merging process is meant to provide parallelism across workers in mounts. The idea is that the server hands out a bunch of concurrent non-intersecting work that's based on the structure of the stable input fs_root btree. The nature of the parallel work (cow of the blocks that intersect a key range) means that the ranges of concurrently issued work can't overlap or the work will all cow the same input blocks, freeing that input stable block multiple times. We're seeing this in testing. Correctness was intended by having an advancing key that sweeps sorted ranges. Duplicate ranges would never be hit as the key advanced past each it visited. This was broken by the mapping of the fs item keys to log merge tree keys by clobbering the sk_zone key value. It effectively interleaves the ranges of each zone in the fs root (meta indexes, orphans, fs items). With just the right log merge conditions that involve logged items in the right places and partial completed work to insert remaining ranges behind the key, ranges can be stored at mapped keys that end up with ranges out of order. The server iterates over these and ends up issueing overlapping work, which results in duplicated frees of the input blocks. The fix, without changing the format of the stored log tree items, is to perform a full sweep of all the range items and determine the next item by looking at the full precision stored keys. This ensures that the processed ranges always advance and never overlap. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	102899290e	Allow harmless srch compact commit errors The server's srch commit error warnings were a bit severe. The compaction operations are a function of persistent state. If they fail then the inputs still exist and the next attempt will retry whatever failed. Not all errors are a problem, only those that result in partial commits that leave inconsistent state. In particular, we have to support the case where a client retransmits a compaction request to a new server after a first server performed the commit but couldn't respond. Throwing warnings when the new server gets ENOENT looking for the busy compaction item isn't helpful. This came in tests as background compaction was in flight as tests unmounted and mounted servers repeatedly to test lock recovery. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Chris Kirby	d277d7e955	Fix race condition in orphan-inodes test Make sure that the orphan scanners can see deletions after forced unmounts by waiting for reclaim_open_log_tree() to run on each mount; and waiting for finalize_and_start_log_merge() to run and not find any finalized trees. Do this by adding two new counters: reclaimed_open_logs and log_merge_no_finalized and fixing the orphan-inodes test to check those before waiting for the orphan scanners to complete. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:59:03 -07:00
Chris Kirby	c72bf915ae	Use ENOLINK as a special error code during forced unmount Tests such as quorum-heartbeat-timeout were failing with EIO messages in dmesg output due to expected errors during forced unmount. Use ENOLINK instead, and filter all errors from dmesg with this errno (67). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:58:44 -07:00
Chris Kirby	35bcad91a6	Close window where we can lose search items In finalize_and_start_log_merge(), we overwrite the server mount's log tree with its finalized form and then later write out its next open log tree. This leaves a window where the mount's srch_file is nulled out, causing us to lose any search items in that log tree. This shows up as intermittent failures in the srch-basic-functionality test. Eliminate this timing window by doing what unmount/reclaim does when it finalizes, by moving the resources from the item that we finalize into server trees/items as it finalizes. Then there is no window where those resources exist only in memory until we create another transaction. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	bb3e1f3665	Fix commit budget calculation with multiple holders The try_drain_data_freed() path was generating errors about overrunning its commit budget: scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602 The budget overrun check was using the current number of commit holders (in this case one) instead of the the maximum number of concurrent holders (in this case two). So even well behaved paths like try_drain_data_freed() can appear to exceed their commit budget if other holders dirty some blocks and apply their commits before the try_drain_data_freed() thread does its final budget reconciliation. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Chris Kirby	3f786596e0	Don't overrun the block budget in server_log_merge_free_work(). This fixes a potential fence post failure like the following: error: 1 holders exceeded alloc budget av: bef 7407 now 7392, fr: bef 8185 now 7672 The code is only accounting for the freed btree blocks, not the dirtying of other items. So it's possible to be at exactly (COMMIT_HOLD_ALLOC_BUDGET / 2), dirty some log btree blocks, loop again, then consume another (COMMIT_HOLD_ALLOC_BUDGET / 2) and blow past the total budget. In this example, we went over by 13 blocks. By only consuming up to 1/8 of the budget on each loop, and committing when we have consumed 3/4 of the budget, we can avoid the fence post condition. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	2f48a606e8	Fix -Wmaybe-uninitalized since rhel9.5 Looks like the compiler isn't smart enough to understand the pass by pointer value, and we can initialize it here easily. make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64' CC [M] /home/auke/scoutfs/kmod/src/server.o /home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’: /home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4170 \| ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4171 \| SCOUTFS_FENCE_CLIENT_RECOVERY); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors There's still the obvious issue here that we'd intended to support ipv6 but just disregard that here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 15:20:50 -07:00
Zach Brown	e457694f19	Don't send dirty data_freed blocks to client At the end of get_log_trees we can try and drain the data_freed extent tree, which can take multiple commits. If a commit fails then the blocks are still dirty in memory. We can't send references to those blocks to the client. We have to return an error and not send the log_trees, like the main get_log_trees does. The client will retry and eventually get a log_trees that references blocks that were successfully committed. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Auke Kok	8a45c2baff	Deprecate struct timeval. We switch to using 64bit usec structs and recommended replacement functions from Documentation/core-api/timekeeping.rst. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Zach Brown	e9ad61b444	Delete multiple log trees items per server commit server_log_merge_free_work() is responsible for freeing all the input log trees for a log merge operation that has finished. It looks for the next item to free, frees the log btree it references, and then deletes the item. It was doing this with a full server commit for each item which can take an agonizingly long time. This changes it perform multiple deletions in a commit as long as there's plenty of alloc space. The moment the commit gets low it applies the commit and opens a new one. This sped up the deletion of a few hundred thousand log tree items from taking hours to seconds. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:30:17 -08:00
Zach Brown	b5630f540d	Add tracing of the log merge finalizing decision Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:30:17 -08:00
Zach Brown	90a4c82363	Make log merge wait timeout tunable Add a mount option for the amount of time that log merge creation can wait before giving up. We add some counters so we can see how often the timeout is being hit and what the average successfull wait time is. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:25:56 -08:00
Zach Brown	f654fa0fda	Send syncs once when starting to merge The server sends sync requests to clients when it sees that they have open log trees that need to be committed for log merging to proceed. These are currently sent in the context of each client's get_log_trees request, resulting in sync requests queued for one client from all clients. Depending on message delivery and commit latencies, this can create a sync storm. The server's sends are reliable and the open commits are marked with the seq when they opened. It's easy for us to record having sent syncs to all open commits so that future attempts can be avoided. Later open commits will have higher seqs and will get a new round of syncs sent. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:25:20 -08:00
Zach Brown	50168a2d2a	Check each client's last log item for stable seq The server was checking all client log_trees items to search for the lowest commit seq that was still open. This can be expensive when there are a lot of finalized log_trees items that won't have open seqs. Only the last log_trees item for each client rid can be open, and the items are sorted by rid and nr, so we can easily only check the last item for each client rid. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:24:50 -08:00
Zach Brown	3c0616524a	Only search last log_trees per rid for finalizing During get_log_trees the server checks log_trees items to see if it should start a log merge operation. It did this by iterating over all log_trees items and there can be quite a lot of them. It doesn't need to see all of the items. It only needs to see the most recent log_trees item for each mount. That's enough to make the decisions that start the log merging process. Signed-off-by: Zach Brown <zab@versity.com>	2024-01-25 11:23:59 -08:00
Zach Brown	d5c699c3b4	Don't respond with ENOENT for no srch compaction The srch compaction request building function and the srch compaction worker both have logic to recognize a valid response with no input files indicating that there's no work to do. The server unfortunately translated nr == 0 into ENOENT and send that error response to the client. This caused the client to increment error counters in the common case when there's no compaction work to perform. We'd like the error counter to reflect actual errors, we're about to check it in a test, so let's fix this up to the server sends a sucessful response with nr == 0 to indicate that there's no work to do. Signed-off-by: Zach Brown <zab@versity.com>	2023-11-07 10:30:38 -08:00
Zach Brown	006f429f72	Use seqlock instead of seqcount in server The server had a few lower level seqcounts that it used to protect state. One user got it wrong by forgetting to disable pre-emption around writers. Debug kernels warned as write_seqcount_begin() was called without preemption disabled. We fix that user and make it easier to get right in the future by having one higher level seqlock and using that consistently for seq read begin/retry and write lock/unlock patterns. Signed-off-by: Zach Brown <zab@versity.com>	2023-10-19 15:43:15 -07:00
Auke Kok	7006a84d96	flush_work_sync is equivalent to flush_work. v3.15-rc1-6-g1a56f2aa4752 removes flush_work_sync entirely, but ever since v3.6-rc1-25-g606a5020b9bd which made all workqueues non-reentrant, it has been equivalent to flush_work. This is safe because in all cases only one server->work can be in flight at a time. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	4784ccdfd5	Start server commits when holds wait for alloc Server code that wants to dirty blocks by holding a commit won't be allowed to until the current allocators for the server transaction have enough space for the holder. As an active holder applies the commit the allocators are refilled and the waiting holders will proceed. But the current allocators can have no resources as the server starts up. There will never be active holders to apply the commit and refill the allocators. In this case all the holders will block indefinitely. The fix is to trigger a server commit when a holder doesn't have room. It used to be that commits were only triggered when apply callers were waiting. We transfer some of that logic into a new 'committing' field so that we can have commits in flight without apply callers waiting. We add it to the server commit tracing. While we're at it we clean up the logic that tests if a hold can proceed. It used to be confusingly split across two functions that both could sample the current allocator space remaining. This could lead to weird cases where the first holder could use the second alloc remaining call, not the one whose values were tested to see if the holder could fit. Now each hold check only samples the allocators once. And finally we fix a subtle case where the budget exceeded message can spuriously trigger in the case where dirtying the freed list created a new empty block after the holder recorded the amount of space in the freed block. Signed-off-by: Zach Brown <zab@versity.com>	2023-10-03 13:32:09 -07:00
Zach Brown	8a64b46a2f	Process log merge splicing in many commits Log merge completions were spliced in one server commit. It's possible to get enough completion work pending that it all can't be completed in one server commit. Operations fail with ENOSPC and because these changes can't be unwound cleanly the server asserts. This allows the completion splicing to break the work up into multiple commits. Processing completions in multiple commits means that request creation can observe the merge status in states that weren't possible before. Splicing is careful to maintain an elevated nr_complete count while the client can't get requests because the tree is rebalancing. Signed-off-by: Zach Brown <zab@versity.com>	2023-07-14 13:28:29 -07:00
Zach Brown	2e2ccb6f61	Allow replaying srch file rotation When a client no longer needs to append to a srch file, for whatever reason, we move the reference from the log_trees item into a specific srch file btree item in the server's srch file tracking btree. Zeroing the log_trees item and inserting the server's btree item are done in a server commit and should be written atomically. But commit_log_trees had an error handling case that could leave the newly inserted item dirty in memory without zeroing the srch file reference in the existing log_trees item. Future attempts to rotate the file reference, perhaps by retrying the commit or by reclaiming the client's rid, would get EEXIST and fail. This fixes the error handling path to ensure that we'll keep the dirty srch file btree and log_trees item in sync. The desynced items can still exist in the world so we'll tolerate getting EEXIST on insertion. After enough time has passed, or if repair zeroed the duplicate reference, we could remove this special case from insertion. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-17 14:33:27 -08:00
Zach Brown	40aa47c888	Have the server keep a private dirty super block As the server does its work its transactions modify a dirty super block in memory. This used the global super block in scoutfs_sb_info which was visible to everything, including the client. Move the dirty super block over to the private server info so that only the server can see it. This is mostly boring storage motion but we do change that the quorum code hands the server a static copy of the quorum config to use as it starts up before it reads the most recent super block. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-06 11:15:45 -08:00
Zach Brown	7720222588	Have statfs use unlocked stable roots The server's statfs request handler was intending to lock dirty structures as they were walked to get sums used for statfs fields. Other callers walk stable structures, though, so the summation calls had grown iteration over other structures that the server didn't know it had to lock. This meant that the server was walking unlocked dirty structures as they were being modified. The races are very tight, but it can result in request handling errors that shut down connections and IO errors from trying to read inconsistent refs as they were modified by the locked writer. We've built up infrastructure so the server can now walk stable structures just like the other callers. It will no longer wander into dirty blocks so it doesn't need to lock them and it will retry if its walk of stale data crosses a broken reference. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	fe4734d019	Save a full stable super in the server The server has a mechanism for tracking the last stable roots used by network rpcs. We expand it a bit to include the entire super so that we can add users in the server which want the last full stable super. We can still use the stable super to give out the stable roots. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	b1a43bb312	Make quorum config use more precise The quorum code was using the copy of the super block in the sb info for its config. With that going away we make different users more carefully reference the config. The quorum agent has a copy that it reads on setup, the client rarely reads a copy when trying to connect, and the server uses its super. This is about data access isolation and should have no functional effect other than to cause more super reads. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	929703213f	Add fsid sbi field A few paths throughout the code get the fsid for the current mount by using the copy of the super block that we store in the scoutfs_sb_info for the mount. We'd like to remove the super block from the sbi and it's cleaner to have a specific constant field for the fsid of the mount which will not change. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	233fbb39f3	Limit alloc_move per-call allocator consumption Recently scoutfs_alloc_move() was changed to try and limit the amount of metadata blocks it could allocate or free. The intent was to stop concurrent holders of a transaction from fully consuming the available allocator for the transaction. The limiting logic was a bit off. It stopped when the allocator had the caller's limit remaining, not when it had consumed the caller's limit. This is overly permissive and could still allow concurrent callers to consume the allocator. It was also triggering warning messages when a call consumed more than its allowed budget while holding a transaction. Unfortunately, we don't have per-caller tracking of allocator resource consumption. The best we can do is sample the allocators as we start and return if they drop by the caller's limit. This is overly conservative in that it accounts any consumption during concurrent callers to all callers. This isn't perfect but it makes the failure case less likely and the impact shouldn't be significant. We don't often have a lot of concurrency and the limits are larger than callers will typically consume. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	e8c64b4217	Move freed data extents in multiple server commits As _get_log_trees() in the server prepares the log_trees item for the client's commit, it moves all the freed data extents from the log_trees item into core data extent allocator btree items. If the freed blocks are very fragmented then it can exceed a commit's metadata allocation budget trying to dirty blocks in the free data extent btree. The fix is to move the freed data extents in multiple commits. First we move a limited number in the main commit that does all the rest of the work preparing the commit. Then we try to move the remaining freed extents in multiple additional commits. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-28 11:42:33 -07:00
Zach Brown	31e474c5fa	Protect get_log_trees corruption with assertion Like a lot of places in the server, get_log_trees() doesn't have the tools in needs to safely unwind partial changes in the face of an error. In the worst case, it can have moved extents from the mount's log_trees item into the server's main data allocator. The dirty data allocator reference is in the super block so it can be written later. The dirty log_trees reference is on stack, though, so it will be thrown away on error. This ends up duplicating extents in the persistent structures because they're written in the new dirty allocator but still remain in the unwritten source log_trees allocator. This change makes it harder for that to happen. It dirties the log_trees item and always tries to update so that the dirty blocks are consistent if they're later written out. If we do get an error updating the item we throw an assertion. It's not great, but it matches other similar circumstances in other parts of the server. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-17 14:22:59 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	48f1305a8a	Increase server commit dirty block budget We're seeing allocator motion during get_log_trees dirty quite a lot of blocks, which makes sense. Let's continue to up the budget. If we still need significantly larger budgets we'll want to look into capping the dirty block use of the allocator extent movers which will mean changing callers to support partial progress. Signed-off-by: Zach Brown <zab@versity.com>	2022-05-05 12:11:14 -07:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	3c2b329675	Limit alloc consumption in server commits Server commits use an allocator that has a limited number of available metadata blocks and entries in a list for freed blocks. The allocator is refilled between commits. Holders can't fully consume the allocator during the commit and that tended to work out because server commit holders commit before sending responses. We'd tend to commit frequently enough that we'd get a chance to refill the allocators before they were consumed. But there was no mechanism to ensure that this would be the case. Enough concurrent server holders were able to fully consume the allocators before committing. This causes scoutfs_meta_alloc and _free to return errors, leading the server to fail in the worst cases. This changes the server commit tracking to use more robust structures which limit the number of concurrent holders so that the allocators aren't exhausted. The commit_users struct stops holders from making progress once the allocators don't have room for more holders. It also lets us stop future holders from making progress once the commit work has been queued. The previous cute use of a rwsem didn't allow for either of these protections. We don't have precise tracking of each holder's allocation consumption so we don't try and reserve blocks for each holder. Instead we have a maxmimum consumption per holder and make sure that all the holders can't consume the allocators if they all use their full limit. All of this requires the holding code paths to be well behaved and not use more than the per-hold limit. We add some debugging code to print the stacks of holders that were active when the total holder limit was exceeded. This is the motivation for having state in the holders. We can record some data at the time their hold started that'll make it a little easier to track down which of the holders exceeded their limit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:17 -07:00
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	8decc54467	Clean up mount option handling The mount options code is some of the oldest in the tree and is weirdly split between options.c and super.c. This cleans up the options code, moves it all to options.c, and reworks it to be more in line with the modern subsystem convenction of storing state in an allocated info struct. Rather than putting the parsed options in the super for everyone to directly reference we put them in the private options info struct and add a locked read function. This will let us add sysfs files to change mount options while safely serializing with readers. All the users of mount options that used to directly reference the parsed struct now call the read function to get a copy. They're all small local changes except for quorum which saves a static copy of the quorum slot number because it references it in so many places and relies on it not changing. Finally, we remove the empty debugfs "options" directory. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	730a84af92	Silence resent log merge commit error The server's log merge complete request handler was considering the absence of the client's original request as a failure. Unfortunately, this case is possible if a previous server successfully completed the client's request but the response was lost because it stopped for whatever reason. The failure was being logged as a hard error to the console which was causing tests to occasionally fail during server failover that hit just as the log merge completion was being processed. The error was being sent to the client as a response, we just need to silence the message for these expected but rare errors. We also fix the related case where the server printed the even more harsh WARN_ON if there was a next original request but it wasn't the one we expected to find from our requesting client. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-02 11:26:36 -08:00
Zach Brown	5f2259c48f	Revert "Fix client/server race btwn lock recov and farewell" This reverts commit `61ad844891`. This fix was trying to ensure that lock recovery response handling can't run after farewell calls reclaim_rid() by jumping through a bunch of hoops to tear down locking state as the first farewell request arrived. It introduced very slippery use after free during shutdown. It appears that it was from drain_workqueue() previously being able to stop chaining work. That's no longer possible when you're trying to drain two workqueues that can queue work in each other. We found a much clearer way to solve the problem so we can toss this. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e14912974d	Wait for lock recovery before sending farewell We recently found that the server can send a farewell response and try to tear down a client's lock state while it was still in lock recovery with the client. The lock recovery response could add a lock for the client after farell's reclaim_rid() had thought the client was gone forever and tore down its locks. This left a lock in the lock server that wasn't associated with any clients and so could never be invalidated. Attempts to acquire conflicting locks with it would hang forever, which we saw as hangs in testing with lots of unmounting. We tried to fix it by serializing incoming request handling and forcefully clobbering the client's lock state as we first got the farewell request. That went very badly. This takes another approach of trying to explicitly wait for lock recovery to finish before sending farewell responses. It's more in line with the overall pattern of having the client be up and functional until farewell tears it down. With this in place we can revert the other attempted fix that was causing so many problems. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:39:51 -08:00
Zach Brown	e3c7e21c40	Use write memory barrier in set_shutting_down The server's little set_shutting_down() helper accidentally used a read barrier instead of a write barrier. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 09:17:38 -08:00

1 2 3 4

173 Commits