Wait for lock recovery before sending farewell

We recently found that the server can send a farewell response and try to tear down a client's lock state while it was still in lock recovery with the client. The lock recovery response could add a lock for the client after farell's reclaim_rid() had thought the client was gone forever and tore down its locks. This left a lock in the lock server that wasn't associated with any clients and so could never be invalidated. Attempts to acquire conflicting locks with it would hang forever, which we saw as hangs in testing with lots of unmounting. We tried to fix it by serializing incoming request handling and forcefully clobbering the client's lock state as we first got the farewell request. That went very badly. This takes another approach of trying to explicitly wait for lock recovery to finish before sending farewell responses. It's more in line with the overall pattern of having the client be up and functional until farewell tears it down. With this in place we can revert the other attempted fix that was causing so many problems. Signed-off-by: Zach Brown <zab@versity.com>
2026-01-07 04:26:29 +00:00 · 2022-01-21 09:46:35 -08:00
parent 813ce24d79
commit e14912974d
1 changed files with 18 additions and 0 deletions
--- a/kmod/src/server.c
+++ b/kmod/src/server.c
@@ -3462,6 +3462,18 @@ static void farewell_worker(struct work_struct *work)
 		}
 	}

+	/*
+	 * Responses that are ready to send can be further delayed by
+	 * moving them back to the reqs list.
+	 */
+	list_for_each_entry_safe(fw, tmp, &send, entry) {
+		/* finish lock recovery before destroying locks, fenced if too long */
+		if (scoutfs_recov_is_pending(sb, fw->rid, SCOUTFS_RECOV_LOCKS)) {
+			list_move_tail(&fw->entry, &reqs);
+			quo_reqs++;
+		}
+	}
+
 	/* clean up resources for mounts before sending responses */
 	list_for_each_entry_safe(fw, tmp, &send, entry) {
 		ret = reclaim_rid(sb, fw->rid);
@@ -3656,8 +3668,14 @@ static void finished_recovery(struct super_block *sb)

 void scoutfs_server_recov_finish(struct super_block *sb, u64 rid, int which)
 {
+	DECLARE_SERVER_INFO(sb, server);
+
 	if (scoutfs_recov_finish(sb, rid, which) > 0)
 		finished_recovery(sb);
+
+	/* rid's farewell response might be sent after it finishes lock recov */
+	if (which & SCOUTFS_RECOV_LOCKS)
+		queue_farewell_work(server);
 }

 /*