Wait for lock recovery before sending farewell

We recently found that the server can send a farewell response and try
to tear down a client's lock state while it was still in lock recovery
with the client.   The lock recovery response could add a lock
for the client after farell's reclaim_rid() had thought the client was
gone forever and tore down its locks.

This left a lock in the lock server that wasn't associated with any
clients and so could never be invalidated.   Attempts to acquire
conflicting locks with it would hang forever, which we saw as hangs in
testing with lots of unmounting.

We tried to fix it by serializing incoming request handling and
forcefully clobbering the client's lock state as we first got
the farewell request.   That went very badly.

This takes another approach of trying to explicitly wait for lock
recovery to finish before sending farewell responses.   It's more in
line with the overall pattern of having the client be up and functional
until farewell tears it down.

With this in place we can revert the other attempted fix that was
causing so many problems.

Signed-off-by: Zach Brown <zab@versity.com>
This commit is contained in:
Zach Brown
2022-01-21 09:46:35 -08:00
parent 813ce24d79
commit e14912974d

View File

@@ -3462,6 +3462,18 @@ static void farewell_worker(struct work_struct *work)
}
}
/*
* Responses that are ready to send can be further delayed by
* moving them back to the reqs list.
*/
list_for_each_entry_safe(fw, tmp, &send, entry) {
/* finish lock recovery before destroying locks, fenced if too long */
if (scoutfs_recov_is_pending(sb, fw->rid, SCOUTFS_RECOV_LOCKS)) {
list_move_tail(&fw->entry, &reqs);
quo_reqs++;
}
}
/* clean up resources for mounts before sending responses */
list_for_each_entry_safe(fw, tmp, &send, entry) {
ret = reclaim_rid(sb, fw->rid);
@@ -3656,8 +3668,14 @@ static void finished_recovery(struct super_block *sb)
void scoutfs_server_recov_finish(struct super_block *sb, u64 rid, int which)
{
DECLARE_SERVER_INFO(sb, server);
if (scoutfs_recov_finish(sb, rid, which) > 0)
finished_recovery(sb);
/* rid's farewell response might be sent after it finishes lock recov */
if (which & SCOUTFS_RECOV_LOCKS)
queue_farewell_work(server);
}
/*