mirror of
https://github.com/SCST-project/scst.git
synced 2026-06-09 23:22:33 +00:00
scst_user: Fix infinite cleanup loop caused by stale SGV pool reference
The device cleanup loop in dev_user_process_cleanup() spins at ~2 million
iterations per second and never exits, ultimately triggering a kernel soft
lockup. The previous workaround panicked the system after 10,000
iterations.
Root cause (confirmed by instrumentation):
A ucmd gets permanently stuck in ucmd_hash with:
state = UCMD_STATE_ON_FREE_SKIPPED (7)
cmd = NULL
ref = 1
sent_to_user = 0
The stuck ref=1 is the reference taken by dev_user_alloc_pages() via
ucmd_get() for the first scatter-gather page. It is released only by
dev_user_free_sg_entries() → ucmd_put(), which fires when the SGV pool
*evicts* a cached object. The sequence that prevents this eviction:
1. dev_user_unjam_dev() finds an EXECING command (sent_to_user=1,
ref=2: alloc + alloc_pages), bumps ref to 3 via ucmd_get_check(),
then calls dev_user_unjam_cmd().
2. dev_user_unjam_cmd() releases cmd_list_lock and calls
scst_cmd_done(SCST_CONTEXT_THREAD), which synchronously runs the
full SCST completion pipeline:
dev_user_on_free_cmd()
ucmd->cmd = NULL
ucmd->state = UCMD_STATE_ON_FREE_SKIPPED (type == IGNORE)
dev_user_process_reply_on_free()
dev_user_free_sgv()
sgv_pool_free(ucmd->sgv)
/* SGV cached on pool LRU; dev_user_free_sg_entries()
* not called; alloc_pages ucmd_get() not balanced */
ucmd->sgv = NULL
ucmd_put() ← ref: 3→2
3. Back in dev_user_unjam_dev(): ucmd_put() ← ref: 2→1.
ref != 0, so dev_user_free_ucmd() / cmd_remove_hash() are NOT called.
ucmd remains in ucmd_hash.
4. unjam_cmd also reset sent_to_user=0, so on every subsequent pass
through dev_user_unjam_dev() the ucmd is counted (res++) but skipped
(!sent_to_user → continue). dev_user_get_next_cmd() returns -EAGAIN
(ucmd is not in ready_cmd_list). With cleanup_done=1 the while(1)
loop has no exit condition.
The sgv_pool_flush() calls at the TOP of dev_user_unjam_dev() run
BEFORE any commands are unjammed. SGV objects cached during unjamming
are therefore never flushed; dev_user_free_sg_entries() never fires.
Fix:
Add sgv_pool_flush() for both pools at the BOTTOM of
dev_user_unjam_dev(), after the spinlock is released. This evicts
all SGV objects cached during unjamming, triggering:
dev_user_free_sg_entries() → ucmd_put() → dev_user_free_ucmd()
→ cmd_remove_hash()
removing the stuck ucmd from the hash. On the next cleanup-loop iteration
dev_user_unjam_dev() returns res=0 and dev_user_process_cleanup() breaks.
sgv_pool_flush() is fully synchronous (calls sgv_dtor_and_free() inline);
by the time it returns the callbacks have already fired and the ucmd has
already been removed from the hash. No schedule() or sleep is needed.
This commit is contained in:
@@ -2732,6 +2732,17 @@ repeat:
|
||||
|
||||
spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);
|
||||
|
||||
/*
|
||||
* Flush again after unjamming. Unjamming calls sgv_pool_free(), which
|
||||
* caches the SGV object on the pool LRU instead of freeing it directly.
|
||||
* The pre-unjam flush above misses these objects. Without this second
|
||||
* flush, dev_user_free_sg_entries() never fires, the alloc_pages
|
||||
* ucmd_get() ref is never balanced, and the ucmd stays in ucmd_hash
|
||||
* indefinitely — causing dev_user_process_cleanup() to loop forever.
|
||||
*/
|
||||
sgv_pool_flush(dev->pool);
|
||||
sgv_pool_flush(dev->pool_clust);
|
||||
|
||||
TRACE_EXIT_RES(res);
|
||||
return res;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user