scst_user: Fix infinite cleanup loop caused by stale SGV pool reference

mirror of https://github.com/SCST-project/scst.git synced 2026-06-09 23:22:33 +00:00

The device cleanup loop in dev_user_process_cleanup() spins at ~2 million
iterations per second and never exits, ultimately triggering a kernel soft
lockup. The previous workaround panicked the system after 10,000
iterations.

Root cause (confirmed by instrumentation):

  A ucmd gets permanently stuck in ucmd_hash with:
    state  = UCMD_STATE_ON_FREE_SKIPPED (7)
    cmd    = NULL
    ref    = 1
    sent_to_user = 0

  The stuck ref=1 is the reference taken by dev_user_alloc_pages() via
  ucmd_get() for the first scatter-gather page. It is released only by
  dev_user_free_sg_entries() → ucmd_put(), which fires when the SGV pool
  *evicts* a cached object. The sequence that prevents this eviction:

  1. dev_user_unjam_dev() finds an EXECING command (sent_to_user=1,
     ref=2: alloc + alloc_pages), bumps ref to 3 via ucmd_get_check(),
     then calls dev_user_unjam_cmd().

  2. dev_user_unjam_cmd() releases cmd_list_lock and calls
     scst_cmd_done(SCST_CONTEXT_THREAD), which synchronously runs the
     full SCST completion pipeline:

       dev_user_on_free_cmd()
         ucmd->cmd = NULL
         ucmd->state = UCMD_STATE_ON_FREE_SKIPPED  (type == IGNORE)
         dev_user_process_reply_on_free()
           dev_user_free_sgv()
             sgv_pool_free(ucmd->sgv)
               /* SGV cached on pool LRU; dev_user_free_sg_entries()
                * not called; alloc_pages ucmd_get() not balanced */
             ucmd->sgv = NULL
           ucmd_put()  ← ref: 3→2

  3. Back in dev_user_unjam_dev(): ucmd_put() ← ref: 2→1.
     ref != 0, so dev_user_free_ucmd() / cmd_remove_hash() are NOT called.
     ucmd remains in ucmd_hash.

  4. unjam_cmd also reset sent_to_user=0, so on every subsequent pass
     through dev_user_unjam_dev() the ucmd is counted (res++) but skipped
     (!sent_to_user → continue). dev_user_get_next_cmd() returns -EAGAIN
     (ucmd is not in ready_cmd_list). With cleanup_done=1 the while(1)
     loop has no exit condition.

  The sgv_pool_flush() calls at the TOP of dev_user_unjam_dev() run
  BEFORE any commands are unjammed. SGV objects cached during unjamming
  are therefore never flushed; dev_user_free_sg_entries() never fires.

Fix:

  Add sgv_pool_flush() for both pools at the BOTTOM of
  dev_user_unjam_dev(), after the spinlock is released. This evicts
  all SGV objects cached during unjamming, triggering:
    dev_user_free_sg_entries() → ucmd_put() → dev_user_free_ucmd()
      → cmd_remove_hash()
  removing the stuck ucmd from the hash. On the next cleanup-loop iteration
  dev_user_unjam_dev() returns res=0 and dev_user_process_cleanup() breaks.

  sgv_pool_flush() is fully synchronous (calls sgv_dtor_and_free() inline);
  by the time it returns the callbacks have already fired and the ucmd has
  already been removed from the hash. No schedule() or sleep is needed.

This commit is contained in:

tashen

2026-06-05 15:11:27 +08:00

committed by

Gleb Chesnokov

parent 3111277776

commit 83745c0a2d

1 changed files with 11 additions and 0 deletions

									
										scst/src/dev_handlers/scst_user.c
									
		+11
		
												View File
												
				@@ -2732,6 +2732,17 @@ repeat:

					spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

					/*

					 * Flush again after unjamming. Unjamming calls sgv_pool_free(), which

					 * caches the SGV object on the pool LRU instead of freeing it directly.

					 * The pre-unjam flush above misses these objects. Without this second

					 * flush, dev_user_free_sg_entries() never fires, the alloc_pages

					 * ucmd_get() ref is never balanced, and the ucmd stays in ucmd_hash

					 * indefinitely — causing dev_user_process_cleanup() to loop forever.

					 */

					sgv_pool_flush(dev->pool);

					sgv_pool_flush(dev->pool_clust);

					TRACE_EXIT_RES(res);

					return res;

				}