The symptom of the crash is that one finds the system deadlocked
spinning on scsi_qla_host_t.hardware_lock in qla2x00_enable_tgt_mode
with a stack something like this:
crash> bt
PID: 6155 TASK: ffff88006e4bc3c0 CPU: 1 COMMAND: "scst_uid"
#0 [ffff88007b915b28] machine_kexec at ffffffff8103163b
#1 [ffff88007b915b88] crash_kexec at ffffffff810b8e52
#2 [ffff88007b915c58] panic at ffffffff814ed0ab
#3 [ffff88007b915cd8] spin_bug at ffffffff8127cd46
#4 [ffff88007b915d18] _raw_spin_lock at ffffffff8127d015
#5 [ffff88007b915d68] _spin_lock_irqsave at ffffffff814f02e4
#6 [ffff88007b915d88] qla2x00_enable_tgt_mode at ffffffffa047b672 [qla2xxx]
#7 [ffff88007b915db8] q2t_host_action at ffffffffa06db6a6 [qla2x00tgt]
#8 [ffff88007b915df8] q2t_enable_tgt at ffffffffa06db6ea [qla2x00tgt]
#9 [ffff88007b915e18] scst_process_tgt_enable_store at ffffffffa04f102e [scst]
#10 [ffff88007b915e48] scst_tgt_enable_store_work_fn at ffffffffa04f1176 [scst]
#11 [ffff88007b915e58] scst_process_sysfs_works at ffffffffa04e8bbe [scst]
#12 [ffff88007b915e78] sysfs_work_thread_fn at ffffffffa04e8db5 [scst]
#13 [ffff88007b915ed8] kthread at ffffffff8108f976
#14 [ffff88007b915f48] kernel_thread at ffffffff8100c20a
I was pulling my hair out on this one, because with the spinlock
debugging (enhanced to capture the PID along with the task pointer), I
figured out that the task (and process) that originally locked the lock
was gone! It got really confusing when I added more spinlock debug code
to the kernel to detect locks held in the task switching and
task/process termination paths -- and didn't catch anything terminating
with locks held!
I finally tracked the problem down to two things:
1. When qla24xx_create_vhost creates a new virtual scsi_qla_host_t it
does it by copying the physical (aka parent) scsi_qla_host_t. Under the
right unlucky conditions, this can happen with the hardware_lock held
(the spinlock is embedded in the structure).
2. The code should only be locking the hardware_lock of the physical
scsi_qla_host_t, because the lock is associated with the hardware.
Unfortunately, quite a few places are not using to_qla_parent to make
sure they lock the correct lock. One of those places is
qla2x00_enable_tgt_mode. Along with the deadlock, this has the
potential to leave the hardware and driver structures in unpredictable
states, because the lock isn't always providing serialization.
The fix entails two things:
1. Zeroing the lock after copying the scsi_qla_host_t structure: This
won't stop the deadlock, but will enable the spinlock debug code to
easily catch anything that misbehaves and locks the wrong lock. I also
initialized the other locks because they could have the same problem. I
also initialized the list heads, because they could end up holding
dangling references. I did not initialize all pointers, because there
are quite a few that point to read only data and are OK (and I didn't
have time to research all of them).
2. Using to_qla_parent everywhere when locking and the scsi_qla_host_t
structure might be virtual. This is a lot of changes, but they are the
same thing over and over again.
I did not make an effort to look for scalar or pointer fields that are
being picked from the wrong structure. That's getting to be as much
pain as merging up to the latest QLogic driver (which would have gotten
rid of this problem).
From "Robinson, Herbie" <Herbie.Robinson@stratus.com>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4420 d57e44dd-8a1f-0410-8b47-8ef2f437770f
The attached patch fixes compilation warnings for the qla2x00t driver
when compiled by RHEL 6.1, kernel version 2.6.32-131.0.15.el6. I
believe this is the second kernel version for RHEL 6.1.
From "Robinson, Herbie" <Herbie.Robinson@stratus.com>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4419 d57e44dd-8a1f-0410-8b47-8ef2f437770f
The fix for the locking bug I found in q2t_pre_xmit_response has already
been added, but I had some comments in my copy that couldn't hurt.
From "Robinson, Herbie" <Herbie.Robinson@stratus.com>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4418 d57e44dd-8a1f-0410-8b47-8ef2f437770f
_syswrite() callers check whether writing into an SCST sysfs attribute
succeeded by checking whether the value returned by this function is
defined(). Return undef (failed) instead of -1 (success) if syswrite()
returned error code EBUSY (found this through source reading). Return
$length instead of length($cmd) if the initial write triggered EAGAIN.
Remove a superfluous $bytes = undef statement. Eliminate the variable
$wait.
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4400 d57e44dd-8a1f-0410-8b47-8ef2f437770f
SCST allows SCSI pass-through to SCSI devices registered with the SCSI
initiator core. SCST I/O activity has to be suspended before a LUN
that corresponds to such a SCSI device can be removed safely. Do not
suspend I/O when a SCSI device is being removed that has not been
exported as an SCST LUN such that removal of such devices does not
slow down I/O.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4399 d57e44dd-8a1f-0410-8b47-8ef2f437770f
Hash signs have to be specified in scst.conf when using the per_portal_acl=1
feature of iSCSI-SCST in combination with LUN masking. Make scstadmin not
consider hash signs preceded by a backslash as the start of a comment. Escape
backslashes and hash signs when writing out scst.conf.
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4391 d57e44dd-8a1f-0410-8b47-8ef2f437770f
The attached patch prevents that a kernel oops is triggered if something
fails when adding a session to sysfs.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4388 d57e44dd-8a1f-0410-8b47-8ef2f437770f
Although I do not know of any initiator that asks for the extended
parameter data format when submitting a REPORT TARGET PORT GROUPS
command, I think it's a good idea to support that data format (i.e.
three most significant bits of the second CDB byte are 001 instead
of 000). The attached patch should implement that. Additionally,
cmd->op_name is set to something more descriptive than
"MAINTENANCE IN" in that patch.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4385 d57e44dd-8a1f-0410-8b47-8ef2f437770f
The value of the variable allocation_length is identical to the value of the
variable buf_len. Hence eliminate the former. Also, for MAINTENANCE(IN)
cdb_len is always equal to 12 so the test "cdb_len > 1" can be eliminated too.
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@4382 d57e44dd-8a1f-0410-8b47-8ef2f437770f