The storage was not draining I/Os and the work load was not spread out
across different CPUs evenly. This led to firmware resource counters
getting overrun on the busy CPU. This overrun prevented error recovery from
happening in a timely manner.
By switching the counter to atomic, it allows the count to be little more
accurate to prevent the overrun.
Cc: stable@vger.kernel.org
Fixes: da7c21b72aa8 ("scsi: qla2xxx: Fix command flush during TMF")
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230821130045.34850-4-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit e370b64c7db9 upstream ]
Fix race condition between Interrupt thread and Chip reset thread in trying
to flush the same mailbox. With the race condition, the "ha->mbx_intr_comp"
will get an extra complete() call. The extra complete call create erroneous
mailbox timeout condition when the next mailbox is sent where the mailbox
call does not wait for interrupt to arrive. Instead, it advances without
waiting.
Add lock protection around the check for mailbox completion.
Cc: stable@vger.kernel.org
Fixes: b2000805a975 ("scsi: qla2xxx: Flush mailbox commands on chip reset")
Signed-off-by: Quinn Tran <quinn.tran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230821130045.34850-3-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 6d0b65569c0a upstream ]
System crashes when a 32-byte CDB was sent to a non T10 PI disk:
[ 177.143279] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx]
[ 177.149165] ? internal_add_timer+0x42/0x70
[ 177.153372] qla2xxx_mqueuecommand+0x207/0x2b0 [qla2xxx]
[ 177.158730] scsi_queue_rq+0x2b7/0xc00
[ 177.162501] blk_mq_dispatch_rq_list+0x3ea/0x7e0
Current code attempted to use CRC IOCB to send the command but failed.
Instead, type 6 IOCB should be used to send the I/O.
Clone existing type 6 IOCB code with addition of MQ support to allow
32-byte CDBs to go through.
Signed-off-by: Quinn Tran <qutran@marvell.com>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230817063132.21900-3-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit ae25f65a351c upstream ]
Connection does not resume after a host reset / chip reset. The cause of
the blockage is due to the FCF_ASYNC_ACTIVE left on. The gnl command was
interrupted by the chip reset. On exiting the command, this flag should be
turn off to allow relogin to reoccur. Clear this flag to prevent blockage.
Cc: stable@vger.kernel.org
Fixes: 17e64648aa47 ("scsi: qla2xxx: Correct fcport flags handling")
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-7-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 39d22740712c upstream ]
Link up failure occurred where driver failed to see certain events from FW
indicating link up (AEN 8011) and fabric login completion (AEN 8014).
Without these 2 events, driver would not proceed forward to scan the
fabric. The cause of this is due to delay in the receive of interrupt for
Mailbox 60 that causes qla to set the fw_started flag late. The late
setting of this flag causes other interrupts to be dropped. These dropped
interrupts happen to be the link up (AEN 8011) and fabric login completion
(AEN 8014).
Set fw_started flag early to prevent interrupts being dropped.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-6-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 5b51f35d127e upstream ]
For each TMF request, driver iterates through each qpair and flushes
commands associated to the TMF. At the end of the qpair flush, a Marker is
used to complete the flush transaction. This process was repeated for each
qpair. The multiple flush and marker for this TMF request seems to cause
confusion for FW.
Instead, 1 flush is sent to FW. Driver would wait for FW to go through all
the I/Os on each qpair to be read then return. Driver then closes out the
transaction with a Marker.
Cc: stable@vger.kernel.org
Fixes: d90171dd0da5 ("scsi: qla2xxx: Multi-que support for TMF")
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-5-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit da7c21b72aa8 upstream ]
During NVMe queue creation, a new qpair is created. FW resource limit needs
to be re-adjusted to take into account the new qpair. Otherwise, NVMe
command can not go through. This issue was discovered while
testing/forcing FW execution to fail at load time.
Add call to readjust IOCB and exchange limit.
In addition, get FW state command and require FW to be running. Otherwise,
error is generated.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-3-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit efa74a62aaa2 upstream ]
System crash when using debug kernel due to link list corruption. The cause
of the link list corruption is due to session deletion was allowed to queue
up twice. Here's the internal trace that show the same port was allowed to
double queue for deletion on different cpu.
20808683956 015 qla2xxx [0000:13:00.1]-e801:4: Scheduling sess ffff93ebf9306800 for deletion 50:06:0e:80:12:48:ff:50 fc4_type 1
20808683957 027 qla2xxx [0000:13:00.1]-e801:4: Scheduling sess ffff93ebf9306800 for deletion 50:06:0e:80:12:48:ff:50 fc4_type 1
Move the clearing/setting of deleted flag lock.
Cc: stable@vger.kernel.org
Fixes: 726b85487067 ("qla2xxx: Add framework for async fabric discovery")
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-2-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 6dfe4344c168 upstream ]
See also commits:
- 7d1afe8a ("qla2x00t-32gbit: Drop redundant
pci_enable_pcie_error_reporting()").
- 596800ce ("qla2x00t-32gbit: Backport to older kernel versions").
This patch refactors scst_put_acg() to directly call kref_put() with a
release callback that handles removal of scst_acg from the workqueue.
This replaces the previous approach of queuing a work item for each
kref_put() call, improving efficiency by reducing the number of queued
work items.
According to kernel documentation, this flag should be set if the
workqueue will be involved in the kernel's memory reclamation flow.
Since it is not, there is no need for the driver's WQ to have this
flag set so remove it.
create_workqueue() was replaced with alloc_workqueue() with max_active set
to 0. However, the original create_workqueue() implicitly set max_active
to 1.
This change has led to unexpected bugs because previously, work items
could only be executed one by one. With the change, they can now be
executed simultaneously.
This patch fixes the issue by restoring max_active to 1.
Fixes: f4686e9102 ("scst: Remove deprecated create_workqueue()")
Fixes: https://github.com/SCST-project/scst/issues/179
Smatch and Clang both complain that LOGIN_TEMPLATE_SIZE is more than
sizeof(ha->plogi_els_payld.fl_csp).
Smatch warning:
drivers/scsi/qla2xxx/qla_iocb.c:3075 qla24xx_els_dcmd2_iocb()
warn: '&ha->plogi_els_payld.fl_csp' sometimes too small '16' size = 112
Clang warning:
include/linux/fortify-string.h:592:4: error: call to
'__read_overflow2_field' declared with 'warning' attribute: detected
read beyond size of field (2nd parameter); maybe use struct_group()?
[-Werror,-Wattribute-warning]
__read_overflow2_field(q_size_field, size);
When I was reading this code I assumed the "- 4" meant that we were
skipping the last 4 bytes but actually it turned out that we are
skipping the first four bytes.
I have re-written it remove the magic numbers, be more clear and
silence the static checker warnings.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/4aa0485e-766f-4b02-8d5d-c6781ea8f511@moroto.mountain
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 134f66959cd0b upstream ]
Unbreak the build for the previous commit against kernel versions
before v5.16.
See also commit 3080ea5553cc ("stddef: Introduce DECLARE_FLEX_ARRAY()
helper") # v5.16.
One-element arrays as fake flex arrays are deprecated and we are moving
towards adopting C99 flexible-array members, instead. So, replace
one-element array declaration in struct ct_sns_gpnft_rsp, which is
ultimately being used inside a union:
drivers/scsi/qla2xxx/qla_def.h:
3240 struct ct_sns_gpnft_pkt {
3241 union {
3242 struct ct_sns_req req;
3243 struct ct_sns_gpnft_rsp rsp;
3244 } p;
3245 };
Refactor the rest of the code, accordingly.
This issue was found with the help of Coccinelle.
Link: https://github.com/KSPP/linux/issues/245
Link: https://github.com/KSPP/linux/issues/193
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/ZH+/rZ1R1cBjIxjS@work
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit 512a365368c7 upstream ]
System crash due to use after free.
Current code allows terminate_rport_io to exit before making
sure all IOs has returned. For FCP-2 device, IO's can hang
on in HW because driver has not tear down the session in FW at
first sign of cable pull. When dev_loss_tmo timer pops,
terminate_rport_io is called and upper layer is about to
free various resources. Terminate_rport_io trigger qla to do
the final cleanup, but the cleanup might not be fast enough where it
leave qla still holding on to the same resource.
Wait for IO's to return to upper layer before resources are freed.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230428075339.32551-7-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit fc0cba0c7be8 upstream ]
System crash, where driver is accessing scsi layer's
memory (scsi_cmnd->device->host) to search for a well known internal
pointer (vha). The scsi_cmnd was released back to upper layer which
could be freed, but the driver is still accessing it.
7 [ffffa8e8d2c3f8d0] page_fault at ffffffff86c010fe
[exception RIP: __qla2x00_eh_wait_for_pending_commands+240]
RIP: ffffffffc0642350 RSP: ffffa8e8d2c3f988 RFLAGS: 00010286
RAX: 0000000000000165 RBX: 0000000000000002 RCX: 00000000000036d8
RDX: 0000000000000000 RSI: ffff9c5c56535188 RDI: 0000000000000286
RBP: ffff9c5bf7aa4a58 R8: ffff9c589aecdb70 R9: 00000000000003d1
R10: 0000000000000001 R11: 0000000000380000 R12: ffff9c5c5392bc78
R13: ffff9c57044ff5c0 R14: ffff9c56b5a3aa00 R15: 00000000000006db
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
8 [ffffa8e8d2c3f9c8] qla2x00_eh_wait_for_pending_commands at ffffffffc0646dd5 [qla2xxx]
9 [ffffa8e8d2c3fa00] __qla2x00_async_tm_cmd at ffffffffc0658094 [qla2xxx]
Remove access of freed memory. Currently the driver was checking to see if
scsi_done was called by seeing if the sp->type has changed. Instead,
check to see if the command has left the oustanding_cmds[] array as
sign of scsi_done was called.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230428075339.32551-6-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ commit b843adde8d49 upstream ]
strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NULL-terminated [1].
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy