SCSI RDMA Protocol (SRP) Target driver for Linux ================================================= The SRP target driver has been designed to work on top of the Linux InfiniBand kernel drivers -- either the InfiniBand drivers included with a Linux distribution of the OFED InfiniBand drivers. For more information about using the SRP target driver in combination with OFED, see also README.ofed. The SRP target driver has been implemented as an SCST driver. This makes it possible to support a lot of I/O modes on real and virtual devices. A few examples of supported device handlers are: 1. scst_disk. This device handler implements transparent pass-through of SCSI commands and allows SRP to access and to export real SCSI devices, i.e. disks, hardware RAID volumes, tape libraries as SRP LUNs. 2. scst_vdisk, either in fileio or in blockio mode. This device handler allows to export software RAID volumes, LVM volumes, IDE disks, and normal files as SRP LUNs. 3. nullio. The nullio device handler allows to measure the performance of the SRP target implementation without performing any actual I/O. Installation ------------ Proceed as follows to compile and install the SRP target driver: 1. To minimize QUEUE_FULL conditions, apply the scst_increase_max_tgt_cmds patch as follows: cd ${SCST_DIR} patch -p0 < srpt/patches/scst_increase_max_tgt_cmds.patch This patch increases SCST's per-device queue size from 48 to 64. This helps to avoid QUEUE_FULL conditions because the size of the transmit queue in Linux' SRP initiator is also 64. Note: the SCSI layer of kernel 2.6.33 will have dynamic queue depth adjustment. When using SRP initiator systems with kernel 2.6.33 or later, this patch is less important. 2. Now compile and install SRPT: cd ${SCST_DIR} make -s scst_clean scst scst_install make -s srpt_clean srpt srpt_install make -s scstadm scstadm_install 3. Edit the installed file /etc/init.d/scst and add ib_srpt to the SCST_MODULES variable. 4. Configure SCST such that it will be started during system boot: chkconfig scst on The ib_srpt kernel module supports the following parameters: * srp_max_message_size (unsigned integer) Maximum size of an SRP control message in bytes. Examples of SRP control messages are: login request, logout request, data transfer request, ... The larger this parameter, the more scatter/gather list elements can be sent at once. Use the following formula to compute an appropriate value for this parameter: 68 + 16 * (max_sg_elem_count). The default value of this parameter is 2116, which corresponds to an sg list with 128 elements. * srp_max_rdma_size (unsigned integer) Maximum number of bytes that may be transferred at once via RDMA. Defaults to 65536 bytes, which is sufficient to use the full bandwidth of low-latency HCA's such as Mellanox' ConnectX series. Increasing this value may decrease latency for applications transferring large amounts of data at once via direct I/O. * thread (0 or 1) Whether incoming SRP requests will be processed in the IB interrupt that was triggered by the request (thread=0) or on the context of a separate thread (thread=1). The choice thread=0 results in the best performance, while thread=1 makes debugging easier. If a kernel oops is triggered inside an interrupt handler the system will be halted. As a result the call trace associated with the kernel oops will not be written to the kernel log in /var/log/messages. When using thread=1 however, the SRPT code runs in thread context. Any kernel oops generated in thread context will cause the offending thread to be killed. Other threads will keep running and call traces will be written to the on-disk kernel log. * trace_flag (unsigned integer, only available in debug builds) The individual bits of the trace_flag parameter define which categories of trace messages should be sent to the kernel log and which ones not. Configuring the SRP Target System --------------------------------- First of all, create the file /etc/scst.conf. Below you can find an example of how you can create this file using the scstadmin tool: /etc/init.d/scst stop /etc/init.d/scst start scstadmin -ClearConfig /etc/scst.conf scstadmin -adddev disk01 -path /dev/ram0 -handler vdisk -options NV_CACHE scstadmin -adddev disk02 -path /dev/ram1 -handler vdisk -options NV_CACHE scstadmin -assigndev disk01 -group Default -lun 0 scstadmin -assigndev disk02 -group Default -lun 1 scstadmin -assigndev 4:0:0:0 -group Default -lun 2 scstadmin -WriteConfig /etc/scst.conf cat /etc/scst.conf Now load the new configuration: /etc/init.d/scst reload Configuring the SRP Initiator System ------------------------------------ First of all, load the SRP kernel module as follows: modprobe ib_srp Next, discover the new SRP target by running the ibsrpdm command: ibsrpdm -c Now let the initiator system log in to the target system: ibsrpdm -c | while read target_info; do echo "${target_info}" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done Finally run lsscsi to display the details of the newly discovered SCSI disks: lsscsi SRP targets can be recognized in the output of lsscsi by looking for the disk names assigned on the SCST target ("disk01" in the example below): [8:0:0:0] disk SCST_FIO disk01 102 /dev/sdb Notes: * You can edit /etc/infiniband/openib.conf to load srp driver and srp HA daemon automatically ie. set SRP_LOAD=yes, and SRPHA_ENABLE=yes * To set up and use high availability feature you need dm-multipath driver and multipath tool * Please refer to the OFED-1.x user manual for more in-detail instructions on how to enable and how to use the HA feature. See e.g. http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_user_manual_1_40_1.pdf. Performance Notes - Initiator Side ---------------------------------- * For latency sensitive applications, using the noop scheduler at the initiator side can give significantly better results than with other schedulers. * The following parameters have a small but measureable impact on SRP performance: * /sys/class/block/${dev}/queue/rq_affinity * /proc/irq/${ib_int_no}/smp_affinity Performance Notes - Target Side ---------------------------------- * In some cases, for instance working with SSD devices, which consume 100% of a single CPU load for data transfers in their internal threads, to maximize IOPS it can be needed to assign for those threads dedicated CPUs using Linux CPU affinity facilities. No IRQ processing should be done on those CPUs. Check that using /proc/interrupts. See taskset command and Documentation/IRQ-affinity.txt in your kernel's source tree for how to assign CPU affinity to tasks and IRQs. The reason for that is that processing of coming commands in SIRQ context can be done on the same CPUs as SSD devices' threads doing data transfers. As the result, those threads won't receive all the CPU power and perform worse. Alternatively to CPU affinity assignment, you can try to enable SRP target's internal thread. It will allows Linux CPU scheduler to better distribute load among available CPUs. To enable SRP target driver's internal thread you should load ib_srpt module with parameter "thread=1". Send questions about this driver to scst-devel@lists.sourceforge.net, CC: Vu Pham and Bart Van Assche .