SCSI RDMA Protocol (SRP) Target driver for Linux ================================================= The SRP target driver has been designed to work on top of the Linux InfiniBand kernel drivers -- either the InfiniBand drivers included with a Linux distribution of the OFED InfiniBand drivers. For more information about using the SRP target driver in combination with OFED, see also README.ofed. The SRP target driver has been implemented as an SCST driver. This makes it possible to support a lot of I/O modes on real and virtual devices. A few examples of supported device handlers are: 1. scst_disk. This device handler implements transparent pass-through of SCSI commands and allows SRP to access and to export real SCSI devices, i.e. disks, hardware RAID volumes, tape libraries as SRP LUNs. 2. scst_vdisk, either in fileio or in blockio mode. This device handler allows to export software RAID volumes, LVM volumes, IDE disks, and normal files as SRP LUNs. 3. nullio. The nullio device handler allows to measure the performance of the SRP target implementation without performing any actual I/O. Installation ------------ Proceed as follows to compile and install the SRP target driver: 1. The SRP initiator (ib_srp) included with Linux kernel 2.6.36 and before frequently makes ib_srpt send BUSY responses, which hurts performance. This can be avoided by making SCST's SCSI command queue size identical to that of the initiator by applying the scst_increase_max_tgt_cmds patch: cd ${SCST_DIR} patch -p0 < srpt/patches/scst_increase_max_tgt_cmds.patch This patch increases SCST's per-device queue size from 48 to 64. This helps to avoid BUSY conditions because the size of the transmit queue in Linux' SRP initiator is also 64. Note: avoiding BUSY conditions is also possible by limiting the number of outstanding requests on the initiator. This is possible either by setting nr_requests low enough or by enabling the dynamic queue depth adjustment feature. Dynamic queue depth adjustment is available from kernel version 2.6.33 on. See also scst/README for more information. 2. Now compile and install SRPT: cd ${SCST_DIR} make -s scst_clean scst scst_install make -s srpt_clean srpt srpt_install make -s scstadm scstadm_install 3. Edit the installed file /etc/init.d/scst and add ib_srpt to the SCST_MODULES variable. 4. Configure SCST such that it will be started during system boot: chkconfig scst on The ib_srpt kernel module supports the following parameters: * srp_max_message_size (number) Maximum size of an SRP control message in bytes. Examples of SRP control messages are: login request, logout request, data transfer request, ... The larger this parameter, the more scatter/gather list elements can be sent at once. Use the following formula to compute an appropriate value for this parameter: 68 + 16 * (sg_tablesize). The default value of this parameter is 2116, which corresponds to an sg table size of 128. * srp_max_rdma_size (number) Maximum number of bytes that may be transferred at once via RDMA. Defaults to 65536 bytes, which is sufficient to use the full bandwidth of low-latency HCAs. Increasing this value may decrease latency for applications transferring large amounts of data at once. * srpt_autodetect_cred_req (y or n, default n) Whether or not to autodetect initiator support for SRP_CRED_REQ (initiators with Linux kernel 2.6.37 or later only). The use of SRP_CRED_REQ allows ib_srpt to process workloads with large I/O depths more efficiently. Note: enabling this mode causes the Windows SRP initiator to stop working. * srpt_srq_size (number, default 4095) ib_srpt uses a shared receive queue (SRQ) for processing incoming SRP requests. This number may have to be increased when a large number of initiator systems is accessing a single SRP target system. * thread (0, 1 or 2, default 1) Defines the context on which SRP requests are processed: * thread=0: do as much processing in IRQ context as possible. Results in lower latency than the other two modes but may trigger soft lockup complaints when multiple initiators are simultaneously processing workloads with large I/O depths. Scalability of this mode is limited - it exploits only a fraction of the power available on multiprocessor systems. * thread=1: dedicates one kernel thread per initiator. Scales well on multiprocessor systems. This is the recommended mode when multiple initiator systems are accessing the same target system simultaneously. * thread=2: makes one CPU process all IB completions and defer further processing to kernel thread context. Scales better than mode thread=0 but not as good as mode thread=1. May trigger soft lockup complaints when multiple initiators are simultaneously processing workloads with large I/O depths. * trace_flag (unsigned integer, only available in debug builds) The individual bits of the trace_flag parameter define which categories of trace messages should be sent to the kernel log and which ones not. Configuring the SRP Target System --------------------------------- First of all, create the file /etc/scst.conf. Below you can find an example of how you can create this file using the scstadmin tool: /etc/init.d/scst stop /etc/init.d/scst start scstadmin -ClearConfig /etc/scst.conf scstadmin -adddev disk01 -path /dev/ram0 -handler vdisk -options NV_CACHE scstadmin -adddev disk02 -path /dev/ram1 -handler vdisk -options NV_CACHE scstadmin -assigndev disk01 -group Default -lun 0 scstadmin -assigndev disk02 -group Default -lun 1 scstadmin -assigndev 4:0:0:0 -group Default -lun 2 scstadmin -WriteConfig /etc/scst.conf cat /etc/scst.conf Now load the new configuration: /etc/init.d/scst reload Configuring the SRP Initiator System ------------------------------------ First of all, load the SRP kernel module as follows: modprobe ib_srp Next, discover the new SRP target by running the ibsrpdm command: ibsrpdm -c Now let the initiator system log in to the target system: ibsrpdm -c | while read target_info; do echo "${target_info}" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done Finally run lsscsi to display the details of the newly discovered SCSI disks: lsscsi SRP targets can be recognized in the output of lsscsi by looking for the disk names assigned on the SCST target ("disk01" in the example below): [8:0:0:0] disk SCST_FIO disk01 102 /dev/sdb High availability ----------------- If there are redundant paths in the IB network between initiator and target, automatic path failover can be set up on the initiator as follows: * Edit /etc/infiniband/openib.conf to load the SRP driver and SRP HA daemon automatically: set SRP_LOAD=yes and SRPHA_ENABLE=yes. * To set up and use the high availability feature you need the dm-multipath driver and multipath tool. * Please refer to the OFED-1.x user manual for more detailed instructions on how to enable and how to use the HA feature. See e.g. http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED%20_Linux_user_manual_1_5_1_2.pdf. A setup with automatic failover between redundant targets is possible by installing and configuring DRBD on both targets. If the initiator system supports mirroring (e.g. Linux), you can use the following approach: * Configure DRBD in Active/Active mode. * Configure the initiator(s) for mirroring between the redundant targets. If the initiator system does not support mirroring (e.g. VMware ESX), you can use the following approach: * Configure DRBD in Active/Passive mode and enable STONITH mode in the Heartbeat software. For more information, see also: * http://www.drbd.org/ * http://www.linux-ha.org/wiki/Main_Page Performance Notes - Initiator Side ---------------------------------- * For latency sensitive applications, using the noop scheduler at the initiator side can give significantly better results than with other schedulers. * The following parameters have a small but measureable impact on SRP performance: * /sys/class/block/${dev}/queue/rq_affinity * /proc/irq/${ib_int_no}/smp_affinity Send questions about this driver to scst-devel@lists.sourceforge.net, CC: Vu Pham and Bart Van Assche .