mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-14 09:11:27 +00:00
Additionally, declare the max_sge_delta parameter obsolete. git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@8983 d57e44dd-8a1f-0410-8b47-8ef2f437770f
445 lines
18 KiB
Plaintext
445 lines
18 KiB
Plaintext
SCSI RDMA Protocol (SRP) Target driver for Linux
|
|
=================================================
|
|
|
|
The SRP target driver has been designed to work on top of the Linux RDMA
|
|
kernel drivers -- either the RDMA drivers included with a Linux distribution
|
|
or the OFED RDMA drivers. For more information about using the SRP target
|
|
driver in combination with OFED, see also README.ofed.
|
|
|
|
The SRP target driver has been implemented as an SCST driver. This
|
|
makes it possible to support a lot of I/O modes on real and virtual
|
|
devices. A few examples of supported device handlers are:
|
|
|
|
1. scst_disk. This device handler implements transparent pass-through
|
|
of SCSI commands and allows SRP to access and to export real
|
|
SCSI devices, i.e. disks, hardware RAID volumes, tape libraries
|
|
as SRP LUNs.
|
|
|
|
2. scst_vdisk, either in fileio or in blockio mode. This device handler
|
|
allows to export software RAID volumes, LVM volumes, IDE disks, and
|
|
normal files as SRP LUNs.
|
|
|
|
3. nullio. The nullio device handler allows to measure the performance
|
|
of the SRP target implementation without performing any actual I/O.
|
|
|
|
|
|
Installation
|
|
------------
|
|
|
|
Building and installing the SRP target driver is possible as follows:
|
|
|
|
cd ${SCST_DIR}
|
|
if type -p rpm >/dev/null; then
|
|
make -s rpm
|
|
sudo rpm -U rpmbuilddir/RPMS/*/*rpm scstadmin/rpmbuilddir/RPMS/*/*rpm
|
|
else
|
|
make -s scst_clean srpt_clean scst srpt scstadmin
|
|
sudo make -s scst_install srpt_install scstadm_install
|
|
fi
|
|
|
|
The ib_srpt kernel module supports the following parameters:
|
|
|
|
* rdma_cm_port (number)
|
|
A 16-bit number that specifies the port number to be registered via the
|
|
RDMA/CM. Must be specified to make communication over RoCE or iWARP
|
|
possible. If this parameter is zero (the default value) the SRP target
|
|
driver does not register with the RDMA/CM.
|
|
* srp_max_req_size (number)
|
|
Maximum size of an SRP control message in bytes. Examples of SRP control
|
|
messages are: login request, logout request, data transfer request, ...
|
|
The larger this parameter, the more scatter/gather list elements can be
|
|
sent at once. Use the following formula to compute an appropriate value
|
|
for this parameter: 68 + 16 * (sg_tablesize). The default value of
|
|
this parameter is 4148, which corresponds to an sg table size of 255.
|
|
* srp_max_rsp_size (number)
|
|
Maximum size of an SRP response message in bytes. Sense data is sent back
|
|
via these messages towards the initiator. The default size is 256 bytes.
|
|
With this value there remains (256-36) = 220 bytes for sense data.
|
|
* srp_max_rdma_size (number)
|
|
Maximum number of bytes that may be transferred at once via RDMA. Defaults
|
|
to 65536 bytes, which is sufficient to use the full bandwidth of low-latency
|
|
HCAs. Increasing this value may decrease latency for applications
|
|
transferring large amounts of data at once.
|
|
* srpt_srq_size (number, default 4095)
|
|
ib_srpt uses a shared receive queue (SRQ) for processing incoming SRP
|
|
requests. This number may have to be increased when a large number of
|
|
initiator systems is accessing a single SRP target system.
|
|
* srpt_sq_size (number, default 256)
|
|
Per-channel InfiniBand send queue size. Depending on the queue depth,
|
|
changing this parameter to a smaller value may cause RDMA requests to be
|
|
retried and hence may slow down data transfer severely.
|
|
* trace_flag (unsigned integer, only available in debug builds)
|
|
The individual bits of the trace_flag parameter define which categories of
|
|
trace messages should be sent to the kernel log and which ones not.
|
|
|
|
|
|
Configuring the SRP Target System
|
|
---------------------------------
|
|
|
|
When using RoCE or iWARP the first step is to enable support for these
|
|
protocols in the target driver by setting the rdma_cm_port kernel module
|
|
parameter to a non-zero value. An example:
|
|
|
|
echo options ib_srpt rdma_cm_port=5000 > /etc/modprobe.d/ib_srpt.conf
|
|
|
|
Next, create the file /etc/scst.conf. You can create this file with
|
|
the scstadmin tool as follows:
|
|
|
|
/etc/init.d/scst stop
|
|
/etc/init.d/scst start
|
|
|
|
Now configure SCST using scstadmin - see also the scstadmin documentation for
|
|
further information. Once finished, save the configuration to /etc/scst.conf:
|
|
|
|
scstadmin -write_config /etc/scst.conf
|
|
|
|
One can verify the contents of scst.conf e.g. as follows:
|
|
|
|
cat /etc/scst.conf
|
|
|
|
Now verify that loading the configuration from file works correctly:
|
|
|
|
/etc/init.d/scst reload
|
|
|
|
Note: when using InfiniBand loading the ib_ipoib kernel module and assigning
|
|
an IP address to each IPoIB interface is only needed when using the RDMA/CM.
|
|
When using the IB/CM however, it is allowed but not necessary to load the
|
|
ib_ipoib kernel module.
|
|
|
|
|
|
Configuring the SRP Initiator System
|
|
------------------------------------
|
|
|
|
First of all, load the SRP kernel module as follows:
|
|
|
|
modprobe ib_srp
|
|
|
|
Next, when using InfiniBand, discover the new SRP target by running the
|
|
srp_daemon command:
|
|
|
|
for d in /dev/infiniband/umad*; do srp_daemon -oacd$d; done
|
|
|
|
If you want to let the initiator system log in to all SRP targets available
|
|
in the same InfiniBand subnet that is possible as follows (-e = execute):
|
|
|
|
for d in /dev/infiniband/umad*; do srp_daemon -oecd$d; done
|
|
|
|
If you want to let the initiator log in to a specific target you can do that
|
|
e.g. as follows:
|
|
|
|
echo "id_ext=0002c903000f1366,ioc_guid=0002c903000f1366,dgid=fe800000000000000002c903000f1367,pkey=ffff,service_id=0002c903000f1366" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done
|
|
|
|
The meaning of the parameters in the above command is as follows:
|
|
* id_ext: must match ioc_guid.
|
|
* ioc_guid: see also the documentation of the ib_srpt ioc_guid parameter.
|
|
* dgid: target HCA port GID to connect to.
|
|
* pkey: IB partition key (P_Key) of the target to connect to.
|
|
* service_id: must match ioc_guid.
|
|
|
|
When using RoCE or iWARP, log in to the target system to determine the id_ext
|
|
and ioc_guid parameters and use these to log in. An example:
|
|
|
|
[ target system ]
|
|
# sed 's/tid_ext=/id_ext=/;s/,\(pkey\|dgid\|service_id\)=[^,]*//g' $(find /sys/kernel/scst_tgt/targets/ib_srpt -name login_info) | uniq
|
|
id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
|
|
|
|
[ initiator system ]
|
|
echo dest=192.168.5.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
|
|
>/sys/class/infiniband_srp/srp-mlx4_0-1/add_target
|
|
echo dest=192.168.6.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
|
|
>/sys/class/infiniband_srp/srp-mlx4_0-2/add_target
|
|
|
|
Initiator port GIDs can be queried e.g. via sysfs:
|
|
|
|
$ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo $f; \
|
|
cat $f | sed 's/://g'; done
|
|
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0
|
|
fe800000000000000002c9030005f34b
|
|
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0
|
|
fe800000000000000002c9030005f34c
|
|
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0
|
|
fe800000000000000002c9030003cca7
|
|
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0
|
|
fe800000000000000002c9030003cca8
|
|
|
|
Finally run lsscsi to display the details of the newly discovered SCSI disks:
|
|
|
|
lsscsi
|
|
|
|
SRP targets can be recognized in the output of lsscsi by looking for
|
|
the disk names assigned on the SCST target ("disk01" in the example below):
|
|
|
|
[8:0:0:0] disk SCST_FIO disk01 102 /dev/sdb
|
|
|
|
|
|
Target names
|
|
------------
|
|
|
|
The name assigned by the ib_srpt target driver to an SCST target is either
|
|
ib_srpt_target_<n>, the node GUID of a HCA in hexadecimal form with a colon
|
|
after every fourth digit or the port GID with a colon afer every fourth
|
|
digit. The HCA node GUID and the port GIDs can be obtained via the
|
|
ibv_devinfo command. An example:
|
|
|
|
# ibv_devinfo -v | grep -E '[^a-z]port:|guid|GID'
|
|
node_guid: 0002:c903:0005:f34e
|
|
sys_image_guid: 0002:c903:0005:f351
|
|
port: 1
|
|
GID[0]: fe80:0000:0000:0000:0002:c903:0005:f34f
|
|
port: 2
|
|
GID[0]: fe80:0000:0000:0000:0002:c903:0005:f350
|
|
|
|
Once the ib_srpt driver has been loaded the available SCST targets can be
|
|
queried as follows:
|
|
|
|
# (cd /sys/kernel/scst_tgt/targets/ib_srpt && ls -d [0-9a-f]*)
|
|
fe80:0000:0000:0000:0002:c903:0005:f34f
|
|
fe80:0000:0000:0000:0002:c903:0005:f350
|
|
|
|
|
|
Session names
|
|
-------------
|
|
|
|
The ib_srpt target driver uses the source port GID as session name.
|
|
|
|
An example:
|
|
|
|
[ INITIATOR ]
|
|
|
|
$ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo
|
|
f; cat $f; done
|
|
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0
|
|
fe80:0000:0000:0000:0002:c903:0005:f34b
|
|
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0
|
|
fe80:0000:0000:0000:0002:c903:0005:f34c
|
|
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0
|
|
fe80:0000:0000:0000:0002:c903:0003:cca7
|
|
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0
|
|
fe80:0000:0000:0000:0002:c903:0003:cca8
|
|
|
|
[ TARGET, after login ]
|
|
|
|
$ (cd /sys/kernel/scst_tgt/targets/ib_srpt/[0-9a-f]* && ls -d sessions/*)
|
|
sessions/fe80:0000:0000:0000:0002:c903:0003:cca7
|
|
sessions/fe80:0000:0000:0000:0002:c903:0003:cca8
|
|
sessions/fe80:0000:0000:0000:0002:c903:0005:f34b
|
|
sessions/fe80:0000:0000:0000:0002:c903:0005:f34c
|
|
|
|
|
|
LUN masking
|
|
-----------
|
|
|
|
In a straightforward configuration every LUN is visible to every initiator.
|
|
It is possible however to make a different set of LUNs visible to each
|
|
initiator by using the LUN masking feature of SCST. SRP initiators are
|
|
identified by their session name (see above). An example of an scst.conf
|
|
file using LUN masking for ib_srpt:
|
|
|
|
TARGET_DRIVER ib_srpt {
|
|
TARGET fe80:0000:0000:0000:0002:c903:0005:f34b {
|
|
enabled 1
|
|
rel_tgt_id 1
|
|
|
|
# LUNs visible by all initiators not listed below
|
|
LUN 0 disk01
|
|
|
|
GROUP grp1 {
|
|
# LUNs visible by initiator system 1
|
|
LUN 0 disk02
|
|
|
|
INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34b
|
|
}
|
|
|
|
GROUP grp2 {
|
|
# LUNs visible by initiator system 2
|
|
LUN 0 disk03
|
|
|
|
INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34c
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
Adding and Removing LUNs Dynamically
|
|
------------------------------------
|
|
|
|
It is possible to add and/or remove LUNs on the target without restarting
|
|
target or initiator. This can be done either via scstadmin or directly via the
|
|
sysfs interface. Although the SCST core will notify the initiator about LUN
|
|
changes, Linux initiators will ignore these notifications. In order to bring a
|
|
Linux initiator again in sync after a LUN change, the initiator has to be told
|
|
to rescan SCSI devices. Rescanning SCSI devices is e.g. possible via the
|
|
rescsan-scsi-bus.sh script that can be found here:
|
|
http://www.garloff.de/kurt/linux/#rescan-scsi. An example:
|
|
$ rescan-scsi-bus --hosts=${srp_host_id} --channels=0 --ids=0 --luns=0-31
|
|
|
|
|
|
InfiniBand Partitions
|
|
---------------------
|
|
|
|
Just like a VLAN allows to segment traffic on an Ethernet network partitions
|
|
allow to segment traffic on an InfiniBand network. Each InfiniBand partition
|
|
is identified by a partition key which is a 16-bit number. During fabric
|
|
initialization the subnet manager assigns one or more partition keys to
|
|
each InfiniBand port. For opensm partitions are defined in
|
|
/etc/opensm/partitions.conf. ib_srpt uses the partition with index 0. Which
|
|
partition key corresponds to index 0 can be found out by querying sysfs:
|
|
|
|
$ head /sys/class/infiniband/*/ports/*/pkeys/0
|
|
==> /sys/class/infiniband/mlx4_0/ports/1/pkeys/0 <==
|
|
0xffff
|
|
|
|
==> /sys/class/infiniband/mlx4_0/ports/2/pkeys/0 <==
|
|
0xffff
|
|
|
|
|
|
High availability
|
|
-----------------
|
|
|
|
If there are redundant paths in the IB network between initiator and target,
|
|
automatic path failover can be set up on the initiator as follows:
|
|
* Edit /etc/infiniband/openib.conf to load the SRP driver and SRP HA daemon
|
|
automatically: set SRP_LOAD=yes and SRPHA_ENABLE=yes.
|
|
* To set up and use the high availability feature you need the dm-multipath
|
|
driver and multipath tool.
|
|
* Please refer to the OFED-1.x user manual for more detailed instructions
|
|
on how to enable and how to use the HA feature. See e.g.
|
|
http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED%20_Linux_user_manual_1_5_1_2.pdf.
|
|
|
|
A setup with automatic failover between redundant targets is possible by
|
|
installing and configuring DRBD on both targets. If the initiator system
|
|
supports mirroring (e.g. Linux), you can use the following approach:
|
|
* Configure DRBD in Active/Active mode.
|
|
* Configure the initiator(s) for mirroring between the redundant targets.
|
|
If the initiator system does not support mirroring (e.g. VMware ESX), you
|
|
can use the following approach:
|
|
* Configure DRBD in Active/Passive mode and enable STONITH mode in the
|
|
Heartbeat software.
|
|
|
|
For more information, see also:
|
|
* http://www.drbd.org/
|
|
* http://www.linux-ha.org/wiki/Main_Page
|
|
|
|
|
|
Performance Notes - Target Side
|
|
-------------------------------
|
|
|
|
* Building the SCST core and the ib_srpt target driver in release mode
|
|
improves performance compared to debug mode.
|
|
|
|
* When using high-latency storage devices (hard disks), the default value
|
|
chosen by SCST for DEVICE.threads_num should be fine. When using
|
|
low-latency storage devices though (SSDs), DEVICE.threads_num should be set
|
|
to 1 or 2 in /etc/scst.conf in order to reach optimal performance for small
|
|
block sizes (e.g. 4 KB).
|
|
|
|
* When multiple InfiniBand HCA's are present in a target system the Linux
|
|
kernel by default will assign the associated interrupt handlers to CPU 0.
|
|
Even irqbalance will often assign the interrupt handlers of multiple HCA's
|
|
to the same CPU. That is unfortunate because it leads to unfair handling of
|
|
SRP sessions. The solution is to assign InfiniBand HCA interrupts manually
|
|
to different CPU's. That's possible by writing looking up the InfiniBand
|
|
interrupt numbers in /proc/interrupts and by writing proper bitmasks into
|
|
/proc/irq/<n>/smp_affinity.
|
|
|
|
|
|
Performance Notes - Initiator Side
|
|
----------------------------------
|
|
|
|
* Using multiple RDMA connections between initiator and target results in a
|
|
significant performance improvement. To benefit from this feature, use
|
|
kernel 3.19 or later at the initiator side and enable scsi-mq either by
|
|
setting SCSI_MQ_DEFAULT=y in the kernel config or via the following command:
|
|
|
|
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
|
|
|
|
If the HCA model in your initiator system supports multiple MSI-X interrupts
|
|
the next step is either to stop the irqbalance service or to write a policy
|
|
script that stops irqbalance from modifying the IB interrupt CPU
|
|
affinity.
|
|
|
|
For more information about scsi-mq see also Michael Larabel, SCSI
|
|
Multi-Queue Performance Appears Great For Linux 3.17, Phoronix, June 18,
|
|
2014 (http://www.phoronix.com/scan.php?page=news_item&px=MTcyMjk).
|
|
|
|
* Choose a proper value for the ib_srp kernel module parameter
|
|
cmd_sg_entries. The default value 12 works well for buffered reads while
|
|
the throughput for write-dominated workloads improves by changing this value
|
|
into 255. One way to set this kernel module parameter is as follows:
|
|
|
|
echo options ib_srp cmd_sg_entries=255 >/etc/modprobe.d/ib_srp.conf
|
|
|
|
* For multithreaded workloads using small block sizes changing rq_affinity
|
|
into 2 improves IOPS significantly (Linux kernel 3.1 and later; see also
|
|
commit 5757a6d76cdf6dda2a492c09b985c015e86779b1).
|
|
|
|
* For latency sensitive applications, using the noop scheduler at the initiator
|
|
side can give significantly better results than with other schedulers.
|
|
|
|
* The SRP initiator limits by default the queue depth to 64 commands. If your
|
|
workload benefits from a larger queue depth, enlarge the queue depth by
|
|
setting the max_cmd_per_lun and queue_size parameters in the SRP login
|
|
string.
|
|
|
|
* The following parameters have a small but measurable impact on SRP
|
|
performance:
|
|
* /sys/class/block/${dev}/queue/rotational
|
|
* /sys/class/block/${dev}/queue/rq_affinity
|
|
* /proc/irq/${ib_int_no}/smp_affinity
|
|
|
|
|
|
Performance Notes - Both Sides
|
|
------------------------------
|
|
|
|
* Disabling CONFIG_SCHED_DEBUG and CONFIG_SCHEDSTATS in the kernel config
|
|
improves performance.
|
|
|
|
* Disable CONFIG_IRQSOFF_TRACER such that CONFIG_TRACE_IRQFLAGS is disabled.
|
|
|
|
* Consider which memory allocator to use. With recent kernels using the SLUB
|
|
memory allocator instead of SLAB may help. On multi-socket systems the SLAB
|
|
memory allocator may result in better performance. Please note that SLAB is
|
|
tunable while SLUB is not. See also http://lkml.org/lkml/2010/7/9/264 and
|
|
http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/.
|
|
|
|
|
|
Frequently Asked Questions
|
|
--------------------------
|
|
|
|
Q: Every now and then "SRP abort called" and "SRP reset_device called"
|
|
messages are logged at the initiator side. Around the same time I see the
|
|
following message in the target log: "ib_srpt: ***ERROR***: Command ...: IB
|
|
completion for idx ... has not been received in time (SRPT command state
|
|
...)". What is the meaning of these messages mean and how can I fix this ?
|
|
|
|
A: This means that a timeout occurred while a HCA was waiting for an
|
|
acknowledge message. Check the IB network for bad IB cables, bad HCA's
|
|
and/or bad switch ports. Also make sure that the HCA firmware is up to
|
|
date.
|
|
|
|
Q: Loading the kernel module ib_srpt triggers a kernel panic with a call trace
|
|
like the one below. What is the cause of this and how can this be solved ?
|
|
|
|
Call Trace:
|
|
[<ffffffffa02f2a50>] srpt_alloc_ioctx+0x60/0xb0 [ib_srpt]
|
|
[<ffffffffa02f2f0a>] srpt_alloc_ioctx_ring+0xea/0x1e0 [ib_srpt]
|
|
[<ffffffffa02f32e9>] srpt_add_one+0x2e9/0x670 [ib_srpt]
|
|
[<ffffffffa015a480>] ib_register_client+0x80/0xa0 [ib_core]
|
|
[<ffffffffa02421eb>] srpt_init_module+0x1eb/0x235 [ib_srpt]
|
|
[<ffffffff81000344>] do_one_initcall+0x34/0x1a0
|
|
[<ffffffff8107a63c>] sys_init_module+0xdc/0x260
|
|
[<ffffffff81002e3b>] system_call_fastpath+0x16/0x1b
|
|
|
|
A: This means that you are using a system on which OFED has been installed but
|
|
that ib_srpt has been compiled against the in-tree kernel headers instead
|
|
of the OFED kernel headers. You can fix this by rebuilding ib_srpt against
|
|
the OFED kernel headers. The ib_srpt makefile should detect the OFED kernel
|
|
headers automatically - at least if ib_srpt is built after OFED has been
|
|
installed.
|
|
|
|
|
|
Feedback
|
|
--------
|
|
|
|
Send questions about this driver to scst-devel@lists.sourceforge.net.
|