mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-14 01:01:27 +00:00
svn+ssh://svn.code.sf.net/p/scst/svn/trunk ........ r7096 | bvassche | 2017-02-23 18:08:17 -0800 (Thu, 23 Feb 2017) | 4 lines scst_copy_mgr: Avoid that LUN removal triggers a BUG() Reported-by: Jinpu Wang <jinpu.wang@profitbricks.com> ........ r7101 | bvassche | 2017-03-01 07:31:59 -0800 (Wed, 01 Mar 2017) | 20 lines scst_vdisk: Avoid that LUN refresh triggers a general protection fault Avoid that triggering LUN referesh concurrently with device deletion triggers the following: general protection fault: 0000 [#1] Workqueue: events vdev_inq_changed_fn [scst_vdisk] Call Trace: _raw_spin_lock_bh+0x2b/0x30 scst_cm_update_dev+0x87/0x190 [scst] scst_dev_inquiry_data_changed+0xfb/0x1b0 [scst] vdev_inq_changed_fn+0x60/0x120 [scst_vdisk] process_one_work+0x14d/0x410 worker_thread+0x66/0x460 kthread+0xdb/0x100 ret_from_fork+0x3f/0x70 Reported-by: Jinpu Wang <jinpu.wang@profitbricks.com> Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> ........ r7104 | vlnb | 2017-03-07 20:36:45 -0800 (Tue, 07 Mar 2017) | 5 lines Linux kernel v4.10 build fix. Signed-off-by: Sebastian Herbszt <herbszt@gmx.de> ........ r7106 | bvassche | 2017-04-11 11:48:36 -0700 (Tue, 11 Apr 2017) | 4 lines ib_srpt: Ensure that the BUG_ON() argument has no side effects Reported-by: David Butterfield <dab21774@gmail.com> ........ r7107 | vlnb | 2017-04-13 15:04:07 -0700 (Thu, 13 Apr 2017) | 7 lines The argument to sleep() would get "promoted" to an integer with value zero. - sleep(0.1); + usleep(100*1000); Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7108 | vlnb | 2017-04-13 15:30:25 -0700 (Thu, 13 Apr 2017) | 6 lines create_and_open_dev() returns a (-errno), so the "if (iser_fd...)" check should detect *any* negative return value as a case when fd should be set to -1. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7109 | vlnb | 2017-04-13 15:38:38 -0700 (Thu, 13 Apr 2017) | 8 lines Change memcpy() to strncpy() because the source name string is not guaranteed to exist as valid addressable memory beyond the NULL byte. Signed-off-by: David Butterfield <dab21774@gmail.com> with small addition to force set last byte NULL ........ r7110 | vlnb | 2017-04-13 16:02:18 -0700 (Thu, 13 Apr 2017) | 7 lines Thre is potential buffer overflow in iscsi_session_alloc() due to short computation of needed string size. Notice the "%s@%s" in the first call to sprintf(). Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7111 | vlnb | 2017-04-13 16:12:46 -0700 (Thu, 13 Apr 2017) | 8 lines iscsi: avoid a crash in iscsi_extracheck_is_rd_thread() Add an extra check in iscsi_extracheck_is_rd_thread() to avoid a crash when conn->rd_task is NULL. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7115 | vlnb | 2017-04-13 16:37:13 -0700 (Thu, 13 Apr 2017) | 12 lines scst: fix memory leak in scst_proc_group_add() Valgrind noticed that the "name" allocated in scst_proc_group_add() was leaking. It turns out that scst_alloc_add_acg makes its own copy of the name passed to it from this code, making the string duplication done here redundant (and leaky). The change eliminates the string duplication (along with all its associated error handling logic) and simply passes the (unowned) incoming string down for duplication below. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7116 | vlnb | 2017-04-13 16:39:14 -0700 (Thu, 13 Apr 2017) | 7 lines scst_vdisk: fix memory leak in vdisk_write_proc() Another leak valgrind popped out, this one in vdisk_write_proc(). Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7117 | vlnb | 2017-04-13 16:41:14 -0700 (Thu, 13 Apr 2017) | 9 lines scst: take scst_mutex before calling scst_del_free_acg() in exit_scst() scst_del_free_acg() does lockdep_assert_held(&scst_mutex), so we'd better take the lock before calling it. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7118 | vlnb | 2017-04-13 16:42:51 -0700 (Thu, 13 Apr 2017) | 8 lines scst: take scst_mutex before calling scst_del_free_acg() in scst_proc_cleanup_module() Take lock before a call that ends up at the lockdep_assert_held() in scst_del_free_acg() without locking. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7125 | vlnb | 2017-04-13 18:17:45 -0700 (Thu, 13 Apr 2017) | 6 lines extraclean does "rm tags cscope.out" Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7134 | vlnb | 2017-04-17 20:57:12 -0700 (Mon, 17 Apr 2017) | 5 lines iscsi-scst: replace strncpy() by strlcpy() Follow up for r7109: strlcpy() is more appropriate in this place. ........ r7135 | vlnb | 2017-04-17 21:02:44 -0700 (Mon, 17 Apr 2017) | 5 lines fcst: Linux kernel v4.10 build fix Signed-off-by: Sebastian Herbszt <herbszt@gmx.de> ........ r7136 | vlnb | 2017-04-17 21:06:18 -0700 (Mon, 17 Apr 2017) | 5 lines scst: avoid possible side effect with WARN_ON_ONCE() Reported-By: David Butterfield <dab21774@gmail.com> ........ r7137 | vlnb | 2017-04-18 20:44:20 -0700 (Tue, 18 Apr 2017) | 5 lines fileio_tgt: change "#if DEBUG_TM_FN_IGNORE" to "#ifdef ..." Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7139 | vlnb | 2017-04-20 18:02:25 -0700 (Thu, 20 Apr 2017) | 7 lines iscsi-scstd: replace signal() with sigaction() Replace signal() with sigaction() for validity in a multithreaded process Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7141 | vlnb | 2017-04-20 18:11:37 -0700 (Thu, 20 Apr 2017) | 7 lines scst: set file size for NULLIO in PROCFS build The file size wasn't getting set for NULLIO with /proc support Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7143 | vlnb | 2017-04-20 18:32:07 -0700 (Thu, 20 Apr 2017) | 9 lines iscsi-scst: fix ENOMEM path In an error path in iscsi_threads_pool_get(), when a new pool cannot be allocated, if there is a pool on iscsi_thread_pools_list, it passes that back as an alternative, so return zero in that case. Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7152 | bvassche | 2017-04-26 16:53:11 -0700 (Wed, 26 Apr 2017) | 5 lines scst: Introduce scst_scsi_execute() This patch does not change any functionality but makes it easier to port SCST to Linux kernel v4.11. ........ r7153 | bvassche | 2017-04-26 17:17:22 -0700 (Wed, 26 Apr 2017) | 1 line scst: Port to Linux kernel v4.11 ........ r7154 | vlnb | 2017-04-28 17:58:55 -0700 (Fri, 28 Apr 2017) | 5 lines scst: create proc/scst_threads with mode S_IRUGO, not 0 Signed-off-by: David Butterfield <dab21774@gmail.com> ........ r7158 | bvassche | 2017-05-01 14:01:33 -0700 (Mon, 01 May 2017) | 1 line scst: Kernel v4.12 build fixes ........ r7162 | bvassche | 2017-05-02 07:13:15 -0700 (Tue, 02 May 2017) | 1 line scst_lib: Fix kernel 2.6.30 build ........ r7163 | bvassche | 2017-05-02 07:23:00 -0700 (Tue, 02 May 2017) | 1 line scst: Fix build for kernels before v2.6.39 ........ r7168 | bvassche | 2017-05-03 19:56:27 -0700 (Wed, 03 May 2017) | 1 line scst: Fix build for kernels before v2.6.39 ........ r7169 | vlnb | 2017-05-05 18:31:59 -0700 (Fri, 05 May 2017) | 3 lines scst: nolockdep patch for kernel 4.9 ........ r7170 | vlnb | 2017-05-10 20:51:32 -0700 (Wed, 10 May 2017) | 3 lines qla2x00t: update to kernel 4.10 ........ r7171 | vlnb | 2017-05-10 20:52:51 -0700 (Wed, 10 May 2017) | 3 lines scst: nolockdep patch for kernel 4.10 ........ r7173 | vlnb | 2017-05-10 21:00:29 -0700 (Wed, 10 May 2017) | 7 lines fcst: Linux kernel v4.11 build fix Linux kernel v4.11 build fix. Signed-off-by: Sebastian Herbszt <herbszt@gmx.de> ........ r7175 | bvassche | 2017-05-13 17:42:24 -0700 (Sat, 13 May 2017) | 4 lines scst/include/backport.h: Remove duplicate definition of kthread_create_on_node() This patch reverts most of r7168. ........ r7176 | bvassche | 2017-05-13 17:55:39 -0700 (Sat, 13 May 2017) | 1 line scst/include/backport.h: Add a comment ........ r7177 | bvassche | 2017-05-13 19:55:24 -0700 (Sat, 13 May 2017) | 1 line scst/include/backport.h: Fix kthread_create_on_node() definition ........ r7178 | bvassche | 2017-05-13 20:06:54 -0700 (Sat, 13 May 2017) | 4 lines scst/include/backport.h: Add a kref_read() backport This patch does not change any functionality. ........ r7179 | bvassche | 2017-05-13 20:13:56 -0700 (Sat, 13 May 2017) | 4 lines scst/include/backport.h: Add a backport of rcu_dereference_protected() This patch does not change any functionality. ........ r7185 | bvassche | 2017-05-14 11:56:09 -0700 (Sun, 14 May 2017) | 14 lines scst_local: Fix a race condition Avoid that the following crash can occur: general protection fault: 0000 [#1] PREEMPT SMP RIP: 0010:scsi_is_host_device+0x7/0x20 [scsi_mod] Call Trace: scst_process_aens+0x95/0x1d0 [scst_local] scst_aen_work_fn+0x6f/0x120 [scst_local] process_one_work+0x20b/0x6c0 worker_thread+0x4e/0x4a0 kthread+0x113/0x150 ret_from_fork+0x31/0x40 ........ r7190 | vlnb | 2017-05-19 20:00:28 -0700 (Fri, 19 May 2017) | 8 lines scst: fix possible NULL dereference in TM code TM command accessing a non-existing LUN might lead NULL dereference in scst_call_dev_task_mgmt_fn_done(). Reported-By: <Ilan Steinberg <ilan.steinberg@kaminario.com>> ........ r7193 | vlnb | 2017-05-22 19:23:38 -0700 (Mon, 22 May 2017) | 5 lines qla2x00t: fix broken 4.9 kernels build Reported-By: Marc Smith <marc.smith@parodyne.com> ........ r7203 | vlnb | 2017-06-02 19:38:51 -0700 (Fri, 02 Jun 2017) | 3 lines Update to 4.11 kernels ........ r7208 | vlnb | 2017-06-12 20:58:26 -0700 (Mon, 12 Jun 2017) | 5 lines scst: Linux kernel v4.12 warning fix. Signed-off-by: Sebastian Herbszt <herbszt@gmx.de> ........ git-svn-id: http://svn.code.sf.net/p/scst/svn/branches/3.2.x@7209 d57e44dd-8a1f-0410-8b47-8ef2f437770f
SCSI RDMA Protocol (SRP) Target driver for Linux
=================================================
The SRP target driver has been designed to work on top of the Linux RDMA
kernel drivers -- either the RDMA drivers included with a Linux distribution
or the OFED RDMA drivers. For more information about using the SRP target
driver in combination with OFED, see also README.ofed.
The SRP target driver has been implemented as an SCST driver. This
makes it possible to support a lot of I/O modes on real and virtual
devices. A few examples of supported device handlers are:
1. scst_disk. This device handler implements transparent pass-through
of SCSI commands and allows SRP to access and to export real
SCSI devices, i.e. disks, hardware RAID volumes, tape libraries
as SRP LUNs.
2. scst_vdisk, either in fileio or in blockio mode. This device handler
allows to export software RAID volumes, LVM volumes, IDE disks, and
normal files as SRP LUNs.
3. nullio. The nullio device handler allows to measure the performance
of the SRP target implementation without performing any actual I/O.
Installation
------------
Building and installing the SRP target driver is possible as follows:
cd ${SCST_DIR}
if type -p rpm >/dev/null; then
make -s rpm
sudo rpm -U rpmbuilddir/RPMS/*/*rpm scstadmin/rpmbuilddir/RPMS/*/*rpm
else
make -s scst_clean srpt_clean scst srpt scstadmin
sudo make -s scst_install srpt_install scstadm_install
fi
The ib_srpt kernel module supports the following parameters:
* max_sge_delta (unsigned): Number to subtract from max_sge. Some but not
all HCA's allow to use up to max_sge S/G-list elements in RDMA
communication. The default value of this parameter is 3 and works with all
HCA's. If you know that the HCA's that are used by the ib_srpt driver allow
to use S/G-lists that are longer than max_sge - 3 then you can decrease this
parameter. Note: setting this parameter too low will cause SRP every login
to fail and will cause a message similar to the following to be logged on
the target system: "ib_srpt: RDMA t ... for idx ... failed with status 12".
* rdma_cm_port (number)
A 16-bit number that specifies the port number to be registered via the
RDMA/CM. Must be specified to make communication over RoCE or iWARP
possible. If this parameter is zero (the default value) the SRP target
driver does not register with the RDMA/CM.
* srp_max_req_size (number)
Maximum size of an SRP control message in bytes. Examples of SRP control
messages are: login request, logout request, data transfer request, ...
The larger this parameter, the more scatter/gather list elements can be
sent at once. Use the following formula to compute an appropriate value
for this parameter: 68 + 16 * (sg_tablesize). The default value of
this parameter is 4148, which corresponds to an sg table size of 255.
* srp_max_rsp_size (number)
Maximum size of an SRP response message in bytes. Sense data is sent back
via these messages towards the initiator. The default size is 256 bytes.
With this value there remains (256-36) = 220 bytes for sense data.
* srp_max_rdma_size (number)
Maximum number of bytes that may be transferred at once via RDMA. Defaults
to 65536 bytes, which is sufficient to use the full bandwidth of low-latency
HCAs. Increasing this value may decrease latency for applications
transferring large amounts of data at once.
* srpt_srq_size (number, default 4095)
ib_srpt uses a shared receive queue (SRQ) for processing incoming SRP
requests. This number may have to be increased when a large number of
initiator systems is accessing a single SRP target system.
* srpt_sq_size (number, default 256)
Per-channel InfiniBand send queue size. Depending on the queue depth,
changing this parameter to a smaller value may cause RDMA requests to be
retried and hence may slow down data transfer severely.
* trace_flag (unsigned integer, only available in debug builds)
The individual bits of the trace_flag parameter define which categories of
trace messages should be sent to the kernel log and which ones not.
Configuring the SRP Target System
---------------------------------
When using RoCE or iWARP the first step is to enable support for these
protocols in the target driver by setting the rdma_cm_port kernel module
parameter to a non-zero value. An example:
echo options ib_srpt rdma_cm_port=5000 > /etc/modprobe.d/ib_srpt.conf
Next, create the file /etc/scst.conf. You can create this file with
the scstadmin tool as follows:
/etc/init.d/scst stop
/etc/init.d/scst start
Now configure SCST using scstadmin - see also the scstadmin documentation for
further information. Once finished, save the configuration to /etc/scst.conf:
scstadmin -write_config /etc/scst.conf (sysfs version)
or
scstadmin -WriteConfig /etc/scst.conf (procfs version)
One can verify the contents of scst.conf e.g. as follows:
cat /etc/scst.conf
Now verify that loading the configuration from file works correctly:
/etc/init.d/scst reload
Note: when using InfiniBand loading the ib_ipoib kernel module and assigning
an IP address to each IPoIB interface is only needed when using the RDMA/CM.
When using the IB/CM however, it is allowed but not necessary to load the
ib_ipoib kernel module.
Configuring the SRP Initiator System
------------------------------------
First of all, load the SRP kernel module as follows:
modprobe ib_srp
Next, when using InfiniBand, discover the new SRP target by running the
srp_daemon command:
for d in /dev/infiniband/umad*; do srp_daemon -oacd$d; done
If you want to let the initiator system log in to all SRP targets available
in the same InfiniBand subnet that is possible as follows (-e = execute):
for d in /dev/infiniband/umad*; do srp_daemon -oecd$d; done
If you want to let the initiator log in to a specific target you can do that
e.g. as follows:
echo "id_ext=0002c903000f1366,ioc_guid=0002c903000f1366,dgid=fe800000000000000002c903000f1367,pkey=ffff,service_id=0002c903000f1366" > /sys/class/infiniband_srp/${SRP_HCA_NAME}/add_target; done
The meaning of the parameters in the above command is as follows:
* id_ext: must match ioc_guid.
* ioc_guid: see also the documentation of the ib_srpt ioc_guid parameter.
* dgid: target HCA port GID to connect to.
* pkey: IB partition key (P_Key) of the target to connect to.
* service_id: must match ioc_guid.
When using RoCE or iWARP, log in to the target system to determine the id_ext
and ioc_guid parameters and use these to log in. An example:
[ target system ]
# sed 's/tid_ext=/id_ext=/;s/,\(pkey\|dgid\|service_id\)=[^,]*//g' $(find /sys/kernel/scst_tgt/targets/ib_srpt -name login_info) | uniq
id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
[ initiator system ]
echo dest=192.168.5.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
>/sys/class/infiniband_srp/srp-mlx4_0-1/add_target
echo dest=192.168.6.1:5000,id_ext=0002c90300a34270,ioc_guid=0002c90300a34270
>/sys/class/infiniband_srp/srp-mlx4_0-2/add_target
Initiator port GIDs can be queried e.g. via sysfs:
$ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo $f; \
cat $f | sed 's/://g'; done
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0
fe800000000000000002c9030005f34b
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0
fe800000000000000002c9030005f34c
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0
fe800000000000000002c9030003cca7
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0
fe800000000000000002c9030003cca8
Finally run lsscsi to display the details of the newly discovered SCSI disks:
lsscsi
SRP targets can be recognized in the output of lsscsi by looking for
the disk names assigned on the SCST target ("disk01" in the example below):
[8:0:0:0] disk SCST_FIO disk01 102 /dev/sdb
Target names
------------
The name assigned by the ib_srpt target driver to an SCST target is either
ib_srpt_target_<n>, the node GUID of a HCA in hexadecimal form with a colon
after every fourth digit or the port GID with a colon afer every fourth
digit. The HCA node GUID and the port GIDs can be obtained via the
ibv_devinfo command. An example:
# ibv_devinfo -v | grep -E '[^a-z]port:|guid|GID'
node_guid: 0002:c903:0005:f34e
sys_image_guid: 0002:c903:0005:f351
port: 1
GID[0]: fe80:0000:0000:0000:0002:c903:0005:f34f
port: 2
GID[0]: fe80:0000:0000:0000:0002:c903:0005:f350
Once the ib_srpt driver has been loaded the available SCST targets can be
queried as follows:
# (cd /sys/kernel/scst_tgt/targets/ib_srpt && ls -d [0-9a-f]*)
fe80:0000:0000:0000:0002:c903:0005:f34f
fe80:0000:0000:0000:0002:c903:0005:f350
Session names
-------------
The ib_srpt target driver uses the source port GID as session name.
An example:
[ INITIATOR ]
$ for f in /sys/devices/*/*/*/infiniband/*/ports/*/gids/0; do echo
f; cat $f; done
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/1/gids/0
fe80:0000:0000:0000:0002:c903:0005:f34b
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/infiniband/mlx4_0/ports/2/gids/0
fe80:0000:0000:0000:0002:c903:0005:f34c
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/1/gids/0
fe80:0000:0000:0000:0002:c903:0003:cca7
/sys/devices/pci0000:00/0000:00:1c.0/0000:05:00.0/infiniband/mlx4_1/ports/2/gids/0
fe80:0000:0000:0000:0002:c903:0003:cca8
[ TARGET, after login ]
$ (cd /sys/kernel/scst_tgt/targets/ib_srpt/[0-9a-f]* && ls -d sessions/*)
sessions/fe80:0000:0000:0000:0002:c903:0003:cca7
sessions/fe80:0000:0000:0000:0002:c903:0003:cca8
sessions/fe80:0000:0000:0000:0002:c903:0005:f34b
sessions/fe80:0000:0000:0000:0002:c903:0005:f34c
LUN masking
-----------
In a straightforward configuration every LUN is visible to every initiator.
It is possible however to make a different set of LUNs visible to each
initiator by using the LUN masking feature of SCST. SRP initiators are
identified by their session name (see above). An example of an scst.conf
file using LUN masking for ib_srpt:
TARGET_DRIVER ib_srpt {
TARGET fe80:0000:0000:0000:0002:c903:0005:f34b {
enabled 1
rel_tgt_id 1
# LUNs visible by all initiators not listed below
LUN 0 disk01
GROUP grp1 {
# LUNs visible by initiator system 1
LUN 0 disk02
INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34b
}
GROUP grp2 {
# LUNs visible by initiator system 2
LUN 0 disk03
INITIATOR fe80:0000:0000:0000:0002:c903:0005:f34c
}
}
}
Adding and Removing LUNs Dynamically
------------------------------------
It is possible to add and/or remove LUNs on the target without restarting
target or initiator. This can be done either via scstadmin or directly via the
sysfs interface. Although the SCST core will notify the initiator about LUN
changes, Linux initiators will ignore these notifications. In order to bring a
Linux initiator again in sync after a LUN change, the initiator has to be told
to rescan SCSI devices. Rescanning SCSI devices is e.g. possible via the
rescsan-scsi-bus.sh script that can be found here:
http://www.garloff.de/kurt/linux/#rescan-scsi. An example:
$ rescan-scsi-bus --hosts=${srp_host_id} --channels=0 --ids=0 --luns=0-31
InfiniBand Partitions
---------------------
Just like a VLAN allows to segment traffic on an Ethernet network partitions
allow to segment traffic on an InfiniBand network. Each InfiniBand partition
is identified by a partition key which is a 16-bit number. During fabric
initialization the subnet manager assigns one or more partition keys to
each InfiniBand port. For opensm partitions are defined in
/etc/opensm/partitions.conf. ib_srpt uses the partition with index 0. Which
partition key corresponds to index 0 can be found out by querying sysfs:
$ head /sys/class/infiniband/*/ports/*/pkeys/0
==> /sys/class/infiniband/mlx4_0/ports/1/pkeys/0 <==
0xffff
==> /sys/class/infiniband/mlx4_0/ports/2/pkeys/0 <==
0xffff
High availability
-----------------
If there are redundant paths in the IB network between initiator and target,
automatic path failover can be set up on the initiator as follows:
* Edit /etc/infiniband/openib.conf to load the SRP driver and SRP HA daemon
automatically: set SRP_LOAD=yes and SRPHA_ENABLE=yes.
* To set up and use the high availability feature you need the dm-multipath
driver and multipath tool.
* Please refer to the OFED-1.x user manual for more detailed instructions
on how to enable and how to use the HA feature. See e.g.
http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED%20_Linux_user_manual_1_5_1_2.pdf.
A setup with automatic failover between redundant targets is possible by
installing and configuring DRBD on both targets. If the initiator system
supports mirroring (e.g. Linux), you can use the following approach:
* Configure DRBD in Active/Active mode.
* Configure the initiator(s) for mirroring between the redundant targets.
If the initiator system does not support mirroring (e.g. VMware ESX), you
can use the following approach:
* Configure DRBD in Active/Passive mode and enable STONITH mode in the
Heartbeat software.
For more information, see also:
* http://www.drbd.org/
* http://www.linux-ha.org/wiki/Main_Page
Performance Notes - Target Side
-------------------------------
* Building the SCST core and the ib_srpt target driver in release mode
improves performance compared to debug mode.
* When using high-latency storage devices (hard disks), the default value
chosen by SCST for DEVICE.threads_num should be fine. When using
low-latency storage devices though (SSDs), DEVICE.threads_num should be set
to 1 or 2 in /etc/scst.conf in order to reach optimal performance for small
block sizes (e.g. 4 KB).
* When multiple InfiniBand HCA's are present in a target system the Linux
kernel by default will assign the associated interrupt handlers to CPU 0.
Even irqbalance will often assign the interrupt handlers of multiple HCA's
to the same CPU. That is unfortunate because it leads to unfair handling of
SRP sessions. The solution is to assign InfiniBand HCA interrupts manually
to different CPU's. That's possible by writing looking up the InfiniBand
interrupt numbers in /proc/interrupts and by writing proper bitmasks into
/proc/irq/<n>/smp_affinity.
Performance Notes - Initiator Side
----------------------------------
* Using multiple RDMA connections between initiator and target results in a
significant performance improvement. To benefit from this feature, use
kernel 3.19 or later at the initiator side and enable scsi-mq either by
setting SCSI_MQ_DEFAULT=y in the kernel config or via the following command:
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
If the HCA model in your initiator system supports multiple MSI-X interrupts
the next step is either to stop the irqbalance service or to write a policy
script that stops irqbalance from modifying the IB interrupt CPU
affinity. Once this has been done spread the IB interrupts uniformly over
CPU cores via e.g. scripts/spread-mlx4-ib-interrupts.
For more information about scsi-mq see also Michael Larabel, SCSI
Multi-Queue Performance Appears Great For Linux 3.17, Phoronix, June 18,
2014 (http://www.phoronix.com/scan.php?page=news_item&px=MTcyMjk).
* Choose a proper value for the ib_srp kernel module parameter
cmd_sg_entries. The default value 12 works well for buffered reads while
the throughput for write-dominated workloads improves by changing this value
into 255. One way to set this kernel module parameter is as follows:
echo options ib_srp cmd_sg_entries=255 >/etc/modprobe.d/ib_srp.conf
* For multithreaded workloads using small block sizes changing rq_affinity
into 2 improves IOPS significantly (Linux kernel 3.1 and later; see also
commit 5757a6d76cdf6dda2a492c09b985c015e86779b1).
* For latency sensitive applications, using the noop scheduler at the initiator
side can give significantly better results than with other schedulers.
* The SRP initiator limits by default the queue depth to 64 commands. If your
workload benefits from a larger queue depth, enlarge the queue depth by
setting the max_cmd_per_lun and queue_size parameters in the SRP login
string.
* The following parameters have a small but measurable impact on SRP
performance:
* /sys/class/block/${dev}/queue/rotational
* /sys/class/block/${dev}/queue/rq_affinity
* /proc/irq/${ib_int_no}/smp_affinity
Performance Notes - Both Sides
------------------------------
* Disabling CONFIG_SCHED_DEBUG and CONFIG_SCHEDSTATS in the kernel config
improves performance.
* Disable CONFIG_IRQSOFF_TRACER such that CONFIG_TRACE_IRQFLAGS is disabled.
* Consider which memory allocator to use. With recent kernels using the SLUB
memory allocator instead of SLAB may help. On multi-socket systems the SLAB
memory allocator may result in better performance. Please note that SLAB is
tunable while SLUB is not. See also http://lkml.org/lkml/2010/7/9/264 and
http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/.
Frequently Asked Questions
--------------------------
Q: Every now and then "SRP abort called" and "SRP reset_device called"
messages are logged at the initiator side. Around the same time I see the
following message in the target log: "ib_srpt: ***ERROR***: Command ...: IB
completion for idx ... has not been received in time (SRPT command state
...)". What is the meaning of these messages mean and how can I fix this ?
A: This means that a timeout occurred while a HCA was waiting for an
acknowledge message. Check the IB network for bad IB cables, bad HCA's
and/or bad switch ports. Also make sure that the HCA firmware is up to
date.
Q: Loading the kernel module ib_srpt triggers a kernel panic with a call trace
like the one below. What is the cause of this and how can this be solved ?
Call Trace:
[<ffffffffa02f2a50>] srpt_alloc_ioctx+0x60/0xb0 [ib_srpt]
[<ffffffffa02f2f0a>] srpt_alloc_ioctx_ring+0xea/0x1e0 [ib_srpt]
[<ffffffffa02f32e9>] srpt_add_one+0x2e9/0x670 [ib_srpt]
[<ffffffffa015a480>] ib_register_client+0x80/0xa0 [ib_core]
[<ffffffffa02421eb>] srpt_init_module+0x1eb/0x235 [ib_srpt]
[<ffffffff81000344>] do_one_initcall+0x34/0x1a0
[<ffffffff8107a63c>] sys_init_module+0xdc/0x260
[<ffffffff81002e3b>] system_call_fastpath+0x16/0x1b
A: This means that you are using a system on which OFED has been installed but
that ib_srpt has been compiled against the in-tree kernel headers instead
of the OFED kernel headers. You can fix this by rebuilding ib_srpt against
the OFED kernel headers. The ib_srpt makefile should detect the OFED kernel
headers automatically - at least if ib_srpt is built after OFED has been
installed.
Feedback
--------
Send questions about this driver to scst-devel@lists.sourceforge.net.