mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-14 01:01:27 +00:00
pr_state is a common device attribute for save/restore of Persistent Reservation state. pr_dump_dir is a dev_disk handler attribute that triggers an automatic kernel-side PR state dump at unregistration time.
3036 lines
138 KiB
Plaintext
3036 lines
138 KiB
Plaintext
Generic SCSI target mid-level for Linux (SCST)
|
|
==============================================
|
|
|
|
Version 3.11.0-pre, 29 December 2025
|
|
----------------------------
|
|
|
|
SCST is designed to provide unified, consistent interface between SCSI
|
|
target drivers and Linux kernel and simplify target drivers development
|
|
as much as possible. Detail description of SCST's features and internals
|
|
could be found on its Internet page http://scst.sourceforge.net.
|
|
|
|
SCST supports the following I/O modes:
|
|
|
|
* Pass-through mode with one to many relationship, i.e. when multiple
|
|
initiators can connect to the exported pass-through devices, for
|
|
the following SCSI devices types: disks (type 0), tapes (type 1),
|
|
processors (type 3), CDROMs (type 5), MO disks (type 7), medium
|
|
changers (type 8) and RAID controllers (type 0xC).
|
|
|
|
* FILEIO mode, which allows to use files on file systems or block
|
|
devices as virtual remotely available SCSI disks or CDROMs with
|
|
benefits of the Linux page cache.
|
|
|
|
* BLOCKIO mode, which performs direct block IO with a block device,
|
|
bypassing page-cache for all operations. This mode works ideally with
|
|
high-end storage HBAs and for applications that either do not need
|
|
caching between application and disk or need the large block
|
|
throughput.
|
|
|
|
* User space mode using scst_user device handler, which allows to
|
|
implement in the user space high performance virtual SCSI
|
|
devices. Comparing with fully in-kernel dev handlers this mode has
|
|
very low overhead (few %%).
|
|
|
|
* "Performance" device handlers, which provide in pseudo pass-through
|
|
mode a way for direct performance measurements without overhead of
|
|
actual data transferring from/to underlying SCSI device.
|
|
|
|
In addition, SCST supports advanced per-initiator access and devices
|
|
visibility management, so different initiators could see different set
|
|
of devices with different access permissions. See below for details.
|
|
|
|
Full list of SCST features and comparison with other Linux targets you
|
|
can find on http://scst.sourceforge.net/comparison.html.
|
|
|
|
|
|
Installation
|
|
------------
|
|
|
|
Only vanilla kernels from kernel.org and RHEL/CentOS 5.2 kernels are
|
|
supported, but SCST should work on other (vendors') kernels, if you
|
|
manage to successfully compile on them. The main problem with vendors'
|
|
kernels is that they often contain patches, which will appear only in
|
|
the next version of the vanilla kernel, therefore it's quite hard to
|
|
track such changes. Thus, if during compilation for some vendor kernel
|
|
your compiler complains about redefinition of some symbol, you should
|
|
either switch to vanilla kernel, or add or change as necessary the
|
|
corresponding to that symbol "#if LINUX_VERSION_CODE" statement.
|
|
|
|
Kernel version 2.6.26 and higher are supported.
|
|
|
|
At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
|
|
points to the source code for your currently running kernel.
|
|
|
|
Then you should consider to apply necessary kernel patches. SCST has the
|
|
following patches for the kernel in the "kernel" subdirectory. All of
|
|
them are optional, so, if you don't need the corresponding
|
|
functionality, you may not apply them.
|
|
|
|
1. readahead-2.6.X.patch. This patch fixes problem in Linux readahead
|
|
subsystem and greatly improves performance for software RAIDs. See
|
|
http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
|
|
thread for more details. It is included in the mainstream kernels 2.6.33
|
|
and 2.6.32.11.
|
|
|
|
2. readahead-context-2.6.X.patch. This is backported from 2.6.31 version
|
|
of the context readahead patch http://lkml.org/lkml/2009/4/12/9, big
|
|
thanks to Wu Fengguang. This is a performance improvement patch. It is
|
|
included in the mainstream kernel 2.6.31.
|
|
|
|
Then, to compile SCST type 'make scst'. It will build SCST itself and its
|
|
device handlers. To install them type 'make scst_install'. The driver
|
|
modules will be installed in '/lib/modules/`you_kernel_version`/extra'.
|
|
In addition, scst.h, scst_debug.h as well as Module.symvers or
|
|
Modules.symvers will be copied to '/usr/local/include/scst'. The first
|
|
file contains all SCST's public data definition, which are used by
|
|
target drivers. The other ones support debug messages logging and build
|
|
process.
|
|
|
|
Then you can load any module by typing 'modprobe module_name'. The names
|
|
are:
|
|
|
|
- scst - SCST itself
|
|
- scst_disk - device handler for disks (type 0)
|
|
- scst_tape - device handler for tapes (type 1)
|
|
- scst_processor - device handler for processors (type 3)
|
|
- scst_cdrom - device handler for CDROMs (type 5)
|
|
- scst_modisk - device handler for MO disks (type 7)
|
|
- scst_changer - device handler for medium changers (type 8)
|
|
- scst_raid - device handler for storage array controller (e.g. raid) (type C)
|
|
- scst_vdisk - device handler for virtual disks (file, device or ISO CD image).
|
|
- scst_user - user space device handler
|
|
|
|
Then, to see your devices remotely, you need to add a corresponding LUN
|
|
for them (see below how). By default, no local devices are seen
|
|
remotely. There must be LUN 0 in each LUNs set (security group), i.e.
|
|
LUs numeration must not start from, e.g., 1. Otherwise you will see no
|
|
devices on remote initiators and SCST core will write into the kernel
|
|
log message: "tgt_dev for LUN 0 not found, command to unexisting LU?"
|
|
|
|
It is highly recommended to use scstadmin utility for configuring
|
|
devices and security groups.
|
|
|
|
The flow of SCST initialization should be as follows:
|
|
|
|
1. Load of SCST modules with necessary module parameters, if needed.
|
|
|
|
2. Configure targets, devices, LUNs, etc. using either scstadmin
|
|
(recommended), or the sysfs interface directly as described below.
|
|
|
|
If you experience problems during modules load or running, check your
|
|
kernel logs (or run dmesg command for the few most recent messages).
|
|
|
|
IMPORTANT: Without loading appropriate device handler, corresponding devices
|
|
========= will be invisible for remote initiators, which could lead to holes
|
|
in the LUN addressing, so automatic device scanning by remote SCSI
|
|
mid-level could not notice the devices. Therefore you will have
|
|
to add them manually via
|
|
'echo "- - -" >/sys/class/scsi_host/hostX/scan',
|
|
where X - is the host number.
|
|
|
|
IMPORTANT: Working of target and initiator on the same host is
|
|
========= supported, except the following 2 cases: swap over target exported
|
|
device and using a writable mmap over a file from target
|
|
exported device. The latter means you can't mount a file
|
|
system over target exported device. In other words, you can
|
|
freely use any sg, sd, st, etc. devices imported from target
|
|
on the same host, but you can't mount file systems or put
|
|
swap on them. This is a limitation of Linux memory/cache
|
|
manager, because in this case a memory allocation deadlock is
|
|
possible like: system needs some memory -> it decides to
|
|
clear some cache -> the cache is needed to be written on a
|
|
target exported device -> initiator sends request to the
|
|
target located on the same system -> the target needs memory
|
|
-> the system needs even more memory -> deadlock.
|
|
|
|
IMPORTANT: In the current version simultaneous access to local SCSI devices
|
|
========= via standard high-level SCSI drivers (sd, st, sg, etc.) and
|
|
SCST's target drivers is unsupported. Especially it is
|
|
important for execution via sg and st commands that change
|
|
the state of devices and their parameters, because that could
|
|
lead to data corruption. If any such command is done, at
|
|
least related device handler(s) must be restarted. For block
|
|
devices READ/WRITE commands using direct disk handler are
|
|
generally safe.
|
|
|
|
To uninstall, type 'make scst_uninstall'.
|
|
|
|
|
|
Creating a kernel patch or patched kernel
|
|
-----------------------------------------
|
|
|
|
You can use generate-kernel-patch or generate-patched-kernel scripts in
|
|
the scripts/ subdirectory to convert SCST source tree as it exists
|
|
in the Subversion repository to a Linux kernel patch or generate a
|
|
kernel source tree with the SCST patches applied correspondingly. This
|
|
subdirectory exists only in the SVN tree.
|
|
|
|
Example how to use generate-kernel-patch you can find at "How To install
|
|
SCST on Ubutuntu 15.04 with in-tree kernel patches"
|
|
https://gist.github.com/chrwei/42f8bbb687290b04b598, thanks to Chris Weiss.
|
|
|
|
|
|
Migration from the obsolete proc interface
|
|
------------------------------------------
|
|
|
|
Sysfs enabled scstadmin supports old procfs config file format, so with
|
|
it you should do the following steps to migrate your proc-based
|
|
configuration to the sysfs interface:
|
|
|
|
1. Load SCST modules
|
|
|
|
2. Run "scstadmin -config old_config_file"
|
|
|
|
3. Run "scstadmin -write_config new_config_file"
|
|
|
|
4. Check new_config_file and make sure it has everything written
|
|
properly.
|
|
|
|
5. Start using "scstadmin -config new_config_file" to configure SCST.
|
|
|
|
|
|
Usage in failover mode
|
|
----------------------
|
|
|
|
It is recommended to use TEST UNIT READY ("tur") command to check if
|
|
SCST target is alive in MPIO configurations.
|
|
|
|
|
|
Device handlers
|
|
---------------
|
|
|
|
Device specific drivers (device handlers) are plugins for SCST, which
|
|
help SCST to analyze incoming requests and determine parameters,
|
|
specific to various types of devices. If an appropriate device handler
|
|
for a SCSI device type isn't loaded, SCST doesn't know how to handle
|
|
devices of this type, so they will be invisible for remote initiators
|
|
(more precisely, "LUN not supported" sense code will be returned).
|
|
|
|
In addition to device handlers for real devices, there are VDISK, user
|
|
space and "performance" device handlers.
|
|
|
|
VDISK device handler works over files on file systems and makes from
|
|
them virtual remotely available SCSI disks or CDROM's. In addition, it
|
|
allows to work directly over a block device, e.g. local IDE or SCSI disk
|
|
or ever disk partition, where there is no file systems overhead. Using
|
|
block devices comparing to sending SCSI commands directly to SCSI
|
|
mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
|
|
are transferred via system cache, so it is possible to fully benefit
|
|
from caching and read ahead performed by Linux's VM subsystem. The only
|
|
disadvantage here that in the FILEIO mode there is superfluous data
|
|
copying between the cache and SCST's buffers. This issue is going to be
|
|
addressed in one of the future releases. Virtual CDROM's are useful for
|
|
remote installation. See below for details how to setup and use VDISK
|
|
device handler.
|
|
|
|
SCST user space device handler provides an interface between SCST and
|
|
the user space, which allows to create pure user space devices. The
|
|
simplest example, where one would want it is if he/she wants to write a
|
|
VTL. With scst_user he/she can write it purely in the user space. Or one
|
|
would want it if he/she needs some sophisticated for kernel space
|
|
processing of the passed data, like encrypting them or making snapshots.
|
|
|
|
"Performance" device handlers for disks, MO disks and tapes in their
|
|
exec() method skip (pretend to execute) all READ and WRITE operations
|
|
and thus provide a way for direct link performance measurements without
|
|
overhead of actual data transferring from/to underlying SCSI device.
|
|
|
|
NOTE: Since "perf" device handlers on READ operations don't touch the
|
|
==== commands' data buffer, it is returned to remote initiators as it
|
|
was allocated, without even being zeroed. Thus, "perf" device
|
|
handlers impose some security risk, so use them with caution.
|
|
|
|
|
|
Compilation options
|
|
-------------------
|
|
|
|
There are the following compilation options, that could be commented
|
|
in/out in Makefile and scst.h:
|
|
|
|
- CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
|
|
including some logging. Makes the driver considerably bigger and slower,
|
|
producing large amount of log data.
|
|
|
|
- CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
|
|
driver considerably bigger and leads to some performance loss.
|
|
|
|
- CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
|
|
the various places.
|
|
|
|
- CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
|
|
supplied expected data transfer length and direction will be used
|
|
only for verification purposes to return error or warn in case if one
|
|
of them is invalid. Instead, locally decoded from SCSI command values
|
|
will be used. This is necessary for security reasons, because
|
|
otherwise a faulty initiator can crash target by supplying invalid
|
|
value in one of those parameters. This is especially important in
|
|
case of pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is
|
|
defined, initiator supplied expected data transfer length and
|
|
direction will override the locally decoded values. This might be
|
|
necessary if internal SCST commands translation table doesn't contain
|
|
SCSI command, which is used in your environment. You can know that if
|
|
you enable "minor" trace level and have messages like "Unknown
|
|
opcode XX for YY. Should you update scst_scsi_op_table?" in your
|
|
kernel log and your initiator returns an error. Also report those
|
|
messages in the SCST mailing list scst-devel@lists.sourceforge.net.
|
|
Note, that not all SCSI transports support supplying expected values.
|
|
You should try to enable this option if you have a not working with
|
|
SCST pass-through device, for instance, an SATA CDROM.
|
|
|
|
- CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
|
|
debugging, when on LUN 6 some of the commands will be delayed for
|
|
about 60 sec., so making the remote initiator send TM functions, eg
|
|
ABORT TASK and TARGET RESET. Also define
|
|
CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you want that
|
|
the device eventually become completely unresponsive, or otherwise to
|
|
circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG turned
|
|
on.
|
|
|
|
- CONFIG_SCST_DEBUG_SYSFS_EAGAIN - if defined, makes three out of four
|
|
reads of sysfs attributes fail with -EAGAIN and also makes every sysfs
|
|
write fail with -EAGAIN. This is useful to test -EAGAIN handling in user
|
|
space tools like e.g. scstadmin. See also the documentation of the
|
|
last_sysfs_mgmt_res sysfs attribute for more information.
|
|
|
|
- CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
|
|
underlying SCSI device synchronously, one after one. This makes task
|
|
management more reliable, with cost of some performance penalty. This
|
|
is mostly actual for stateful SCSI devices like tapes, where the
|
|
result of command's execution depends from device's settings defined
|
|
by previous commands. Disk and RAID devices are stateless in the most
|
|
cases. The current SCSI core in Linux doesn't allow to abort all
|
|
commands reliably if they sent asynchronously to a stateful device.
|
|
Turned off by default, turn it on if you use stateful device(s) and
|
|
need as much error recovery reliability as possible. As a side effect
|
|
of CONFIG_SCST_STRICT_SERIALIZING, on kernels below 2.6.30 no kernel
|
|
patching is necessary for pass-through device handlers (scst_disk,
|
|
etc.).
|
|
|
|
- CONFIG_SCST_TEST_IO_IN_SIRQ - if defined, allows SCST to submit selected
|
|
SCSI commands (TUR and READ/WRITE) from soft-IRQ context (tasklets).
|
|
Enabling it will decrease amount of context switches and slightly
|
|
improve performance. The goal of this option is to be able to measure
|
|
overhead of the context switches. If after enabling this option you
|
|
don't see under load in vmstat output on the target significant
|
|
decrease of amount of context switches, then your target driver
|
|
doesn't submit commands to SCST in IRQ context. For instance,
|
|
iSCSI-SCST doesn't do that, but qla2x00t with
|
|
CONFIG_QLA_TGT_DEBUG_WORK_IN_THREAD disabled - does. This option is
|
|
designed to be used with vdisk NULLIO backend.
|
|
|
|
WARNING! Using this option enabled with other backend than vdisk
|
|
NULLIO is unsafe and can lead you to a kernel crash!
|
|
|
|
- CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
|
|
buffers. Undefining it (default) considerably improves performance
|
|
and eases CPU load, but could create a security hole (information
|
|
leakage), so enable it, if you have strict security requirements.
|
|
|
|
- CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
|
|
in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
|
|
command, which has already finished, remote initiator, which sent the
|
|
ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
|
|
response for the ABORT TASK request. This is more logical response,
|
|
since, because the command finished, attempt to abort it failed, but
|
|
some initiators, particularly VMware iSCSI initiator, consider TASK
|
|
NOT EXIST response as if the target got crazy and try to RESET it.
|
|
Then sometimes get crazy itself. So, this option is disabled by
|
|
default.
|
|
|
|
- CONFIG_SCST_DIF_INJECT_CORRUPTED_TAGS - if defined, allows injection
|
|
of corrupted DIF tags according to the Oracle specification. This
|
|
functionality is working only if dif_mode doesn't contain dev_store
|
|
and dif_type is 1.
|
|
|
|
- CONFIG_SCST_NO_TOTAL_MEM_CHECKS - disables checks of allocated
|
|
memory, see scst_max_cmd_mem below. Allows to avoid 2 global
|
|
variables on the fast path, hence get better multi-queue performance.
|
|
|
|
HIGHMEM kernel configurations are fully supported, but not recommended
|
|
for performance reasons, except for scst_user, where they are not
|
|
supported, because this module deals with user supplied memory on a
|
|
zero-copy manner. If you need to use HIGHMEM enabled, consider change
|
|
VMSPLIT option or use 64-bit system configuration instead.
|
|
|
|
For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
|
|
"make menuconfig" command set the following variables:
|
|
|
|
- General setup->Configure standard kernel features (for small systems): ON
|
|
|
|
- General setup->Prompt for development and/or incomplete code/drivers: ON
|
|
|
|
- Processor type and features->High Memory Support: OFF
|
|
|
|
- Processor type and features->Memory split: according to amount of
|
|
memory you have. If it is less than 800MB, you may not touch this
|
|
option at all.
|
|
|
|
|
|
Module parameters
|
|
-----------------
|
|
|
|
Module scst supports the following parameters:
|
|
|
|
- scst_threads - allows to set count of SCST's threads. By default it
|
|
is CPU count.
|
|
|
|
- scst_max_cmd_mem - sets maximum amount of memory in MB allowed to be
|
|
consumed by the SCST commands for data buffers at any given time. By
|
|
default it is approximately TotalMem/4.
|
|
|
|
- scst_max_dev_cmd_mem - sets maximum amount of memory in MB allowed
|
|
to be consumed by all SCSI commands of a device at any given time. By
|
|
default, it is approximately 2/5 of scst_max_cmd_mem.
|
|
|
|
- auto_cm_assignment - enables the copy managers auto registration.
|
|
If a device is not registered in the copy manager, it can not be
|
|
source or target of EXTENDED COPY commands. Enabled by default.
|
|
Disable, if you want to manually control the copy manager
|
|
registration or need to change a device, e.g. a DM cache device, with
|
|
SCST LUN on top of it to avoid extra reference the copy manager holds
|
|
on this device. In the later case you can also remove this reference
|
|
by manually deleting the corresponding copy manager LUN via sysfs interface
|
|
(/sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/mgmt).
|
|
|
|
|
|
SCST sysfs interface
|
|
--------------------
|
|
|
|
Starting from 2.0.0 SCST has sysfs interface. It supports only kernels
|
|
2.6.26 and higher, because in 2.6.26 internal kernel's sysfs interface
|
|
had a major change, which made it heavily incompatible with pre-2.6.26
|
|
version.
|
|
|
|
SCST sysfs interface designed to be self descriptive and self
|
|
containing. This means that a high level management tool for it can be
|
|
written once and automatically support any future sysfs interface
|
|
changes (attributes additions or removals, new target drivers and dev
|
|
handlers, etc.) without any modifications. Scstadmin is an example of
|
|
such management tool.
|
|
|
|
To implement that an management tool should not be implemented around
|
|
drivers and their attributes, but around common rules those drivers and
|
|
attributes follow. You can find those rules in SysfsRules file. For
|
|
instance, each SCST sysfs file (attribute) can contain in the last line
|
|
mark "[key]". It is automatically added to allow scstadmin and other
|
|
management tools to see which attributes it should save in the config
|
|
file. If you are doing manual attributes manipulations, you can ignore
|
|
this mark.
|
|
|
|
Root of SCST sysfs interface is /sys/kernel/scst_tgt. It has the
|
|
following entries:
|
|
|
|
- devices - this is a root subdirectory for all SCST devices
|
|
|
|
- handlers - this is a root subdirectory for all SCST dev handlers
|
|
|
|
- max_tasklet_cmd - specifies how many commands at max can be queued in
|
|
the SCST core simultaneously on a single CPU from all connected
|
|
initiators to allow processing commands on this CPU in soft-IRQ
|
|
context in tasklets. If the count of the commands exceeds this value,
|
|
then all of them will be processed only in SCST threads. This is to
|
|
to prevent possible under heavy load starvation of processes on the
|
|
CPUs serving soft IRQs and in some cases to improve performance by
|
|
more evenly spreading load over available CPUs.
|
|
|
|
- measure_latency - whether or not to enable latency measurements.
|
|
Enabling latency measurements has a small impact on performance but
|
|
makes detailed information available about how much time is needed
|
|
to process SCSI commands. The structure of the paths to files with
|
|
latency information is as follows:
|
|
|
|
/sys/kernel/scst_tgt/targets/${target_driver_name}/${target_port_name}/sessions/${initiator_name}/latency/${io_type}${io_size}
|
|
|
|
${io_type} is n, r, w or b. 'n' means that no data buffer was
|
|
associated with the command, 'r' stands for read, 'w' for write
|
|
and 'b' for bidirectional. ${io_size} is a power of two between 512
|
|
and 524288. Each file contains statistics for I/O requests with a
|
|
size up to ${io_size} and that exceed a smaller I/O size. The files
|
|
for ${io_size} 524288 are an exception because these also include
|
|
data for all larger requests.
|
|
|
|
Here is an example of the data produced by this infrastructure (edited for
|
|
clarity):
|
|
|
|
$ echo 1 >/sys/kernel/scst_tgt/measure_latency
|
|
$ sleep 10 # Wait until an initiator has submitted multiple I/O requests
|
|
$ (cd /sys/kernel/scst_tgt/targets &&
|
|
find -name latency | xargs grep -raH .)
|
|
state count min max avg stddev
|
|
PARSE 219 1.3 26.6 2.2 2.5 us
|
|
PREPARE_SPACE 219 0.9 10.3 1.1 0.6 us
|
|
RDY_TO_XFER 219 0.7 1.7 0.7 0.2 us
|
|
TGT_PRE_EXEC 219 0.7 11.0 0.8 0.9 us
|
|
EXEC_CHECK_SN 219 0.7 1.7 0.8 0.2 us
|
|
PRE_DEV_DONE 219 11.3 3445.7 39.6 276.4 us
|
|
DEV_DONE 219 0.7 11.0 0.9 0.7 us
|
|
PRE_XMIT_RESP1 219 1.2 58.4 1.6 3.8 us
|
|
CSW2 219 0.7 1.6 0.8 0.1 us
|
|
PRE_XMIT_RESP2 219 0.7 1.5 0.7 0.1 us
|
|
XMIT_RESP 219 0.7 1.5 0.7 0.1 us
|
|
INIT_WAIT 219 1.0 57.3 2.1 4.4 us
|
|
INIT 219 0.9 27.4 1.6 2.4 us
|
|
CSW1 219 15.0 3856.1 74.2 264.8 us
|
|
EXEC_CHECK_BLOCKING 219 1.3 10.8 1.7 0.9 us
|
|
LOCAL_EXEC 219 0.7 1.8 0.7 0.1 us
|
|
REAL_EXEC 219 0.6 1.5 0.7 0.1 us
|
|
EXEC_WAIT 219 40.6 1021.7 54.4 68.7 us
|
|
XMIT_WAIT 219 6.4 1682.0 50.6 228.1 us
|
|
total 219 - - 236.9 2012.1 us
|
|
|
|
PRE_DEV_DONE refers to internal checks done after execution of a command
|
|
finished. CSW1 is the context switch that happens after the transport
|
|
driver received a command and before processing of a command starts.
|
|
EXEC_WAIT is the time spent in the device handler .exec() method.
|
|
|
|
- sgv - this is a root subdirectory for all SCST SGV caches
|
|
|
|
- targets - this is a root subdirectory for all SCST targets
|
|
|
|
- setup_id - allows to read and write SCST setup ID. This ID can be
|
|
used in cases, when the same SCST configuration should be installed
|
|
on several targets, but exported from those targets devices should
|
|
have different IDs and SNs. For instance, VDISK dev handler uses this
|
|
ID to generate T10 vendor specific identifier and SN of the devices.
|
|
|
|
- poll_us - if polling is desired, sets how many us each SCST thread
|
|
is polling its queue after it became empty in a hope that a new
|
|
command can come. In some cases, polling can significantly increase
|
|
IOPS, especially if low power states on CPU not disabled, because on
|
|
high IOPS polling could be cheaper comparing to spending significant
|
|
time on entering, then exiting CPU low power states + corresponding
|
|
context switches. Disabled, i.e. set to 0, by default.
|
|
|
|
- suspend - globally suspends or releases all SCSI activities on all
|
|
devices. Useful for mass management, like adding or deleting LUNs.
|
|
Writing to it value v:
|
|
|
|
* v > 0 - suspends activities, but waits no more, than v seconds
|
|
|
|
* v = 0 - suspends activities, waits indefinitely
|
|
|
|
* V < 0 - releases activities.
|
|
|
|
Reading from this attribute returns number of previous suspend
|
|
requests.
|
|
|
|
- threads - allows to read and set number of global SCST I/O threads.
|
|
Those threads used with async. dev handlers, for instance, vdisk
|
|
BLOCKIO or NULLIO.
|
|
|
|
- trace_cmds - shows current SCST commands up to size of the sysfs
|
|
buffer (4KB)
|
|
|
|
- trace_mcmds - shows current SCST management commands up to size of
|
|
the sysfs buffer (4KB)
|
|
|
|
- trace_level - allows to enable and disable various tracing
|
|
facilities. See content of this file for help how to use it. See also
|
|
section "Dealing with massive logs" for more info how to make correct
|
|
logs when you enabled trace levels producing a lot of logs data.
|
|
|
|
- version - read-only attribute, which allows to see version of
|
|
SCST and enabled optional features.
|
|
|
|
- last_sysfs_mgmt_res - read-only attribute returning completion status
|
|
of the last management command. In the sysfs implementation there are
|
|
some problems between internal sysfs and internal SCST locking. To
|
|
avoid them in some cases sysfs calls can return error with errno
|
|
EAGAIN. This doesn't mean the operation failed. It only means that
|
|
the operation queued and not yet completed. To wait for it to
|
|
complete, an management tool should poll this file. If the operation
|
|
hasn't yet completed, it will also return EAGAIN. But after it's
|
|
completed, it will return the result of this operation (0 for success
|
|
or -errno for error). The following two shell functions show how to do
|
|
this:
|
|
|
|
- force_global_sgv_pool - if not set, buffers for SCSI commands are
|
|
allocated from per-CPU SGV pool. Otherwise, global SGV pool is used.
|
|
|
|
# Read the SCST sysfs attribute $1. See also scst/README for more information.
|
|
scst_sysfs_read() {
|
|
local EAGAIN val
|
|
|
|
EAGAIN="Resource temporarily unavailable"
|
|
while true; do
|
|
if val="$(LC_ALL=C cat "$1" 2>&1)"; then
|
|
echo -n "${val%\[key\]}"
|
|
return 0
|
|
elif [ "${val/*: }" != "$EAGAIN" ]; then
|
|
return 1
|
|
fi
|
|
sleep 1
|
|
done
|
|
}
|
|
|
|
# Write $1 into the SCST sysfs attribute $2. See also scst/README for more
|
|
# information.
|
|
scst_sysfs_write() {
|
|
local EAGAIN status
|
|
|
|
EAGAIN="Resource temporarily unavailable"
|
|
if status="$(LC_ALL=C; (echo -n "$1" > "$2") 2>&1)"; then
|
|
return 0
|
|
elif [ "${status/*: }" != "$EAGAIN" ]; then
|
|
return 1
|
|
fi
|
|
scst_sysfs_read /sys/kernel/scst_tgt/last_sysfs_mgmt_res >/dev/null
|
|
}
|
|
|
|
"Devices" subdirectory contains subdirectories for each SCST devices.
|
|
|
|
Content of each device's subdirectory is dev handler specific. See
|
|
documentation for your dev handlers for more info about it as well as
|
|
SysfsRules file for more info about common to all dev handlers rules.
|
|
SCST dev handlers can have the following common entries:
|
|
|
|
- block - allows to temporary block and unblock this device. See below.
|
|
|
|
- exported - subdirectory containing links to all LUNs where this
|
|
device was exported.
|
|
|
|
- handler - if dev handler determined for this device, this link points
|
|
to it. The handler can be not set for pass-through devices.
|
|
|
|
- threads_num - shows and allows to set number of threads in this device's
|
|
threads pool. If 0 - no threads will be created, and global SCST
|
|
threads pool will be used. If <0 - creation of the threads pool is
|
|
prohibited.
|
|
|
|
- threads_pool_type - shows and allows to sets threads pool type.
|
|
Possible values: "per_initiator" and "shared". When the value is
|
|
"per_initiator" (default), each session from each initiator will use
|
|
separate dedicated pool of threads. When the value is "shared", all
|
|
sessions from all initiators will share the same per-device pool of
|
|
threads. Valid only if threads_num attribute >0.
|
|
|
|
- dump_prs - allows to dump persistent reservations information in the
|
|
kernel log.
|
|
|
|
- pr_state - allows to save and restore the complete Persistent
|
|
Reservation state (registrants, active reservation, generation
|
|
counter). On read, serialises the state to a text format. On write,
|
|
restores state into the device; this should be done before the device
|
|
starts serving I/O.
|
|
|
|
- type - SCSI type of this device
|
|
|
|
- max_tgt_dev_commands - maximum number of SCSI commands any session to
|
|
this device can have in flight.
|
|
|
|
- numa_node_id - NUMA node id this device physically belongs to. SCST
|
|
NUMA handling assumes that being used in the system NUMA memory
|
|
allocation policy is to always allocate from the current node.
|
|
|
|
Attribute "block" allows to temporary block and unblock this device.
|
|
"Blocking" means that no new commands for this device will go into the
|
|
execution stage, but instead will be suspended just before it. The
|
|
blocked state is not reached until queue of the corresponding device is
|
|
completely drained. You can also call this state "frozen". It is useful
|
|
in many cases, like consistent snapshots and graceful shutdown.
|
|
|
|
On write "block" entry allows the following 3 types of parameters:
|
|
|
|
- 1 - block device synchronously, i.e. don't return until this device
|
|
becomes blocked, i.e. until queue of it is not completely drained. Can
|
|
be called as many times as needed.
|
|
|
|
- 11 params - block device asynchronously, i.e. return immediately.
|
|
Notification about completing is delivered using SCST_EVENT_EXT_BLOCKING_DONE
|
|
event. "Params" delivered to it as is in "data" payload. Can be
|
|
called as many times as needed. Alternatively, status of blocking could be
|
|
polled by reading this attributes until the second number reaches 0
|
|
(see below).
|
|
|
|
- 0 - unblock this device.
|
|
|
|
Reading from "block" entry returns two numbers separated by space:
|
|
|
|
1. How many times this device was blocked, i.e. how many times writing
|
|
"0" to it is needed to unblock this device.
|
|
|
|
2. Boolean (0 or 1) if blocking, if any, is done (0) or still pending (1).
|
|
|
|
See below for more information about other entries of this subdirectory
|
|
of the standard SCST dev handlers.
|
|
|
|
"Handlers" subdirectory contains subdirectories for each SCST dev
|
|
handler.
|
|
|
|
Content of each handler's subdirectory is dev handler specific. See
|
|
documentation for your dev handlers for more info about it as well as
|
|
SysfsRules file for more info about common to all dev handlers rules.
|
|
SCST dev handlers can have the following common entries:
|
|
|
|
- mgmt - this entry allows to create virtual devices and their
|
|
attributes (for virtual devices dev handlers) or assign/unassign real
|
|
SCSI devices to/from this dev handler (for pass-through dev
|
|
handlers).
|
|
|
|
- trace_level - allows to enable and disable various tracing
|
|
facilities. See content of this file for help how to use it. See also
|
|
section "Dealing with massive logs" for more info how to make correct
|
|
logs when you enabled trace levels producing a lot of logs data.
|
|
|
|
- type - SCSI type of devices served by this dev handler.
|
|
|
|
See below for more information about other entries of this subdirectory
|
|
of the standard SCST dev handlers.
|
|
|
|
"Sgv" subdirectory contains statistic information of SCST SGV caches. It
|
|
has the following entries:
|
|
|
|
- None, one or more subdirectories for each existing SGV cache.
|
|
|
|
- global_stats - file containing global SGV caches statistics.
|
|
|
|
Each SGV cache's subdirectory has the following item:
|
|
|
|
- stats - file containing statistics for this SGV caches.
|
|
|
|
"Targets" subdirectory contains subdirectories for each SCST target.
|
|
|
|
Content of each target's subdirectory is target specific. See
|
|
documentation for your target for more info about it as well as
|
|
SysfsRules file for more info about common to all targets rules.
|
|
Every target should have at least the following entries:
|
|
|
|
- ini_groups - subdirectory, which contains and allows to define
|
|
initiator-oriented access control information, see below.
|
|
|
|
- luns - subdirectory, which contains list of available LUNs in the
|
|
target-oriented access control and allows to define it, see below.
|
|
|
|
- sessions - subdirectory containing connected to this target sessions.
|
|
|
|
- comment - this attribute can be used to store any human readable info
|
|
to help identify target. For instance, to help identify the target's
|
|
mapping to the corresponding hardware port. It isn't anyhow used by
|
|
SCST.
|
|
|
|
- enabled - using this attribute you can enable or disable this target.
|
|
It allows to finish configuring it before it starts accepting new
|
|
connections. 0 by default.
|
|
|
|
- addr_method - used LUNs addressing method. Possible values:
|
|
"Peripheral", "Flat" or "LUN". Most initiators work well with
|
|
Peripheral addressing method (default), but some (HP-UX, for instance)
|
|
may require the Flat method or the LUN method (e.g. IBM systems). This
|
|
attribute is also available in the initiators security groups, so you
|
|
can assign the addressing method on per-initiator basis. See also the
|
|
"Logical unit addressing (LUN)" section in SAM-5 for more information.
|
|
|
|
- black_hole - if set, all LUNs in the corresponding initiator group,
|
|
default target group in this case, start "swallowing" requests from
|
|
initiators. Possible values are:
|
|
|
|
* 0 - disable black hole mode
|
|
|
|
* 1 - immediately abort all coming SCSI commands, i.e. all SCSI commands
|
|
are dropped and TM requests return that they completed. It is
|
|
supposed to simulate lost front end responses.
|
|
|
|
* 2 - immediately abort all coming SCSI commands and drop all coming TM
|
|
commands. It is supposed to simulate logical target hang, when the
|
|
target stops responding, but on the HW/TCP connection level still
|
|
appears to be online.
|
|
|
|
* 3 - immediately abort all coming data transfer SCSI commands, i.e.
|
|
only data transfer SCSI commands are dropped, while commands like
|
|
INQUIRY and TEST UNIT READY pass well. It is supposed to simulate
|
|
flaky front end connectivity, when responses for small commands
|
|
pass well, but big data transfers fail.
|
|
|
|
* 4 - immediately abort all coming data transfer SCSI commands and
|
|
drop all coming TM commands. It is supposed to simulate really
|
|
flaky front end connectivity, when TM requests or responses are
|
|
also lost.
|
|
|
|
Modes 3 and 4 are the most evil ones, because they are not too well
|
|
handled by many initiator OS'es, including Linux, so they may never
|
|
recover from it.
|
|
|
|
Note, dropping TM commands, i.e. not sending response on them,
|
|
implemented not for all target drivers. If it's implemented for your
|
|
particular target driver or not, you can find out by checking traces
|
|
or the target driver's source code.
|
|
|
|
- dif_capabilities - if this target supports T10-PI, returns which
|
|
exact DIF capabilities this target supports.
|
|
|
|
- dif_checks_failed - if this target supports T10-PI, returns
|
|
statistics how many DIF errors have been detected on the
|
|
corresponding processing stages on this target. It returns 3 rows of
|
|
numbers with 3 numbers in each row: for target driver stage, for SCST
|
|
stage and for dev handler stage. Numbers in each row: how many errors
|
|
detected checking application, reference and guard tags
|
|
correspondingly. Writing to this attribute resets the numbers.
|
|
|
|
- cpu_mask - defines CPU affinity mask for threads serving this target.
|
|
For threads serving LUNs it is used only for devices with
|
|
threads_pool_type "per_initiator".
|
|
|
|
- io_grouping_type - defines how I/O from sessions to this target are
|
|
grouped together. This I/O grouping is very important for
|
|
performance. By setting this attribute in a right value, you can
|
|
considerably increase performance of your setup. This grouping is
|
|
performed only if you use CFQ I/O scheduler on the target and for
|
|
devices with threads_num >= 0 and, if threads_num > 0, with
|
|
threads_pool_type "per_initiator". Possible values:
|
|
"this_group_only", "never", "auto", or I/O group number >0. When the
|
|
value is "this_group_only" all I/O from all sessions in this target
|
|
will be grouped together. When the value is "never", I/O from
|
|
different sessions will not be grouped together, i.e. all sessions in
|
|
this target will have separate dedicated I/O groups. When the value
|
|
is "auto" (default), all I/O from initiators with the same name
|
|
(iSCSI initiator name, for instance) in all targets will be grouped
|
|
together with a separate dedicated I/O group for each initiator name.
|
|
For iSCSI this mode works well, but other transports usually use
|
|
different initiator names for different sessions, so using such
|
|
transports in MPIO configurations you should either use value
|
|
"this_group_only", or an explicit I/O group number. This attribute is
|
|
also available in the initiators security groups, so you can assign
|
|
the I/O grouping on per-initiator basis. See below for more info how
|
|
to use this attribute.
|
|
|
|
- rel_tgt_id - allows to read or write SCSI Relative Target Port
|
|
Identifier attribute. This identifier is used to identify SCSI Target
|
|
Ports by some SCSI commands, mainly by Persistent Reservations
|
|
commands. This identifier must be unique among all SCST targets, but
|
|
for convenience SCST allows disabled targets to have not unique
|
|
rel_tgt_id. In this case SCST will not allow to enable this target
|
|
until rel_tgt_id becomes unique. This attribute initialized unique by
|
|
SCST by default.
|
|
|
|
- forward_src - if set this target port is a forwarding source. This means
|
|
that commands like COMPARE AND WRITE, EXTENDED COPY and RECEIVE COPY
|
|
RESULTS are submitted to the SCSI device instead of being handled inside
|
|
the SCST core. PERSISTENT RESERVE IN and OUT commands are processed by the
|
|
SCST core, whether or not this mode is enabled. The name 'forwarding_src'
|
|
refers to the use case where SCSI passthrough is used to send SCSI commands
|
|
to another H.A. node.
|
|
|
|
- forward_dst - if set this target port is a forwarding destination. This means
|
|
that it does not check any local SCSI events (reservations, etc.). Those
|
|
event are supposed to be checked at the forwarding source side.
|
|
|
|
- forwarding - obsolete synonym for forward_dst.
|
|
|
|
- *count*, e.g. read_io_count_kb, - statistics about executed
|
|
commands and transferred data. Those attributes have speaking names
|
|
built from parts:
|
|
|
|
1. Data transfer direction
|
|
|
|
2. Alignment type: not specified or unaligned (on 4K boundaries)
|
|
|
|
3. Type: IO (commands) count or amount of transferred data
|
|
|
|
4. For transferred data: measurement units
|
|
|
|
For instance, read_unaligned_cmd_count means number of 4K unaligned IOs.
|
|
|
|
- aen_disabled - if set this target port is not to send AEN (Asynchronous
|
|
Event Notification), but rather generate a Unit Attention - even if the
|
|
underlying transport does support AEN.
|
|
|
|
This could prove useful in different situations including when the target
|
|
is also a forward_dst.
|
|
|
|
A target driver may have also the following entries:
|
|
|
|
- "hw_target" - if the target driver supports both hardware and virtual
|
|
targets (for instance, an FC adapter supporting NPIV, which has
|
|
hardware targets for its physical ports as well as virtual NPIV
|
|
targets), this read only attribute for all hardware targets will
|
|
exist and contain value 1.
|
|
|
|
Subdirectory "sessions" contains one subdirectory for each connected
|
|
session with name equal to name of the connected initiator with the
|
|
following entries:
|
|
|
|
- initiator_name - contains initiator name
|
|
|
|
- force_close - optional write-only attribute, which allows to force
|
|
close this session.
|
|
|
|
- active_commands - contains number of active, i.e. not yet or being
|
|
executed, SCSI commands in this session.
|
|
|
|
- commands - contains overall number of SCSI commands in this session.
|
|
|
|
- dif_checks_failed - if target of this session supports T10-PI, returns
|
|
statistics how many DIF errors have been detected on the
|
|
corresponding processing stages on all DIF-enabled LUNs in this
|
|
session. It returns 3 rows of numbers with 3 numbers in each row: for
|
|
target driver stage, for SCST stage and for dev handler stage.
|
|
Numbers in each row: how many errors detected checking application,
|
|
reference and guard tags correspondingly. Writing to this attribute
|
|
resets the numbers. Similar statistics returned in attribute with the
|
|
same name for each LUN in this session in this LUN's subdirectory, if
|
|
its device configured with dif_type > 0.
|
|
|
|
- read_cmd_count - number of READ SCSI commands received since beginning
|
|
or last reset (writing 0 in this attribute)
|
|
|
|
- read_io_count_kb - amount of data in KB read by the initiator since
|
|
beginning or last reset (writing 0 in this attribute)
|
|
|
|
- write_cmd_count - number of WRITE SCSI commands received since
|
|
beginning or last reset (writing 0 in this attribute)
|
|
|
|
- write_io_count_kb - amount of data in KB written by the initiator
|
|
since beginning or last reset (writing 0 in this attribute)
|
|
|
|
- bidi_cmd_count - number of BIDI SCSI commands received since
|
|
beginning or last reset (writing 0 in this attribute)
|
|
|
|
- bidi_io_count_kb - amount of data in KB transferred by the
|
|
initiator since beginning or last reset (writing 0 in this attribute)
|
|
|
|
- none_cmd_count - number of not transferring data SCSI commands
|
|
(e.g. INQUIRY or TEST UNIT READY) received since beginning or last
|
|
reset (writing 0 in this attribute)
|
|
|
|
- unknown_cmd_count - number of unknown SCSI commands received since
|
|
beginning or last reset (writing 0 in this attribute)
|
|
|
|
- *count*, e.g. read_io_count_kb, - statistics about executed
|
|
commands and transferred data. See above for more details.
|
|
|
|
- luns - a link pointing out to the corresponding LUNs set (security
|
|
group) where this session was attached to.
|
|
|
|
- One or more "lunX" subdirectories, where 'X' is a number, for each LUN
|
|
this session has (see below).
|
|
|
|
- other target driver specific attributes and subdirectories.
|
|
|
|
See below description of the VDISK's sysfs interface for samples.
|
|
|
|
|
|
Each sessions/<sess>/lun<X> subdirectory contains the following entries:
|
|
|
|
- active_commands - contains number of active, i.e. not yet or being
|
|
executed, SCSI commands for lun<X> in session <sess>.
|
|
|
|
- thread_pid - contains a single line with all the process identifiers
|
|
(PIDs) of the kernel threads that process SCSI commands intended for
|
|
lun<X> in session <sess>.
|
|
|
|
- thread_index - thread index assigned by scst_add_threads().
|
|
Can be used to look up which export thread is serving which target
|
|
since this index also appears in the export thread name. This
|
|
information then could be used to set CPU affinity for those threads
|
|
to improve performance. Has a value in the range 0..n-1 for
|
|
threads_pool_type per_initiator or -1 when using a shared thread pool
|
|
per LUN or the global thread pool.
|
|
|
|
|
|
Access and devices visibility management (LUN masking)
|
|
------------------------------------------------------
|
|
|
|
Access and devices visibility management allows for an initiator or
|
|
group of initiators to see different devices with different LUNs
|
|
with necessary access permissions.
|
|
|
|
SCST supports two modes of access control:
|
|
|
|
1. Target-oriented. In this mode you define for each target a default
|
|
set of LUNs, which are accessible to all initiators, connected to that
|
|
target. This is a regular access control mode, which people usually mean
|
|
thinking about access control in general. For instance, in IET this is
|
|
the only supported mode.
|
|
|
|
2. Initiator-oriented. In this mode you define which LUNs are accessible
|
|
for each initiator. In this mode you should create for each set of one
|
|
or more initiators, which should access to the same set of devices with
|
|
the same LUNs, a separate security group, then add to it devices and
|
|
names of allowed initiator(s).
|
|
|
|
Both modes can be used simultaneously. In this case the
|
|
initiator-oriented mode has higher priority, than the target-oriented,
|
|
i.e. initiators are at first searched in all defined security groups for
|
|
this target and, if none matches, the default target's set of LUNs is
|
|
used. This set of LUNs might be empty, then the initiator will not see
|
|
any LUNs from the target.
|
|
|
|
You can at any time find out which set of LUNs each session is assigned
|
|
to by looking where link
|
|
/sys/kernel/scst_tgt/targets/target_driver/target_name/sessions/initiator_name/luns
|
|
points to.
|
|
|
|
To configure the target-oriented access control SCST provides the
|
|
following interface. Each target's sysfs subdirectory
|
|
(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "luns"
|
|
subdirectory. This subdirectory contains the list of already defined
|
|
target-oriented access control LUNs for this target as well as file
|
|
"mgmt". This file has the following commands, which you can send to it,
|
|
for instance, using "echo" shell command. You can always get a small
|
|
help about supported commands by looking inside this file. "Parameters"
|
|
are one or more param_name=value pairs separated by ';'.
|
|
|
|
- "add H:C:I:L lun [parameters]" - adds a pass-through device with
|
|
host:channel:id:lun with LUN "lun". Optionally, the device could be
|
|
marked as read only by using parameter "read_only". The recommended
|
|
way to find out H:C:I:L numbers is use of lsscsi utility.
|
|
|
|
- "replace H:C:I:L lun [parameters]" - replaces by pass-through device
|
|
with host:channel:id:lun existing with LUN "lun" device with
|
|
generation of INQUIRY DATA HAS CHANGED Unit Attention. If the old
|
|
device doesn't exist, this command acts as the "add" command.
|
|
Optionally, the device could be marked as read only by using
|
|
parameter "read_only". The recommended way to find out H:C:I:L
|
|
numbers is use of lsscsi utility.
|
|
|
|
- "add VNAME lun [parameters]" - adds a virtual device with name VNAME
|
|
with LUN "lun". Optionally, the device could be marked as read only
|
|
by using parameter "read_only".
|
|
|
|
- "replace VNAME lun [parameters]" - replaces by virtual device
|
|
with name VNAME existing with LUN "lun" device with generation of
|
|
INQUIRY DATA HAS CHANGED Unit Attention. If the old device doesn't
|
|
exist, this command acts as the "add" command. Optionally, the device
|
|
could be marked as read only by using parameter "read_only".
|
|
|
|
- "del lun" - deletes LUN lun
|
|
|
|
- "clear" - clears the list of devices
|
|
|
|
To configure the initiator-oriented access control SCST provides the
|
|
following interface. Each target's sysfs subdirectory
|
|
(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "ini_groups"
|
|
subdirectory. This subdirectory contains the list of already defined
|
|
security groups for this target as well as file "mgmt". This file has
|
|
the following commands, which you can send to it, for instance, using
|
|
"echo" shell command. You can always get a small help about supported
|
|
commands by looking inside this file.
|
|
|
|
- "create GROUP_NAME" - creates a new security group.
|
|
|
|
- "del GROUP_NAME" - deletes a new security group.
|
|
|
|
Each security group's subdirectory contains 2 subdirectories: initiators
|
|
and luns as well as the following attributes: addr_method, cpu_mask and
|
|
io_grouping_type, black_hole. See above description of them.
|
|
|
|
Each "initiators" subdirectory contains list of added to this groups
|
|
initiator as well as as well as file "mgmt". This file has the following
|
|
commands, which you can send to it, for instance, using "echo" shell
|
|
command. You can always get a small help about supported commands by
|
|
looking inside this file.
|
|
|
|
- "add INITIATOR_NAME" - adds initiator with name INITIATOR_NAME to the
|
|
group.
|
|
|
|
- "del INITIATOR_NAME" - deletes initiator with name INITIATOR_NAME
|
|
from the group.
|
|
|
|
- "move INITIATOR_NAME DEST_GROUP_NAME" moves initiator with name
|
|
INITIATOR_NAME from the current group to group with name
|
|
DEST_GROUP_NAME.
|
|
|
|
- "clear" - deletes all initiators from this group.
|
|
|
|
For "add" and "del" commands INITIATOR_NAME can be a simple DOS-type
|
|
patterns, containing '*' and '?' symbols. '*' means match all any
|
|
symbols, '?' means match only any single symbol. For instance,
|
|
"blah.xxx" will match "bl?h.*". Additionally, you can use negative sign
|
|
'!' to revert the value of the pattern. For instance, "ah.xxx" will
|
|
match "!bl?h.*".
|
|
|
|
Each "luns" subdirectory contains the list of already defined LUNs for
|
|
this group as well as file "mgmt". Content of this file as well as list
|
|
of available in it commands is fully identical to the "luns"
|
|
subdirectory of the target-oriented access control.
|
|
|
|
Examples:
|
|
|
|
- echo "create INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/mgmt -
|
|
creates security group INI for target iqn.2006-10.net.vlnb:tgt1.
|
|
|
|
- echo "add 2:0:1:0 11" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt -
|
|
adds a pass-through device sitting on host 2, channel 0, ID 1, LUN 0
|
|
to group with name INI as LUN 11.
|
|
|
|
- echo "add disk1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt -
|
|
adds a virtual disk with name disk1 to group with name INI as LUN 0.
|
|
|
|
- echo "add 21:*:e0:?b:83:*" >/sys/kernel/scst_tgt/targets/21:00:00:a0:8c:54:52:12/ini_groups/INI/initiators/mgmt -
|
|
adds a pattern to group with name INI to Fibre Channel target with
|
|
WWN 21:00:00:a0:8c:54:52:12, which matches WWNs of Fibre Channel
|
|
initiator ports.
|
|
|
|
Consider you need to have an iSCSI target with name
|
|
"iqn.2007-05.com.example:storage.disk1.sys1.xyz", which should export
|
|
virtual device "dev1" with LUN 0 and virtual device "dev2" with LUN 1,
|
|
but initiator with name
|
|
"iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only
|
|
virtual device "dev2" read only with LUN 0. To achieve that you should
|
|
do the following commands:
|
|
|
|
# echo "iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
# echo "add dev1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt
|
|
# echo "add dev2 1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt
|
|
# echo "create SPEC_INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/mgmt
|
|
# echo "add dev2 0 read_only=1" \
|
|
>/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/luns/mgmt
|
|
# echo "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" \
|
|
>/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/initiators/mgmt
|
|
|
|
For Fibre Channel or SAS in the above example you should use target's
|
|
and initiator ports WWNs instead of iSCSI names.
|
|
|
|
It is highly recommended to use scstadmin utility instead of described
|
|
in this section low level interface.
|
|
|
|
IMPORTANT
|
|
=========
|
|
|
|
There must be LUN 0 in each set of LUNs, i.e. LUs numeration must not
|
|
start from, e.g., 1. Otherwise you will see no devices on remote
|
|
initiators and SCST core will write into the kernel log message: "tgt_dev
|
|
for LUN 0 not found, command to unexisting LU?"
|
|
|
|
IMPORTANT
|
|
=========
|
|
|
|
All the access control must be fully configured BEFORE the corresponding
|
|
target is enabled. When you enable a target, it will immediately start
|
|
accepting new connections, hence creating new sessions, and those new
|
|
sessions will be assigned to security groups according to the
|
|
*currently* configured access control settings. For instance, to
|
|
the default target's set of LUNs, instead of "HOST004" group as you may
|
|
need, because "HOST004" doesn't exist yet. So, you must configure all
|
|
the security groups before new connections from the initiators are
|
|
created, i.e. before the target enabled.
|
|
|
|
|
|
VDISK device handler
|
|
--------------------
|
|
|
|
Starting from 2.0.0 VDISK device handler uses sysfs interface.
|
|
|
|
VDISK has 4 built-in dev handlers: vdisk_fileio, vdisk_blockio,
|
|
vdisk_nullio and vcdrom. Roots of their sysfs interface are
|
|
/sys/kernel/scst_tgt/handlers/handler_name, e.g. for vdisk_fileio:
|
|
/sys/kernel/scst_tgt/handlers/vdisk_fileio. Each root has the following
|
|
entries:
|
|
|
|
- None, one or more links to devices with name equal to names
|
|
of the corresponding devices.
|
|
|
|
- trace_level - allows to enable and disable various tracing
|
|
facilities. See content of this file for help how to use it. See also
|
|
section "Dealing with massive logs" for more info how to make correct
|
|
logs when you enabled trace levels producing a lot of logs data.
|
|
|
|
- mgmt - main management entry, which allows to add/delete VDISK
|
|
devices with the corresponding type.
|
|
|
|
The "mgmt" file has the following commands, which you can send to it,
|
|
for instance, using "echo" shell command. You can always get a small
|
|
help about supported commands by looking inside this file. "Parameters"
|
|
are one or more param_name=value pairs separated by ';'.
|
|
|
|
- echo "add_device device_name [parameters]" - adds a virtual device
|
|
with name device_name and specified parameters (see below)
|
|
|
|
- echo "del_device device_name" - deletes a virtual device with name
|
|
device_name.
|
|
|
|
Handler vdisk_fileio provides FILEIO mode to create virtual devices.
|
|
This mode uses as backend files and accesses to them using regular
|
|
read()/write() file calls. This allows to use full power of Linux page
|
|
cache. The following parameters possible for vdisk_fileio:
|
|
|
|
- filename - specifies path and file name of the backend file. The path
|
|
must be absolute.
|
|
|
|
- blocksize - specifies block size used by this virtual device. The
|
|
block size must be power of 2 and >= 512 bytes. Default is 512.
|
|
|
|
- opt_trans_len - specifies the optimal transfer length data in the block
|
|
limits VPD page. Value is in bytes, and must be a multiple of the block
|
|
size. Default is 524288. Setting this parameter to a multiple of the
|
|
block size that is less than 4194304 (4 MB) may improve performance.
|
|
Setting this parameter to a value greater than 4194304 hurts performance
|
|
because the SGV cache only supports buffers up to 4 MB.
|
|
|
|
- write_through - disables write back caching. Note, this option
|
|
has sense only if you also *manually* disable write-back cache in
|
|
*all* your backstorage devices and make sure it's actually disabled,
|
|
since many devices are known to lie about this mode to get better
|
|
benchmark results. Default is 0.
|
|
|
|
- read_only - read only. Default is 0.
|
|
|
|
- async - submit I/O asynchronously to the device handler. This mode
|
|
allows concurrent processing of SCSI commands even when using only
|
|
a single SCST command thread. This mode is only supported for kernel
|
|
version 4.1 and later. RHEL 8 is the first RHEL version that supports
|
|
in-kernel asynchronous file I/O.
|
|
|
|
- o_direct - disables both read and write caching if asynchronous
|
|
I/O is used. This mode bypasses the page cache and hence improves
|
|
performance.
|
|
|
|
- nv_cache - enables "non-volatile cache" mode. In this mode it is
|
|
assumed that the target has a GOOD UPS with ability to cleanly
|
|
shutdown target in case of power failure and it is software/hardware
|
|
bugs free, i.e. all data from the target's cache are guaranteed
|
|
sooner or later to go to the media. Hence all data synchronization
|
|
with media operations, like SYNCHRONIZE_CACHE, are ignored in order
|
|
to bring more performance. Also in this mode target reports to
|
|
initiators that the corresponding device has write-through cache to
|
|
disable all write-back cache workarounds used by initiators. Use with
|
|
extreme caution, since in this mode after a crash of the target
|
|
journaled file systems don't guarantee the consistency after journal
|
|
recovery, therefore manual fsck MUST be ran. Note, that since usually
|
|
the journal barrier protection (see "IMPORTANT" note below) turned
|
|
off, enabling NV_CACHE could change nothing from data protection
|
|
point of view, since no data synchronization with media operations
|
|
will go from the initiator. This option overrides "write_through"
|
|
option. Disabled by default.
|
|
|
|
- thin_provisioned - enables thin provisioning facility, when remote
|
|
initiators can unmap blocks of storage, if they don't need them
|
|
anymore. Backend storage also must support this facility.
|
|
|
|
- tst - allows to specify TST control mode page field. It specifies
|
|
the type of task set in the device. Possible values are: 0 - the
|
|
device maintains one task set for all I_T nexuses and 1 - the device
|
|
maintains separate task sets for each I_T nexus. Default - 1.
|
|
|
|
- removable - with this flag set the device is reported to remote
|
|
initiators as removable.
|
|
|
|
- rotational - if set, this device reported as rotational. Otherwise,
|
|
it is reported as non-rotational (SSD, etc.)
|
|
|
|
- zero_copy - obsolete. For zero-copy I/O, set the async flag and
|
|
possibly also the o_direct flag and use Linux kernel v4.10 or later.
|
|
|
|
- dif_mode - specifies which T10-PI, or DIF, mode this device will use.
|
|
See SCSI standards from more info about T10-PI. Available DIF modes
|
|
(can be combined using '|'):
|
|
|
|
* tgt - DIF tags are checked on the target hardware, if supported
|
|
|
|
* scst - DIF tags are checked inside SCST core
|
|
|
|
* dev_check - DIF tags are checked inside backend device. No DIF
|
|
tags storing is required, but optionally possible.
|
|
|
|
* dev_store - DIF tags are stored inside backend device on the WRITE
|
|
path and read from it on the READ path. No DIF tags checking is
|
|
required, but optionally possible.
|
|
|
|
For instance, if only tgt DIF mode specified, then target driver,
|
|
serving this device, will inside hardware check, then STRIP DIF tags
|
|
from SCSI commands on the WRITE path and generate, then INSERT DIF
|
|
tags into SCSI commands on the READ path, so neither SCST core, nor
|
|
dev handler will see them.
|
|
|
|
Similarly, if only scst DIF mode specified, then target driver will
|
|
PASS DIF tags into SCST core, which then check/STRIP/generate/INSERT
|
|
them, so dev handler will not see them.
|
|
|
|
If only dev_check DIF mode specified, then both target driver and
|
|
SCST core will PASS DIF tags into the dev handler, which is then
|
|
responsible to check them in the backend hardware. If only dev_store
|
|
specified, then DIF tags will only be stored by the dev handler in
|
|
the backend hardware without checking at any level.
|
|
|
|
If all "tgt|scst|dev_check|dev_store" DIF mode specified, then all
|
|
target driver, SCST core and dev handler will check DIF tags, then
|
|
dev handler will store them in the backend hardware.
|
|
|
|
- dif_type - specifies which DIF SCSI type this device will use.
|
|
|
|
- dif_static_app_tag - specifies fixed (static) DIF application tag for
|
|
this device.
|
|
|
|
- dif_filename - specifies full path to filename, where DIF tags will
|
|
be stored.
|
|
|
|
- lb_per_pb_exp - allows READ CAPACITY 16 to return LOGICAL BLOCKS
|
|
PER PHYSICAL BLOCK EXPONENT. Possible values are: 0 - the exponent
|
|
is not returned (zero is returned instead) and 1 - the device value is
|
|
returned. Default - 1. There are some initiators (like MS SQL) that
|
|
do not like large physical block sizes, even if they are true.
|
|
|
|
Handler vdisk_blockio provides BLOCKIO mode to create virtual devices.
|
|
This mode performs direct block I/O with a block device, bypassing the
|
|
page cache for all operations. This mode works ideally with high-end
|
|
storage HBAs and for applications that either do not need caching
|
|
between application and disk or need the large block throughput. See
|
|
below for more info.
|
|
|
|
The following parameters possible for vdisk_blockio: filename,
|
|
blocksize, nv_cache, read_only, removable, rotational, thin_provisioned,
|
|
tst, dif_mode, dif_type, dif_static_app_tag, dif_filename. See
|
|
vdisk_fileio above for description of those parameters.
|
|
|
|
vdisk_blockio devices have the following two additional attributes:
|
|
|
|
- active - if this flag is set (the default), the backing block device
|
|
will be opened when the SCST device is added/opened. If a SCST device
|
|
is opened with active=0 then the backing block device will not be
|
|
opened, allowing for an active/passive SCST configuration. In addition,
|
|
this attribute is writable via sysfs allowing the user to open/close the
|
|
backing block device on the fly, or via a script.
|
|
|
|
- bind_alua_state - if this flag is set (the default), when the device is
|
|
associated with an ALUA device group, and a target group ALUA state
|
|
changes to the active/nonoptimized state, the active attribute will be
|
|
set to 1 which attempts to open the backing block device. If the target
|
|
group ALUA state changes to a value other than active/nonoptimized, the
|
|
backing device will be closed (active=0). If bind_alua_state=0 for a
|
|
device the ALUA state changes have NO effect on the active attribute,
|
|
it is left up to the user to use a script, or manually set the active
|
|
attribute to open/close the backing block device.
|
|
|
|
Handler vdisk_nullio provides NULLIO mode to create virtual devices. In
|
|
this mode no real I/O is done, but success returned to initiators.
|
|
Intended to be used for performance measurements at the same way as
|
|
"*_perf" handlers. The following parameters possible for vdisk_nullio:
|
|
blocksize, read_only, removable, tst. See vdisk_fileio above for
|
|
description of those parameters.
|
|
|
|
vdisk_nullio devices have the following two additional attributes:
|
|
|
|
- dummy - if this flag is set, LUNs corresponding to this device will
|
|
not appear at the initiator side. This is because SCST will set the
|
|
PERIPHERAL QUALIFIER qualifier field to 1 (not connected) and the
|
|
PERIPHERAL DEVICE TYPE to 0x1f (no device) in the INQUIRY response.
|
|
See also SPC-4 for more information. It is designed to be used as a
|
|
"dummy" placeholder on LUN 0, if LUN 0 is not desired.
|
|
|
|
- read_zero - if this flag is set, reading from a vdisk_nullio device
|
|
returns a buffer filled with byte 0x00. If this flag is cleared
|
|
(which is the default behavior), the buffer returned to the
|
|
initiator is not cleared. Although this results in slightly faster
|
|
operation this is a security hole since any data that is present in
|
|
kernel memory can be returned to the initiator.
|
|
|
|
Handler vcdrom allows emulation of a virtual CDROM device using an ISO
|
|
file as backend. It has only single parameter: tst.
|
|
|
|
For example:
|
|
|
|
echo "add_device disk1 filename=/disk1; blocksize=4096; nv_cache=1" >/sys/kernel/scst_tgt/handlers/vdisk_fileio/mgmt
|
|
|
|
will create a FILEIO virtual device disk1 with backend file /disk1
|
|
with block size 4K and NV_CACHE enabled.
|
|
|
|
Each vdisk_fileio's device has the following attributes in
|
|
/sys/kernel/scst_tgt/devices/device_name:
|
|
|
|
- filename - contains path and file name of the backend file.
|
|
|
|
- blocksize - contains block size used by this virtual device.
|
|
|
|
- opt_trans_len - contains the optimal transfer length used by this virtual
|
|
device.
|
|
|
|
- write_through - contains status of write back caching of this virtual
|
|
device.
|
|
|
|
- sync - writing into this attribute causes the page cache contents to
|
|
be flushed to disk.
|
|
|
|
- read_only - contains read only status of this virtual device.
|
|
|
|
- o_direct - contains O_DIRECT status of this virtual device.
|
|
|
|
- inq_vend_specific - Vendor specific data that will be reported via
|
|
either bytes 36..55 or bytes 96..256 of the INQUIRY response, depending
|
|
on whether this field is <= 20 or > 20 bytes long.
|
|
|
|
- nv_cache - contains NV_CACHE status of this virtual device.
|
|
|
|
- prod_id - PRODUCT IDENTIFICATION as reported via the INQUIRY response.
|
|
The default value for this field is the SCST device name.
|
|
|
|
- prod_rev_lvl - PRODUCT REVISION LEVEL as reported via the INQUIRY
|
|
response. The default value for this field is " 300".
|
|
|
|
- scsi_device_name - optional SCSI target device name to which this
|
|
SCST device belongs to (in SCSI terminology all SCST devices called
|
|
Logical Units). See SPC for more info.
|
|
|
|
- tst - contains TST field of SCSI Control mode page. See SPC-4 for
|
|
more details about this field.
|
|
|
|
- thin_provisioned - contains thin provisioning status of this virtual
|
|
device.
|
|
|
|
- gen_tp_soft_threshold_reached_UA - for thin provisioned devices
|
|
writing of anything into this write-only attribute will generate THIN
|
|
PROVISIONING SOFT THRESHOLD REACHED Unit Attention to all connected
|
|
to this device initiators.
|
|
|
|
- removable - contains removable status of this virtual device.
|
|
|
|
- rotational - contains rotational status of this virtual device.
|
|
|
|
- size_mb - contains size of this virtual device in MB.
|
|
|
|
- pr_file_name - Full path of the file or block device in which to store
|
|
persistent reservation information. The default value for this attribute is
|
|
/var/lib/scst/pr/${device_name}. Writing a new value into this sysfs
|
|
attribute is only allowed if the device is not exported. Modifying this
|
|
sysfs attribute causes the persistent reservation state to be reloaded.
|
|
|
|
- t10_dev_id - contains and allows to set T10 vendor specific
|
|
identifier for Device Identification VPD page (0x83) of INQUIRY data.
|
|
By default VDISK handler always generates t10_dev_id for every new
|
|
created device at creation time based on the device name and
|
|
scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below)
|
|
or the SCST setup_id when using the sysfs interface (see above).
|
|
Note: some initiators, e.g. VMware's ESXi or MS Hyper-V, only looks
|
|
at the first eight characters of t10_dev_id. You have to make sure
|
|
that these first eight characters are unique or VMware will consider
|
|
these devices as identical.
|
|
|
|
- eui64_id - allows to set the EUI-64 based device identifier in the
|
|
SCSI device identification VPD page (83h). This identifier must be 8,
|
|
12 or 16 bytes long and must be specified in hexadecimal format (EUI =
|
|
Extended Unique Identifier). A leading "0x" is allowed but is not
|
|
required. Writing a newline into this attribute discards the EUI-64
|
|
identifier. If neither eui64_id nor naa_id have been set the first
|
|
eight bytes of the t10_dev_id are used as the EUI-64 ID. If naa_id has
|
|
been set but eui64_id has not been set no EUI-64 identifier is
|
|
reported in the SCSI device identification VPD page. If eui64_id has
|
|
been set the value of this attribute is reported as the EUI-64 ID. The
|
|
first three bytes of an EUI-64 ID are a so-called organizationally
|
|
unique identifier (OUI). The remaining bytes may be chosen by the
|
|
organization that owns the OUI. For more information about OUIs, see
|
|
also http://standards.ieee.org/develop/regauth/oui/public.html.
|
|
|
|
- naa_id - allows to set the NAA ID in the SCSI INQUIRY response (NAA =
|
|
Network Address Authority). This identifier must be 8 or 16 bytes long
|
|
and must be specified in hex format. A leading "0x" is allowed but is
|
|
not required. Writing a newline into this attribute discards the NAA
|
|
ID. If this ID is set it is reported in the SCSI VPD device
|
|
identification page (83h). More information about NAA identifiers can
|
|
be found in the following documents:
|
|
* ANSI T11 committee, Fibre Channel Framing and Signaling Interface - 4
|
|
(FC-FS-4) rev 0.50, May 2014 (http://www.t11.org/).
|
|
* IETF, RFC 3980 - T11 Network Address Authority (NAA) Naming Format for
|
|
iSCSI Node Names, February 2005 (https://tools.ietf.org/html/rfc3980).
|
|
|
|
- t10_vend_id - Contents of the T10 VENDOR IDENTIFICATION field of the
|
|
INQUIRY response. The default value for this field is "SCST_BIO" for
|
|
vdisk_block devices and "SCST_FIO" for vdisk_fileio devices.
|
|
|
|
- usn - contains the virtual device's serial number of INQUIRY data. It
|
|
is created at the device creation time based on the device name and
|
|
scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below)
|
|
or the SCST setup_id when using the sysfs interface (see above).
|
|
|
|
- type - contains SCSI type of this virtual device.
|
|
|
|
- resync_size - write only attribute, which makes vdisk_fileio to
|
|
rescan size of the backend file. It is useful if you changed it, for
|
|
instance, if you resized it.
|
|
|
|
- vend_specific_id - Vendor specific ID as reported via the Device
|
|
Identification VPD page (83h). The default value for this attribute
|
|
is the value of the t10_dev_id attribute.
|
|
|
|
For example:
|
|
|
|
/sys/kernel/scst_tgt/devices/disk1
|
|
|-- block
|
|
|-- blocksize
|
|
|-- opt_trans_len
|
|
|-- exported
|
|
| |-- export0 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/luns/0
|
|
| |-- export1 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/ini_groups/INI/luns/0
|
|
| |-- export2 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/luns/0
|
|
| |-- export3 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI1/luns/0
|
|
| |-- export4 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI2/luns/0
|
|
|-- filename
|
|
|-- handler -> ../../handlers/vdisk_fileio
|
|
|-- nv_cache
|
|
|-- o_direct
|
|
|-- read_only
|
|
|-- removable
|
|
|-- resync_size
|
|
|-- rotational
|
|
|-- size_mb
|
|
|-- t10_dev_id
|
|
|-- thin_provisioned
|
|
|-- threads_num
|
|
|-- threads_pool_type
|
|
|-- tst
|
|
|-- type
|
|
|-- usn
|
|
`-- write_through
|
|
|
|
Each vdisk_blockio's device has the following attributes in
|
|
/sys/kernel/scst_tgt/devices/device_name: blocksize, filename, nv_cache,
|
|
read_only, removable, resync_size, rotational, size_mb, t10_dev_id,
|
|
thin_provisioned, gen_tp_soft_threshold_reached_UA, threads_num,
|
|
threads_pool_type, tst, type, usn. See above description of those
|
|
parameters.
|
|
|
|
Each vdisk_nullio's device has the following attributes in
|
|
/sys/kernel/scst_tgt/devices/device_name: blocksize, read_only,
|
|
removable, size_mb, t10_dev_id, threads_num, threads_pool_type, type,
|
|
tst, usn, dummy. See above description of those parameters.
|
|
|
|
Each vcdrom's device has the following attributes in
|
|
/sys/kernel/scst_tgt/devices/device_name: filename, size_mb,
|
|
t10_dev_id, threads_num, threads_pool_type, type, usn, tst. See above
|
|
description of those parameters. Exception is filename attribute. For
|
|
vcdrom it is writable. Writing to it allows to virtually insert or
|
|
change virtual CD media in the virtual CDROM device. For example:
|
|
|
|
- echo "/image.iso" >/sys/kernel/scst_tgt/devices/cdrom/filename - will
|
|
insert file /image.iso as virtual media to the virtual CDROM cdrom.
|
|
|
|
- echo "" >/sys/kernel/scst_tgt/devices/cdrom/filename - will remove
|
|
"media" from the virtual CDROM cdrom.
|
|
|
|
Additionally VDISK handler has module parameter "num_threads", which
|
|
specifies count of I/O threads for each FILEIO VDISK's or VCDROM device.
|
|
If you have a workload, which tends to produce rather random accesses
|
|
(e.g. DB-like), you should increase this count to a bigger value, like
|
|
32. If you have a rather sequential workload, you should decrease it to
|
|
a lower value, like number of CPUs on the target or even 1. Due to some
|
|
limitations of Linux I/O subsystem, increasing number of I/O threads too
|
|
much leads to sequential performance drop, especially with deadline
|
|
scheduler, so decreasing it can improve sequential performance. The
|
|
default provides a good compromise between random and sequential
|
|
accesses.
|
|
|
|
You shouldn't be afraid to have too many VDISK I/O threads if you have
|
|
many VDISK devices. Kernel threads consume very little amount of
|
|
resources (several KBs) and only necessary threads will be used by SCST,
|
|
so the threads will not trash your system.
|
|
|
|
CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
|
|
======== ever try to export and then mount it (even accidentally) with another
|
|
block size. Otherwise you can *instantly* damage it pretty
|
|
badly as well as all your data on it. Messages on initiator
|
|
like: "attempt to access beyond end of device" is the sign of
|
|
such damage.
|
|
|
|
Moreover, if you want to compare how well different block sizes
|
|
work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
|
|
**COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
|
|
other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
|
|
AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
|
|
sizes isn't like switching between FILEIO and BLOCKIO, after
|
|
changing block size all previously written with another block
|
|
size data MUST BE ERASED. Otherwise you will have a full set of
|
|
very weird behaviors, because blocks addressing will be
|
|
changed, but initiators in most cases will not have a
|
|
possibility to detect that old addresses written on the device
|
|
in, e.g., partition table, don't refer anymore to what they are
|
|
intended to refer.
|
|
|
|
IMPORTANT: Some disk and partition table management utilities don't support
|
|
========= block sizes >512 bytes, therefore make sure that your favorite one
|
|
supports it. Currently only cfdisk is known to work only with
|
|
512 bytes blocks, other utilities like fdisk on Linux or
|
|
standard disk manager on Windows are proved to work well with
|
|
non-512 bytes blocks. Note, if you export a disk file or
|
|
device with some block size, different from one, with which
|
|
it was already partitioned, you could get various weird
|
|
things like utilities hang up or other unexpected behavior.
|
|
Hence, to be sure, zero the exported file or device before
|
|
the first access to it from the remote initiator with another
|
|
block size. On Window initiator make sure you "Set Signature"
|
|
in the disk manager on the imported from the target drive
|
|
before doing any other partitioning on it. After you
|
|
successfully mounted a file system over non-512 bytes block
|
|
size device, the block size stops matter, any program will
|
|
work with files on such file system.
|
|
|
|
|
|
Dealing with massive logs
|
|
-------------------------
|
|
|
|
If you want to enable using "trace_level" file logging levels, which
|
|
produce a lot of events, like "debug", to not loose logged events you
|
|
should also:
|
|
|
|
* Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
|
|
to much bigger value, then recompile it. For example, value 25 will
|
|
provide good protection from logging overflow even under high volume
|
|
of logging events. To use it you will need to modify the maximum
|
|
allowed value for CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig
|
|
file to 25 as well.
|
|
|
|
* Change in your /etc/syslog.conf or other config file of your favorite
|
|
logging program to store kernel logs in async manner. For example,
|
|
you can add in rsyslog.conf line "kern.info -/var/log/kernel" and
|
|
add "kern.none" in line for /var/log/messages, so the resulting line
|
|
would looks like:
|
|
|
|
"*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"
|
|
|
|
|
|
Persistent Reservations
|
|
-----------------------
|
|
|
|
SCST implements Persistent Reservations with full set of capabilities,
|
|
including "Persistence Through Power Loss".
|
|
|
|
The "Persistence Through Power Loss" data are saved in /var/lib/scst/pr
|
|
with files with names the same as the names of the corresponding
|
|
devices. Also this directory contains backup versions of those files
|
|
with suffix ".1". Those backup files are used in case of power or other
|
|
failure to prevent Persistent Reservation information from corruption
|
|
during update. It is safe to assume that each of those files can be up
|
|
to 1KB big.
|
|
|
|
The Persistent Reservations available on all transports implementing
|
|
get_initiator_port_transport_id() callback. Transports not implementing
|
|
this callback will act in one of 2 possible scenarios ("all or
|
|
nothing"):
|
|
|
|
1. If a device has such transport connected and doesn't have persistent
|
|
reservations, it will refuse Persistent Reservations commands as if it
|
|
doesn't support them.
|
|
|
|
2. If a device has persistent reservations, all initiators newly
|
|
connecting via such transports will not see this device. After all
|
|
persistent reservations from this device are released, upon reconnect
|
|
the initiators will see it.
|
|
|
|
|
|
ALUA Support
|
|
------------
|
|
|
|
SCST supports both implicit and explicit asymmetric logical unit access
|
|
(ALUA). ALUA is a feature defined by the ANSI T10 SCSI committee. It
|
|
allows a target to tell the initiator which path to use in a multipath
|
|
setup plus, in the explicit case, control state of each path via SET
|
|
TARGET PORT GROUPS SCSI command. The redundant paths between initiator
|
|
and target can be used either for redundancy or for load sharing
|
|
purposes. The target can either be a single target system running SCST
|
|
with multiple communication interfaces or two target systems each
|
|
running SCST and configured in a high availability setup.
|
|
|
|
In the SPC-4 standard the following concepts are defined related to ALUA:
|
|
* Relative target port ID. A number between 1 and 65535 that uniquely
|
|
identifies a target port. These numbers must be unique over the target as
|
|
a whole, even if that target consists of multiple systems each running SCST.
|
|
* Target port group asymmetric access state. One of active/optimized,
|
|
active/non-optimized, standby, unavailable, logical block dependent or
|
|
offline. The access state of a port defines which (if any) SCSI commands
|
|
will be processed by the target port.
|
|
* Target port preference indicator. This indicator is additional information
|
|
next to the asymmetric access state that is provided by the target to an
|
|
initiator and that may impact the decision taken by the initiator about
|
|
which path that will be chosen.
|
|
|
|
More detailed information about ALUA can be found in section 5.11.2 of the
|
|
ANSI T10 standard called SPC-4.
|
|
|
|
ALUA support in SCST
|
|
....................
|
|
|
|
SCST allows to define ALUA settings for each unique combination of SCST
|
|
device and SCST target. An initiator however queries ALUA settings by
|
|
sending an appropriate SCSI command to a specific LUN of an SCST target.
|
|
Each such LUN maps uniquely to an SCST device. For hardware SCST target
|
|
drivers, e.g. ib_srpt, there is a one-to-one correspondence between SCST
|
|
target and SCSI target port. With other SCST targets, e.g. iSCSI-SCST,
|
|
by default the only relationship between SCST targets and SCSI target
|
|
ports is that all SCST targets defined on a system are visible via all
|
|
SCSI target ports. See also the iSCSI-SCST documentation about the
|
|
allowed_portal attribute for information about how to associate iSCSI
|
|
targets with a single physical interface.
|
|
|
|
Notes:
|
|
- In a H.A. setup it is the responsibility of the user to synchronize ALUA
|
|
information between the individual systems running SCST. There are no
|
|
provisions in SCST to exchange ALUA information automatically between
|
|
individual systems.
|
|
- In order to support H.A. setups it is possible to let one SCST system
|
|
report information about target ports present in other SCST systems.
|
|
- With SCST, and certainly in a H.A. setup, it is possible to configure ALUA
|
|
such that an initiator receives information that is not standard compliant,
|
|
e.g. setting all target ports in the offline state. It is the responsibility
|
|
of the user to make sure that the information queried by an initiator is
|
|
consistent independent of the LUN and the target port used by the initiator
|
|
to query this information.
|
|
- Before building a H.A. setup consisting of two or more SCST systems one
|
|
should evaluate whether it's acceptable that persistent reservation commands,
|
|
SCSI task management commands and MODE SELECT commands will only be processed
|
|
by a single node instead of being processed by all nodes.
|
|
|
|
Configuring ALUA in SCST
|
|
........................
|
|
|
|
SCST allows to configure the following settings related to ALUA
|
|
for each unique combination of SCST target and virtual SCST device
|
|
(vdisk_fileio, vdisk_blockio, vcdrom, ...):
|
|
* The target port group asymmetric access state. SCST supports all ALUA port
|
|
states except logical block dependent.
|
|
* The preference indicator for a target port group.
|
|
* The relative target port ID associated with the SCST target.
|
|
|
|
It is possible to configure the following ALUA-related information via the
|
|
sysfs interface of SCST:
|
|
* Device groups, where each device group has a name and contains zero or more
|
|
SCST devices. If a device group contains only a single SCST device, the name
|
|
of the group may be identical to the device name. See also
|
|
/sys/kernel/scst_tgt/device_groups/mgmt.
|
|
* Which devices are inside a device group. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/devices/mgmt.
|
|
* Target groups, where each target group has a name and contains zero or more
|
|
SCST target names. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/mgmt.
|
|
* Target port group identifier. This is a number in the range 0..65535 and is
|
|
called the TARGET PORT GROUP in SPC-4. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
|
|
group name>/group_id.
|
|
* Target port group preference indicator. This is a boolean value called the
|
|
PREF bit in SPC-4. See also /sys/kernel/scst_tgt/device_groups/<device group
|
|
name>/target_groups/<target group name>/preferred.
|
|
* Target port group state name. One of active, nonoptimized, standby,
|
|
unavailable, offline or transitioning. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
|
|
group name>/state.
|
|
* Target group contents - zero or more target names. The target names either
|
|
exist on the local system or on a remote system in a H.A. setup. For target
|
|
names that refer to SCST targets on another system only the relative target
|
|
port identifier matters, not the assigned name. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
|
|
group name>/mgmt.
|
|
* Relative target identifier. See also
|
|
/sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
|
|
group name>/<target name>/rel_tgt_id.
|
|
|
|
The steps involved in configuring ALUA are:
|
|
* Identify the SCST devices that will always share the same ALUA settings and
|
|
state. Assign a name to each such group of SCST devices. If a device group
|
|
only contains a single device, the group name may be identical to the device
|
|
name.
|
|
* Configure that device group in SCST via sysfs.
|
|
* Identify the SCSI target ports that will always share the same ALUA settings
|
|
and state. Assign a name, a group ID and preference indicator to each such
|
|
SCSI target port group.
|
|
* Configure the target port group information in SCST via sysfs.
|
|
* Identify all SCST targets that can be accessed via a target port group.
|
|
* Assign all these SCST target names to the target group via sysfs.
|
|
* Assign a relative target port identifier to each target.
|
|
|
|
As an example, in a H.A. setup with two systems each having one InfiniBand
|
|
HCA controlled by the ib_srpt driver and where each system exports two LUNs
|
|
the following configuration can be used in scst.conf on both systems:
|
|
|
|
DEVICE_GROUP dgroup1 {
|
|
DEVICE disk01
|
|
|
|
TARGET_GROUP tgroup1 {
|
|
group_id 256
|
|
preferred 1
|
|
state active
|
|
TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 {
|
|
rel_tgt_id 1
|
|
}
|
|
}
|
|
TARGET_GROUP tgroup2 {
|
|
group_id 257
|
|
state standby
|
|
TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 {
|
|
rel_tgt_id 2
|
|
}
|
|
}
|
|
}
|
|
|
|
DEVICE_GROUP dgroup2 {
|
|
DEVICE disk02
|
|
|
|
TARGET_GROUP tgroup1 {
|
|
group_id 258
|
|
state standby
|
|
TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 {
|
|
rel_tgt_id 1
|
|
}
|
|
}
|
|
TARGET_GROUP tgroup2 {
|
|
group_id 259
|
|
preferred 1
|
|
state active
|
|
TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 {
|
|
rel_tgt_id 2
|
|
}
|
|
}
|
|
}
|
|
|
|
Note, if you are using "active" BLOCKIO device attribute to prevent open
|
|
of the backend block device on the passive node, it is not recommended
|
|
to set both active ("active", "nonoptimized") and passive ("standby",
|
|
etc.) ALUA states for the same device if "bind_alua_state=1" is used, as
|
|
shown above to keep internal "active" state of the BLOCKIO device consistent.
|
|
|
|
If using the "active" BLOCKIO device attribute and multiple target groups
|
|
exist per device on a SCST instance then "bind_alua_state=0" should be used
|
|
and it is left up to the user to modify the "active" attribute value.
|
|
|
|
Explicit ALUA
|
|
.............
|
|
|
|
To enable explicit ALUA you need in addition to the above settings set
|
|
expl_alua device attribute to 1 (by default it is 0). Also you need to
|
|
run stpgd and supply to it path to a script or program, which will
|
|
perform actual path state switching on SET TARGET PORT GROUPS command,
|
|
for instance, by calling drbdadm. For more information see stpgd README
|
|
as well as sample script scst_on_stpg.
|
|
|
|
DRBD and other replication/failover SW compatibility
|
|
....................................................
|
|
|
|
DRBD as well as other replication/failover SW does not allow to open its
|
|
device on the secondary as well as does not allow to perform primary to
|
|
secondary transition, if this device is open.
|
|
|
|
SCST BLOCKIO handler has necessary support for such behavior:
|
|
|
|
1. If you need to prevent an SCST BLOCKIO device from opening its block
|
|
device, you need to create it with parameter "active=0". In case of DRBD
|
|
it would be done automatically, you don't have to use the "active"
|
|
attribute.
|
|
|
|
2. By default, if you write new ALUA state in the "state" attribute and
|
|
"bind_alua_state=1" for the device, SCST BLOCKIO handler before transition
|
|
closes open handles on all affected SCST devices and after transition
|
|
reopens them, if the new state is active or nonoptimized. Alternatively,
|
|
set "bind_alua_state=0" for SCST BLOCKIO devices and ALUA state changes
|
|
will not open/close the backing block device, the user will need to handle
|
|
this manually or via a cluster RA in an HA setup.
|
|
|
|
Thus, the recommended implicit ALUA state change procedure for primary
|
|
to secondary transition is:
|
|
|
|
1. Block all involved SCST devices using "block" sysfs attribute (see
|
|
above). Wait until the blocking finished.
|
|
|
|
2. Change the ALUA state to "transitioning". At this moment all open
|
|
file handles will be closed.
|
|
|
|
3. Perform the DRBD or other replication/failover SW state transition
|
|
|
|
4. Change the ALUA state to your desired secondary state.
|
|
|
|
5. Unblock the blocked on step 1 devices.
|
|
|
|
Optionally, if your initiators support Transitioning ALUA state, for
|
|
more responsive behavior the blocked devices can be unblocked
|
|
immediately after step (2). However, not all initiators correctly
|
|
behave, if they receive ASYMMETRIC STATE TRANSITION sense.
|
|
|
|
For the secondary to primary transition procedure is similar.
|
|
|
|
In case of explicit ALUA, SCST automatically performs the necessary
|
|
devices blocking around sending SCST_EVENT_STPG_USER_INVOKE event.
|
|
|
|
Checking the Target Configuration
|
|
.................................
|
|
|
|
One way to verify the ALUA configuration from a Linux initiator is via
|
|
the commands provided in the sg3_utils package. The first step is to
|
|
verify whether for a certain LUN ALUA has been configured on the target.
|
|
This is possible by checking whether the TPGS=1 text appears in the
|
|
sg_inq output, where /dev/sdb is a device node created by the ib_srp
|
|
initiator:
|
|
|
|
# sg_inq /dev/sdb
|
|
standard INQUIRY:
|
|
PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3]
|
|
[AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2
|
|
SCCS=0 ACC=0 TPGS=1 3PC=0 Protect=0 BQue=0
|
|
EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1
|
|
[RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1
|
|
[SPI: Clocking=0x0 QAS=0 IUS=0]
|
|
length=66 (0x42) Peripheral device type: disk
|
|
Vendor identification: SCST_FIO
|
|
Product identification: disk01
|
|
Product revision level: 300
|
|
Unit serial number: 27cddc71
|
|
|
|
The next step is to verify the target group configuration. That is possible
|
|
by verifying whether the output of the sg_rtpg command matches the values
|
|
configured on the target:
|
|
|
|
# sg_rtpg /dev/sdb
|
|
Report target port groups:
|
|
target port group id : 0x100 , Pref=1
|
|
target port group asymmetric access state : 0x00
|
|
T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1
|
|
status code : 0x02
|
|
vendor unique status : 0x00
|
|
target port count : 01
|
|
Relative target port ids:
|
|
0x01
|
|
target port group id : 0x101 , Pref=0
|
|
target port group asymmetric access state : 0x02
|
|
T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1
|
|
status code : 0x02
|
|
vendor unique status : 0x00
|
|
target port count : 01
|
|
Relative target port ids:
|
|
0x02
|
|
|
|
The relative target port ID and the target port group ID for a certain path
|
|
can be queried e.g. as follows:
|
|
|
|
# sg_vpd -p di /dev/sdb
|
|
Device Identification VPD page:
|
|
Addressed logical unit:
|
|
designator type: T10 vendor identification, code set: ASCII
|
|
vendor id: SCST_FIO
|
|
vendor specific: 27cddc71-disk01
|
|
designator type: EUI-64 based, code set: Binary
|
|
0x3237636464633731
|
|
Target port:
|
|
designator type: Relative target port, code set: Binary
|
|
Relative target port: 0x1
|
|
designator type: Target port group, code set: Binary
|
|
Target port group: 0x100
|
|
|
|
|
|
Initiator Support
|
|
.................
|
|
|
|
On Linux systems ALUA support is provided by the scsi_dh_alua kernel
|
|
driver in combination with the user space multipathd daemon. You will
|
|
have to modify at least the following in /etc/multipath.conf to enable
|
|
ALUA:
|
|
|
|
* hardware_handler "1 alua"
|
|
* prio alua
|
|
* path_grouping_policy group_by_prio
|
|
* path_checker tur
|
|
|
|
Notes:
|
|
- Newer versions of multipathd support a parameter called
|
|
"detect_prio". It can be more convenient to enable this parameter instead of
|
|
setting the parameter "prio" to "alua" for only those LUNs that support ALUA.
|
|
- Older versions of multipathd (e.g. RHEL 5 and SLES 10 SP1) need
|
|
'prio_callout "/sbin/mpath_prio_alua /dev/%n"' instead of 'prio alua'.
|
|
|
|
# multipath -ll
|
|
23237636464633731 dm-3 SCST_FIO,disk01
|
|
size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|
|
|-+- policy='service-time 0' prio=1 status=active
|
|
| `- 10:0:0:0 sdd 8:48 active ready running
|
|
`-+- policy='service-time 0' prio=130 status=enabled
|
|
`- 11:0:0:0 sde 8:64 active ready running
|
|
23133326137346538 dm-4 SCST_FIO,disk02
|
|
size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|
|
|-+- policy='service-time 0' prio=130 status=active
|
|
| `- 10:0:0:2 sdn 8:208 active ready running
|
|
`-+- policy='service-time 0' prio=1 status=enabled
|
|
`- 11:0:0:2 sdp 8:240 active ready running
|
|
|
|
The following information can be derived from the above output:
|
|
* That the hardware handler (hw_handler) has been set to "1 alua".
|
|
* That multipathd created two priority groups - one with priority 1 and one
|
|
with priority 130.
|
|
* That the SRP path with SCSI host number 10 will be used for communication
|
|
with LUN "disk01" and that the SRP path with SCSI host number 11 will be used
|
|
for communication with LUN "disk02".
|
|
|
|
More information about how to configure the device mapper and the scsi_dh_alua
|
|
driver can be found in the manual of your Linux distribution ("man
|
|
multipath.conf", "man multipath" and "man multipathd").
|
|
|
|
Windows initiator systems support ALUA from Windows Server 2008 on. For more
|
|
information about ALUA support in Windows Server, see also:
|
|
* Microsoft, Windows Server 2008 R2 Multipath I/O Overview, MSDN
|
|
(http://technet.microsoft.com/en-us/library/cc725907.aspx).
|
|
* Microsoft, Multipathing Support in Windows Server 2008, July 2008, MSDN
|
|
(http://blogs.msdn.com/b/san/archive/2008/07/27/multipathing-support-in-windows-server-2008.aspx).
|
|
* Microsoft, ALUA MPIO Logo Test, MSDN
|
|
(http://msdn.microsoft.com/en-us/library/gg607458%28v=vs.85%29.aspx).
|
|
|
|
Active/Non-Optimized via internal redirection
|
|
.............................................
|
|
|
|
The Active-Standby configuration is simple to understand and to set up.
|
|
However, it might cause serious interoperability issues because not all
|
|
initiators handle the ALUA state 'standby' state correctly. For instance,
|
|
some versions of VMware reported to have such issues. Same for Windows.
|
|
|
|
It is better to use the 'nonoptimized' state on the passive node instead
|
|
of 'standby' with internal commands redirection to the active node. This
|
|
is what the vast majority of storage vendors are doing. This is actually
|
|
the reason why the 'standby' and 'unavailable' states have all those
|
|
initiator interoperability issues. The latter combination has received
|
|
too few testing because it is only marginally used.
|
|
|
|
SCST has the necessary support for such redirection, it just needs to be
|
|
configured correctly. It's a little bit of effort, especially to
|
|
understand how it's going to function, but then it would work MUCH more
|
|
reliable for full range of initiators. Ever poor initiators, who have no
|
|
idea about ALUA (boot from SAN, e.g.) would work now. The following
|
|
diagram illustrates this approach:
|
|
|
|
................................................................
|
|
. . .
|
|
. Initiator A . Initiator B .
|
|
. | . | .
|
|
................................................................
|
|
. | . | .
|
|
. target port C . target port D .
|
|
. | . | .
|
|
. SCST . SCST .
|
|
. Instance E - target . target - Instance F .
|
|
. / \ port G . port H / \ .
|
|
. / \ \./ / \ .
|
|
. / \ /.\ / \ .
|
|
. vdisk_blockio dev_disk / . \ dev_disk vdisk_blockio .
|
|
. handler handler / . \ handler handler .
|
|
. | | / . \ | | .
|
|
. block device SCSI / . SCSI block device .
|
|
. I initiator . initiator J .
|
|
. | node K . node L | .
|
|
. |______________________ .______________________| .
|
|
................................................................
|
|
The link between block devices I and J stands for synchronous replication.
|
|
|
|
|
|
Such a setup can be configured as follows:
|
|
|
|
1. Build SCST.
|
|
|
|
2. Setup on active node internal redirect target, which is going to
|
|
accept redirected commands from the passive node. It must be visible
|
|
only to the passive node.
|
|
|
|
3. Set "forward_dst" attribute for this target to 1. This is necessary to
|
|
correctly handle PRs.
|
|
|
|
4. Export through this target the SAME backend SCST device as being
|
|
served to initiator(s) (consider for simplicity that there is only one
|
|
served device)
|
|
|
|
5. Connect to this SCST device through this internal target from the
|
|
passive node, for instance, using iSCSI. Now you have a local SCSI
|
|
device on the passive side pointing to the active node.
|
|
|
|
6. Export this local device to the initiator(s) using SCST
|
|
*pass-through* handler (scst_disk). Pass-though is needed to redirect
|
|
non-block commands as well: ATS, XCOPY, etc.
|
|
|
|
7. Set ALUA state to this target as "nonoptimized". Set the forward_src
|
|
attribute to one.
|
|
|
|
That's it on the normal path. Now the initiator(s) would see 2 paths:
|
|
OPTIMIZED going to the active node and NON-OPTIMIZED going to the
|
|
passive node, then redirected to the active node.
|
|
|
|
On failover (i.e. switching active and passive states):
|
|
|
|
1. Setup similar redirect target on the new active node.
|
|
|
|
2. Setup connectivity to that new redirect target from the new passive
|
|
node
|
|
|
|
3. Start ALUA change (see above) on both nodes
|
|
|
|
4. !! Exchange in the sysfs security group(s) for the initiator(s) *LUN*
|
|
from old SCST device to the new one (blockio -> pass-through on the new
|
|
passive and pass-through -> blockio on the new active) using "replace_no_ua"
|
|
SCST command. You need to do it directly in the sysfs interface,
|
|
scstadmin can't do it.
|
|
|
|
5. Set ALUA states to "active" on the new active node and "nonoptimized"
|
|
on the new passive node.
|
|
|
|
6. Finish ALUA states change.
|
|
|
|
Example using direct sysfs interface could look like:
|
|
|
|
Active-Optimized node:
|
|
|
|
modprobe scst
|
|
modprobe scst_disk
|
|
modprobe scst_vdisk
|
|
|
|
# Main device, DRBD primary here
|
|
echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt
|
|
|
|
# Redirect device, not used here. Coming from connecting via iSCSI to the
|
|
# corresponding redirect target on the other side.
|
|
DEVICE=10:0:0:0
|
|
echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt
|
|
|
|
service iscsi-scst start
|
|
|
|
# This is a regular, user-visible target
|
|
echo "add_target iqn.2006-10.net.v:tgt " >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/rel_tgt_id
|
|
echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/luns/mgmt
|
|
|
|
# This is redirect target, 192.168.9.x is the redirect network
|
|
echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id
|
|
echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding
|
|
echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt
|
|
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/enabled
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled
|
|
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled
|
|
|
|
# ALUA config
|
|
|
|
echo create aa >/sys/kernel/scst_tgt/device_groups/mgmt
|
|
echo add aa >/sys/kernel/scst_tgt/device_groups/aa/devices/mgmt
|
|
|
|
echo add tgt_a >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt
|
|
echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/mgmt
|
|
echo 1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/group_id
|
|
echo active >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/state
|
|
|
|
echo add tgt_n >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt
|
|
echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/mgmt
|
|
echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/iqn.2006-10.net.v:tgt1/rel_tgt_id
|
|
echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/group_id
|
|
echo nonoptimized >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/state
|
|
|
|
Active-Non-Optimized node:
|
|
|
|
modprobe scst
|
|
modprobe scst_disk
|
|
modprobe scst_vdisk
|
|
|
|
# Main device, DRBD secondary, not used here
|
|
echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt
|
|
|
|
# Redirect device. Coming from connecting via iSCSI to the
|
|
# corresponding redirect target on the other side.
|
|
DEVICE=10:0:0:0
|
|
echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt
|
|
|
|
service iscsi-scst start
|
|
|
|
echo "add_target iqn.2006-10.net.v:tgt1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/rel_tgt_id
|
|
echo "add $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt
|
|
|
|
# Redirect target, 192.168.9.x is the redirect network
|
|
echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id
|
|
echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.2" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
|
|
echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding
|
|
echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt
|
|
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/enabled
|
|
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled
|
|
echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled
|
|
|
|
# ALUA config
|
|
|
|
echo create $DEVICE >/sys/kernel/scst_tgt/device_groups/mgmt
|
|
echo add $DEVICE >/sys/kernel/scst_tgt/device_groups/$DEVICE/devices/mgmt
|
|
|
|
echo add tgt_a >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt
|
|
echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/mgmt
|
|
echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/iqn.2006-10.net.v:tgt/rel_tgt_id
|
|
echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/group_id
|
|
echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state
|
|
|
|
echo add tgt_n >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt
|
|
echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/mgmt
|
|
echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/group_id
|
|
echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state
|
|
|
|
ALUA state switch after DRBD primary-secondary transition:
|
|
|
|
Ex-Optimized:
|
|
|
|
echo "replace_no_ua $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt
|
|
echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state
|
|
echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state
|
|
|
|
Ex-Non-Optimized:
|
|
|
|
echo "replace_no_ua aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt
|
|
echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state
|
|
echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state
|
|
|
|
If you have any questions, please read this above text at least 3 times
|
|
before asking. It might be tricky to understand :-)
|
|
|
|
|
|
VAAI
|
|
----
|
|
|
|
SCST supports all 3 VAAI SCSI commands: WRITE SAME, COMPARE AND WRITE
|
|
(ATS) and EXTENDED COPY. Additionally, it supports not directly related
|
|
to VAAI Thin Provisioning capabilities, particularly, UNMAP SCSI
|
|
commands, WRITE SAME with UNMAP bit as well as thin provisioning related
|
|
devices' sysfs attributes (see above).
|
|
|
|
In some cases dev handlers should perform some manual actions to fully
|
|
benefit from SCST VAAI implementation. Those actions described in the
|
|
implementation notes below. For vdisk and fileio_tgt handlers they have
|
|
already been implemented.
|
|
|
|
IMPORTANT: To use EXTENDED COPY command between LUNs (datastores) they all
|
|
========= MUST have the same PRODUCT IDENTIFICATION INQUIRY field. By
|
|
default, to simplify remote devices identification, SCST uses
|
|
vdisk names as PRODUCT IDENTIFICATION, so SCST devices look
|
|
differently from the initiators. However, for some reasons,
|
|
VMware does not use EXTENDED COPY between LUNs with different
|
|
PRODUCT IDENTIFICATION. Thus, to be able to use full VAAI in
|
|
your VMware setups you must manually set PRODUCT
|
|
IDENTIFICATION for all your VMware LUNs to the same value,
|
|
for instance, "SCST", via using "prod_id" attribute. It could
|
|
be done either by adding "prod_id" attribute to scstadmin
|
|
scst.conf, or by directly writing to SCST sysfs attribute.
|
|
For example:
|
|
|
|
HANDLER vdisk_blockio {
|
|
DEVICE blockio1 {
|
|
filename /dev/sda5
|
|
prod_id SCST
|
|
}
|
|
|
|
or
|
|
|
|
echo SCST >/sys/kernel/scst_tgt/devices/blockio1/prod_id
|
|
correspondingly.
|
|
|
|
Note, this prod_id modification must be done on all
|
|
datastores BEFORE VMware connects to them.
|
|
|
|
|
|
Implementation notes
|
|
....................
|
|
|
|
WRITE SAME
|
|
~~~~~~~~~~
|
|
|
|
WRITE SAME command supports 2 modes:
|
|
|
|
1. Manual writing mode. In this mode WRITE SAME generates a set of
|
|
internal WRITE(16) SCSI commands to perform requested writing.
|
|
|
|
2. Remap mode. In this mode a dev handler, if supported, can remap being
|
|
written blocks to a single block and then tell SCST to manually write
|
|
parts of the requested area, which for some reason can not be remapped.
|
|
|
|
In both cases dev handlers should call from WRITE SAME command handler
|
|
scst_write_same() function. This function as the second argument gets
|
|
array of descriptors where to write the requested block of data. Last
|
|
element in this array must have len 0. If this argument is NULL, then
|
|
the whole area will be manually written by SCST. This value should be
|
|
used by dev handlers not supporting remapping blocks.
|
|
|
|
User space dev handlers should use SCST_EXEC_REPLY_DO_WRITE_SAME
|
|
reply_type of SCST_USER_EXEC subcommand. See scst_user doc for more
|
|
info.
|
|
|
|
|
|
COMPARE AND WRITE
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
COMPARE AND WRITE implemented by SCST a set of read, compare and write
|
|
actions done in atomic manner against affected blocks as well as regular
|
|
RESERVE SCSI commands. Particularly, COMPARE AND WRITE doesn't need any
|
|
queue flushing and unlimited number of COMPARE AND WRITE commands on
|
|
different blocks can be executed simultaneously.
|
|
|
|
The read and write actions implemented as generation of internal
|
|
READ(16) and WRITE(16) SCSI commands.
|
|
|
|
COMPARE AND WRITE command is completely transparent to dev handlers
|
|
(they only see the corresponding READ(16) and WRITE(16) commands), so
|
|
doesn't require any manual actions from them.
|
|
|
|
|
|
EXTENDED COPY
|
|
~~~~~~~~~~~~~
|
|
|
|
SCST implements EXTENDED COPY via internal Copy Manager target. This
|
|
target has the following specific attribute in its sysfs:
|
|
|
|
- allow_not_connected_copy - if not set (default), an initiator can
|
|
perform copy only between devices it has direct access to via any
|
|
target/session. If set, any initiator can copy between any devices in
|
|
the system.
|
|
|
|
The Copy Manager has access only to those devices, for which it has LUNs
|
|
in /sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/.
|
|
Devices from scst_vdisk dev handler added to it automatically upon
|
|
registration, but for other devices you need to manually add LUNs there
|
|
the same way as for any target driver. You can also delete any device at
|
|
any time from the Copy Manager visibility by deleting the corresponding
|
|
LUN from the sysfs. It might be useful during ALUA state switching.
|
|
|
|
Internally SCST implements EXTENDED COPY as generation of sets of
|
|
internal READ(16) and WRITE(16) SCSI commands. Dev handlers don't need
|
|
any manual actions to use it.
|
|
|
|
Also SCST provides for dev handlers possibility to remap blocks instead
|
|
of copy them, if they support this feature. It allows them to perform
|
|
EXTENDED COPY command much faster by just metadata update of their
|
|
backend storage, which supposed to be nearly instantaneous.
|
|
|
|
To use this feature, a dev handler should setup ext_copy_remap()
|
|
callback in its struct scst_dev_type. This callback is called by SCST
|
|
during EXTENDED COPY command processing to let the dev handler try to
|
|
remap affected blocks at first.
|
|
|
|
Upon finish, the dev handler should call scst_ext_copy_remap_done(). In
|
|
case of error, the dev handler should set the corresponding sense to cmd
|
|
and then also call scst_ext_copy_remap_done(cmd, NULL, 0).
|
|
|
|
If dev handler is not able to remap any part of the segment, if should
|
|
kmalloc(), then fill all leftover subsegments and supply them to
|
|
scst_ext_copy_remap_done(). SCST then will copy the subsegments using
|
|
internal copy machine, then kfree() the supplied array. If the dev
|
|
handler is not able to remap the whole segment, it can simply directly
|
|
supply the original segment to scst_ext_copy_remap_done().
|
|
|
|
It is highly recommended that in normal circumstances dev handlers call
|
|
scst_ext_copy_remap_done() from another thread context than one where
|
|
ext_copy_remap() callback was originally called, because otherwise there
|
|
could be recursion in the segments processing. Hopefully, this thread
|
|
context switch is natural for such potentially long operation as
|
|
EXTENDED COPY.
|
|
|
|
|
|
VMware and Ceph RBD space reclaim
|
|
---------------------------------
|
|
|
|
VMware with VMFS5 filesystem ignores UNMAP alignment, so if you use 4MB
|
|
Ceph RBD objects and VMFS5, only some discards will reclaim RBD space
|
|
due to 1MB discard not often hitting the tail of objects.
|
|
|
|
Thus, to have efficient ESXi space reclamation with RBD and VMFS5, you are
|
|
recommended to use 1 MB object size in Ceph.
|
|
|
|
See https://sourceforge.net/p/scst/mailman/message/35287598 thread for
|
|
details.
|
|
|
|
|
|
Caching
|
|
-------
|
|
|
|
By default for performance reasons VDISK FILEIO devices use write back
|
|
caching policy.
|
|
|
|
Generally, write back caching is safe for use and danger of it is
|
|
greatly overestimated, because most modern (especially, Enterprise
|
|
level) applications are well prepared to work with write back cached
|
|
storage. Particularly, such are all transactions-based applications.
|
|
Those applications flush cache to completely avoid ANY data loss on a
|
|
crash or power failure. For instance, journaled file systems flush cache
|
|
on each meta data update, so they survive power/hardware/software
|
|
failures pretty well.
|
|
|
|
Since locally on initiators write back caching is always on, if an
|
|
application cares about its data consistency, it does flush the cache
|
|
when necessary or on any write, if open files with O_SYNC. If it doesn't
|
|
care, it doesn't flush the cache. As soon as the cache flushes
|
|
propagated to the storage, write back caching on it doesn't make any
|
|
difference. If application doesn't flush the cache, it's doomed to loose
|
|
data in case of a crash or power failure doesn't matter where this cache
|
|
located, locally or on the storage.
|
|
|
|
To illustrate that consider, for example, a user who wants to copy /src
|
|
directory to /dst directory reliably, i.e. after the copy finished no
|
|
power failure or software/hardware crash could lead to a loss of the
|
|
data in /dst. There are 2 ways to achieve this. Let's suppose for
|
|
simplicity cp opens files for writing with O_SYNC flag, hence bypassing
|
|
the local cache.
|
|
|
|
1. Slow. Make the device behind /dst working in write through caching
|
|
mode and then run "cp -a /src /dst".
|
|
|
|
2. Fast. Let the device behind /dst working in write back caching mode
|
|
and then run "cp -a /src /dst; sync". The reliability of the result is
|
|
the same, but it's much faster than (1). Nobody would care if a crash
|
|
happens during the copy, because after recovery simply leftovers from
|
|
the not completed attempt would be deleted and the operation would be
|
|
restarted from the very beginning.
|
|
|
|
So, you can see in (2) there is no danger of ANY data loss from the
|
|
write back caching. Moreover, since on practice cp doesn't open files
|
|
for writing with O_SYNC flag, to get the copy done reliably, sync
|
|
command must be called after cp anyway, so enabling write back caching
|
|
wouldn't make any difference for reliability.
|
|
|
|
Also you can consider it from another side. Modern HDDs have at least
|
|
16MB of cache working in write back mode by default, so for a 10 drives
|
|
RAID it is 160MB of a write back cache. How many people are happy with
|
|
it and how many disabled write back cache of their HDDs? Almost all and
|
|
almost nobody correspondingly? Moreover, many HDDs lie about state of
|
|
their cache and report write through while working in write back mode.
|
|
They are also successfully used.
|
|
|
|
Note, Linux I/O subsystem guarantees to propagated cache flushes to the
|
|
storage only using data protection barriers, which usually turned off by
|
|
default (see http://lwn.net/Articles/283161). Without barriers enabled
|
|
Linux doesn't provide a guarantee that after sync()/fsync() all written
|
|
data really hit permanent storage. They can be stored in the cache of
|
|
your backstorage devices and, hence, lost on a power failure event.
|
|
Thus, ever with write-through cache mode, you still either need to
|
|
enable barriers on your backend file system on the target (for direct
|
|
/dev/sdX devices this is, indeed, impossible), or need a good UPS to
|
|
protect yourself from not committed data loss. Some info about barriers
|
|
from the XFS point of view could be found at
|
|
http://xfs.org/index.php/XFS_FAQ#Write_barrier_support. On Linux
|
|
initiators for Ext3 and ReiserFS file systems the barrier protection
|
|
could be turned on using "barrier=1" and "barrier=flush" mount options
|
|
correspondingly. You can check if the barriers turn on or off by looking
|
|
in /proc/mounts. Windows and, AFAIK, other UNIX'es don't need any
|
|
special explicit options and do necessary barrier actions on write-back
|
|
caching devices by default.
|
|
|
|
To limit this data loss with write back caching you can use files in
|
|
/proc/sys/vm to limit amount of unflushed data in the system cache.
|
|
|
|
If you for some reason have to use VDISK FILEIO devices in write through
|
|
caching mode, don't forget to disable internal caching on their backend
|
|
devices or make sure they have additional battery or supercapacitors
|
|
power supply on board. Otherwise, you still on a power failure would
|
|
loose all the unsaved yet data in the devices internal cache.
|
|
|
|
Note, on some real-life workloads write through caching might perform
|
|
better, than write back one with the barrier protection turned on.
|
|
|
|
|
|
Errors caching
|
|
..............
|
|
|
|
When using virtual device in FILEIO mode, the Linux page cache comes
|
|
into picture. The negative side of it is that it's sometimes also
|
|
caching errored pages. That is, if the underlying file experiences IO
|
|
errors, those errors might be cached by the Linux page cache. As a
|
|
result, even when the underlying file recovers and stops failing IOs,
|
|
the initiator may still hit IO errors returned by the Linux page cache,
|
|
until the cache re-reads the errored pages (usually it happens pretty
|
|
soon, but not immediately). To make sure that cached pages are dropped,
|
|
one of the following can be done:
|
|
|
|
- Detach the SCSI virtual device (del_device) and re-attach it
|
|
(add_device). This should evict all the cached pages, unless somebody
|
|
else holds the same "filename" opened.
|
|
|
|
- Issue a BLKFLSBUF ioctl to the same "filename" you provided for "add_device".
|
|
|
|
For the second option, a rudimentary C code is required:
|
|
|
|
fd = open(filename, O_RDWR);
|
|
if (fd < 0) {
|
|
err = errno;
|
|
...
|
|
} else {
|
|
err = ioctl(fd, BLKFLSBUF);
|
|
if (err < 0) {
|
|
err = errno;
|
|
...
|
|
}
|
|
close(fd);
|
|
}
|
|
|
|
|
|
BLOCKIO VDISK mode
|
|
------------------
|
|
|
|
This module works best for these types of scenarios:
|
|
|
|
1) Data that are not aligned to 4K sector boundaries and <4K block sizes
|
|
are used, which is normally found in virtualization environments where
|
|
operating systems start partitions on odd sectors (Windows and it's
|
|
sector 63).
|
|
|
|
2) Large block data transfers normally found in database loads/dumps and
|
|
streaming media.
|
|
|
|
3) Advanced relational database systems that perform their own caching
|
|
which prefer or demand direct IO access and, because of the nature of
|
|
their data access, can actually see worse performance with
|
|
non-discriminate caching.
|
|
|
|
4) Multiple layers of targets were the secondary and above layers need
|
|
to have a consistent view of the primary targets in order to preserve
|
|
data integrity which a page cache backed IO type might not provide
|
|
reliably.
|
|
|
|
Also it has an advantage over FILEIO that it doesn't copy data between
|
|
the system cache and the commands data buffers, so it saves a
|
|
considerable amount of CPU power and memory bandwidth.
|
|
|
|
IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
|
|
========= each other, if you try to use a device in both those modes
|
|
simultaneously, you will almost instantly corrupt your data
|
|
on that device.
|
|
|
|
IMPORTANT: Some kernels starting from 2.6.32 have a problem, which
|
|
========= prevents BLOCKIO from working correctly with RAID5/DM. See
|
|
http://lkml.org/lkml/2010/7/28/315. That problem was fixed in
|
|
2.6.32.19, 2.6.34.4, 2.6.35.2 and 2.6.36-rc1. It is strongly
|
|
recommended to not use affected kernels with BLOCKIO.
|
|
|
|
IMPORTANT: In SCST 1.x BLOCKIO worked by default in NV_CACHE mode, when
|
|
========= each device reported to remote initiators as having write through
|
|
caching. But if your backend block device has internal write
|
|
back caching it might create a possibility for data loss of
|
|
the cached in the internal cache data in case of a power
|
|
failure. Starting from SCST 2.0 BLOCKIO works by default in
|
|
non-NV_CACHE mode, when each device reported to remote
|
|
initiators as having write back caching, and synchronizes the
|
|
internal device's cache on each SYNCHRONIZE_CACHE command
|
|
from the initiators. It might lead to some *PERFORMANCE LOSS*,
|
|
so if you are are sure in your power supply and want to
|
|
restore the 1.x behavior, your should recreate your BLOCKIO
|
|
devices in NV_CACHE mode.
|
|
|
|
|
|
Pass-through mode
|
|
-----------------
|
|
|
|
In the pass-through mode (i.e. using the pass-through device handlers
|
|
scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
|
|
are passed to local SCSI devices on target as is, without any
|
|
modifications.
|
|
|
|
SCST supports 1 to many pass-through, when several initiators can safely
|
|
connect a single pass-through device (a tape, for instance). For such
|
|
cases SCST emulates all the necessary functionality.
|
|
|
|
In the sysfs interface all real SCSI devices are listed in
|
|
/sys/kernel/scst_tgt/devices in form host:channel:id:lun numbers, for
|
|
instance 1:0:0:0. The recommended way to match those numbers to your
|
|
devices is use of lsscsi utility.
|
|
|
|
Each pass-through dev handler has in its root subdirectory
|
|
/sys/kernel/scst_tgt/handlers/handler_name, e.g.
|
|
/sys/kernel/scst_tgt/handlers/dev_disk, "mgmt" file. It allows the
|
|
following commands. They can be sent to it using, e.g., echo command.
|
|
|
|
- "add_device" - this command assigns SCSI device with
|
|
host:channel:id:lun numbers to this dev handler.
|
|
|
|
echo "add_device 1:0:0:0" >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt
|
|
|
|
will assign SCSI device 1:0:0:0 to this dev handler.
|
|
|
|
- "del_device" - this command unassigns SCSI device with
|
|
host:channel:id:lun numbers from this dev handler.
|
|
|
|
As usually, on read the "mgmt" file returns small help about available
|
|
commands.
|
|
|
|
The dev_disk handler also exposes the following attribute in its root
|
|
subdirectory:
|
|
|
|
- pr_dump_dir - when set to a directory path, causes each dev_disk
|
|
device to write its PR state to <dir>/<serial> at unregistration
|
|
time, after all in-flight commands have completed. The default value
|
|
is an empty string (dump disabled). The filename is the device's
|
|
SCSI Unit Serial Number. The file format is the same as pr_state.
|
|
|
|
You need to manually assign each your real SCSI device to the
|
|
corresponding pass-through dev handler using the "add_device" command,
|
|
otherwise the real SCSI devices will not be visible remotely. The
|
|
assignment isn't done automatically, because it could lead to the
|
|
pass-through dev handlers load and initialization problems if any of the
|
|
local real SCSI devices are malfunctioning.
|
|
|
|
As any other hardware, the local SCSI hardware can not handle commands
|
|
with amount of data and/or segments count in scatter-gather array bigger
|
|
some values. Therefore, when using the pass-through mode you should note
|
|
that values for maximum number of segments and maximum amount of
|
|
transferred data (max_sectors) for each SCSI command on devices on
|
|
initiators can not be bigger, than corresponding values of the
|
|
corresponding SCSI devices on the target. Otherwise you will see
|
|
symptoms like small transfers work well, but large ones stall and
|
|
messages like: "Unable to complete command due to SG IO count
|
|
limitation" are printed in the kernel logs.
|
|
|
|
You can't control from the user space limit of the scatter-gather
|
|
segments, but for block devices usually it is sufficient if you set on
|
|
the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
|
|
or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
|
|
the corresponding devices on the target.
|
|
|
|
For not-block devices SCSI commands are usually generated directly by
|
|
applications, so, if you experience large transfers stalls, you should
|
|
check documentation for your application how to limit the transfer
|
|
sizes.
|
|
|
|
Another way to solve this issue is to build SG entries with more than 1
|
|
page each. See the following patch as an example:
|
|
http://scst.sourceforge.net/sgv_big_order_alloc.diff
|
|
|
|
|
|
User space mode using scst_user dev handler
|
|
-------------------------------------------
|
|
|
|
User space program fileio_tgt uses interface of scst_user dev handler
|
|
and allows to see how it works in various modes. Fileio_tgt provides
|
|
mostly the same functionality as scst_vdisk handler with the most
|
|
noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
|
|
basically the same as BLOCKIO, but also supports files, so for some
|
|
loads it could be significantly faster, than the regular FILEIO access.
|
|
All the words about BLOCKIO from above apply to O_DIRECT as well. See
|
|
fileio_tgt's README file for more details.
|
|
|
|
|
|
Performance
|
|
-----------
|
|
|
|
SCST from the very beginning has been designed and implemented to
|
|
provide the best possible performance. Since there is no "one fit all"
|
|
the best performance configuration for different setups and loads, SCST
|
|
provides extensive set of settings to allow to tune it for the best
|
|
performance in each particular case. You don't have to necessary use
|
|
those settings. If you don't, SCST will do very good job to autotune for
|
|
you, so the resulting performance will, in average, be better
|
|
(sometimes, much better) than with other SCSI targets. But in some cases
|
|
you can by manual tuning improve it even more.
|
|
|
|
Before doing any performance measurements note that performance results
|
|
are very much dependent from your type of load, so it is crucial that
|
|
you choose access mode (FILEIO, BLOCKIO, O_DIRECT, pass-through), which
|
|
suits your needs the best.
|
|
|
|
In order to get the maximum performance you should:
|
|
|
|
1. For SCST:
|
|
|
|
- Disable in Makefile and scst.h CONFIG_SCST_STRICT_SERIALIZING,
|
|
CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*,
|
|
CONFIG_SCST_STRICT_SECURITY.
|
|
|
|
2. For target drivers:
|
|
|
|
- Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
|
|
CONFIG_SCST_DEBUG*
|
|
|
|
3. For device handlers, including VDISK:
|
|
|
|
- Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.
|
|
|
|
Note, by disabling CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG you are
|
|
disabling many useful SCST diagnostic messages, which can significantly
|
|
help in many troubleshooting cases. So, if you may consider to keep
|
|
CONFIG_SCST_TRACING, its performance impact is very limited.
|
|
|
|
IMPORTANT: The development version of SCST in the SVN is optimized for
|
|
========= development and bug hunting, not for performance. This means
|
|
it is MUCH slower (multiple times). To reconfigure SCST for
|
|
release you should run "make 2release" command in the root of
|
|
your source code (e.g. trunk/). It will set the above options
|
|
as needed. The only option it doesn't set is
|
|
CONFIG_SCST_TEST_IO_IN_SIRQ, so, if needed, you should change
|
|
it manually. There is also so called "performance" build
|
|
mode, which you can activate by "make 2perf" command. The
|
|
only difference it has comparing to release build mode is
|
|
disabled CONFIG_SCST_TRACING option. Because of that, you
|
|
won't be able to see many important SCST run time logging
|
|
messages. This mode is intended to evaluate impact of
|
|
CONFIG_SCST_TRACING on performance and not recommended for
|
|
production.
|
|
|
|
IMPORTANT: You can't use debug SCST drivers with non-debug SCST core.
|
|
========= So, after disabling both CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG
|
|
for SCST core you have to disable them for all SCST drivers
|
|
you are using as well.
|
|
|
|
4. Make sure you have io_grouping_type option set correctly, especially
|
|
in the following cases:
|
|
|
|
- Several initiators share your target's backstorage. It can be a
|
|
shared LU using some cluster FS, like VMFS, as well as can be
|
|
different LUs located on the same backstorage (RAID array). For
|
|
instance, if you have 3 initiators and each of them using its own
|
|
dedicated FILEIO device file from the same RAID-6 array on the
|
|
target.
|
|
|
|
In this case for the best performance you should have
|
|
io_grouping_type option set in value "never" in all the LUNs' targets
|
|
and security groups.
|
|
|
|
- Your initiator connected to your target in MPIO mode. In this case for
|
|
the best performance you should:
|
|
|
|
* Either connect all the sessions from the initiator to a single
|
|
target or security group and have io_grouping_type option set in
|
|
value "this_group_only" in the target or security group,
|
|
|
|
* Or, if it isn't possible to connect all the sessions from the
|
|
initiator to a single target or security group, assign the same
|
|
numeric io_grouping_type value for each target/security group this
|
|
initiator connected to. The exact value itself doesn't matter,
|
|
important only that all the targets/security groups use the same
|
|
value.
|
|
|
|
Don't forget, io_grouping_type makes sense only if you use CFQ I/O
|
|
scheduler on the target and for devices with threads_num >= 0 and, if
|
|
threads_num > 0, with threads_pool_type "per_initiator".
|
|
|
|
You can check if in your setup io_grouping_type set correctly as well as
|
|
if the "auto" io_grouping_type value works for you by tests like the
|
|
following:
|
|
|
|
- For not MPIO case you can run single thread sequential reading, e.g.
|
|
using buffered dd, from one initiator, then run the same single
|
|
thread sequential reading from the second initiator in parallel. If
|
|
io_grouping_type is set correctly the aggregate throughput measured
|
|
on the target should only slightly decrease as well as all initiators
|
|
should have nearly equal share of it. If io_grouping_type is not set
|
|
correctly, the aggregate throughput and/or throughput on any
|
|
initiator will decrease significantly, in 2 times or even more. For
|
|
instance, you have 80MB/s single thread sequential reading from the
|
|
target on any initiator. When then both initiators are reading in
|
|
parallel you should see on the target aggregate throughput something
|
|
like 70-75MB/s with correct io_grouping_type and something like
|
|
35-40MB/s or 8-10MB/s on any initiator with incorrect.
|
|
|
|
- For the MPIO case it's quite easier. With incorrect io_grouping_type
|
|
you simply won't see performance increase from adding the second
|
|
session (assuming your hardware is capable to transfer data through
|
|
both sessions in parallel), or can even see a performance decrease.
|
|
|
|
5. If you are going to use your target in an VM environment, for
|
|
instance as a shared storage with VMware, make sure all your VMs
|
|
connected to the target via *separate* sessions. For instance, for iSCSI
|
|
it means that each VM has own connection to the target, not all VMs
|
|
connected using a single connection. You can check it using SCST sysfs
|
|
interface. For other transports you should use available facilities,
|
|
like NPIV for Fibre Channel, to make separate sessions for each VM. If
|
|
you miss it, you can greatly loose performance of parallel access to
|
|
your target from different VMs. This isn't related to the case if your
|
|
VMs are using the same shared storage, like with VMFS, for instance. In
|
|
this case all your VM hosts will be connected to the target via separate
|
|
sessions, which is enough.
|
|
|
|
6. For other target and initiator software parts:
|
|
|
|
- Make sure you applied on your kernel all available SCST patches.
|
|
If for your kernel version this patch doesn't exist, it is strongly
|
|
recommended to upgrade your kernel to version, for which this patch
|
|
exists.
|
|
|
|
- Don't enable debug/hacking features in the kernel, i.e. use them as
|
|
they are by default.
|
|
|
|
- The default kernel read-ahead and queuing settings are optimized
|
|
for locally attached disks, therefore they are not optimal if they
|
|
attached remotely (SCSI target case), which sometimes could lead to
|
|
unexpectedly low throughput. You should increase read-ahead size to at
|
|
least 512KB or even more on all initiators and the target.
|
|
|
|
You should also limit on all initiators maximum amount of sectors per
|
|
SCSI command. This tuning is also recommended on targets with large
|
|
read-ahead values. To do it on Linux, run:
|
|
|
|
echo “64” > /sys/block/sdX/queue/max_sectors_kb
|
|
|
|
where specify instead of X your imported from target device letter,
|
|
like 'b', i.e. sdb.
|
|
|
|
To increase read-ahead size on Linux, run:
|
|
|
|
blockdev --setra N /dev/sdX
|
|
|
|
where N is a read-ahead number in 512-byte sectors and X is a device
|
|
letter like above.
|
|
|
|
Note: you need to set read-ahead setting for device sdX again after
|
|
you changed the maximum amount of sectors per SCSI command for that
|
|
device.
|
|
|
|
Note2: you need to restart SCST after you changed read-ahead settings
|
|
on the target. It is a limitation of the Linux read ahead
|
|
implementation. It reads RA values for each file only when the file
|
|
is open and not updates them when the global RA parameters changed.
|
|
Hence, the need for vdisk to reopen all its files/devices.
|
|
|
|
- You may need to increase amount of requests that OS on initiator
|
|
sends to the target device. To do it on Linux initiators, run
|
|
|
|
echo “64” > /sys/block/sdX/queue/nr_requests
|
|
|
|
where X is a device letter like above.
|
|
|
|
You may also experiment with other parameters in /sys/block/sdX
|
|
directory, they also affect performance. If you find the best values,
|
|
please share them with us.
|
|
|
|
- On the target use CFQ IO scheduler. In most cases it has performance
|
|
advantage over other IO schedulers, sometimes huge (2+ times
|
|
aggregate throughput increase).
|
|
|
|
- It is recommended to turn the kernel preemption off, i.e. set
|
|
the kernel preemption model to "No Forced Preemption (Server)".
|
|
|
|
- Looks like XFS is the best filesystem on the target to store device
|
|
files, because it allows considerably better linear write throughput,
|
|
than ext3.
|
|
|
|
7. For hardware on target.
|
|
|
|
- Make sure that your target hardware (e.g. target FC or network card)
|
|
and underlying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
|
|
which your disks connected) don't share the same PCI bus. You can
|
|
check it using lspci utility. They have to work in parallel, so it
|
|
will be better if they don't compete for the bus. The problem is not
|
|
only in the bandwidth, which they have to share, but also in the
|
|
interaction between cards during that competition. This is very
|
|
important, because in some cases if target and backend storage
|
|
controllers share the same PCI bus, it could lead up to 5-10 times
|
|
less performance, than expected. Moreover, some motherboard (by
|
|
Supermicro, particularly) have serious stability issues if there are
|
|
several high speed devices on the same bus working in parallel. If
|
|
you have no choice, but PCI bus sharing, set in the BIOS PCI latency
|
|
as low as possible.
|
|
|
|
8. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
|
|
provide you the best performance. But using it make sure you use a good
|
|
UPS with ability to shutdown the target on the power failure.
|
|
|
|
Baseline performance numbers you can find in those measurements:
|
|
http://lkml.org/lkml/2009/3/30/283.
|
|
|
|
IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
|
|
========= you can't get good write performance for VDISK FILEIO devices with
|
|
default 512 bytes block sizes. You could get about 10% of the
|
|
expected one. This is because of the partition alignment, which
|
|
is (simplifying) incompatible with how Linux page cache
|
|
works, so for each write the corresponding block must be read
|
|
first. Use 4096 bytes block sizes for VDISK devices and you
|
|
will have the expected write performance. Actually, any OS on
|
|
initiators, not only Windows, will benefit from block size
|
|
max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
|
|
is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
|
|
on the underlying FS, on which the device file located, or 0,
|
|
if a device node is used. Both values are from the target.
|
|
See also important notes about setting block sizes >512 bytes
|
|
for VDISK FILEIO devices above.
|
|
|
|
|
|
9. In some cases, for instance working with SSD devices, which consume
|
|
100% of a single CPU load for data transfers in their internal threads,
|
|
to maximize IOPS it can be needed to assign for those threads dedicated
|
|
CPUs. Consider using cpu_mask attribute for devices with
|
|
threads_pool_type "per_initiator" or Linux CPU affinity facilities for
|
|
other threads_pool_types. No IRQ processing should be done on those
|
|
CPUs. Check that using /proc/interrupts. See taskset command and
|
|
Documentation/IRQ-affinity.txt in your kernel's source tree for how to
|
|
assign IRQ affinity to tasks and IRQs.
|
|
|
|
The reason for that is that processing of coming commands in SIRQ
|
|
context might be done on the same CPUs as SSD devices' threads doing data
|
|
transfers. As the result, those threads won't receive all the processing
|
|
power of those CPUs and perform worse.
|
|
|
|
10. If your storage is capable of operation on hundreds of thousands
|
|
IOPS level, you can use poll_us sysfs attribute to set how many us each
|
|
SCST thread is polling its queue after it became empty in a hope that a
|
|
new command can come. In some cases, polling can significantly increase
|
|
IOPS, especially if low power states on CPU not disabled, because on
|
|
high IOPS polling could be cheaper comparing to spending significant
|
|
time on entering, then exiting CPU low power states + corresponding
|
|
context switches. Polling is disabled by default. The recommended value
|
|
to start from is 5-10 us. Then you can increase or decrease it to see if
|
|
your IOPS are increasing or decreasing.
|
|
|
|
|
|
Commands suspending takes too long
|
|
----------------------------------
|
|
|
|
SCST is suspending commands during some management activities like
|
|
adding/deleting LUNs or devices. It is done to have lockless LUNs
|
|
translation on the hot commands processing path. This brings significant
|
|
performance advantage. You will see a message like "Waiting for X active
|
|
commands to complete" when this wait started.
|
|
|
|
But downside of it is that no new commands start executing until older
|
|
ones, which had started before the suspending begun, finished. This
|
|
wait can not be any longer, than the worst command latency any your
|
|
initiator is seeing at this particular time.
|
|
|
|
So, if this wait takes too long, in majority of cases it means that you
|
|
are overloading your storage. A proper storage should have worst case
|
|
latency below few hundreds of milliseconds. In this case the SCST
|
|
suspending will finish in few hundreds of milliseconds at worse.
|
|
|
|
Another case, when it can take too long to suspend is a hung user space
|
|
device (i.e. scst_user device) not responding to any command. In this
|
|
case you should kill the corresponding user space program to finish
|
|
suspending.
|
|
|
|
|
|
Work if target's backstorage or link is too slow
|
|
------------------------------------------------
|
|
|
|
Under high I/O load, when your target's backstorage gets overloaded, or
|
|
working over a slow link between initiator and target, when the link
|
|
can't serve all the queued commands on time, you can experience I/O
|
|
stalls or see in the kernel log abort or reset messages.
|
|
|
|
At first, consider the case of too slow target's backstorage. On some
|
|
seek intensive workloads even fast disks or RAIDs, which able to serve
|
|
continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s.
|
|
Another possible cause for that can be MD/LVM/RAID on your target as in
|
|
http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well).
|
|
|
|
Thus, in such situations simply processing of one or more commands takes
|
|
too long time, hence initiator decides that they are stuck on the target
|
|
and tries to recover. Particularly, it is known that the default amount
|
|
of simultaneously queued commands (48) is sometimes too high if you do
|
|
intensive writes from VMware on a target disk, which uses LVM in the
|
|
snapshot mode. In this case value like 16 or even 8-10 depending of your
|
|
backstorage speed could be more appropriate.
|
|
|
|
There are 6 possible actions, which you can do to workaround or fix such
|
|
issues:
|
|
|
|
1. Ignore incoming task management (TM) commands. It's fine if there are
|
|
not too many of them, so average performance isn't hurt and the
|
|
corresponding device isn't getting put offline, i.e. if the backstorage
|
|
isn't a way too slow.
|
|
|
|
2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
|
|
if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
|
|
in scst_priv.h file until you stop seeing incoming TM commands.
|
|
ISCSI-SCST driver also has its own iSCSI specific parameter for that,
|
|
see its README file.
|
|
|
|
To decrease device queue depth on Linux initiators you can run command:
|
|
|
|
# echo Y >/sys/block/sdX/device/queue_depth
|
|
|
|
where Y is the new number of simultaneously queued commands, X - your
|
|
imported device letter, like 'a' for sda device. There are no special
|
|
limitations for Y value, it can be any value from 1 to possible maximum
|
|
(usually, 32), so start from dividing the current value on 2, i.e. set
|
|
16, if /sys/block/sdX/device/queue_depth contains 32.
|
|
|
|
3. Increase the corresponding timeout on the initiator. For Linux it is
|
|
located in
|
|
/sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can
|
|
be done automatically by an udev rule. For instance, the following
|
|
rule will increase it to 300 seconds:
|
|
|
|
SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300"
|
|
|
|
By default, this timeout is 30 or 60 seconds, depending on your distribution.
|
|
|
|
4. Try to avoid such seek intensive workloads.
|
|
|
|
5. Increase speed of the target's backstorage.
|
|
|
|
6. Implement in SCST QoS, so queue depth size on the target is
|
|
dynamically adjusted, hence worst case initiator seen latencies are
|
|
controlled.
|
|
|
|
Next, consider the case of too slow link between initiator and target,
|
|
when the initiator tries to simultaneously push N commands to the target
|
|
over it. In this case time to serve those commands, i.e. send or receive
|
|
data for them over the link, can be more, than timeout for any single
|
|
command, hence one or more commands in the tail of the queue can not be
|
|
served on time less than the timeout, so the initiator will decide that
|
|
they are stuck on the target and will try to recover.
|
|
|
|
To workaround/fix this issue in this case you can use ways 1, 2, 3 above
|
|
or (7): increase speed of the link between target and initiator.
|
|
|
|
Note, that logged messages about QUEUE_FULL status are quite different
|
|
by nature. This is a normal work, just SCSI flow control in action.
|
|
Simply don't enable "mgmt_minor" logging level, or, alternatively, if
|
|
you are confident in the worst case performance of your back-end storage
|
|
or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in
|
|
scst_priv.h to 64. Usually initiators don't try to push more commands on
|
|
the target.
|
|
|
|
IMPORTANT
|
|
=========
|
|
|
|
There must be LUN 0 in each security group, i.e. LUs numeration must not
|
|
start from, e.g., 1. Otherwise you will see no devices on remote
|
|
initiators and SCST core will write into the kernel log message: "tgt_dev
|
|
for LUN 0 not found, command to unexisting LU?"
|
|
|
|
IMPORTANT
|
|
=========
|
|
|
|
All the access control must be fully configured BEFORE load of the
|
|
corresponding target driver! When you load a target driver or enable
|
|
target mode in it, as for qla2x00t driver, it will immediately start
|
|
accepting new connections, hence creating new sessions, and those new
|
|
sessions will be assigned to security groups according to the
|
|
*currently* configured access control settings. For instance, to
|
|
"Default" group, instead of "HOST004" as you may need, because "HOST004"
|
|
doesn't exist yet. So, one must configure all the security groups before
|
|
new connections from the initiators are created, i.e. before target
|
|
drivers loaded.
|
|
|
|
Access controls can be altered after the target driver loaded as long as
|
|
the target session doesn't yet exist. And even in the case of the
|
|
session already existing, changes are still possible, but won't be
|
|
reflected on the initiator side.
|
|
|
|
So, the safest choice is to configure all the access control before any
|
|
target driver load and then only add new devices to new groups for new
|
|
initiators or add new devices to old groups, but not altering existing
|
|
LUNs in them.
|
|
|
|
|
|
Credits
|
|
-------
|
|
|
|
Thanks to:
|
|
|
|
* Mark Buechler <mark.buechler@gmail.com> for a lot of useful
|
|
suggestions, bug reports and help in debugging.
|
|
|
|
* Ming Zhang <mingz@ele.uri.edu> for fixes and comments.
|
|
|
|
* Nathaniel Clark <nate@misrule.us> for fixes and comments.
|
|
|
|
* Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
|
|
suggestions.
|
|
|
|
* Hu Gang <hugang@soulinfo.com> for the original version of the
|
|
LSI target driver.
|
|
|
|
* Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
|
|
of the LSI target driver.
|
|
|
|
* Ross S. W. Walker <rswwalker@hotmail.com> for BLOCKIO inspiration
|
|
and Vu Pham <huongvp@yahoo.com> who implemented it for VDISK dev handler.
|
|
|
|
* Alessandro Premoli <a.premoli@andxor.it> for fixes
|
|
|
|
* Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.
|
|
|
|
* Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports.
|
|
|
|
* Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with
|
|
devices >2TB in size
|
|
|
|
* Bart Van Assche <bvanassche@acm.org> for a lot of help
|
|
|
|
* University of New Hampshire Interoperability Labs (UNH IOL, http://www.iol.unh.edu)
|
|
for UNH-iSCSI project (http://www.iol.unh.edu/consortiums/iscsi/index.html)
|
|
on which interface between SCST core and target drivers was based.
|
|
|
|
* Daniel Debonzi <debonzi@linux.vnet.ibm.com> for a big part of the
|
|
initial SCST sysfs tree implementation
|
|
|
|
|
|
Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net
|