scst/scst/README

Generic SCSI target mid-level for Linux (SCST)
==============================================

Version 3.0.0, XX XXXXX 2011
----------------------------

SCST is designed to provide unified, consistent interface between SCSI
target drivers and Linux kernel and simplify target drivers development
as much as possible. Detail description of SCST's features and internals
could be found on its Internet page http://scst.sourceforge.net.

SCST supports the following I/O modes:

 * Pass-through mode with one to many relationship, i.e. when multiple
   initiators can connect to the exported pass-through devices, for
   the following SCSI devices types: disks (type 0), tapes (type 1),
   processors (type 3), CDROMs (type 5), MO disks (type 7), medium
   changers (type 8) and RAID controllers (type 0xC).

 * FILEIO mode, which allows to use files on file systems or block
   devices as virtual remotely available SCSI disks or CDROMs with
   benefits of the Linux page cache.

 * BLOCKIO mode, which performs direct block IO with a block device,
   bypassing page-cache for all operations. This mode works ideally with
   high-end storage HBAs and for applications that either do not need
   caching between application and disk or need the large block
   throughput.

 * User space mode using scst_user device handler, which allows to
   implement in the user space high performance virtual SCSI
   devices. Comparing with fully in-kernel dev handlers this mode has
   very low overhead (few %%)

 * "Performance" device handlers, which provide in pseudo pass-through
   mode a way for direct performance measurements without overhead of
   actual data transferring from/to underlying SCSI device.

In addition, SCST supports advanced per-initiator access and devices
visibility management, so different initiators could see different set
of devices with different access permissions. See below for details.

Full list of SCST features and comparison with other Linux targets you
can find on http://scst.sourceforge.net/comparison.html.


Installation
------------

Only vanilla kernels from kernel.org and RHEL/CentOS 5.2 kernels are
supported, but SCST should work on other (vendors') kernels, if you
manage to successfully compile on them. The main problem with vendors'
kernels is that they often contain patches, which will appear only in
the next version of the vanilla kernel, therefore it's quite hard to
track such changes. Thus, if during compilation for some vendor kernel
your compiler complains about redefinition of some symbol, you should
either switch to vanilla kernel, or add or change as necessary the
corresponding to that symbol "#if LINUX_VERSION_CODE" statement.

Default sysfs interface supports only kernels 2.6.26 and higher, because
in 2.6.26 internal kernel's sysfs interface had a major change, which
made it heavily incompatible with pre-2.6.26 version. But with the
obsolete procfs interface kernels 2.6.16+ are supported.

At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
points to the source code for your currently running kernel.

Then you should consider to apply necessary kernel patches. SCST has the
following patches for the kernel in the "kernel" subdirectory. All of
them are optional, so, if you don't need the corresponding
functionality, you may not apply them.

1. scst_exec_req_fifo-2.6.X.patch. This patch is necessary for
pass-through dev handlers, because in the mainstream kernels
scsi_do_req()/scsi_execute_async() work in LIFO order, instead of
expected and required FIFO. So SCST needs new functions
scsi_do_req_fifo() or scsi_execute_async_fifo() to be added in the
kernel. This patch does that. You may not patch the kernel if you don't
need the pass-through support. Alternatively, you can define
CONFIG_SCST_STRICT_SERIALIZING compile option during the compilation
(see description below). Unfortunately, the CONFIG_SCST_STRICT_SERIALIZING
trick doesn't work on kernels starting from 2.6.30, because those
kernels don't have the required functionality (scsi_execute_async())
anymore. So, on them to have pass-through working you have to apply
scst_exec_req_fifo-2.6.X.patch.

2. readahead-2.6.X.patch. This patch fixes problem in Linux readahead
subsystem and greatly improves performance for software RAIDs. See
http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
thread for more details. It is included in the mainstream kernels 2.6.33
and 2.6.32.11.

3. readahead-context-2.6.X.patch. This is backported from 2.6.31 version
of the context readahead patch http://lkml.org/lkml/2009/4/12/9, big
thanks to Wu Fengguang. This is a performance improvement patch. It is
included in the mainstream kernel 2.6.31.

Then, to compile SCST type 'make scst'. It will build SCST itself and its
device handlers. To install them type 'make scst_install'. The driver
modules will be installed in '/lib/modules/`you_kernel_version`/extra'.
In addition, scst.h, scst_debug.h as well as Module.symvers or
Modules.symvers will be copied to '/usr/local/include/scst'. The first
file contains all SCST's public data definition, which are used by
target drivers. The other ones support debug messages logging and build
process.

Then you can load any module by typing 'modprobe module_name'. The names
are:

 - scst - SCST itself
 - scst_disk - device handler for disks (type 0)
 - scst_tape - device handler for tapes (type 1)
 - scst_processor - device handler for processors (type 3)
 - scst_cdrom - device handler for CDROMs (type 5)
 - scst_modisk - device handler for MO disks (type 7)
 - scst_changer - device handler for medium changers (type 8)
 - scst_raid - device handler for storage array controller (e.g. raid) (type C)
 - scst_vdisk - device handler for virtual disks (file, device or ISO CD image).
 - scst_user - user space device handler

Then, to see your devices remotely, you need to add a corresponding LUN
for them (see below how). By default, no local devices are seen
remotely. There must be LUN 0 in each LUNs set (security group), i.e.
LUs numeration must not start from, e.g., 1. Otherwise you will see no
devices on remote initiators and SCST core will write into the kernel
log message: "tgt_dev for LUN 0 not found, command to unexisting LU?"

It is highly recommended to use scstadmin utility for configuring
devices and security groups.

The flow of SCST inialization should be as the following:

1. Load of SCST modules with necessary module parameters, if needed.

2. Configure targets, devices, LUNs, etc. using either scstadmin
(recommended), or the sysfs interface directly as described below.

If you experience problems during modules load or running, check your
kernel logs (or run dmesg command for the few most recent messages).

IMPORTANT: Without loading appropriate device handler, corresponding devices
=========  will be invisible for remote initiators, which could lead to holes
           in the LUN addressing, so automatic device scanning by remote SCSI
           mid-level could not notice the devices. Therefore you will have
	   to add them manually via
	   'echo "- - -" >/sys/class/scsi_host/hostX/scan',
	   where X - is the host number.

IMPORTANT: Working of target and initiator on the same host is
=========  supported, except the following 2 cases: swap over target exported
           device and using a writable mmap over a file from target
	   exported device. The latter means you can't mount a file
	   system over target exported device. In other words, you can
	   freely use any sg, sd, st, etc. devices imported from target
	   on the same host, but you can't mount file systems or put
	   swap on them. This is a limitation of Linux memory/cache
	   manager, because in this case a memory allocation deadlock is
	   possible like: system needs some memory -> it decides to
	   clear some cache -> the cache is needed to be written on a
	   target exported device -> initiator sends request to the
	   target located on the same system -> the target needs memory
	   -> the system needs even more memory -> deadlock.

IMPORTANT: In the current version simultaneous access to local SCSI devices
=========  via standard high-level SCSI drivers (sd, st, sg, etc.) and
           SCST's target drivers is unsupported. Especially it is
	   important for execution via sg and st commands that change
	   the state of devices and their parameters, because that could
	   lead to data corruption. If any such command is done, at
	   least related device handler(s) must be restarted. For block
	   devices READ/WRITE commands using direct disk handler are
	   generally safe.

To uninstall, type 'make scst_uninstall'.

Migration from the obsolete proc interface
------------------------------------------

Sysfs enabled scstadmin supports old procfs config file format, so with
it you should do the following steps to migrate your proc-based
configuration to the sysfs interface:

1. Load SCST modules

2. Run "scstadmin -config old_config_file"

3. Run "scstadmin -write_config new_config_file"

4. Check new_config_file and make sure it has everything written
properly.

5. Start using "scstadmin -config new_config_file" to configure SCST.


Usage in failover mode
----------------------

It is recommended to use TEST UNIT READY ("tur") command to check if
SCST target is alive in MPIO configurations.


Device handlers
---------------

Device specific drivers (device handlers) are plugins for SCST, which
help SCST to analyze incoming requests and determine parameters,
specific to various types of devices. If an appropriate device handler
for a SCSI device type isn't loaded, SCST doesn't know how to handle
devices of this type, so they will be invisible for remote initiators
(more precisely, "LUN not supported" sense code will be returned).

In addition to device handlers for real devices, there are VDISK, user
space and "performance" device handlers.

VDISK device handler works over files on file systems and makes from
them virtual remotely available SCSI disks or CDROM's. In addition, it
allows to work directly over a block device, e.g. local IDE or SCSI disk
or ever disk partition, where there is no file systems overhead. Using
block devices comparing to sending SCSI commands directly to SCSI
mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
are transferred via system cache, so it is possible to fully benefit
from caching and read ahead performed by Linux's VM subsystem. The only
disadvantage here that in the FILEIO mode there is superfluous data
copying between the cache and SCST's buffers. This issue is going to be
addressed in one of the future releases. Virtual CDROM's are useful for
remote installation. See below for details how to setup and use VDISK
device handler.

SCST user space device handler provides an interface between SCST and
the user space, which allows to create pure user space devices. The
simplest example, where one would want it is if he/she wants to write a
VTL. With scst_user he/she can write it purely in the user space. Or one
would want it if he/she needs some sophisticated for kernel space
processing of the passed data, like encrypting them or making snapshots.

"Performance" device handlers for disks, MO disks and tapes in their
exec() method skip (pretend to execute) all READ and WRITE operations
and thus provide a way for direct link performance measurements without
overhead of actual data transferring from/to underlying SCSI device.

NOTE: Since "perf" device handlers on READ operations don't touch the
====  commands' data buffer, it is returned to remote initiators as it
      was allocated, without even being zeroed. Thus, "perf" device
      handlers impose some security risk, so use them with caution.


Compilation options
-------------------

There are the following compilation options, that could be commented
in/out in Makefile:

 - CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
   including some logging. Makes the driver considerably bigger and slower,
   producing large amount of log data.

 - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
   driver considerably bigger and leads to some performance loss.

 - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
   the various places.

 - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
   supplied expected data transfer length and direction will be used
   only for verification purposes to return error or warn in case if one
   of them is invalid. Instead, locally decoded from SCSI command values
   will be used. This is necessary for security reasons, because
   otherwise a faulty initiator can crash target by supplying invalid
   value in one of those parameters. This is especially important in
   case of pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is
   defined, initiator supplied expected data transfer length and
   direction will override the locally decoded values. This might be
   necessary if internal SCST commands translation table doesn't contain
   SCSI command, which is used in your environment. You can know that if
   you enable "minor" trace level and have messages like "Unknown
   opcode XX for YY. Should you update scst_scsi_op_table?" in your
   kernel log and your initiator returns an error. Also report those
   messages in the SCST mailing list scst-devel@lists.sourceforge.net.
   Note, that not all SCSI transports support supplying expected values.
   You should try to enable this option if you have a not working with
   SCST pass-through device, for instance, an SATA CDROM.

 - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
   debugging, when on LUN 6 some of the commands will be delayed for
   about 60 sec., so making the remote initiator send TM functions, eg
   ABORT TASK and TARGET RESET. Also define
   CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you want that
   the device eventually become completely unresponsive, or otherwise to
   circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG turned
   on.

 - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
   underlying SCSI device synchronously, one after one. This makes task
   management more reliable, with cost of some performance penalty. This
   is mostly actual for stateful SCSI devices like tapes, where the
   result of command's execution depends from device's settings defined
   by previous commands. Disk and RAID devices are stateless in the most
   cases. The current SCSI core in Linux doesn't allow to abort all
   commands reliably if they sent asynchronously to a stateful device.
   Turned off by default, turn it on if you use stateful device(s) and
   need as much error recovery reliability as possible. As a side effect
   of CONFIG_SCST_STRICT_SERIALIZING, on kernels below 2.6.30 no kernel
   patching is necessary for pass-through device handlers (scst_disk,
   etc.).

 - CONFIG_SCST_TEST_IO_IN_SIRQ - if defined, allows SCST to submit selected
   SCSI commands (TUR and READ/WRITE) from soft-IRQ context (tasklets).
   Enabling it will decrease amount of context switches and slightly
   improve performance. The goal of this option is to be able to measure
   overhead of the context switches. If after enabling this option you
   don't see under load in vmstat output on the target significant
   decrease of amount of context switches, then your target driver
   doesn't submit commands to SCST in IRQ context. For instance,
   iSCSI-SCST doesn't do that, but qla2x00t with
   CONFIG_QLA_TGT_DEBUG_WORK_IN_THREAD disabled - does. This option is
   designed to be used with vdisk NULLIO backend.

   WARNING! Using this option enabled with other backend than vdisk
   NULLIO is unsafe and can lead you to a kernel crash!

 - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
   buffers. Undefining it (default) considerably improves performance
   and eases CPU load, but could create a security hole (information
   leakage), so enable it, if you have strict security requirements.

 - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
   in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
   command, which has already finished, remote initiator, which sent the
   ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
   response for the ABORT TASK request. This is more logical response,
   since, because the command finished, attempt to abort it failed, but
   some initiators, particularly VMware iSCSI initiator, consider TASK
   NOT EXIST response as if the target got crazy and try to RESET it.
   Then sometimes get crazy itself. So, this option is disabled by
   default.

 - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in "latency" files
   global and per-LUN average commands processing latency statistic. You
   can clear already measured results by writing 0 in each file. Note,
   you need a non-preemptible kernel to have correct results.

HIGHMEM kernel configurations are fully supported, but not recommended
for performance reasons, except for scst_user, where they are not
supported, because this module deals with user supplied memory on a
zero-copy manner. If you need to use HIGHMEM enabled, consider change
VMSPLIT option or use 64-bit system configuration instead.

For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
"make menuconfig" command set the following variables:

 - General setup->Configure standard kernel features (for small systems): ON

 - General setup->Prompt for development and/or incomplete code/drivers: ON

 - Processor type and features->High Memory Support: OFF

 - Processor type and features->Memory split: according to amount of
   memory you have. If it is less than 800MB, you may not touch this
   option at all.


Module parameters
-----------------

Module scst supports the following parameters:

 - scst_threads - allows to set count of SCST's threads. By default it
   is CPU count.

 - scst_max_cmd_mem - sets maximum amount of memory in MB allowed to be
   consumed by the SCST commands for data buffers at any given time. By
   default it is approximately TotalMem/4.

 - scst_max_dev_cmd_mem - sets maximum amount of memory in MB allowed
   to be consumed by all SCSI commands of a device at any given time. By
   default, it is approximately 2/5 of scst_max_cmd_mem.


SCST sysfs interface
--------------------

Starting from 2.0.0 SCST has sysfs interface. It supports only kernels
2.6.26 and higher, because in 2.6.26 internal kernel's sysfs interface
had a major change, which made it heavily incompatible with pre-2.6.26
version. If you need pre-2.6.26 kernel, you need to use obsolete procfs
interface (see below).

SCST sysfs interface designed to be self descriptive and self
containing. This means that a high level managament tool for it can be
written once and automatically support any future sysfs interface
changes (attributes additions or removals, new target drivers and dev
handlers, etc.) without any modifications. Scstadmin is an example of
such management tool.

To implement that an management tool should not be implemented around
drivers and their attributes, but around common rules those drivers and
attributes follow. You can find those rules in SysfsRules file. For
instance, each SCST sysfs file (attribute) can contain in the last line
mark "[key]". It is automatically added to allow scstadmin and other
management tools to see which attributes it should save in the config
file. If you are doing manual attributes manipulations, you can ignore
this mark.

Root of SCST sysfs interface is /sys/kernel/scst_tgt. It has the
following entries:

 - devices - this is a root subdirectory for all SCST devices

 - handlers - this is a root subdirectory for all SCST dev handlers

 - max_tasklet_cmd - specifies how many commands at max can be queued in
   the SCST core simultaneously on a single CPU from all connected
   initiators to allow processing commands on this CPU in soft-IRQ
   context in tasklets. If the count of the commands exceeds this value,
   then all of them will be processed only in SCST threads. This is to
   to prevent possible under heavy load starvation of processes on the
   CPUs serving soft IRQs and in some cases to improve performance by
   more evenly spreading load over available CPUs.

 - sgv - this is a root subdirectory for all SCST SGV caches

 - targets - this is a root subdirectory for all SCST targets

 - setup_id - allows to read and write SCST setup ID. This ID can be
   used in cases, when the same SCST configuration should be installed
   on several targets, but exported from those targets devices should
   have different IDs and SNs. For instance, VDISK dev handler uses this
   ID to generate T10 vendor specific identifier and SN of the devices.

 - threads - allows to read and set number of global SCST I/O threads.
   Those threads used with async. dev handlers, for instance, vdisk
   BLOCKIO or NULLIO.

 - trace_level - allows to enable and disable various tracing
   facilities. See content of this file for help how to use it. See also
   section "Dealing with massive logs" for more info how to make correct
   logs when you enabled trace levels producing a lot of logs data.

 - version - read-only attribute, which allows to see version of
   SCST and enabled optional features.

 - last_sysfs_mgmt_res - read-only attribute returning completion status
   of the last management command. In the sysfs implementation there are
   some problems between internal sysfs and internal SCST locking. To
   avoid them in some cases sysfs calls can return error with errno
   EAGAIN. This doesn't mean the operation failed. It only means that
   the operation queued and not yet completed. To wait for it to
   complete, an management tool should poll this file. If the operation
   hasn't yet completed, it will also return EAGAIN. But after it's
   completed, it will return the result of this operation (0 for success
   or -errno for error).

"Devices" subdirectory contains subdirectories for each SCST devices.

Content of each device's subdirectory is dev handler specific. See
documentation for your dev handlers for more info about it as well as
SysfsRules file for more info about common to all dev handlers rules.
SCST dev handlers can have the following common entries:

 - exported - subdirectory containing links to all LUNs where this
   device was exported.

 - handler - if dev handler determined for this device, this link points
   to it. The handler can be not set for pass-through devices.

 - threads_num - shows and allows to set number of threads in this device's
   threads pool. If 0 - no threads will be created, and global SCST
   threads pool will be used. If <0 - creation of the threads pool is
   prohibited.

 - threads_pool_type - shows and allows to sets threads pool type.
   Possible values: "per_initiator" and "shared". When the value is
   "per_initiator" (default), each session from each initiator will use
   separate dedicated pool of threads. When the value is "shared", all
   sessions from all initiators will share the same per-device pool of
   threads. Valid only if threads_num attribute >0.

 - dump_prs - allows to dump persistent reservations information in the
   kernel log.

 - type - SCSI type of this device

See below for more information about other entries of this subdirectory
of the standard SCST dev handlers.

"Handlers" subdirectory contains subdirectories for each SCST dev
handler.

Content of each handler's subdirectory is dev handler specific. See
documentation for your dev handlers for more info about it as well as
SysfsRules file for more info about common to all dev handlers rules.
SCST dev handlers can have the following common entries:

 - mgmt - this entry allows to create virtual devices and their
   attributes (for virtual devices dev handlers) or assign/unassign real
   SCSI devices to/from this dev handler (for pass-through dev
   handlers).

 - trace_level - allows to enable and disable various tracing
   facilities. See content of this file for help how to use it. See also
   section "Dealing with massive logs" for more info how to make correct
   logs when you enabled trace levels producing a lot of logs data.

 - type - SCSI type of devices served by this dev handler.

See below for more information about other entries of this subdirectory
of the standard SCST dev handlers.

"Sgv" subdirectory contains statistic information of SCST SGV caches. It
has the following entries:

 - None, one or more subdirectories for each existing SGV cache.

 - global_stats - file containing global SGV caches statistics.

Each SGV cache's subdirectory has the following item:

 - stats - file containing statistics for this SGV caches.

"Targets" subdirectory contains subdirectories for each SCST target.

Content of each target's subdirectory is target specific. See
documentation for your target for more info about it as well as
SysfsRules file for more info about common to all targets rules.
Every target should have at least the following entries:

 - ini_groups - subdirectory, which contains and allows to define
   initiator-oriented access control information, see below.

 - luns - subdirectory, which contains list of available LUNs in the
   target-oriented access control and allows to define it, see below.

 - sessions - subdirectory containing connected to this target sessions.

 - comment - this attribute can be used to store any human readable info
   to help identify target. For instance, to help identify the target's
   mapping to the corresponding hardware port. It isn't anyhow used by
   SCST.

 - enabled - using this attribute you can enable or disable this target/
   It allows to finish configuring it before it starts accepting new
   connections. 0 by default.

 - addr_method - used LUNs addressing method. Possible values:
   "Peripheral" and "Flat". Most initiators work well with Peripheral
   addressing method (default), but some (HP-UX, for instance) may
   require Flat method. This attribute is also available in the
   initiators security groups, so you can assign the addressing method
   on per-initiator basis.

 - cpu_mask - defines CPU affinity mask for threads serving this target.
   For threads serving LUNs it is used only for devices with
   threads_pool_type "per_initiator".

 - io_grouping_type - defines how I/O from sessions to this target are
   grouped together. This I/O grouping is very important for
   performance. By setting this attribute in a right value, you can
   considerably increase performance of your setup. This grouping is
   performed only if you use CFQ I/O scheduler on the target and for
   devices with threads_num >= 0 and, if threads_num > 0, with
   threads_pool_type "per_initiator". Possible values:
   "this_group_only", "never", "auto", or I/O group number >0. When the
   value is "this_group_only" all I/O from all sessions in this target
   will be grouped together. When the value is "never", I/O from
   different sessions will not be grouped together, i.e. all sessions in
   this target will have separate dedicated I/O groups. When the value
   is "auto" (default), all I/O from initiators with the same name
   (iSCSI initiator name, for instance) in all targets will be grouped
   together with a separate dedicated I/O group for each initiator name.
   For iSCSI this mode works well, but other transports usually use
   different initiator names for different sessions, so using such
   transports in MPIO configurations you should either use value
   "this_group_only", or an explicit I/O group number. This attribute is
   also available in the initiators security groups, so you can assign
   the I/O grouping on per-initiator basis. See below for more info how
   to use this attribute.

 - rel_tgt_id - allows to read or write SCSI Relative Target Port
   Identifier attribute. This identifier is used to identify SCSI Target
   Ports by some SCSI commands, mainly by Persistent Reservations
   commands. This identifier must be unique among all SCST targets, but
   for convenience SCST allows disabled targets to have not unique
   rel_tgt_id. In this case SCST will not allow to enable this target
   until rel_tgt_id becomes unique. This attribute initialized unique by
   SCST by default.

A target driver may have also the following entries:

 - "hw_target" - if the target driver supports both hardware and virtual
    targets (for instance, an FC adapter supporting NPIV, which has
    hardware targets for its physical ports as well as virtual NPIV
    targets), this read only attribute for all hardware targets will
    exist and contain value 1.

Subdirectory "sessions" contains one subdirectory for each connected
session with name equal to name of the connected initiator.

Each session subdirectory contains the following entries:

 - initiator_name - contains initiator name

 - force_close - optional write-only attribute, which allows to force
   close this session.

 - active_commands - contains number of active, i.e. not yet or being
   executed, SCSI commands in this session.

 - commands - contains overall number of SCSI commands in this session.

 - latency - if CONFIG_SCST_MEASURE_LATENCY enabled, contains latency
   statistics for this session.

 - luns - a link pointing out to the corresponding LUNs set (security
   group) where this session was attached to.

 - One or more "lunX" subdirectories, where 'X' is a number, for each LUN
   this session has (see below).

 - other target driver specific attributes and subdirectories.

See below description of the VDISK's sysfs interface for samples.


Access and devices visibility management (LUN masking)
------------------------------------------------------

Access and devices visibility management allows for an initiator or
group of initiators to see different devices with different LUNs
with necessary access permissions.

SCST supports two modes of access control:

1. Target-oriented. In this mode you define for each target a default
set of LUNs, which are accessible to all initiators, connected to that
target. This is a regular access control mode, which people usually mean
thinking about access control in general. For instance, in IET this is
the only supported mode.

2. Initiator-oriented. In this mode you define which LUNs are accessible
for each initiator. In this mode you should create for each set of one
or more initiators, which should access to the same set of devices with
the same LUNs, a separate security group, then add to it devices and
names of allowed initiator(s).

Both modes can be used simultaneously. In this case the
initiator-oriented mode has higher priority, than the target-oriented,
i.e. initiators are at first searched in all defined security groups for
this target and, if none matches, the default target's set of LUNs is
used. This set of LUNs might be empty, then the initiator will not see
any LUNs from the target.

You can at any time find out which set of LUNs each session is assigned
to by looking where link
/sys/kernel/scst_tgt/targets/target_driver/target_name/sessions/initiator_name/luns
points to.

To configure the target-oriented access control SCST provides the
following interface. Each target's sysfs subdirectory
(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "luns"
subdirectory. This subdirectory contains the list of already defined
target-oriented access control LUNs for this target as well as file
"mgmt". This file has the following commands, which you can send to it,
for instance, using "echo" shell command. You can always get a small
help about supported commands by looking inside this file. "Parameters"
are one or more param_name=value pairs separated by ';'.

 - "add H:C:I:L lun [parameters]" - adds a pass-through device with
   host:channel:id:lun with LUN "lun". Optionally, the device could be
   marked as read only by using parameter "read_only". The recommended
   way to find out H:C:I:L numbers is use of lsscsi utility.

 - "replace H:C:I:L lun [parameters]" - replaces by pass-through device
   with host:channel:id:lun existing with LUN "lun" device with
   generation of INQUIRY DATA HAS CHANGED Unit Attention. If the old
   device doesn't exist, this command acts as the "add" command.
   Optionally, the device could be marked as read only by using
   parameter "read_only". The recommended way to find out H:C:I:L
   numbers is use of lsscsi utility.

 - "add VNAME lun [parameters]" - adds a virtual device with name VNAME
   with LUN "lun". Optionally, the device could be marked as read only
   by using parameter "read_only".

 - "replace VNAME lun [parameters]" - replaces by virtual device
   with name VNAME existing with LUN "lun" device with generation of
   INQUIRY DATA HAS CHANGED Unit Attention. If the old device doesn't
   exist, this command acts as the "add" command. Optionally, the device
   could be marked as read only by using parameter "read_only".

 - "del lun" - deletes LUN lun

 - "clear" - clears the list of devices

To configure the initiator-oriented access control SCST provides the
following interface. Each target's sysfs subdirectory
(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "ini_groups"
subdirectory. This subdirectory contains the list of already defined
security groups for this target as well as file "mgmt". This file has
the following commands, which you can send to it, for instance, using
"echo" shell command. You can always get a small help about supported
commands by looking inside this file.

 - "create GROUP_NAME" - creates a new security group.

 - "del GROUP_NAME" - deletes a new security group.

Each security group's subdirectory contains 2 subdirectories: initiators
and luns as well as the following attributes: addr_method, cpu_mask and
io_grouping_type. See above description of them.

Each "initiators" subdirectory contains list of added to this groups
initiator as well as as well as file "mgmt". This file has the following
commands, which you can send to it, for instance, using "echo" shell
command. You can always get a small help about supported commands by
looking inside this file.

 - "add INITIATOR_NAME" - adds initiator with name INITIATOR_NAME to the
   group.

 - "del INITIATOR_NAME" - deletes initiator with name INITIATOR_NAME
   from the group.

 - "move INITIATOR_NAME DEST_GROUP_NAME" moves initiator with name
   INITIATOR_NAME from the current group to group with name
   DEST_GROUP_NAME.

 - "clear" - deletes all initiators from this group.

For "add" and "del" commands INITIATOR_NAME can be a simple DOS-type
patterns, containing '*' and '?' symbols. '*' means match all any
symbols, '?' means match only any single symbol. For instance,
"blah.xxx" will match "bl?h.*". Additionally, you can use negative sign
'!' to revert the value of the pattern. For instance, "ah.xxx" will
match "!bl?h.*".

Each "luns" subdirectory contains the list of already defined LUNs for
this group as well as file "mgmt". Content of this file as well as list
of available in it commands is fully identical to the "luns"
subdirectory of the target-oriented access control.

Examples:

 - echo "create INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/mgmt -
   creates security group INI for target iqn.2006-10.net.vlnb:tgt1.

 - echo "add 2:0:1:0 11" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt -
   adds a pass-through device sitting on host 2, channel 0, ID 1, LUN 0
   to group with name INI as LUN 11.

 - echo "add disk1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt -
   adds a virtual disk with name disk1 to group with name INI as LUN 0.

 - echo "add 21:*:e0:?b:83:*" >/sys/kernel/scst_tgt/targets/21:00:00:a0:8c:54:52:12/ini_groups/INI/initiators/mgmt -
   adds a pattern to group with name INI to Fibre Channel target with
   WWN 21:00:00:a0:8c:54:52:12, which matches WWNs of Fibre Channel
   initiator ports.

Consider you need to have an iSCSI target with name
"iqn.2007-05.com.example:storage.disk1.sys1.xyz", which should export
virtual device "dev1" with LUN 0 and virtual device "dev2" with LUN 1,
but initiator with name
"iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only
virtual device "dev2" read only with LUN 0. To achieve that you should
do the following commands:

# echo "iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/sys/kernel/scst_tgt/targets/iscsi/mgmt
# echo "add dev1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt
# echo "add dev2 1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt
# echo "create SPEC_INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/mgmt
# echo "add dev2 0 read_only=1" \
	>/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/luns/mgmt
# echo "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" \
	>/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/initiators/mgmt

For Fibre Channel or SAS in the above example you should use target's
and initiator ports WWNs instead of iSCSI names.

It is highly recommended to use scstadmin utility instead of described
in this section low level interface.

IMPORTANT
=========

There must be LUN 0 in each set of LUNs, i.e. LUs numeration must not
start from, e.g., 1. Otherwise you will see no devices on remote
initiators and SCST core will write into the kernel log message: "tgt_dev
for LUN 0 not found, command to unexisting LU?"

IMPORTANT
=========

All the access control must be fully configured BEFORE the corresponding
target is enabled. When you enable a target, it will immediately start
accepting new connections, hence creating new sessions, and those new
sessions will be assigned to security groups according to the
*currently* configured access control settings. For instance, to
the default target's set of LUNs, instead of "HOST004" group as you may
need, because "HOST004" doesn't exist yet. So, you must configure all
the security groups before new connections from the initiators are
created, i.e. before the target enabled.


VDISK device handler
--------------------

Starting from 2.0.0 VDISK device handler uses sysfs interface. The
procfs interface is obsolete and will be removed in one of the next
versions.

VDISK has 4 built-in dev handlers: vdisk_fileio, vdisk_blockio,
vdisk_nullio and vcdrom. Roots of their sysfs interface are
/sys/kernel/scst_tgt/handlers/handler_name, e.g. for vdisk_fileio:
/sys/kernel/scst_tgt/handlers/vdisk_fileio. Each root has the following
entries:

 - None, one or more links to devices with name equal to names
   of the corresponding devices.

 - trace_level - allows to enable and disable various tracing
   facilities. See content of this file for help how to use it. See also
   section "Dealing with massive logs" for more info how to make correct
   logs when you enabled trace levels producing a lot of logs data.

 - mgmt - main management entry, which allows to add/delete VDISK
   devices with the corresponding type.

The "mgmt" file has the following commands, which you can send to it,
for instance, using "echo" shell command. You can always get a small
help about supported commands by looking inside this file. "Parameters"
are one or more param_name=value pairs separated by ';'.

 - echo "add_device device_name [parameters]" - adds a virtual device
   with name device_name and specified parameters (see below)

 - echo "del_device device_name" - deletes a virtual device with name
   device_name.

Handler vdisk_fileio provides FILEIO mode to create virtual devices.
This mode uses as backend files and accesses to them using regular
read()/write() file calls. This allows to use full power of Linux page
cache. The following parameters possible for vdisk_fileio:

 - filename - specifies path and file name of the backend file. The path
   must be absolute.

 - blocksize - specifies block size used by this virtual device. The
   block size must be power of 2 and >= 512 bytes. Default is 512.

 - write_through - disables write back caching. Note, this option
   has sense only if you also *manually* disable write-back cache in
   *all* your backstorage devices and make sure it's actually disabled,
   since many devices are known to lie about this mode to get better
   benchmark results. Default is 0.

 - read_only - read only. Default is 0.

 - o_direct - disables both read and write caching. This mode isn't
   currently fully implemented, you should use user space fileio_tgt
   program in O_DIRECT mode instead (see below).

 - nv_cache - enables "non-volatile cache" mode. In this mode it is
   assumed that the target has a GOOD UPS with ability to cleanly
   shutdown target in case of power failure and it is software/hardware
   bugs free, i.e. all data from the target's cache are guaranteed
   sooner or later to go to the media. Hence all data synchronization
   with media operations, like SYNCHRONIZE_CACHE, are ignored in order
   to bring more performance. Also in this mode target reports to
   initiators that the corresponding device has write-through cache to
   disable all write-back cache workarounds used by initiators. Use with
   extreme caution, since in this mode after a crash of the target
   journaled file systems don't guarantee the consistency after journal
   recovery, therefore manual fsck MUST be ran. Note, that since usually
   the journal barrier protection (see "IMPORTANT" note below) turned
   off, enabling NV_CACHE could change nothing from data protection
   point of view, since no data synchronization with media operations
   will go from the initiator. This option overrides "write_through"
   option. Disabled by default.

 - thin_provisioned - enables thin provisioning facility, when remote
   initiators can unmap blocks of storage, if they don't need them
   anymore. Backend storage also must support this facility.

 - removable - with this flag set the device is reported to remote
   initiators as removable.

 - rotational - if set, this device reported as rotational. Otherwise,
   it is reported as non-rotational (SSD, etc.)

Handler vdisk_blockio provides BLOCKIO mode to create virtual devices.
This mode performs direct block I/O with a block device, bypassing the
page cache for all operations. This mode works ideally with high-end
storage HBAs and for applications that either do not need caching
between application and disk or need the large block throughput. See
below for more info.

The following parameters possible for vdisk_blockio: filename,
blocksize, nv_cache, read_only, removable, rotational, thin_provisioned.
See vdisk_fileio above for description of those parameters.

Handler vdisk_nullio provides NULLIO mode to create virtual devices. In
this mode no real I/O is done, but success returned to initiators.
Intended to be used for performance measurements at the same way as
"*_perf" handlers. The following parameters possible for vdisk_nullio:
blocksize, read_only, removable. See vdisk_fileio above for description
of those parameters.

Handler vcdrom allows emulation of a virtual CDROM device using an ISO
file as backend. It doesn't have any parameters.

For example:

echo "add_device disk1 filename=/disk1; blocksize=4096; nv_cache=1" >/sys/kernel/scst_tgt/handlers/vdisk_fileio/mgmt

will create a FILEIO virtual device disk1 with backend file /disk1
with block size 4K and NV_CACHE enabled.

Each vdisk_fileio's device has the following attributes in
/sys/kernel/scst_tgt/devices/device_name:

 - filename - contains path and file name of the backend file.

 - blocksize - contains block size used by this virtual device.

 - write_through - contains status of write back caching of this virtual
   device.

 - read_only - contains read only status of this virtual device.

 - o_direct - contains O_DIRECT status of this virtual device.

 - nv_cache - contains NV_CACHE status of this virtual device.

 - thin_provisioned - contains thin provisioning status of this virtual
   device.

 - removable - contains removable status of this virtual device.

 - rotational - contains rotational status of this virtual device.

 - size_mb - contains size of this virtual device in MB.

 - t10_dev_id - contains and allows to set T10 vendor specific
   identifier for Device Identification VPD page (0x83) of INQUIRY data.
   By default VDISK handler always generates t10_dev_id for every new
   created device at creation time based on the device name and
   scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below)
   or the SCST setup_id when using the sysfs interface (see above).
   Note: some initiators, e.g. VMware's ESXi or MS Hyper-V, only looks
   at the first eight characters of t10_dev_id. You have to make sure
   that these first eight characters are unique or VMware will consider
   these devices as identical.

 - usn - contains the virtual device's serial number of INQUIRY data. It
   is created at the device creation time based on the device name and
   scst_vdisk_ID scst_vdisk.ko module parameter for procfs (see below)
   or the SCST setup_id when using the sysfs interface (see above).

 - type - contains SCSI type of this virtual device.

 - resync_size - write only attribute, which makes vdisk_fileio to
   rescan size of the backend file. It is useful if you changed it, for
   instance, if you resized it.

For example:

/sys/kernel/scst_tgt/devices/disk1
|-- blocksize
|-- exported
|   |-- export0 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/luns/0
|   |-- export1 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/ini_groups/INI/luns/0
|   |-- export2 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/luns/0
|   |-- export3 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI1/luns/0
|   |-- export4 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI2/luns/0
|-- filename
|-- handler -> ../../handlers/vdisk_fileio
|-- nv_cache
|-- o_direct
|-- read_only
|-- removable
|-- resync_size
|-- rotational
|-- size_mb
|-- t10_dev_id
|-- thin_provisioned
|-- threads_num
|-- threads_pool_type
|-- type
|-- usn
`-- write_through

Each vdisk_blockio's device has the following attributes in
/sys/kernel/scst_tgt/devices/device_name: blocksize, filename, nv_cache,
read_only, removable, resync_size, rotational, size_mb, t10_dev_id,
thin_provisioned, threads_num, threads_pool_type, type, usn. See above
description of those parameters.

Each vdisk_nullio's device has the following attributes in
/sys/kernel/scst_tgt/devices/device_name: blocksize, read_only,
removable, size_mb, t10_dev_id, threads_num, threads_pool_type, type,
usn. See above description of those parameters.

Each vcdrom's device has the following attributes in
/sys/kernel/scst_tgt/devices/device_name: filename, size_mb,
t10_dev_id, threads_num, threads_pool_type, type, usn. See above
description of those parameters. Exception is filename attribute. For
vcdrom it is writable. Writing to it allows to virtually insert or
change virtual CD media in the virtual CDROM device. For example:

 - echo "/image.iso" >/sys/kernel/scst_tgt/devices/cdrom/filename - will
   insert file /image.iso as virtual media to the virtual CDROM cdrom.

 - echo "" >/sys/kernel/scst_tgt/devices/cdrom/filename - will remove
   "media" from the virtual CDROM cdrom.

Additionally VDISK handler has module parameter "num_threads", which
specifies count of I/O threads for each FILEIO VDISK's or VCDROM device.
If you have a workload, which tends to produce rather random accesses
(e.g. DB-like), you should increase this count to a bigger value, like
32. If you have a rather sequential workload, you should decrease it to
a lower value, like number of CPUs on the target or even 1. Due to some
limitations of Linux I/O subsystem, increasing number of I/O threads too
much leads to sequential performance drop, especially with deadline
scheduler, so decreasing it can improve sequential performance. The
default provides a good compromise between random and sequential
accesses.

You shouldn't be afraid to have too many VDISK I/O threads if you have
many VDISK devices. Kernel threads consume very little amount of
resources (several KBs) and only necessary threads will be used by SCST,
so the threads will not trash your system.

CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
======== ever try to export and then mount it (even accidentally) with another
         block size. Otherwise you can *instantly* damage it pretty
	 badly as well as all your data on it. Messages on initiator
	 like: "attempt to access beyond end of device" is the sign of
	 such damage.

	 Moreover, if you want to compare how well different block sizes
	 work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
	 **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
	 other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
	 AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
	 sizes isn't like switching between FILEIO and BLOCKIO, after
	 changing block size all previously written with another block
	 size data MUST BE ERASED. Otherwise you will have a full set of
	 very weird behaviors, because blocks addressing will be
	 changed, but initiators in most cases will not have a
	 possibility to detect that old addresses written on the device
	 in, e.g., partition table, don't refer anymore to what they are
	 intended to refer.

IMPORTANT: Some disk and partition table management utilities don't support
=========  block sizes >512 bytes, therefore make sure that your favorite one
           supports it. Currently only cfdisk is known to work only with
	   512 bytes blocks, other utilities like fdisk on Linux or
	   standard disk manager on Windows are proved to work well with
	   non-512 bytes blocks. Note, if you export a disk file or
	   device with some block size, different from one, with which
	   it was already partitioned, you could get various weird
	   things like utilities hang up or other unexpected behavior.
	   Hence, to be sure, zero the exported file or device before
	   the first access to it from the remote initiator with another
	   block size. On Window initiator make sure you "Set Signature"
	   in the disk manager on the imported from the target drive
	   before doing any other partitioning on it. After you
	   successfully mounted a file system over non-512 bytes block
	   size device, the block size stops matter, any program will
	   work with files on such file system.


Dealing with massive logs
-------------------------

If you want to enable using "trace_level" file logging levels, which
produce a lot of events, like "debug", to not loose logged events you
should also:

  * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
    to much bigger value, then recompile it. For example, value 25 will
    provide good protection from logging overflow even under high volume
    of logging events. To use it you will need to modify the maximum
    allowed value for CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig
    file to 25 as well.

  * Change in your /etc/syslog.conf or other config file of your favorite
    logging program to store kernel logs in async manner. For example,
    you can add in rsyslog.conf line "kern.info -/var/log/kernel" and
    add "kern.none" in line for /var/log/messages, so the resulting line
    would looks like:

    "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"


Persistent Reservations
-----------------------

SCST implements Persistent Reservations with full set of capabilities,
including "Persistence Through Power Loss".

The "Persistence Through Power Loss" data are saved in /var/lib/scst/pr
with files with names the same as the names of the corresponding
devices. Also this directory contains backup versions of those files
with suffix ".1". Those backup files are used in case of power or other
failure to prevent Persistent Reservation information from corruption
during update.

The "Persistence Through Power Loss" feature is not available in the
procfs build, because the SCST proc interface doesn't allow to keep
persistent Relative Target IDs of each target between reboots/reloads
(they are load and initialization order dependent).

The Persistent Reservations available on all transports implementing
get_initiator_port_transport_id() callback. Transports not implementing
this callback will act in one of 2 possible scenarios ("all or
nothing"):

1. If a device has such transport connected and doesn't have persistent
reservations, it will refuse Persistent Reservations commands as if it
doesn't support them.

2. If a device has persistent reservations, all initiators newly
connecting via such transports will not see this device. After all
persistent reservations from this device are released, upon reconnect
the initiators will see it.


Implicit ALUA Support
---------------------

SCST supports implicit asymmetric logical unit access (ALUA). Implicit ALUA is
a feature defined by the ANSI T10 SCSI committee that allows a target to tell
the initiator which path to use in a multipath setup. The redundant paths
between initiator and target can be used either for redundancy or for load
sharing purposes. The target can either be a single target system running SCST
with multiple communication interfaces or two target systems each running SCST
and configured in a high availability setup.

In the SPC-4 standard the following concepts are defined related to ALUA:
* Relative target port ID. A number between 1 and 65535 that uniquely
  identifies a target port. These numbers must be unique over the target as
  a whole, even if that target consists of multiple systems each running SCST.
* Target port group asymmetric access state. One of active/optimized,
  active/non-optimized, standby, unavailable, logical block dependent or
  offline. The access state of a port defines which (if any) SCSI commands
  will be processed by the target port.
* Target port preference indicator. This indicator is additional information
  next to the asymmetric access state that is provided by the target to an
  initiator and that may impact the decision taken by the initiator about
  which path that will be choosen.

More detailed information about ALUA can be found in section 5.11.2 of the
ANSI T10 standard called SPC-4.

ALUA support in SCST
....................

SCST allows to define implicit ALUA settings for each unique combination of
SCST device and SCST target. An initiator however queries ALUA settings by
sending an appropriate SCSI command to a specific LUN of an SCST target. Each
such LUN maps uniquely to an SCST device. For hardware SCST target drivers,
e.g. ib_srpt, there is a one-to-one correspondence between SCST target and
SCSI target port. With other SCST targets, e.g. iSCSI-SCST, by default the
only relationship between SCST targets and SCSI target ports is that all SCST
targets defined on a system are visible via all SCSI target ports. See also
the iSCSI-SCST documentation about the allowed_portal attribute for
information about how to associate iSCSI targets with a single physical
interface.

Notes:
- In a H.A. setup it is the responsibility of the user to synchronize ALUA
  information between the individual systems running SCST. There are no
  provisions in SCST to exchange ALUA information automatically between
  individual systems.
- In order to support H.A. setups it is possible to let one SCST system
  report information about target ports present in other SCST systems.
- With SCST, and certainly in a H.A. setup, it is possible to configure ALUA
  such that an initiator receives information that is not standard compliant,
  e.g. setting all target ports in the offline state. It is the responsibility
  of the user to make sure that the information queried by an initiator is
  consistent independent of the LUN and the target port used by the initiator
  to query this information.

Configuring ALUA in SCST
........................

SCST allows to configure the following settings related to implicit ALUA
for each unique combination of SCST target and virtual SCST device
(vdisk_fileio, vdisk_blockio, vcdrom, ...):
* The target port group asymmetric access state. SCST supports all ALUA port
  states except logical block dependent.
* The preference indicator for a target port group.
* The relative target port ID associated with the SCST target.

It is possible to configure the following ALUA-related information via the
sysfs interface of SCST:
* Device groups, where each device group has a name and contains zero or more
  SCST devices. If a device group contains only a single SCST device, the name
  of the group may be identical to the device name. See also
  /sys/kernel/scst_tgt/device_groups/mgmt.
* Which devices are inside a device group. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/devices/mgmt.
* Target groups, where each target group has a name and contains zero or more
  SCST target names. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/mgmt.
* Target port group identifier. This is a number in the range 0..65535 and is
  called the TARGET PORT GROUP in SPC-4. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
  group name>/group_id.
* Target port group preference indicator. This is a boolean value called the
  PREF bit in SPC-4. See also /sys/kernel/scst_tgt/device_groups/<device group
  name>/target_groups/<target group name>/preferred.
* Target port group state name. One of active, nonoptimized, standby,
  unavailable, offline or transitioning. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
  group name>/state.
* Target group contents - zero or more target names. The target names either
  exist on the local system or on a remote system in a H.A. setup. For target
  names that refer to SCST targets on another system only the relative target
  port identifier matters, not the assigned name. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
  group name>/mgmt.
* Relative target identifier. See also
  /sys/kernel/scst_tgt/device_groups/<device group name>/target_groups/<target
  group name>/<target name>/rel_tgt_id.

The steps involved in configuring ALUA are:
* Identify the SCST devices that will always share the same ALUA settings and
  state. Assign a name to each such group of SCST devices. If a device group
  only contains a single device, the group name may be identical to the device
  name.
* Configure that device group in SCST via sysfs.
* Identify the SCSI target ports that will always share the same ALUA settings
  and state. Assign a name, a group ID and preference indicator to each such
  SCSI target port group.
* Configure the target port group information in SCST via sysfs.
* Identify all SCST targets that can be accessed via a target port group.
* Assign all these SCST target names to the target group via sysfs.
* Assign a relative target port identifier to each target.

As an example, in a H.A. setup with two systems each having one InfiniBand
HCA controlled by the ib_srpt driver and where each system exports two LUNs
could be configured as follows:

own_tgt_id=1
other_tgt_id=2
cd /sys/kernel/scst_tgt/device_groups
echo del dgroup1         >mgmt
echo del dgroup2         >mgmt
echo create dgroup1      >mgmt
echo add disk01          >dgroup1/devices/mgmt
echo create tgroup1      >dgroup1/target_groups/mgmt
echo ${own_tgt_id}       >dgroup1/target_groups/tgroup1/group_id
echo add ib_srpt_0       >dgroup1/target_groups/tgroup1/mgmt
echo ${own_tgt_id}       >dgroup1/target_groups/tgroup1/ib_srpt_0/rel_tgt_id
if [ ${own_tgt_id} = 1 ]; then
 echo 1                  >dgroup1/target_groups/tgroup1/preferred
fi
echo create tgroup2      >dgroup1/target_groups/mgmt
echo ${other_tgt_id}     >dgroup1/target_groups/tgroup2/group_id
echo add ib_srpt_0-other >dgroup1/target_groups/tgroup2/mgmt
echo ${other_tgt_id}     >dgroup1/target_groups/tgroup2/ib_srpt_0-other/rel_tgt_id
if [ ${other_tgt_id} = 1 ]; then
 echo 1                  >dgroup1/target_groups/tgroup1/preferred
fi
echo create dgroup2      >mgmt
echo add disk02          >dgroup2/devices/mgmt
echo create tgroup1      >dgroup2/target_groups/mgmt
echo ${own_tgt_id}       >dgroup2/target_groups/tgroup1/group_id
echo add ib_srpt_0       >dgroup2/target_groups/tgroup1/mgmt
echo ${own_tgt_id}       >dgroup2/target_groups/tgroup1/ib_srpt_0/rel_tgt_id
if [ ${own_tgt_id} = 2 ]; then
 echo 1                  >dgroup2/target_groups/tgroup1/preferred
fi
echo create tgroup2      >dgroup2/target_groups/mgmt
echo ${other_tgt_id}     >dgroup2/target_groups/tgroup2/group_id
echo add ib_srpt_0-other >dgroup2/target_groups/tgroup2/mgmt
echo ${other_tgt_id}     >dgroup2/target_groups/tgroup2/ib_srpt_0-other/rel_tgt_id
if [ ${other_tgt_id} = 2 ]; then
 echo 1                  >dgroup2/target_groups/tgroup1/preferred
fi

The second system in the same H.A. setup can be configured with the same
commands but with the values of ${own_rel_tgt_id} and ${other_rel_tgt_id}
swapped.

The result of the above commands is:

$ find -type f | grep -v '/mgmt$' | cut -c3- | sort | \
  while read f; do echo $f =  $(head -n 1 $f); done
dgroup1/target_groups/tgroup1/group_id = 1
dgroup1/target_groups/tgroup1/ib_srpt_0/rel_tgt_id = 1
dgroup1/target_groups/tgroup1/preferred = 1
dgroup1/target_groups/tgroup1/state = active
dgroup1/target_groups/tgroup2/group_id = 2
dgroup1/target_groups/tgroup2/ib_srpt_0-other/rel_tgt_id = 2
dgroup1/target_groups/tgroup2/preferred = 0
dgroup1/target_groups/tgroup2/state = active
dgroup2/target_groups/tgroup1/group_id = 1
dgroup2/target_groups/tgroup1/ib_srpt_0/rel_tgt_id = 1
dgroup2/target_groups/tgroup1/preferred = 1
dgroup2/target_groups/tgroup1/state = active
dgroup2/target_groups/tgroup2/group_id = 2
dgroup2/target_groups/tgroup2/ib_srpt_0-other/rel_tgt_id = 2
dgroup2/target_groups/tgroup2/preferred = 0
dgroup2/target_groups/tgroup2/state = active

Checking the Target Configuration
.................................

One way to verify the implicit ALUA configuration from a Linux initiator is
via the commands provided in the sg3_utils package. The first step is to
verify whether for a certain LUN implicit ALUA has been configured on the
target. This is possible by checking whether the TPGS=1 text appears in the
sg_inq output, where /dev/sdb is a device node created by the ib_srp initiator:

# sg_inq /dev/sdb
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=1  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=1  3PC=0  Protect=0  BQue=0
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=66 (0x42)   Peripheral device type: disk
 Vendor identification: SCST_FIO
 Product identification: disk01
 Product revision level:  300
 Unit serial number: 27cddc71

The next step is to verify the target group configuration. That is possible
by verifying whether the output of the sg_rtpg command matches the values
configured on the target:

# sg_rtpg /dev/sdb
Report target port groups:
  target port group id : 0x1 , Pref=1
    target port group asymmetric access state : 0x00
    T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1
    status code : 0x02
    vendor unique status : 0x00
    target port count : 01
    Relative target port ids:
      0x01
  target port group id : 0x2 , Pref=0
    target port group asymmetric access state : 0x00
    T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1
    status code : 0x02
    vendor unique status : 0x00
    target port count : 01
    Relative target port ids:
      0x02

Initiator Support
.................

On Linux systems implicit ALUA support is provided by the scsi_dh_alua driver
of the device mapper. You will have to modify at least the following in
/etc/multipath.conf:
* path_checker scsi_dh_alua
* prio_callout "/sbin/mpath_prio_alua /dev/%n"

If your distribution does not provide a /sbin/mpath_prio_alua script, you can
use the following implementation:
$ cat /sbin/mpath_prio_alua
#!/bin/bash
# Given a SCSI device node, query the target port group asymmetric access
# state and report it in numeric form.
tpg_id="$(sg_vpd --page=di "$1" | sed -n 's/.*Target port group: //p')"
aas="$(sg_rtpg "$1" \
| grep -A1 "target port group id : $tpg_id" \
| tail -n 1 \
| sed 's/.*target port group asymmetric access state : //')"
echo $((aas))

More information about how to configure the device mapper and the scsi_dh_alua
driver can be found in the manual of your Linux distribution.

Windows initiator systems support ALUA from Windows Server 2008 on. For more
information, see also:
* Microsoft, Multipathing Support in Windows Server 2008, MSDN
(http://blogs.msdn.com/b/san/archive/2008/07/27/multipathing-support-in-windows-server-2008.aspx).
* Microsoft, ALUA MPIO Logo Test, MSDN
(http://msdn.microsoft.com/en-us/library/gg607458%28v=vs.85%29.aspx).


Caching
-------

By default for performance reasons VDISK FILEIO devices use write back
caching policy.

Generally, write back caching is safe for use and danger of it is
greatly overestimated, because most modern (especially, Enterprise
level) applications are well prepared to work with write back cached
storage. Particularly, such are all transactions-based applications.
Those applications flush cache to completely avoid ANY data loss on a
crash or power failure. For instance, journaled file systems flush cache
on each meta data update, so they survive power/hardware/software
failures pretty well.

Since locally on initiators write back caching is always on, if an
application cares about its data consistency, it does flush the cache
when necessary or on any write, if open files with O_SYNC. If it doesn't
care, it doesn't flush the cache. As soon as the cache flushes
propagated to the storage, write back caching on it doesn't make any
difference. If application doesn't flush the cache, it's doomed to loose
data in case of a crash or power failure doesn't matter where this cache
located, locally or on the storage.

To illustrate that consider, for example, a user who wants to copy /src
directory to /dst directory reliably, i.e. after the copy finished no
power failure or software/hardware crash could lead to a loss of the
data in /dst. There are 2 ways to achieve this. Let's suppose for
simplicity cp opens files for writing with O_SYNC flag, hence bypassing
the local cache.

1. Slow. Make the device behind /dst working in write through caching
mode and then run "cp -a /src /dst".

2. Fast. Let the device behind /dst working in write back caching mode
and then run "cp -a /src /dst; sync". The reliability of the result is
the same, but it's much faster than (1). Nobody would care if a crash
happens during the copy, because after recovery simply leftovers from
the not completed attempt would be deleted and the operation would be
restarted from the very beginning.

So, you can see in (2) there is no danger of ANY data loss from the
write back caching. Moreover, since on practice cp doesn't open files
for writing with O_SYNC flag, to get the copy done reliably, sync
command must be called after cp anyway, so enabling write back caching
wouldn't make any difference for reliability.

Also you can consider it from another side. Modern HDDs have at least
16MB of cache working in write back mode by default, so for a 10 drives
RAID it is 160MB of a write back cache. How many people are happy with
it and how many disabled write back cache of their HDDs? Almost all and
almost nobody correspondingly? Moreover, many HDDs lie about state of
their cache and report write through while working in write back mode.
They are also successfully used.

Note, Linux I/O subsystem guarantees to propagated cache flushes to the
storage only using data protection barriers, which usually turned off by
default (see http://lwn.net/Articles/283161). Without barriers enabled
Linux doesn't provide a guarantee that after sync()/fsync() all written
data really hit permanent storage. They can be stored in the cache of
your backstorage devices and, hence, lost on a power failure event.
Thus, ever with write-through cache mode, you still either need to
enable barriers on your backend file system on the target (for direct
/dev/sdX devices this is, indeed, impossible), or need a good UPS to
protect yourself from not committed data loss. Some info about barriers
from the XFS point of view could be found at
http://oss.sgi.com/projects/xfs/faq.html#wcache. On Linux initiators for
Ext3 and ReiserFS file systems the barrier protection could be turned on
using "barrier=1" and "barrier=flush" mount options correspondingly. You
can check if the barriers turn on or off by looking in /proc/mounts.
Windows and, AFAIK, other UNIX'es don't need any special explicit
options and do necessary barrier actions on write-back caching devices
by default.

To limit this data loss with write back caching you can use files in
/proc/sys/vm to limit amount of unflushed data in the system cache.

If you for some reason have to use VDISK FILEIO devices in write through
caching mode, don't forget to disable internal caching on their backend
devices or make sure they have additional battery or supercapacitors
power supply on board. Otherwise, you still on a power failure would
loose all the unsaved yet data in the devices internal cache.

Note, on some real-life workloads write through caching might perform
better, than write back one with the barrier protection turned on.


Errors caching
..............

When using virtual device in FILEIO mode, the Linux page cache comes
into picture. The negative side of it is that it's sometimes also
caching errored pages. That is, if the underlying file experiences IO
errors, those errors might be cached by the Linux page cache. As a
result, even when the underlying file recovers and stops failing IOs,
the initiator may still hit IO errors returned by the Linux page cache,
until the cache re-reads the errored pages (usually it happens pretty
soon, but not immediately). To make sure that cached pages are dropped,
one of the following can be done:

- Detach the SCSI virtual device (del_device) and re-attach it
  (add_device). This should evict all the cached pages, unless somebody
  else holds the same "filename" opened.

- Issue a BLKFLSBUF ioctl to the same "filename" you provided for "add_device".

For the second option, a rudimentary C code is required:

fd = open(filename, O_RDWR);
if (fd < 0) {
    err = errno;
    ...
}
else {
   err = ioctl(fd, BLKFLSBUF);
   if (err < 0) {
       err = errno;
       ...
   }
   close(fd);
}

Patch to implement a sysfs entry for the FILEIO handler to accomplish
the above is welcome.


BLOCKIO VDISK mode
------------------

This module works best for these types of scenarios:

1) Data that are not aligned to 4K sector boundaries and <4K block sizes
are used, which is normally found in virtualization environments where
operating systems start partitions on odd sectors (Windows and it's
sector 63).

2) Large block data transfers normally found in database loads/dumps and
streaming media.

3) Advanced relational database systems that perform their own caching
which prefer or demand direct IO access and, because of the nature of
their data access, can actually see worse performance with
non-discriminate caching.

4) Multiple layers of targets were the secondary and above layers need
to have a consistent view of the primary targets in order to preserve
data integrity which a page cache backed IO type might not provide
reliably.

Also it has an advantage over FILEIO that it doesn't copy data between
the system cache and the commands data buffers, so it saves a
considerable amount of CPU power and memory bandwidth.

IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
=========  each other, if you try to use a device in both those modes
	   simultaneously, you will almost instantly corrupt your data
	   on that device.

IMPORTANT: Some kernels starting from 2.6.32 have a problem, which prevents
=========  prevents BLOCKIO from working correctly with RAID5/DM. See
	   http://lkml.org/lkml/2010/7/28/315. That problem was fixed in
	   2.6.32.19, 2.6.34.4, 2.6.35.2 and 2.6.36-rc1. It is strongly
	   recommended to not use affected kernels with BLOCKIO.

IMPORTANT: In SCST 1.x BLOCKIO worked by default in NV_CACHE mode, when
=========  each device reported to remote initiators as having write through
           caching. But if your backend block device has internal write
	   back caching it might create a possibility for data loss of
	   the cached in the internal cache data in case of a power
	   failure. Starting from SCST 2.0 BLOCKIO works by default in
	   non-NV_CACHE mode, when each device reported to remote
	   initiators as having write back caching, and synchronizes the
	   internal device's cache on each SYNCHRONIZE_CACHE command
	   from the initiators. It might lead to some *PERFORMANCE LOSS*,
	   so if you are are sure in your power supply and want to
	   restore the 1.x behavior, your should recreate your BLOCKIO
	   devices in NV_CACHE mode.


Pass-through mode
-----------------

In the pass-through mode (i.e. using the pass-through device handlers
scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
are passed to local SCSI devices on target as is, without any
modifications.

SCST supports 1 to many pass-through, when several initiators can safely
connect a single pass-through device (a tape, for instance). For such
cases SCST emulates all the necessary functionality.

In the sysfs interface all real SCSI devices are listed in
/sys/kernel/scst_tgt/devices in form host:channel:id:lun numbers, for
instance 1:0:0:0. The recommended way to match those numbers to your
devices is use of lsscsi utility.

Each pass-through dev handler has in its root subdirectory
/sys/kernel/scst_tgt/handlers/handler_name, e.g.
/sys/kernel/scst_tgt/handlers/dev_disk, "mgmt" file. It allows the
following commands. They can be sent to it using, e.g., echo command.

 - "add_device" - this command assigns SCSI device with
host:channel:id:lun numbers to this dev handler.

echo "add_device 1:0:0:0" >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt

will assign SCSI device 1:0:0:0 to this dev handler.

 - "del_device" - this command unassigns SCSI device with
host:channel:id:lun numbers from this dev handler.

As usually, on read the "mgmt" file returns small help about available
commands.

You need to manually assign each your real SCSI device to the
corresponding pass-through dev handler using the "add_device" command,
otherwise the real SCSI devices will not be visible remotely. The
assignment isn't done automatically, because it could lead to the
pass-through dev handlers load and initialization problems if any of the
local real SCSI devices are malfunctioning.

As any other hardware, the local SCSI hardware can not handle commands
with amount of data and/or segments count in scatter-gather array bigger
some values. Therefore, when using the pass-through mode you should note
that values for maximum number of segments and maximum amount of
transferred data (max_sectors) for each SCSI command on devices on
initiators can not be bigger, than corresponding values of the
corresponding SCSI devices on the target. Otherwise you will see
symptoms like small transfers work well, but large ones stall and
messages like: "Unable to complete command due to SG IO count
limitation" are printed in the kernel logs.

You can't control from the user space limit of the scatter-gather
segments, but for block devices usually it is sufficient if you set on
the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
the corresponding devices on the target.

For not-block devices SCSI commands are usually generated directly by
applications, so, if you experience large transfers stalls, you should
check documentation for your application how to limit the transfer
sizes.

Another way to solve this issue is to build SG entries with more than 1
page each. See the following patch as an example:
http://scst.sourceforge.net/sgv_big_order_alloc.diff


User space mode using scst_user dev handler
-------------------------------------------

User space program fileio_tgt uses interface of scst_user dev handler
and allows to see how it works in various modes. Fileio_tgt provides
mostly the same functionality as scst_vdisk handler with the most
noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
basically the same as BLOCKIO, but also supports files, so for some
loads it could be significantly faster, than the regular FILEIO access.
All the words about BLOCKIO from above apply to O_DIRECT as well. See
fileio_tgt's README file for more details.


Performance
-----------

SCST from the very beginning has been designed and implemented to
provide the best possible performance. Since there is no "one fit all"
the best performance configuration for different setups and loads, SCST
provides extensive set of settings to allow to tune it for the best
performance in each particular case. You don't have to necessary use
those settings. If you don't, SCST will do very good job to autotune for
you, so the resulting performance will, in average, be better
(sometimes, much better) than with other SCSI targets. But in some cases
you can by manual tuning improve it even more.

If you want to get maximum performance from your target, RHEL/CentOS 5.x
kernels are not recommended on both target and initiators, if you are
using Linux initiators, because those kernels are based on very outdated
2.6.18 kernel, hence, missed >3 years of important improvements in the
kernel's storage area. You should use at least long maintained vanilla
2.6.27.x kernel, although 2.6.29+ would be even better.

Before doing any performance measurements note that performance results
are very much dependent from your type of load, so it is crucial that
you choose access mode (FILEIO, BLOCKIO, O_DIRECT, pass-through), which
suits your needs the best.

In order to get the maximum performance you should:

1. For SCST:

 - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS,
   CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY,
   CONFIG_SCST_MEASURE_LATENCY

2. For target drivers:

 - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
   CONFIG_SCST_DEBUG*

3. For device handlers, including VDISK:

 - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.

IMPORTANT: The development version of SCST in the SVN is optimized for
=========  development and bug hunting, not for performance. To reconfigure
	   it for performance you should run "make 2perf" in the
	   root of your source code (e.g. trunk/). It will set the above
	   options as needed. The only option it doesn't set is
	   CONFIG_SCST_TEST_IO_IN_SIRQ, so, if needed, you should change
	   it manually.

IMPORTANT: You can't use debug SCST drivers with non-debug SCST core.
=========  So, after disabling both CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG
	   for SCST core you have to disable them for all SCST drivers
	   you are using as well.

4. Make sure you have io_grouping_type option set correctly, especially
in the following cases:

 - Several initiators share your target's backstorage. It can be a
   shared LU using some cluster FS, like VMFS, as well as can be
   different LUs located on the same backstorage (RAID array). For
   instance, if you have 3 initiators and each of them using its own
   dedicated FILEIO device file from the same RAID-6 array on the
   target.

   In this case for the best performance you should have
   io_grouping_type option set in value "never" in all the LUNs' targets
   and security groups.

 - Your initiator connected to your target in MPIO mode. In this case for
   the best performance you should:

    * Either connect all the sessions from the initiator to a single
      target or security group and have io_grouping_type option set in
      value "this_group_only" in the target or security group,

    * Or, if it isn't possible to connect all the sessions from the
      initiator to a single target or security group, assign the same
      numeric io_grouping_type value for each target/security group this
      initiator connected to. The exact value itself doesn't matter,
      important only that all the targets/security groups use the same
      value.

Don't forget, io_grouping_type makes sense only if you use CFQ I/O
scheduler on the target and for devices with threads_num >= 0 and, if
threads_num > 0, with threads_pool_type "per_initiator".

You can check if in your setup io_grouping_type set correctly as well as
if the "auto" io_grouping_type value works for you by tests like the
following:

 - For not MPIO case you can run single thread sequential reading, e.g.
   using buffered dd, from one initiator, then run the same single
   thread sequential reading from the second initiator in parallel. If
   io_grouping_type is set correctly the aggregate throughput measured
   on the target should only slightly decrease as well as all initiators
   should have nearly equal share of it. If io_grouping_type is not set
   correctly, the aggregate throughput and/or throughput on any
   initiator will decrease significantly, in 2 times or even more. For
   instance, you have 80MB/s single thread sequential reading from the
   target on any initiator. When then both initiators are reading in
   parallel you should see on the target aggregate throughput something
   like 70-75MB/s with correct io_grouping_type and something like
   35-40MB/s or 8-10MB/s on any initiator with incorrect.

 - For the MPIO case it's quite easier. With incorrect io_grouping_type
   you simply won't see performance increase from adding the second
   session (assuming your hardware is capable to transfer data through
   both sessions in parallel), or can even see a performance decrease.

5. If you are going to use your target in an VM environment, for
instance as a shared storage with VMware, make sure all your VMs
connected to the target via *separate* sessions. For instance, for iSCSI
it means that each VM has own connection to the target, not all VMs
connected using a single connection. You can check it using SCST sysfs
interface. For other transports you should use available facilities,
like NPIV for Fibre Channel, to make separate sessions for each VM. If
you miss it, you can greatly loose performance of parallel access to
your target from different VMs. This isn't related to the case if your
VMs are using the same shared storage, like with VMFS, for instance. In
this case all your VM hosts will be connected to the target via separate
sessions, which is enough.

6. For other target and initiator software parts:

 - Make sure you applied on your kernel all available SCST patches.
   If for your kernel version this patch doesn't exist, it is strongly
   recommended to upgrade your kernel to version, for which this patch
   exists.

 - Don't enable debug/hacking features in the kernel, i.e. use them as
   they are by default.

 - The default kernel read-ahead and queuing settings are optimized
   for locally attached disks, therefore they are not optimal if they
   attached remotely (SCSI target case), which sometimes could lead to
   unexpectedly low throughput. You should increase read-ahead size to at
   least 512KB or even more on all initiators and the target.

   You should also limit on all initiators maximum amount of sectors per
   SCSI command. This tuning is also recommended on targets with large
   read-ahead values. To do it on Linux, run:

   echo “64” > /sys/block/sdX/queue/max_sectors_kb

   where specify instead of X your imported from target device letter,
   like 'b', i.e. sdb.

   To increase read-ahead size on Linux, run:

   blockdev --setra N /dev/sdX

   where N is a read-ahead number in 512-byte sectors and X is a device
   letter like above.

   Note: you need to set read-ahead setting for device sdX again after
   you changed the maximum amount of sectors per SCSI command for that
   device.

   Note2: you need to restart SCST after you changed read-ahead settings
   on the target. It is a limitation of the Linux read ahead
   implementation. It reads RA values for each file only when the file
   is open and not updates them when the global RA parameters changed.
   Hence, the need for vdisk to reopen all its files/devices.

 - You may need to increase amount of requests that OS on initiator
   sends to the target device. To do it on Linux initiators, run

   echo “64” > /sys/block/sdX/queue/nr_requests

   where X is a device letter like above.

   You may also experiment with other parameters in /sys/block/sdX
   directory, they also affect performance. If you find the best values,
   please share them with us.

 - On the target use CFQ IO scheduler. In most cases it has performance
   advantage over other IO schedulers, sometimes huge (2+ times
   aggregate throughput increase).

 - It is recommended to turn the kernel preemption off, i.e. set
   the kernel preemption model to "No Forced Preemption (Server)".

 - Looks like XFS is the best filesystem on the target to store device
   files, because it allows considerably better linear write throughput,
   than ext3.

7. For hardware on target.

 - Make sure that your target hardware (e.g. target FC or network card)
   and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
   which your disks connected) don't share the same PCI bus. You can
   check it using lspci utility. They have to work in parallel, so it
   will be better if they don't compete for the bus. The problem is not
   only in the bandwidth, which they have to share, but also in the
   interaction between cards during that competition. This is very
   important, because in some cases if target and backend storage
   controllers share the same PCI bus, it could lead up to 5-10 times
   less performance, than expected. Moreover, some motherboard (by
   Supermicro, particularly) have serious stability issues if there are
   several high speed devices on the same bus working in parallel. If
   you have no choice, but PCI bus sharing, set in the BIOS PCI latency
   as low as possible.

8. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
provide you the best performance. But using it make sure you use a good
UPS with ability to shutdown the target on the power failure.

Baseline performance numbers you can find in those measurements:
http://lkml.org/lkml/2009/3/30/283.

IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
=========  you can't get good write performance for VDISK FILEIO devices with
           default 512 bytes block sizes. You could get about 10% of the
	   expected one. This is because of the partition alignment, which
	   is (simplifying) incompatible with how Linux page cache
	   works, so for each write the corresponding block must be read
	   first. Use 4096 bytes block sizes for VDISK devices and you
	   will have the expected write performance. Actually, any OS on
	   initiators, not only Windows, will benefit from block size
	   max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
	   is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
	   on the underlying FS, on which the device file located, or 0,
	   if a device node is used. Both values are from the target.
	   See also important notes about setting block sizes >512 bytes
	   for VDISK FILEIO devices above.


9. In some cases, for instance working with SSD devices, which consume
100% of a single CPU load for data transfers in their internal threads,
to maximize IOPS it can be needed to assign for those threads dedicated
CPUs. Consider using cpu_mask attribute for devices with
threads_pool_type "per_initiator" or Linux CPU affinity facilities for
other threads_pool_types. No IRQ processing should be done on those
CPUs. Check that using /proc/interrupts. See taskset command and
Documentation/IRQ-affinity.txt in your kernel's source tree for how to
assign IRQ affinity to tasks and IRQs.

The reason for that is that processing of coming commands in SIRQ
context might be done on the same CPUs as SSD devices' threads doing data
transfers. As the result, those threads won't receive all the processing
power of those CPUs and perform worse.


Work if target's backstorage or link is too slow
------------------------------------------------

Under high I/O load, when your target's backstorage gets overloaded, or
working over a slow link between initiator and target, when the link
can't serve all the queued commands on time, you can experience I/O
stalls or see in the kernel log abort or reset messages.

At first, consider the case of too slow target's backstorage. On some
seek intensive workloads even fast disks or RAIDs, which able to serve
continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s.
Another possible cause for that can be MD/LVM/RAID on your target as in
http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well).

Thus, in such situations simply processing of one or more commands takes
too long time, hence initiator decides that they are stuck on the target
and tries to recover. Particularly, it is known that the default amount
of simultaneously queued commands (48) is sometimes too high if you do
intensive writes from VMware on a target disk, which uses LVM in the
snapshot mode. In this case value like 16 or even 8-10 depending of your
backstorage speed could be more appropriate.

Unfortunately, currently SCST lacks dynamic I/O flow control, when the
queue depth on the target is dynamically decreased/increased based on
how slow/fast the backstorage speed comparing to the target link. So,
there are 6 possible actions, which you can do to workaround or fix this
issue in this case:

1. Ignore incoming task management (TM) commands. It's fine if there are
not too many of them, so average performance isn't hurt and the
corresponding device isn't getting put offline, i.e. if the backstorage
isn't too slow.

2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
in scst_priv.h file until you stop seeing incoming TM commands.
ISCSI-SCST driver also has its own iSCSI specific parameter for that,
see its README file.

To decrease device queue depth on Linux initiators you can run command:

# echo Y >/sys/block/sdX/device/queue_depth

where Y is the new number of simultaneously queued commands, X - your
imported device letter, like 'a' for sda device. There are no special
limitations for Y value, it can be any value from 1 to possible maximum
(usually, 32), so start from dividing the current value on 2, i.e. set
16, if /sys/block/sdX/device/queue_depth contains 32.

3. Increase the corresponding timeout on the initiator. For Linux it is
located in
/sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can
be done automatically by an udev rule. For instance, the following
rule will increase it to 300 seconds:

SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300"

By default, this timeout is 30 or 60 seconds, depending on your distribution.

4. Try to avoid such seek intensive workloads.

5. Increase speed of the target's backstorage.

6. Implement in SCST dynamic I/O flow control. This will be an ultimate
solution. See "Dynamic I/O flow control" section on
http://scst.sourceforge.net/contributing.html page for possible
implementation idea.

Next, consider the case of too slow link between initiator and target,
when the initiator tries to simultaneously push N commands to the target
over it. In this case time to serve those commands, i.e. send or receive
data for them over the link, can be more, than timeout for any single
command, hence one or more commands in the tail of the queue can not be
served on time less than the timeout, so the initiator will decide that
they are stuck on the target and will try to recover.

To workaround/fix this issue in this case you can use ways 1, 2, 3, 6
above or (7): increase speed of the link between target and initiator.
But for some initiators implementations for WRITE commands there might
be cases when target has no way to detect the issue, so dynamic I/O flow
control will not be able to help. In those cases you could also need on
the initiator(s) to either decrease the queue depth (way 2), or increase
the corresponding timeout (way 3).

Note, that logged messages about QUEUE_FULL status are quite different
by nature. This is a normal work, just SCSI flow control in action.
Simply don't enable "mgmt_minor" logging level, or, alternatively, if
you are confident in the worst case performance of your back-end storage
or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in
scst_priv.h to 64. Usually initiators don't try to push more commands on
the target.


Obsolete /proc interface
------------------------

Also for communications with user space programs SCST provides
proc-based interface in /proc/scsi_tgt directory. This interface is
available in the procfs build only. Starting from version 2.0.0 it is
obsolete and will be removed in one of the next versions. To switch in
the procfs build you need to run "make enable_proc" command before
building anything else.

It contains the following entries.

  - "help" file, which provides online help for SCST commands

  - "scsi_tgt" file, which on read provides information of serving by SCST
    devices and their dev handlers. On write it supports the following
    command:

      * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
        on device with host:channel:id:lun. The recommended way to find out
        H:C:I:L numbers is use of lsscsi utility.

  - "sessions" file, which lists currently connected initiators (open sessions)

  - "sgv" file provides some statistic about with which block sizes
    commands from remote initiators come and how effective sgv_pool in
    serving those allocations from the cache, i.e. without memory
    allocations requests to the kernel. "Size" - is the commands data
    size upper rounded to power of 2, "Hit" - how many there are
    allocations from the cache, "Total" - total number of allocations.

  - "threads" file, which allows to read and set number of SCST's threads

  - "version" file, which shows version of SCST

  - "trace_level" file, which allows to read and set trace (logging) level
    for SCST. Also this file allows to dump persistent reservations
    information about some device in the log file. See
    /proc/scsi_tgt/help file for list of commands and trace levels. See
    also section "Dealing with massive logs" for more info how to make
    correct logs when you enabled trace levels producing a lot of logs
    data.

Each dev handler has own subdirectory. Most dev handler have only two
files in this subdirectory: "trace_level" and "type". The first one is
similar to main SCST "trace_level" file, the latter one shows SCSI type
number of this handler as well as some text description.

For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
will assign device handler "dev_disk" to real device sitting on host 1,
channel 0, ID 1, LUN 0.


Access and devices visibility management (LUN masking) - /proc interface
------------------------------------------------------------------------

Access and devices visibility management allows for an initiator or
group of initiators to see different devices with different LUNs
with necessary access permissions.

SCST supports two modes of access control:

1. Target-oriented. In this mode you define for each target devices and
their LUNs, which are accessible to all initiators, connected to that
target. This is a regular access control mode, which people usually mean
thinking about access control in general. For instance, in IET this is
the only supported mode. In this mode you should create a security group
with name "Default_TARGET_NAME", where "TARGET_NAME" is name of the
target, like "Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz"
for target "iqn.2007-05.com.example:storage.disk1.sys1.xyz". Then you
should add to it all LUNs, available from that target.

2. Initiator-oriented. In this mode you define which devices and their
LUNs are accessible for each initiator. In this mode you should create
for each set of one or more initiators, which should access to the same
set of devices with the same LUNs, a separate security group, then add
to it available devices and names of allowed initiator(s).

Both modes can be used simultaneously. In this case initiator-oriented
mode has higher priority, than target-oriented.

When a target driver registers itself in SCST core, it tells SCST core
its name. Then, when there is a new connection from a remote initiator,
the target driver registers this connection in SCST core and tells it
the name of the remote initiator. Then SCST core finds the corresponding
devices for it using the following algorithm:

1. It searches through all defined groups trying to find group
containing the initiator name. If it succeeds, the found group is used.

2. Otherwise, it searches through all groups trying to find group with
name "Default_TARGET_NAME". If it succeeds, the found group is used.

3. Otherwise, the group with name "Default" is used. This group is
always defined, but empty by default.

Names of both target and initiator you can clarify in the kernel log. In
it SCST reports to which group each session is assigned.

In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
subdirectory. In it there are files "devices" and "names". File
"devices" lists devices and their LUNs in the group, file "names" lists
names of initiators, which allowed to access devices in this group.

To configure access and devices visibility management SCST provides the
following files and directories under /proc/scsi_tgt:

  - "add_group GROUP_NAME" to /proc/scsi_tgt/scsi_tgt adds group "GROUP_NAME"

  - "del_group GROUP_NAME" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP_NAME"

  - "rename_group OLD_NAME NEW_NAME" to /proc/scsi_tgt/scsi_tgt renames
    group "OLD_NAME" to "NEW_NAME".

  - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP_NAME/devices adds
    device with host:channel:id:lun with LUN "lun" in group "GROUP_NAME". Optionally,
    the device could be marked as read only. The recommended way to find out
    H:C:I:L numbers is use of lsscsi utility.

  - "replace H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP_NAME/devices
    replaces by device with host:channel:id:lun existing with LUN "lun"
    device in group "GROUP_NAME" with generation of INQUIRY DATA HAS
    CHANGED Unit Attention. If the old device doesn't exist, this
    command acts as the "add" command. Optionally, the device could be
    marked as read only. The recommended way to find out H:C:I:L numbers
    is use of lsscsi utility.

  - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP_NAME/devices deletes device with
    host:channel:id:lun from group "GROUP_NAME". The recommended way to find out
    H:C:I:L numbers is use of lsscsi utility.

  - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP_NAME/devices adds
    device with virtual name "V_NAME" with LUN "lun" in group "GROUP_NAME".
    Optionally, the device could be marked as read only.

  - "replace V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP_NAME/devices
    replaces by device with virtual name "V_NAME" existing with LUN
    "lun" device in group "GROUP_NAME" with generation of INQUIRY DATA
    HAS CHANGED Unit Attention. If the old device doesn't exist, this
    command acts as the "add" command. Optionally, the device could
    be marked as read only.

  - "del V_NAME" to /proc/scsi_tgt/groups/GROUP_NAME/devices deletes device with
    virtual name "V_NAME" from group "GROUP_NAME"

  - "clear" to /proc/scsi_tgt/groups/GROUP_NAME/devices clears the list of devices
    for group "GROUP_NAME"

  - "add NAME" to /proc/scsi_tgt/groups/GROUP_NAME/names adds name "NAME" to group
    "GROUP_NAME". For NAME you can use simple DOS-type patterns,
    containing '*' and '?' symbols. '*' means match all any symbols, '?'
    means match only any single symbol. For instance, "blah.xxx" will
    match "bl?h.*". Additionally, you can use negative sign '!' to
    revert the value of the pattern. For instance, "ah.xxx" will match
    "!bl?h.*".

  - "del NAME" to /proc/scsi_tgt/groups/GROUP_NAME/names deletes name "NAME" from group
    "GROUP_NAME"

  - "move NAME NEW_GROUP_NAME" to /proc/scsi_tgt/groups/OLD_GROUP_NAME/names
    moves name "NAME" from group "OLD_GROUP_NAME" to group "NEW_GROUP_NAME".

  - "clear" to /proc/scsi_tgt/groups/GROUP_NAME/names clears the list of names
    for group "GROUP_NAME"

Examples:

 - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
 add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
 group with LUN 0.

 - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
 add virtual VDISK device with name "disk1" to "Default" group
 with LUN 1.

- "echo "21:*:e0:?b:83:*'" >/proc/scsi_tgt/groups/LAB1/names" will
 add a pattern, which matches WWNs of Fibre Channel ports from LAB1.

Consider you need to have an iSCSI target with name
"iqn.2007-05.com.example:storage.disk1.sys1.xyz" (you defined it in
iscsi-scst.conf), which should export virtual device "dev1" with LUN 0
and virtual device "dev2" with LUN 1, but initiator with name
"iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only
virtual device "dev2" with LUN 0. To achieve that you should do the
following commands:

# echo "add_group Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/proc/scsi_tgt/scsi_tgt
# echo "add dev1 0" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices
# echo "add dev2 1" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices

# echo "add_group spec_ini" >/proc/scsi_tgt/scsi_tgt
# echo "add iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" >/proc/scsi_tgt/groups/spec_ini/names
# echo "add dev2 0" >/proc/scsi_tgt/groups/spec_ini/devices

It is highly recommended to use scstadmin utility instead of described
in this section low level interface.

IMPORTANT
=========

There must be LUN 0 in each security group, i.e. LUs numeration must not
start from, e.g., 1. Otherwise you will see no devices on remote
initiators and SCST core will write into the kernel log message: "tgt_dev
for LUN 0 not found, command to unexisting LU?"

IMPORTANT
=========

All the access control must be fully configured BEFORE load of the
corresponding target driver! When you load a target driver or enable
target mode in it, as for qla2x00t driver, it will immediately start
accepting new connections, hence creating new sessions, and those new
sessions will be assigned to security groups according to the
*currently* configured access control settings. For instance, to
"Default" group, instead of "HOST004" as you may need, because "HOST004"
doesn't exist yet. So, one must configure all the security groups before
new connections from the initiators are created, i.e. before target
drivers loaded.

Access controls can be altered after the target driver loaded as long as
the target session doesn't yet exist. And even in the case of the
session already existing, changes are still possible, but won't be
reflected on the initiator side.

So, the safest choice is to configure all the access control before any
target driver load and then only add new devices to new groups for new
initiators or add new devices to old groups, but not altering existing
LUNs in them.


VDISK /proc interface
---------------------

This interface starting from version 2.0.0 is obsolete and will be
removed in one of the next versions. To switch to it you should run
"make enable_proc".

After loading VDISK device handler creates in /proc/scsi_tgt/
subdirectories "vdisk" and "vcdrom". They have the following layout:

  - "trace_level" and "type" files as described above

  - "help" file, which provides online help for VDISK commands

  - "vdisk"/"vcdrom" files, which on read provides information of
    currently open device files. On write it supports the following
    command:

    * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
      device "NAME" with block size "BLOCK_SIZE" bytes with flags
      "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
      and "FLAGS" are valid only for disk VDISK. The block size must be
      power of 2 and >= 512 bytes. Default is 512. Possible flags:

      - WRITE_THROUGH - write back caching disabled. Note, this option
        has sense only if you also *manually* disable write-back cache
	in *all* your backstorage devices and make sure it's actually
	disabled, since many devices are known to lie about this mode to
	get better benchmark results.

      - READ_ONLY - read only

      - O_DIRECT - both read and write caching disabled. This mode
        isn't currently fully implemented, you should use user space
	fileio_tgt program in O_DIRECT mode instead (see below).

      - NULLIO - in this mode no real IO will be done, but success will be
        returned. Intended to be used for performance measurements at the same
        way as "*_perf" handlers.

      - NV_CACHE - enables "non-volatile cache" mode. In this mode it is
        assumed that the target has a GOOD UPS with ability to cleanly
	shutdown target in case of power failure and it is
	software/hardware bugs free, i.e. all data from the target's
	cache are guaranteed sooner or later to go to the media. Hence
	all data synchronization with media operations, like
	SYNCHRONIZE_CACHE, are ignored in order to bring more
	performance. Also in this mode target reports to initiators that
	the corresponding device has write-through cache to disable all
	write-back cache workarounds used by initiators. Use with
	extreme caution, since in this mode after a crash of the target
	journaled file systems don't guarantee the consistency after
	journal recovery, therefore manual fsck MUST be ran. Note, that
	since usually the journal barrier protection (see "IMPORTANT"
	note below) turned off, enabling NV_CACHE could change nothing
	from data protection point of view, since no data
	synchronization with media operations will go from the
	initiator. This option overrides WRITE_THROUGH.

      - BLOCKIO - enables block mode, which will perform direct block
        IO with a block device, bypassing page-cache for all operations.
	This mode works ideally with high-end storage HBAs and for
	applications that either do not need caching between application
	and disk or need the large block throughput. See also below.

      - REMOVABLE - with this flag set the device is reported to remote
        initiators as removable.

    * "close NAME" - closes device "NAME".

    * "resync_size NAME" - refreshes size of device "NAME". Intended to be
      used after device resize.

    * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.

    * "set_t10_dev_id NAME T10_DEVICE_ID" - sets T10 vendor specific
      identifier on Device Identification VPD page (0x83) of device
      "NAME" in INQUIRY data. By default VDISK handler always generates
      T10_DEVICE_ID for every new created device at creation time.
      This parameter allows to overwrite generated by VDISK value of
      T10_DEVICE_ID.

By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO
mode is used.

For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1".


Credits
-------

Thanks to:

 * Mark Buechler <mark.buechler@gmail.com> for a lot of useful
   suggestions, bug reports and help in debugging.

 * Ming Zhang <mingz@ele.uri.edu> for fixes and comments.

 * Nathaniel Clark <nate@misrule.us> for fixes and comments.

 * Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
   suggestions.

 * Hu Gang <hugang@soulinfo.com> for the original version of the
   LSI target driver.

 * Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
   of the LSI target driver.

 * Ross S. W. Walker <rswwalker@hotmail.com> for BLOCKIO inspiration
   and Vu Pham <huongvp@yahoo.com> who implemented it for VDISK dev handler.

 * Alessandro Premoli <a.premoli@andxor.it> for fixes

 * Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.

 * Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports.

 * Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with
   devices >2TB in size

 * Bart Van Assche <bvanassche@acm.org> for a lot of help

 * University of New Hampshire Interoperability Labs (UNH IOL, http://www.iol.unh.edu)
   for UNH-iSCSI project (http://www.iol.unh.edu/consortiums/iscsi/index.html)
   on which interface between SCST core and target drivers was based.

 * Daniel Debonzi <debonzi@linux.vnet.ibm.com> for a big part of the
   initial SCST sysfs tree implementation


Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net