scst/scst/README

Generic SCSI target mid-level for Linux (SCST)
==============================================

Version 1.0.2, XX XXXXX 2009
----------------------------

SCST is designed to provide unified, consistent interface between SCSI
target drivers and Linux kernel and simplify target drivers development
as much as possible. Detail description of SCST's features and internals
could be found in "Generic SCSI Target Middle Level for Linux" document
SCST's Internet page http://scst.sourceforge.net.

SCST supports the following I/O modes:

 * Pass-through mode with one to many relationship, i.e. when multiple
   initiators can connect to the exported pass-through devices, for
   the following SCSI devices types: disks (type 0), tapes (type 1),
   processors (type 3), CDROMs (type 5), MO disks (type 7), medium
   changers (type 8) and RAID controllers (type 0xC)

 * FILEIO mode, which allows to use files on file systems or block
   devices as virtual remotely available SCSI disks or CDROMs with
   benefits of the Linux page cache

 * BLOCKIO mode, which performs direct block IO with a block device,
   bypassing page-cache for all operations. This mode works ideally with
   high-end storage HBAs and for applications that either do not need
   caching between application and disk or need the large block
   throughput

 * User space mode using scst_user device handler, which allows to
   implement in the user space virtual SCSI devices in the SCST
   environment

 * "Performance" device handlers, which provide in pseudo pass-through
   mode a way for direct performance measurements without overhead of
   actual data transferring from/to underlying SCSI device

In addition, SCST supports advanced per-initiator access and devices
visibility management, so different initiators could see different set
of devices with different access permissions. See below for details.


Installation
------------

Only vanilla kernels from kernel.org and RHEL/CentOS 5.2 kernels are
supported, but SCST should work on other (vendors') kernels, if you
manage to successfully compile on them. The main problem with vendors'
kernels is that they often contain patches, which will appear only in
the next version of the vanilla kernel, therefore it's quite hard to
track such changes. Thus, if during compilation for some vendor kernel
your compiler complains about redefinition of some symbol, you should
either switch to vanilla kernel, or add or change as necessary the
corresponding to that symbol "#if LINUX_VERSION_CODE" statement.

At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
points to the source code for your currently running kernel.

Then you should consider to apply necessary kernel patches. SCST has the
following patches for the kernel in the "kernel" subdirectory. All of
them are optional, so, if you don't need the corresponding
functionality, you may not apply them.

1. scst_exec_req_fifo-2.6.X.patch. This patch is necessary for
pass-through dev handlers, because in the mainstream kernels
scsi_do_req()/scsi_execute_async() work in LIFO order, instead of
expected and required FIFO. So SCST needs new functions
scsi_do_req_fifo() or scsi_execute_async_fifo() to be added in the
kernel. This patch does that. You may not patch the kernel if you don't
need pass-through support. Alternatively, you can define
CONFIG_SCST_STRICT_SERIALIZING compile option during the compilation
(see description below). This patch is optional for kernels starting
from 2.6.30. On those kernels pass-through will well work without it.
(Actually, implementation on scsi_async_exec(), which you can find in
scst_lib.c for kernels >=2.6.30, can work on the earlier kernels as
well, so you're welcome to backport it.)

2. io_context-2.6.X.patch. This patch exports some IO context management
functions from the kernel. For performance reasons SCST queues commands
using a pool of IO threads. It is considerably better for performance
(>30% increase on sequential reads) if threads in a pool have the same
IO context. This patch allows that. If you don't apply this patch, you
will loose this performance benefit.

3. readahead-2.6.X.patch. This patch fixes problem in Linux readahead
subsystem and greatly improves performance for software RAIDs. See
http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
thread for more details.

4. readahead-context-2.6.X.patch. This is backported from 2.6.31 version
of the context readahead patch http://lkml.org/lkml/2009/4/12/9, big
thanks to Wu Fengguang. This is a performance improvement patch. It is
included in the mainstream kernel 2.6.31.

Then, to compile SCST type 'make scst'. It will build SCST itself and its
device handlers. To install them type 'make scst_install'. The driver
modules will be installed in '/lib/modules/`you_kernel_version`/extra'.
In addition, scst.h, scst_debug.h as well as Module.symvers or
Modules.symvers will be copied to '/usr/local/include/scst'. The first
file contains all SCST's public data definition, which are used by
target drivers. The other ones support debug messages logging and build
process.

Then you can load any module by typing 'modprobe module_name'. The names
are:

 - scst - SCST itself
 - scst_disk - device handler for disks (type 0)
 - scst_tape - device handler for tapes (type 1)
 - scst_processor - device handler for processors (type 3)
 - scst_cdrom - device handler for CDROMs (type 5)
 - scst_modisk - device handler for MO disks (type 7)
 - scst_changer - device handler for medium changers (type 8)
 - scst_raid - device handler for storage array controller (e.g. raid) (type C)
 - scst_vdisk - device handler for virtual disks (file, device or ISO CD image).
 - scst_user - user space device handler

Then, to see your devices remotely, you need to add them to at least
"Default" security group (see below how). By default, no local devices
are seen remotely. There must be LUN 0 in each security group, i.e. LUs
numeration must not start from, e.g., 1. Otherwise you will see no
devices on remote initiators and SCST core will write into the kernel
log message: "tgt_dev for LUN 0 not found, command to unexisting LU?"

It is highly recommended to use scstadmin utility for configuring
devices and security groups.

If you experience problems during modules load or running, check your
kernel logs (or run dmesg command for the few most recent messages).

IMPORTANT: Without loading appropriate device handler, corresponding devices
=========  will be invisible for remote initiators, which could lead to holes
           in the LUN addressing, so automatic device scanning by remote SCSI
           mid-level could not notice the devices. Therefore you will have
	   to add them manually via
	   'echo "- - -" >/sys/class/scsi_host/hostX/scan',
	   where X - is the host number.

IMPORTANT: Working of target and initiator on the same host is
=========  supported, except the following 2 cases: swap over target exported
           device and using a writable mmap over a file from target
	   exported device. The latter means you can't mount a file
	   system over target exported device. In other words, you can
	   freely use any sg, sd, st, etc. devices imported from target
	   on the same host, but you can't mount file systems or put
	   swap on them. This is a limitation of Linux memory/cache
	   manager, because in this case an OOM deadlock like: system
	   needs some memory -> it decides to clear some cache -> cache
	   needs to write on target exported device -> initiator sends
	   request to the target -> target needs memory -> system needs
	   even more memory -> deadlock.

IMPORTANT: In the current version simultaneous access to local SCSI devices
=========  via standard high-level SCSI drivers (sd, st, sg, etc.) and
           SCST's target drivers is unsupported. Especially it is
	   important for execution via sg and st commands that change
	   the state of devices and their parameters, because that could
	   lead to data corruption. If any such command is done, at
	   least related device handler(s) must be restarted. For block
	   devices READ/WRITE commands using direct disk handler look to
	   be safe.

To uninstall, type 'make scst_uninstall'.


Usage in failover mode
----------------------

It is recommended to use TEST UNIT READY ("tur") command to check if
SCST target is alive.


Device handlers
---------------

Device specific drivers (device handlers) are plugins for SCST, which
help SCST to analyze incoming requests and determine parameters,
specific to various types of devices. If an appropriate device handler
for a SCSI device type isn't loaded, SCST doesn't know how to handle
devices of this type, so they will be invisible for remote initiators
(more precisely, "LUN not supported" sense code will be returned).

In addition to device handlers for real devices, there are VDISK, user
space and "performance" device handlers.

VDISK device handler works over files on file systems and makes from
them virtual remotely available SCSI disks or CDROM's. In addition, it
allows to work directly over a block device, e.g. local IDE or SCSI disk
or ever disk partition, where there is no file systems overhead. Using
block devices comparing to sending SCSI commands directly to SCSI
mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
are transferred via system cache, so it is possible to fully benefit from
caching and read ahead performed by Linux's VM subsystem. The only
disadvantage here that in the FILEIO mode there is superfluous data
copying between the cache and SCST's buffers. This issue is going to be
addressed in the next release. Virtual CDROM's are useful for remote
installation. See below for details how to setup and use VDISK device
handler.

SCST user space device handler provides an interface between SCST and
the user space, which allows to create pure user space devices. The
simplest example, where one would want it is if he/she wants to write a
VTL. With scst_user he/she can write it purely in the user space. Or one
would want it if he/she needs some sophisticated for kernel space
processing of the passed data, like encrypting them or making snapshots.

"Performance" device handlers for disks, MO disks and tapes in their
exec() method skip (pretend to execute) all READ and WRITE operations
and thus provide a way for direct link performance measurements without
overhead of actual data transferring from/to underlying SCSI device.

NOTE: Since "perf" device handlers on READ operations don't touch the
====  commands' data buffer, it is returned to remote initiators as it
      was allocated, without even being zeroed. Thus, "perf" device
      handlers impose some security risk, so use them with caution.


Compilation options
-------------------

There are the following compilation options, that could be commented
in/out in Makefile:

 - CONFIG_SCST_DEBUG - if defined, turns on some debugging code,
   including some logging. Makes the driver considerably bigger and slower,
   producing large amount of log data.

 - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the
   driver considerably bigger and leads to some performance loss.

 - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in
   the various places.

 - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator
   supplied expected data transfer length and direction will be used only for
   verification purposes to return error or warn in case if one of them
   is invalid. Instead, locally decoded from SCSI command values will be
   used. This is necessary for security reasons, because otherwise a
   faulty initiator can crash target by supplying invalid value in one
   of those parameters. This is especially important in case of
   pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is defined, initiator
   supplied expected data transfer length and direction will override
   the locally decoded values. This might be necessary if internal SCST
   commands translation table doesn't contain SCSI command, which is
   used in your environment. You can know that if you have messages like
   "Unknown opcode XX for YY. Should you update scst_scsi_op_table?" in
   your kernel log and your initiator returns an error. Also report
   those messages in the SCST mailing list
   scst-devel@lists.sourceforge.net. Note, that not all SCSI transports
   support supplying expected values.

 - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions
   debugging, when on LUN 0 in the default access control group some of the
   commands will be delayed for about 60 sec., so making the remote
   initiator send TM functions, eg ABORT TASK and TARGET RESET. Also
   define CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you
   want that the device eventually become completely unresponsive, or
   otherwise to circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG
   turned on.

 - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to
   underlying SCSI device synchronously, one after one. This makes task
   management more reliable, with cost of some performance penalty. This
   is mostly actual for stateful SCSI devices like tapes, where the
   result of command's execution depends from device's settings defined
   by previous commands. Disk and RAID devices are stateless in the most
   cases. The current SCSI core in Linux doesn't allow to abort all
   commands reliably if they sent asynchronously to a stateful device.
   Turned off by default, turn it on if you use stateful device(s) and
   need as much error recovery reliability as possible. As a side effect
   of CONFIG_SCST_STRICT_SERIALIZING, no kernel patching is necessary
   for pass-through device handlers (scst_disk, etc.).

 - CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ - if defined, it will be
   allowed to submit pass-through commands to real SCSI devices via the SCSI
   middle layer using scsi_execute_async() function from soft IRQ
   context (tasklets). This used to be the default, but currently it
   seems the SCSI middle layer starts expecting only thread context on
   the IO submit path, so it is disabled now by default. Enabling it
   will decrease amount of context switches and improve performance. It
   is more or less safe, in the worst case, if in your configuration the
   SCSI middle layer really doesn't expect SIRQ context in
   scsi_execute_async() function, you will get a warning message in the
   kernel log.

 - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
   buffers. Undefining it (default) considerably improves performance
   and eases CPU load, but could create a security hole (information
   leakage), so enable it, if you have strict security requirements.

 - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined,
   in case when TASK MANAGEMENT function ABORT TASK is trying to abort a
   command, which has already finished, remote initiator, which sent the
   ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED)
   response for the ABORT TASK request. This is more logical response,
   since, because the command finished, attempt to abort it failed, but
   some initiators, particularly VMware iSCSI initiator, consider TASK
   NOT EXIST response as if the target got crazy and try to RESET it.
   Then sometimes get crazy itself. So, this option is disabled by
   default.

 - CONFIG_SCST_MEASURE_LATENCY - if defined, provides in /proc/scsi_tgt/latency
   file average commands processing latency. You can clear already
   measured results by writing 0 in this file. Note, you need a
   non-preemptible kernel to have correct results.

HIGHMEM kernel configurations are fully supported, but not recommended
for performance reasons, except for scst_user, where they are not
supported, because this module deals with user supplied memory on a
zero-copy manner. If you need to use it, consider change VMSPLIT option
or use 64-bit system configuration instead.

For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
"make menuconfig" command set the following variables:

 - General setup->Configure standard kernel features (for small systems): ON

 - General setup->Prompt for development and/or incomplete code/drivers: ON

 - Processor type and features->High Memory Support: OFF

 - Processor type and features->Memory split: according to amount of
   memory you have. If it is less than 800MB, you may not touch this
   option at all.


Module parameters
-----------------

Module scst supports the following parameters:

 - scst_threads - allows to set count of SCST's threads. By default it
   is CPU count.

 - scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be
   consumed by the SCST commands for data buffers at any given time. By
   default it is approximately TotalMem/4.


SCST "/proc" commands
---------------------

For communications with user space programs SCST provides proc-based
interface in "/proc/scsi_tgt" directory. It contains the following
entries:

  - "help" file, which provides online help for SCST commands

  - "scsi_tgt" file, which on read provides information of serving by SCST
    devices and their dev handlers. On write it supports the following
    command:

      * "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
        on device with host:channel:id:lun. The recommended way to find out
        H:C:I:L numbers is use of lsscsi utility.

  - "sessions" file, which lists currently connected initiators (open sessions)

  - "sgv" file provides some statistic about with which block sizes
    commands from remote initiators come and how effective sgv_pool in
    serving those allocations from the cache, i.e. without memory
    allocations requests to the kernel. "Size" - is the commands data
    size upper rounded to power of 2, "Hit" - how many there are
    allocations from the cache, "Total" - total number of allocations.

  - "threads" file, which allows to read and set number of SCST's threads

  - "version" file, which shows version of SCST

  - "trace_level" file, which allows to read and set trace (logging) level
    for SCST. See "help" file for list of trace levels. If you want to
    enable logging options, which produce a lot of events, like "debug",
    to not loose logged events you should also:

     * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable
       to much bigger value, then recompile it. For example, I use 25,
       but to use it I needed to modify the maximum allowed value for
       CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig.

     * Change in your /etc/syslog.conf or other config file of your favorite
       logging program to store kernel logs in async manner. For example,
       I added in my rsyslog.conf line "kern.info -/var/log/kernel"
       and added "kern.none" in line for /var/log/messages, so I had:
       "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages"

Each dev handler has own subdirectory. Most dev handler have only two
files in this subdirectory: "trace_level" and "type". The first one is
similar to main SCST "trace_level" file, the latter one shows SCSI type
number of this handler as well as some text description.

For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
will assign device handler "dev_disk" to real device sitting on host 1,
channel 0, ID 1, LUN 0.


Access and devices visibility management (LUN masking)
------------------------------------------------------

Access and devices visibility management allows for an initiator or
group of initiators to see different devices with different LUNs
with necessary access permissions.

SCST supports two modes of access control:

1. Target-oriented. In this mode you define for each target devices and
their LUNs, which are accessible to all initiators, connected to that
target. This is a regular access control mode, which people usually mean
thinking about access control in general. For instance, in IET this is
the only supported mode. In this mode you should create a security group
with name "Default_TARGET_NAME", where "TARGET_NAME" is name of the
target, like "Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz"
for target "iqn.2007-05.com.example:storage.disk1.sys1.xyz". Then you
should add to it all LUNs, available from that target.

2. Initiator-oriented. In this mode you define which devices and their
LUNs are accessible for each initiator. In this mode you should create
for each set of one or more initiators, which should access to the same
set of devices with the same LUNs, a separate security group, then add
to it available devices and names of allowed initiator(s).

Both modes can be used simultaneously. In this case initiator-oriented
mode has higher priority, than target-oriented.

When a target driver registers itself in SCST core, it tells SCST core
its name. Then, when there is a new connection from a remote initiator,
the target driver registers this connection in SCST core and tells it
the name of the remote initiator. Then SCST core finds the corresponding
devices for it using the following algorithm:

1. It searches through all defined groups trying to find group
containing the initiator name. If it succeeds, the found group is used.

2. Otherwise, it searches through all groups trying to find group with
name "Default_TARGET_NAME". If it succeeds, the found group is used.

3. Otherwise, the group with name "Default" is used. This group is
always defined, but empty by default.

Names of both target and initiator you can clarify in the kernel log. In
it SCST reports to which group each session is assigned.

In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
subdirectory. In it there are files "devices" and "names". File
"devices" lists devices and their LUNs in the group, file "names" lists
names of initiators, which allowed to access devices in this group.

To configure access and devices visibility management SCST provides the
following files and directories under /proc/scsi_tgt:

  - "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP"

  - "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP"

  - "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
    device with host:channel:id:lun with LUN "lun" in group "GROUP". Optionally,
    the device could be marked as read only. The recommended way to find out
    H:C:I:L numbers is use of lsscsi utility.

  - "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
    host:channel:id:lun from group "GROUP". The recommended way to find out
    H:C:I:L numbers is use of lsscsi utility.

  - "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
    device with virtual name "V_NAME" with LUN "lun" in group "GROUP".
    Optionally, the device could be marked as read only.

  - "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
    virtual name "V_NAME" from group "GROUP"

  - "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices
    for group "GROUP"

  - "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group
    "GROUP". For NAME you can use simple DOS-type patterns, containing
    '*' and '?' symbols. '*' means match all any symbols, '?' means
    match only any single symbol. For instance, "blah.xxx" will match
    "bl?h.*".

  - "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group
    "GROUP"

  - "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names
    for group "GROUP"

Examples:

 - "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
 add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
 group with LUN 0.

 - "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
 add virtual VDISK device with name "disk1" to "Default" group
 with LUN 1.

- "echo "21:*:e0:?b:83:*'" >/proc/scsi_tgt/groups/LAB1/names" will
 add a pattern, which matches WWNs of Fibre Channel ports from LAB1.

Consider you need to have an iSCSI target with name
"iqn.2007-05.com.example:storage.disk1.sys1.xyz" (you defined it in
iscsi-scst.conf), which should export virtual device "dev1" with LUN 0
and virtual device "dev2" with LUN 1, but initiator with name
"iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only
virtual device "dev2" with LUN 0. To achieve that you should do the
following commands:

# echo "add_group Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/proc/scsi_tgt/scsi_tgt
# echo "add dev1 0" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices
# echo "add dev2 1" >/proc/scsi_tgt/groups/Default_iqn.2007-05.com.example:storage.disk1.sys1.xyz/devices

# echo "add_group spec_ini" >/proc/scsi_tgt/scsi_tgt
# echo "add iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" >/proc/scsi_tgt/groups/spec_ini/names
# echo "add dev2 0" >/proc/scsi_tgt/groups/spec_ini/devices

It is highly recommended to use scstadmin utility instead of described
in this section low level interface.

IMPORTANT
=========

There must be LUN 0 in each security group, i.e. LUs numeration must not
start from, e.g., 1. Otherwise you will see no devices on remote
initiators and SCST core will write into the kernel log message: "tgt_dev
for LUN 0 not found, command to unexisting LU?"

IMPORTANT
=========

All the access control must be fully configured BEFORE load of the
corresponding target driver! When you load a target driver or enable
target mode in it, as for qla2x00t driver, it will immediately start
accepting new connections, hence creating new sessions, and those new
sessions will be assigned to security groups according to the
*currently* configured access control settings. For instance, to
"Default" group, instead of "HOST004" as you may need, because "HOST004"
doesn't exist yet. So, one must configure all the security groups before
new connections from the initiators are created, i.e. before target
drivers loaded.

Access controls can be altered after the target driver loaded as long as
the target session doesn't yet exist. And even in the case of the
session already existing, changes are still possible, but won't be
reflected on the initiator side.

So, the safest choice is to configure all the access control before any
target driver load and then only add new devices to new groups for new
initiators or add new devices to old groups, but not altering existing
LUNs in them.


VDISK device handler
--------------------

After loading VDISK device handler creates in "/proc/scsi_tgt/"
subdirectories "vdisk" and "vcdrom". They have similar layout:

  - "trace_level" and "type" files as described for other dev handlers

  - "help" file, which provides online help for VDISK commands

  - "vdisk"/"vcdrom" files, which on read provides information of
    currently open device files. On write it supports the following
    command:

    * "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
      device "NAME" with block size "BLOCK_SIZE" bytes with flags
      "FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
      and "FLAGS" are valid only for disk VDISK. The block size must be
      power of 2 and >= 512 bytes. Default is 512. Possible flags:

      - WRITE_THROUGH - write back caching disabled. Note, this option
        has sense only if you also *manually* disable write-back cache
	in *all* your backstorage devices and make sure it's actually
	disabled, since many devices are known to lie about this mode to
	get better benchmark results.

      - READ_ONLY - read only

      - O_DIRECT - both read and write caching disabled. This mode
        isn't currently fully implemented, you should use user space
	fileio_tgt program in O_DIRECT mode instead (see below).

      - NULLIO - in this mode no real IO will be done, but success will be
        returned. Intended to be used for performance measurements at the same
        way as "*_perf" handlers.

      - NV_CACHE - enables "non-volatile cache" mode. In this mode it is
        assumed that the target has a GOOD UPS with ability to cleanly
	shutdown target in case of power failure and it is
	software/hardware bugs free, i.e. all data from the target's
	cache are guaranteed sooner or later to go to the media. Hence
	all data synchronization with media operations, like
	SYNCHRONIZE_CACHE, are ignored in order to bring more
	performance. Also in this mode target reports to initiators that
	the corresponding device has write-through cache to disable all
	write-back cache workarounds used by initiators. Use with
	extreme caution, since in this mode after a crash of the target
	journaled file systems don't guarantee the consistency after
	journal recovery, therefore manual fsck MUST be ran. Note, that
	since usually the journal barrier protection (see "IMPORTANT"
	note below) turned off, enabling NV_CACHE could change nothing
	from data protection point of view, since no data
	synchronization with media operations will go from the
	initiator. This option overrides WRITE_THROUGH.

      - BLOCKIO - enables block mode, which will perform direct block
        IO with a block device, bypassing page-cache for all operations.
	This mode works ideally with high-end storage HBAs and for
	applications that either do not need caching between application
	and disk or need the large block throughput. See also below.

      - REMOVABLE - with this flag set the device is reported to remote
        initiators as removable.

    * "close NAME" - closes device "NAME".

    * "resync_size NAME" - refreshes size of device "NAME". Intended to be
      used after device resize.

    * "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.

By default, if neither BLOCKIO, nor NULLIO option is supplied, FILEIO
mode is used.

For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
will open file /vdisks/disk1 as virtual FILEIO disk with name "disk1".

CAUTION: If you partitioned/formatted your device with block size X, *NEVER*
======== ever try to export and then mount it (even accidentally) with another
         block size. Otherwise you can *instantly* damage it pretty
	 badly as well as all your data on it. Messages on initiator
	 like: "attempt to access beyond end of device" is the sign of
	 such damage.

	 Moreover, if you want to compare how well different block sizes
	 work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE
	 **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In
	 other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS**
	 AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block
	 sizes isn't like switching between FILEIO and BLOCKIO, after
	 changing block size all previously written with another block
	 size data MUST BE ERASED. Otherwise you will have a full set of
	 very weird behaviors, because blocks addressing will be
	 changed, but initiators in most cases will not have a
	 possibility to detect that old addresses written on the device
	 in, e.g., partition table, don't refer anymore to what they are
	 intended to refer.

IMPORTANT: By default for performance reasons VDISK FILEIO devices use write
=========  back caching policy. This is generally safe from the consistence of

           journaled file systems, laying over them, point of view, but
	   your unsaved cached data will be lost in case of
	   power/hardware/software failure, so you must supply your
	   target server with some kind of UPS or disable write back
	   caching using WRITE_THROUGH flag.
	   Note, that the file systems journaling over write back
	   caching enabled devices work reliably *ONLY* if the order of
	   journal writes is guaranteed or they use some kind of data
	   protection barriers (i.e. after writing journal data some
	   kind of synchronization with media operations is used),
	   otherwise, because of possible reordering in the cache, even
	   after successful journal rollback, you very much risk to
	   loose your data on the FS. Currently, Linux IO subsystem
	   guarantees order of write operations only using data
	   protection barriers. Some info about it from the XFS point of
	   view could be found at
	   http://oss.sgi.com/projects/xfs/faq.html#wcache. On Linux
	   initiators for EXT3 and ReiserFS file systems the barrier
	   protection could be turned on using "barrier=1" and
	   "barrier=flush" mount options correspondingly. Note, that
	   usually it's turned off by default (see
	   http://lwn.net/Articles/283161). You can check if it's turn
	   on or off by looking in /proc/mounts. Windows and, AFAIK,
	   other UNIX'es don't need any special explicit options and do
	   necessary barrier actions on write-back caching devices by
	   default. Also note that on some real-life workloads write
	   through caching might perform better, than write back one
	   with the barrier protection turned on.
	   Also you should understand that without barriers enabled
	   (i.e. by default) Linux doesn't provide a guarantee that
	   after sync()/fsync() all written data really hit permanent
	   storage. They can be stored in the cache of your backstorage
	   device only and lost on power failure event. Thus, ever with
	   write-through cache mode, you still either need to enable
	   barriers on your backend file system on the target (for
	   devices in it is, indeed, impossible), or need a good UPS to
	   protect yourself from your data loss (note, data loss, not
	   the file system corruption).

IMPORTANT: Some disk and partition table management utilities don't support
=========  block sizes >512 bytes, therefore make sure that your favorite one
           supports it. Currently only cfdisk is known to work only with
	   512 bytes blocks, other utilities like fdisk on Linux or
	   standard disk manager on Windows are proved to work well with
	   non-512 bytes blocks. Note, if you export a disk file or
	   device with some block size, different from one, with which
	   it was already partitioned, you could get various weird
	   things like utilities hang up or other unexpected behavior.
	   Hence, to be sure, zero the exported file or device before
	   the first access to it from the remote initiator with another
	   block size. On Window initiator make sure you "Set Signature"
	   in the disk manager on the imported from the target drive
	   before doing any other partitioning on it. After you
	   successfully mounted a file system over non-512 bytes block
	   size device, the block size stops matter, any program will
	   work with files on such file system.


BLOCKIO VDISK mode
------------------

This module works best for these types of scenarios:

1) Data that are not aligned to 4K sector boundaries and <4K block sizes
are used, which is normally found in virtualization environments where
operating systems start partitions on odd sectors (Windows and it's
sector 63).

2) Large block data transfers normally found in database loads/dumps and
streaming media.

3) Advanced relational database systems that perform their own caching
which prefer or demand direct IO access and, because of the nature of
their data access, can actually see worse performance with
non-discriminate caching.

4) Multiple layers of targets were the secondary and above layers need
to have a consistent view of the primary targets in order to preserve
data integrity which a page cache backed IO type might not provide
reliably.

Also it has an advantage over FILEIO that it doesn't copy data between
the system cache and the commands data buffers, so it saves a
considerable amount of CPU power and memory bandwidth.

IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
=========  them, if you try to use a device in both those modes simultaneously,
           you will almost instantly corrupt your data on that device.


Pass-through mode
-----------------

In the pass-through mode (i.e. using the pass-through device handlers
scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
are passed to local SCSI hardware on target as is, without any
modifications. As any other hardware, the local SCSI hardware can not
handle commands with amount of data and/or segments count in
scatter-gather array bigger some values. Therefore, when using the
pass-through mode you should note that values for maximum number of
segments and maximum amount of transferred data for each SCSI command on
devices on initiators can not be bigger, than corresponding values of
the corresponding SCSI devices on the target. Otherwise you will see
symptoms like small transfers work well, but large ones stall and
messages like: "Unable to complete command due to SG IO count
limitation" are printed in the kernel logs.

You can't control from the user space limit of the scatter-gather
segments, but for block devices usually it is sufficient if you set on
the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
the corresponding devices on the target.

For not-block devices SCSI commands are usually generated directly by
applications, so, if you experience large transfers stalls, you should
check documentation for your application how to limit the transfer
sizes.

Another way to solve this issue is to build SG entries with more than 1
page each. See the following patch as an example:
http://scst.sf.net/sgv_big_order_alloc.diff


User space mode using scst_user dev handler
-------------------------------------------

User space program fileio_tgt uses interface of scst_user dev handler
and allows to see how it works in various modes. Fileio_tgt provides
mostly the same functionality as scst_vdisk handler with the most
noticeable difference that it supports O_DIRECT mode. O_DIRECT mode is
basically the same as BLOCKIO, but also supports files, so for some
loads it could be significantly faster, than the regular FILEIO access.
All the words about BLOCKIO from above apply to O_DIRECT as well. See
fileio_tgt's README file for more details.


Performance
-----------

Before doing any performance measurements note that:

I. Performance results are very much dependent from your type of load,
so it is crucial that you choose access mode (FILEIO, BLOCKIO,
O_DIRECT, pass-through), which suits your needs the best.

II. In order to get the maximum performance you should:

1. For SCST:

 - Disable in Makefile CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS,
   CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY

 - For pass-through devices enable
   CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ.

2. For target drivers:

 - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING,
   CONFIG_SCST_DEBUG*

3. For device handlers, including VDISK:

 - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG.


IMPORTANT: Some of the above compilation options in the SCST SVN enabled by default,
=========  i.e. development version of SCST is optimized currently rather for
           development and bug hunting, than for performance.

You can set the above options, except
CONFIG_SCST_ALLOW_PASSTHROUGH_IO_SUBMIT_IN_SIRQ, in the needed values
using debug2perf root Makefile target.

4. For other target and initiator software parts:

 - Make sure you applied on your kernel all available SCST patches,
   especially io_context-2.6.X.patch. If for your kernel version this
   patch doesn't exist, it is strongly recommended to upgrade your
   kernel to version, for which this patch exists.

 - Don't enable debug/hacking features in the kernel, i.e. use them as
   they are by default.

 - The default kernel read-ahead and queuing settings are optimized
   for locally attached disks, therefore they are not optimal if they
   attached remotely (SCSI target case), which sometimes could lead to
   unexpectedly low throughput. You should increase read-ahead size to at
   least 512KB or even more on all initiators and the target.

   You should also limit on all initiators maximum amount of sectors per
   SCSI command. This tuning is also recommended on targets with large
   read-ahead values. To do it on Linux, run:

   echo “64” > /sys/block/sdX/queue/max_sectors_kb

   where specify instead of X your imported from target device letter,
   like 'b', i.e. sdb.

   To increase read-ahead size on Linux, run:

   blockdev --setra N /dev/sdX

   where N is a read-ahead number in 512-byte sectors and X is a device
   letter like above.

   Note: you need to set read-ahead setting for device sdX again after
   you changed the maximum amount of sectors per SCSI command for that
   device.

   Note2: you need to restart SCST after you changed read-ahead settings
   on the target.

 - You may need to increase amount of requests that OS on initiator
   sends to the target device. To do it on Linux initiators, run

   echo “64” > /sys/block/sdX/queue/nr_requests

   where X is a device letter like above.

   You may also experiment with other parameters in /sys/block/sdX
   directory, they also affect performance. If you find the best values,
   please share them with us.

 - On the target use CFQ IO scheduler. In most cases it has performance
   advantage over other IO schedulers, sometimes huge (2+ times
   aggregate throughput increase).

 - It is recommended to turn the kernel preemption off, i.e. set
   the kernel preemption model to "No Forced Preemption (Server)".

 - Looks like XFS is the best filesystem on the target to store device
   files, because it allows considerably better linear write throughput,
   than ext3.

5. For hardware on target.

 - Make sure that your target hardware (e.g. target FC or network card)
   and underlaying IO hardware (e.g. IO card, like SATA, SCSI or RAID to
   which your disks connected) don't share the same PCI bus. You can
   check it using lspci utility. They have to work in parallel, so it
   will be better if they don't compete for the bus. The problem is not
   only in the bandwidth, which they have to share, but also in the
   interaction between cards during that competition. This is very
   important, because in some cases if target and backend storage
   controllers share the same PCI bus, it could lead up to 5-10 times
   less performance, than expected. Moreover, some motherboard (by
   Supermicro, particularly) have serious stability issues if there are
   several high speed devices on the same bus working in parallel. If
   you have no choice, but PCI bus sharing, set in the BIOS PCI latency
   as low as possible.

6. If you use VDISK IO module in FILEIO mode, NV_CACHE option will
provide you the best performance. But using it make sure you use a good
UPS with ability to shutdown the target on the power failure.

Baseline performance numbers you can find in those measurements:
http://lkml.org/lkml/2009/3/30/283.

IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
=========  you can't get good write performance for VDISK FILEIO devices with
           default 512 bytes block sizes. You could get about 10% of the
	   expected one. This is because of the partition alignment, which
	   is (simplifying) incompatible with how Linux page cache
	   works, so for each write the corresponding block must be read
	   first. Use 4096 bytes block sizes for VDISK devices and you
	   will have the expected write performance. Actually, any OS on
	   initiators, not only Windows, will benefit from block size
	   max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
	   is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
	   on the underlying FS, on which the device file located, or 0,
	   if a device node is used. Both values are from the target.
	   See also important notes about setting block sizes >512 bytes
	   for VDISK FILEIO devices above.


In some cases, for instance working with SSD devices, which consume 100%
of a single CPU load for data transfers in their internal threads, to
maximize IOPS it can be needed to assign for those threads dedicated
CPUs using Linux CPU affinity facilities. No IRQ processing should be
done on those CPUs. Check that using /proc/interrupts. See taskset
command and Documentation/IRQ-affinity.txt in your kernel's source tree
for how to assign IRQ affinity to tasks and IRQs.

The reason for that is that processing of coming commands in SIRQ
context might be done on the same CPUs as SSD devices' threads doing data
transfers. As the result, those threads won't receive all the processing
power of those CPUs and perform worse.


Work if target's backstorage or link is too slow
------------------------------------------------

Under high I/O load, when your target's backstorage gets overloaded, or
working over a slow link between initiator and target, when the link
can't serve all the queued commands on time, you can experience I/O
stalls or see in the kernel log abort or reset messages.

At first, consider the case of too slow target's backstorage. On some
seek intensive workloads even fast disks or RAIDs, which able to serve
continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s.
Another possible cause for that can be MD/LVM/RAID on your target as in
http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well).

Thus, in such situations simply processing of one or more commands takes
too long time, hence initiator decides that they are stuck on the target
and tries to recover. Particularly, it is known that the default amount
of simultaneously queued commands (48) is sometimes too high if you do
intensive writes from VMware on a target disk, which uses LVM in the
snapshot mode. In this case value like 16 or even 8-10 depending of your
backstorage speed could be more appropriate.

Unfortunately, currently SCST lacks dynamic I/O flow control, when the
queue depth on the target is dynamically decreased/increased based on
how slow/fast the backstorage speed comparing to the target link. So,
there are 6 possible actions, which you can do to workaround or fix this
issue in this case:

1. Ignore incoming task management (TM) commands. It's fine if there are
not too many of them, so average performance isn't hurt and the
corresponding device isn't getting put offline, i.e. if the backstorage
isn't too slow.

2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case
if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant
in scst_priv.h file until you stop seeing incoming TM commands.
ISCSI-SCST driver also has its own iSCSI specific parameter for that,
see its README file.

To decrease device queue depth on Linux initiators you can run command:

# echo Y >/sys/block/sdX/device/queue_depth

where Y is the new number of simultaneously queued commands, X - your
imported device letter, like 'a' for sda device. There are no special
limitations for Y value, it can be any value from 1 to possible maximum
(usually, 32), so start from dividing the current value on 2, i.e. set
16, if /sys/block/sdX/device/queue_depth contains 32.

3. Increase the corresponding timeout on the initiator. For Linux it is
located in
/sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can
be done automatically by an udev rule. For instance, the following
rule will increase it to 300 seconds:

SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300"

By default, this timeout is 30 or 60 seconds, depending on your distribution.

4. Try to avoid such seek intensive workloads.

5. Increase speed of the target's backstorage.

6. Implement in SCST dynamic I/O flow control. This will be an ultimate
solution. See "Dynamic I/O flow control" section on
http://scst.sourceforge.net/contributing.html page for possible
implementation idea.

Next, consider the case of too slow link between initiator and target,
when the initiator tries to simultaneously push N commands to the target
over it. In this case time to serve those commands, i.e. send or receive
data for them over the link, can be more, than timeout for any single
command, hence one or more commands in the tail of the queue can not be
served on time less than the timeout, so the initiator will decide that
they are stuck on the target and will try to recover.

To workaround/fix this issue in this case you can use ways 1, 2, 3, 6
above or (7): increase speed of the link between target and initiator.
But for some initiators implementations for WRITE commands there might
be cases when target has no way to detect the issue, so dynamic I/O flow
control will not be able to help. In those cases you could also need on
the initiator(s) to either decrease the queue depth (way 2), or increase
the corresponding timeout (way 3).

Note, that logged messages about QUEUE_FULL status are quite different
by nature. This is a normal work, just SCSI flow control in action.
Simply don't enable "mgmt_minor" logging level, or, alternatively, if
you are confident in the worst case performance of your back-end storage
or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in
scst_priv.h to 64. Usually initiators don't try to push more commands on
the target.


Credits
-------

Thanks to:

 * Mark Buechler <mark.buechler@gmail.com> for a lot of useful
   suggestions, bug reports and help in debugging.

 * Ming Zhang <mingz@ele.uri.edu> for fixes and comments.

 * Nathaniel Clark <nate@misrule.us> for fixes and comments.

 * Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
   suggestions.

 * Hu Gang <hugang@soulinfo.com> for the original version of the
   LSI target driver.

 * Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
   of the LSI target driver.

 * Ross S. W. Walker <rswwalker@hotmail.com> for the original block IO
   code and Vu Pham <huongvp@yahoo.com> who updated it for the VDISK dev
   handler.

 * Michael G. Byrnes <michael.byrnes@hp.com> for fixes.

 * Alessandro Premoli <a.premoli@andxor.it> for fixes

 * Nathan Bullock <nbullock@yottayotta.com> for fixes.

 * Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.

 * Krzysztof Blaszkowski <kb@sysmikro.com.pl> for many fixes and bug reports.

 * Jianxi Chen <pacers@users.sourceforge.net> for fixing problem with
   devices >2TB in size

 * Bart Van Assche <bart.vanassche@gmail.com> for a lot of help

 * University of New Hampshire Interoperability Labs (UNH IOL, http://www.iol.unh.edu)
   for UNH-iSCSI project (http://www.iol.unh.edu/consortiums/iscsi/index.html)
   on which interface between SCST core and target drivers was based.

Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net