mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-23 05:31:28 +00:00
- Improved rebuilt warning as suggested by Tomasz Chmielewski <mangoo@wpkg.org> - Minor cleanups git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@175 d57e44dd-8a1f-0410-8b47-8ef2f437770f
664 lines
30 KiB
Plaintext
664 lines
30 KiB
Plaintext
Generic SCSI target mid-level for Linux (SCST)
|
|
==============================================
|
|
|
|
Version 0.9.6, XX XXX 200X
|
|
--------------------------
|
|
|
|
SCST is designed to provide unified, consistent interface between SCSI
|
|
target drivers and Linux kernel and simplify target drivers development
|
|
as much as possible. Detail description of SCST's features and internals
|
|
could be found in "Generic SCSI Target Middle Level for Linux" document
|
|
SCST's Internet page http://scst.sourceforge.net.
|
|
|
|
SCST supports the following I/O modes:
|
|
|
|
* Pass-through mode with one to many relationship, i.e. when multiple
|
|
initiators can connect to the exported pass-through devices, for
|
|
the following SCSI devices types: disks (type 0), tapes (type 1),
|
|
processors (type 3), CDROMs (type 5), MO disks (type 7), medium
|
|
changers (type 8) and RAID controllers (type 0xC)
|
|
|
|
* FILEIO mode, which allows to use files on file systems or block
|
|
devices as virtual remotely available SCSI disks or CDROMs with
|
|
benefits of the Linux page cache
|
|
|
|
* BLOCKIO mode, which performs direct block IO with a block device,
|
|
bypassing page-cache for all operations. This mode works ideally with
|
|
high-end storage HBAs and for applications that either do not need
|
|
caching between application and disk or need the large block
|
|
throughput
|
|
|
|
* User space mode using scst_user device handler, which allows to
|
|
implement in the user space virtual SCSI devices in the SCST
|
|
environment
|
|
|
|
* "Performance" device handlers, which provide in pseudo pass-through
|
|
mode a way for direct performance measurements without overhead of
|
|
actual data transferring from/to underlying SCSI device
|
|
|
|
In addition, SCST supports advanced per-initiator access and devices
|
|
visibility management, so different initiators could see different set
|
|
of devices with different access permissions. See below for details.
|
|
|
|
This is quite stable (but still beta) version.
|
|
|
|
Tested mostly on "vanilla" 2.6.21.1 kernel from kernel.org.
|
|
|
|
Installation
|
|
------------
|
|
|
|
At first, make sure that the link "/lib/modules/`you_kernel_version`/build"
|
|
points to the source code for your currently running kernel.
|
|
|
|
Then, since in the mainstream kernels scsi_do_req()/scsi_execute_async()
|
|
work in LIFO order, instead of expected and required FIFO, SCST needs a
|
|
new functions scsi_do_req_fifo()/scsi_execute_async_fifo() to be added
|
|
in the kernel. Patch scst_exec_req_fifo.patch from "kernel" directory
|
|
does that. If it doesn't apply to your kernel, apply it manually, it
|
|
only adds one of those functions and nothing more. You may not patch the
|
|
kernel if you don't need pass-through support or STRICT_SERIALIZING is
|
|
defined during the compilation (see description below).
|
|
|
|
To compile SCST type 'make scst'. It will build SCST itself and its
|
|
device handlers. To install them type 'make scst_install'. The driver
|
|
modules will be installed in '/lib/modules/`you_kernel_version`/extra'.
|
|
In addition, scsi_tgt.h, scst_debug.h as well as Module.symvers or
|
|
Modules.symvers will be copied to '/usr/local/include/scst'. The first
|
|
file contains all SCST's public data definition, which are used by
|
|
target drivers. The other ones support debug messages logging and build
|
|
process.
|
|
|
|
Then you can load any module by typing 'modprobe module_name'. The names
|
|
are:
|
|
|
|
- scst - SCST itself
|
|
- scst_disk - device handler for disks (type 0)
|
|
- scst_tape - device handler for tapes (type 1)
|
|
- scst_processor - device handler for processors (type 3)
|
|
- scst_cdrom - device handler for CDROMs (type 5)
|
|
- scst_modisk - device handler for MO disks (type 7)
|
|
- scst_changer - device handler for medium changers (type 8)
|
|
- scst_raid - device handler for storage array controller (e.g. raid) (type C)
|
|
- scst_vdisk - device handler for virtual disks (file, device or ISO CD image).
|
|
- scst_user - user space device handler
|
|
|
|
Then, to see your devices remotely, you need to add them to at least
|
|
"Default" security group (see below how). By default, no local devices
|
|
are seen remotely. There must be LUN 0 in each security group, i.e. LUs
|
|
numeration must not start from, e.g., 1.
|
|
|
|
IMPORTANT: Without loading appropriate device handler, corresponding devices
|
|
========= will be invisible for remote initiators, which could lead to holes
|
|
in the LUN addressing, so automatic device scanning by remote SCSI
|
|
mid-level could not notice the devices. Therefore you will have
|
|
to add them manually via
|
|
'echo "scsi add-single-device A 0 0 B" >/proc/scsi/scsi',
|
|
where A - is the host number, B - LUN.
|
|
|
|
IMPORTANT: Experience shows that if people work with out of SCST tree target
|
|
========= drivers, like target driver for Infiniband or in case if they
|
|
downloaded and use the released versions of SCST and target
|
|
drivers, they are too often (actually, almost always) after
|
|
upgrading SCST core forget to rebuild their target drivers,
|
|
which then immediately after load crash in the hard to trace
|
|
manner. So, after you reinstalled SCST core don't forget to
|
|
rebuild and reinstall all your target drivers, custom dev
|
|
handlers and necessary user space applications.
|
|
|
|
IMPORTANT: In the current version simultaneous access to local SCSI devices
|
|
========= via standard high-level SCSI drivers (sd, st, sg, etc.) and
|
|
SCST's target drivers is unsupported. Especially it is
|
|
important for execution via sg and st commands that change
|
|
the state of devices and their parameters, because that could
|
|
lead to data corruption. If any such command is done, at
|
|
least related device handler(s) must be restarted. For block
|
|
devices READ/WRITE commands using direct disk handler look to
|
|
be safe.
|
|
|
|
To uninstall, type 'make scst_uninstall'.
|
|
|
|
If you install QLA2x00 target driver's source code in this directory,
|
|
then you can build, install or uninstall it by typing 'make qla', 'make
|
|
qla_install' or 'make qla_uninstall' correspondingly. For more details
|
|
about QLA2x00 target drivers see their README files.
|
|
|
|
Device handlers
|
|
---------------
|
|
|
|
Device specific drivers (device handlers) are plugins for SCST, which
|
|
help SCST to analyze incoming requests and determine parameters,
|
|
specific to various types of devices. If an appropriate device handler
|
|
for a SCSI device type isn't loaded, SCST doesn't know how to handle
|
|
devices of this type, so they will be invisible for remote initiators
|
|
(more precisely, "LUN not supported" sense code will be returned).
|
|
|
|
In addition to device handlers for real devices, there are VDISK, user
|
|
space and "performance" device handlers.
|
|
|
|
VDISK device handler works over files on file systems and makes from
|
|
them virtual remotely available SCSI disks or CDROM's. In addition, it
|
|
allows to work directly over a block device, e.g. local IDE or SCSI disk
|
|
or ever disk partition, where there is no file systems overhead. Using
|
|
block devices comparing to sending SCSI commands directly to SCSI
|
|
mid-level via scsi_do_req()/scsi_execute_async() has advantage that data
|
|
are transferred via system cache, so it is possible to fully benefit from
|
|
caching and read ahead performed by Linux's VM subsystem. The only
|
|
disadvantage here that in the FILEIO mode there is superfluous data
|
|
copying between the cache and SCST's buffers. This issue is going to be
|
|
addressed in the next release. Virtual CDROM's are useful for remote
|
|
installation. See below for details how to setup and use VDISK device
|
|
handler.
|
|
|
|
SCST user space device handler provides an interface between SCST and
|
|
the user space, which allows to create pure user space devices. The
|
|
simplest example, where one would want it is if he/she wants to write a
|
|
VTL. With scst_user he/she can write it purely in the user space. Or one
|
|
would want it if he/she needs some sophisticated for kernel space
|
|
processing of the passed data, like encrypting them or making snapshots.
|
|
|
|
"Performance" device handlers for disks, MO disks and tapes in their
|
|
exec() method skip (pretend to execute) all READ and WRITE operations
|
|
and thus provide a way for direct link performance measurements without
|
|
overhead of actual data transferring from/to underlying SCSI device.
|
|
|
|
NOTE: Since "perf" device handlers on READ operations don't touch the
|
|
==== commands' data buffer, it is returned to remote initiators as it
|
|
was allocated, without even being zeroed. Thus, "perf" device
|
|
handlers impose some security risk, so use them with caution.
|
|
|
|
Compilation options
|
|
-------------------
|
|
|
|
There are the following compilation options, that could be commented
|
|
in/out in Makefile:
|
|
|
|
- DEBUG - turns on some debugging code, including some logging. Makes
|
|
the driver considerably bigger and slower, producing large amount of
|
|
log data.
|
|
|
|
- TRACING - turns on ability to log events. Makes the driver considerably
|
|
bigger and lead to some performance loss.
|
|
|
|
- EXTRACHECKS - adds extra validity checks in the various places.
|
|
|
|
- DEBUG_TM - turns on task management functions debugging, when on
|
|
LUN 0 in the default access control group some of the commands will
|
|
be delayed for about 60 sec., so making the remote initiator send TM
|
|
functions, eg ABORT TASK and TARGET RESET. Also set TM_DBG_GO_OFFLINE
|
|
symbol in the Makefile to 1 if you want that the device eventually
|
|
become completely unresponsive, or to 0 otherwise to circle around
|
|
ABORTs and RESETs code. Needs DEBUG turned on.
|
|
|
|
- STRICT_SERIALIZING - makes SCST send all commands to underlying SCSI
|
|
device synchronously, one after one. This makes task management more
|
|
reliable, with cost of some performance penalty. This is mostly
|
|
actual for stateful SCSI devices like tapes, where the result of
|
|
command's execution depends from device's settings set by previous
|
|
commands. Disk and RAID devices are stateless in the most cases. The
|
|
current SCSI core in Linux doesn't allow to abort all commands
|
|
reliably if they sent asynchronously to a stateful device. Turned off
|
|
by default, turn it on if you use stateful device(s) and need as much
|
|
error recovery reliability as possible. As a side effect, no kernel
|
|
patching is necessary.
|
|
|
|
- SCST_HIGHMEM - if defined on HIGHMEM systems with 2.6 kernels, it
|
|
allows SCST to use HIGHMEM. This is very experimental feature, which
|
|
is currently broken and unsupported, since it is unclear, if it
|
|
brings something valuable, except some performance hit. Note, that
|
|
SCST_HIGHMEM isn't required for HIGHMEM systems and SCST will work
|
|
fine on them with SCST_HIGHMEM off.
|
|
|
|
- SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data
|
|
buffers. Undefining it (default) considerably improves performance
|
|
and eases CPU load, but could create a security hole (information
|
|
leakage), so enable it, if you have strict security requirements.
|
|
|
|
HIGHMEM kernel configurations are fully supported, but not recommended
|
|
for performance reasons, except for scst_user, where they are not
|
|
supported, because this module deals with user supplied memory on a
|
|
zero-copy manner. Consider change VMSPLIT option or use 64-bit system
|
|
configuration instead.
|
|
|
|
For changing VMSPLIT option (CONFIG_VMSPLIT to be precise) you should in
|
|
"make menuconfig" command set the following variables:
|
|
|
|
- General setup->Configure standard kernel features (for small systems): ON
|
|
|
|
- Processor type and features->High Memory Support: OFF
|
|
|
|
- Processor type and features->Memory split: according to amount of
|
|
memory you have. If it is less than 800MB, you may not touch this
|
|
option at all.
|
|
|
|
Module parameters
|
|
-----------------
|
|
|
|
Module scst supports the following parameters:
|
|
|
|
- scst_threads - allows to set count of SCST's threads. By default it
|
|
is CPU count.
|
|
|
|
- scst_max_cmd_mem - sets maximum amount of memory in Mb allowed to be
|
|
consumed by the SCST commands for data buffers at any given time. By
|
|
default it is approximately TotalMem/4.
|
|
|
|
SCST "/proc" commands
|
|
---------------------
|
|
|
|
For communications with user space programs SCST provides proc-based
|
|
interface in "/proc/scsi_tgt" directory. It contains the following
|
|
entries:
|
|
|
|
- "help" file, which provides online help for SCST commands
|
|
|
|
- "scsi_tgt" file, which on read provides information of serving by SCST
|
|
devices and their dev handlers. On write it supports the following
|
|
command:
|
|
|
|
* "assign H:C:I:L HANDLER_NAME" assigns dev handler "HANDLER_NAME"
|
|
on device with host:channel:id:lun
|
|
|
|
- "sessions" file, which lists currently connected initiators (open sessions)
|
|
|
|
- "sgv" file provides some statistic about with which block sizes
|
|
commands from remote initiators come and how effective sgv_pool in
|
|
serving those allocations from the cache, i.e. without memory
|
|
allocations requests to the kernel. "Size" - is the commands data
|
|
size upper rounded to power of 2, "Hit" - how many there are
|
|
allocations from the cache, "Total" - total number of allocations.
|
|
|
|
- "threads" file, which allows to read and set number of SCST's threads
|
|
|
|
- "version" file, which shows version of SCST
|
|
|
|
- "trace_level" file, which allows to read and set trace (logging) level
|
|
for SCST. See "help" file for list of trace levels.
|
|
|
|
Each dev handler has own subdirectory. Most dev handler have only two
|
|
files in this subdirectory: "trace_level" and "type". The first one is
|
|
similar to main SCST "trace_level" file, the latter one shows SCSI type
|
|
number of this handler as well as some text description.
|
|
|
|
For example, "echo "assign 1:0:1:0 dev_disk" >/proc/scsi_tgt/scsi_tgt"
|
|
will assign device handler "dev_disk" to real device sitting on host 1,
|
|
channel 0, ID 1, LUN 0.
|
|
|
|
Access and devices visibility management (LUN masking)
|
|
------------------------------------------------------
|
|
|
|
Access and devices visibility management allows for an initiator or
|
|
group of initiators to have different limited set of LUs/LUNs (security
|
|
group) each with appropriate access permissions. Initiator is
|
|
represented as a SCST session. Session is bound to security group on its
|
|
registration time by character "name" parameter of the registration
|
|
function, which provided by target driver, based on its internal
|
|
authentication. For example, for FC "name" could be WWN or just loop ID.
|
|
For iSCSI this could be iSCSI login credentials or iSCSI initiator name.
|
|
Each security group has set of names assigned to it by system
|
|
administrator. Session is bound to security group with provided name. If
|
|
no such groups found, the session bound to either "Default_target_name",
|
|
or "Default" group, depending from either "Default_target_name" exists
|
|
or not. In "Default_target_name" target name means name of the target.
|
|
|
|
In /proc/scsi_tgt each group represented as "groups/GROUP_NAME/"
|
|
subdirectory. In it there are files "devices" and "users". File
|
|
"devices" lists all devices and their LUNs in the group, file "users"
|
|
lists all names that should be bound to this group.
|
|
|
|
To configure access and devices visibility management SCST provides the
|
|
following files and directories under /proc/scsi_tgt:
|
|
|
|
- "add_group GROUP" to /proc/scsi_tgt/scsi_tgt adds group "GROUP"
|
|
|
|
- "del_group GROUP" to /proc/scsi_tgt/scsi_tgt deletes group "GROUP"
|
|
|
|
- "add H:C:I:L lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
|
|
device with host:channel:id:lun as LUN "lun" in group "GROUP". Optionally,
|
|
the device could be marked as read only.
|
|
|
|
- "del H:C:I:L" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
|
|
host:channel:id:lun from group "GROUP"
|
|
|
|
- "add V_NAME lun [READ_ONLY]" to /proc/scsi_tgt/groups/GROUP/devices adds
|
|
device with virtual name "V_NAME" as LUN "lun" in group "GROUP".
|
|
Optionally, the device could be marked as read only.
|
|
|
|
- "del V_NAME" to /proc/scsi_tgt/groups/GROUP/devices deletes device with
|
|
virtual name "V_NAME" from group "GROUP"
|
|
|
|
- "clear" to /proc/scsi_tgt/groups/GROUP/devices clears the list of devices
|
|
for group "GROUP"
|
|
|
|
- "add NAME" to /proc/scsi_tgt/groups/GROUP/names adds name "NAME" to group
|
|
"GROUP"
|
|
|
|
- "del NAME" to /proc/scsi_tgt/groups/GROUP/names deletes name "NAME" from group
|
|
"GROUP"
|
|
|
|
- "clear" to /proc/scsi_tgt/groups/GROUP/names clears the list of names
|
|
for group "GROUP"
|
|
|
|
Examples:
|
|
|
|
- "echo "add 1:0:1:0 0" >/proc/scsi_tgt/groups/Default/devices" will
|
|
add real device sitting on host 1, channel 0, ID 1, LUN 0 to "Default"
|
|
group with LUN 0.
|
|
|
|
- "echo "add disk1 1" >/proc/scsi_tgt/groups/Default/devices" will
|
|
add virtual VDISK device with name "disk1" to "Default" group
|
|
with LUN 1.
|
|
|
|
VDISK device handler
|
|
--------------------
|
|
|
|
After loading VDISK device handler creates in "/proc/scsi_tgt/"
|
|
subdirectories "vdisk" and "vcdrom". They have similar layout:
|
|
|
|
- "trace_level" and "type" files as described for other dev handlers
|
|
|
|
- "help" file, which provides online help for VDISK commands
|
|
|
|
- "vdisk"/"vcdrom" files, which on read provides information of
|
|
currently open device files. On write it supports the following
|
|
command:
|
|
|
|
* "open NAME [PATH] [BLOCK_SIZE] [FLAGS]" - opens file "PATH" as
|
|
device "NAME" with block size "BLOCK_SIZE" bytes with flags
|
|
"FLAGS". "PATH" could be empty only for VDISK CDROM. "BLOCK_SIZE"
|
|
and "FLAGS" are valid only for disk VDISK. The block size must be
|
|
power of 2 and >= 512 bytes. Default is 512. Possible flags:
|
|
|
|
- WRITE_THROUGH - write back caching disabled
|
|
|
|
- READ_ONLY - read only
|
|
|
|
- O_DIRECT - both read and write caching disabled. This mode
|
|
isn't currently fully implemented, you should use user space
|
|
fileio_tgt program in O_DIRECT mode instead (see below).
|
|
|
|
- NULLIO - in this mode no real IO will be done, but success will be
|
|
returned. Intended to be used for performance measurements at the same
|
|
way as "*_perf" handlers.
|
|
|
|
- NV_CACHE - enables "non-volatile cache" mode. In this mode it is
|
|
assumed that the target has GOOD UPS and software/hardware bug
|
|
free, i.e. all data from the target's cache are guaranteed
|
|
sooner or later to go to the media, hence all data
|
|
synchronization with media operations, like SYNCHRONIZE_CACHE,
|
|
are ignored (BTW, so violating SCSI standard) in order to bring
|
|
a bit more performance. Use with extreme caution, since in this
|
|
mode after a crash of the target journaled file systems don't
|
|
guarantee the consistency after journal recovery, therefore
|
|
manual fsck MUST be ran. The main intent for it is to determine
|
|
the performance impact caused by the cache synchronization.
|
|
Note, that since usually the journal barrier protection (see
|
|
"IMPORTANT" below) turned off, enabling NV_CACHE could change
|
|
nothing, since no data synchronization with media operations
|
|
will go from the initiator.
|
|
|
|
- BLOCKIO - enables block mode, which will perform direct block
|
|
IO with a block device, bypassing page-cache for all operations.
|
|
This mode works ideally with high-end storage HBAs and for
|
|
applications that either do not need caching between application
|
|
and disk or need the large block throughput. See also below.
|
|
|
|
* "close NAME" - closes device "NAME".
|
|
|
|
* "change NAME [PATH]" - changes a virtual CD in the VDISK CDROM.
|
|
|
|
For example, "echo "open disk1 /vdisks/disk1" >/proc/scsi_tgt/vdisk/vdisk"
|
|
will open file /vdisks/disk1 as virtual VDISK disk with name "disk1".
|
|
|
|
IMPORTANT: By default for performance reasons VDISK FILEIO devices use write
|
|
========= back caching policy. This is generally safe from the consistence of
|
|
journaled file systems, laying over them, point of view, but
|
|
your unsaved cached data will be lost in case of
|
|
power/hardware/software failure, so you must supply your
|
|
target server with some kind of UPS or disable write back
|
|
caching using WRITE_THROUGH flag. You also should note, that
|
|
the file systems journaling over write back caching enabled
|
|
devices works reliably *ONLY* if the order of journal writes
|
|
is guaranteed or it uses some kind of data protection
|
|
barriers (i.e. after writing journal data some kind of
|
|
synchronization with media operations is used), otherwise,
|
|
because of possible reordering in the cache, even after
|
|
successful journal rollback, you very much risk to loose your
|
|
data on the FS. Currently, Linux IO subsystem guarantees
|
|
order of write operations only using data protection
|
|
barriers. Some info about it from the XFS point of view could
|
|
be found at http://oss.sgi.com/projects/xfs/faq.html#wcache.
|
|
On Linux initiators for EXT3 and ReiserFS file systems the
|
|
barrier protection could be turned on using "barrier=1" and
|
|
"barrier=flush" mount options correspondingly. Note, that
|
|
usually it turned off by default and the status of barriers
|
|
usage isn't reported anywhere in the system logs as well as
|
|
there is no way to know it on the mounted file system (at
|
|
least no known one). Also note that on some real-life
|
|
workloads write through caching might perform better, than
|
|
write back one with the barrier protection turned on.
|
|
|
|
IMPORTANT: Many disk and partition table management utilities don't support
|
|
========= block sizes >512 bytes, therefore make sure that your favorite one
|
|
supports it. Also, if you export disk file or device with
|
|
some block size, different from one, with which it was
|
|
already divided on partitions, you could get various weird
|
|
things like utilities hang up or other unexpected behavior.
|
|
Hence, to be sure, zero the exported file or device before the
|
|
first access to it from the remote initiator with another
|
|
block size.
|
|
|
|
BLOCKIO VDISK mode
|
|
------------------
|
|
|
|
This module works best for these types of scenarios:
|
|
|
|
1) Data that are not aligned to 4K sector boundaries and <4K block sizes
|
|
are used, which is normally found in virtualization environments where
|
|
operating systems start partitions on odd sectors (Windows and it's
|
|
sector 63).
|
|
|
|
2) Large block data transfers normally found in database loads/dumps and
|
|
streaming media.
|
|
|
|
3) Advanced relational database systems that perform their own caching
|
|
which prefer or demand direct IO access and, because of the nature of
|
|
their data access, can actually see worse performance with
|
|
non-discriminate caching.
|
|
|
|
4) Multiple layers of targets were the secondary/triary layers need to
|
|
have a consistent view of the primary targets in order to preserve data
|
|
integrity which a page cache backed IO type might not provide reliably.
|
|
|
|
Also it has an advantage over FILEIO that it doesn't copy data between
|
|
the system cache and the commands data buffers, so it saves a
|
|
considerable amount of CPU power and memory bandwidth.
|
|
|
|
IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between
|
|
========= them, if you try to use a device in both those modes simultaneously,
|
|
you will almost instantly corrupt your data on that device.
|
|
|
|
Pass-through mode
|
|
-----------------
|
|
|
|
In the pass-through mode (i.e. using the pass-through device handlers
|
|
scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
|
|
are passed to local SCSI hardware on target as is, without any
|
|
modifications. As any other hardware, the local SCSI hardware can not
|
|
handle commands with amount of data and/or segments count in
|
|
scatter-gather array bigger some values. Therefore, when using the
|
|
pass-through mode you should note that values for maximum number of
|
|
segments and maximum amount of transferred data for each SCSI command on
|
|
devices on initiators can not be bigger, than corresponding values of
|
|
the corresponding SCSI devices on the target. Otherwise you will see
|
|
symptoms like small transfers work well, but large ones stall and
|
|
messages like: "Unable to complete command due to SG IO count
|
|
limitation" are printed in the kernel logs.
|
|
|
|
You can't control from the user space limit of the scatter-gather
|
|
segments, but for block devices usually it is sufficient if you set on
|
|
the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same
|
|
or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for
|
|
the corresponding devices on the target.
|
|
|
|
For not-block devices SCSI commands are usually generated directly by
|
|
applications, so, if you experience large transfers stalls, you should
|
|
check documentation for your application how to limit the transfer
|
|
sizes.
|
|
|
|
User space mode using scst_user dev handler
|
|
-------------------------------------------
|
|
|
|
User space program fileio_tgt uses interface of scst_user dev handler
|
|
and allows to see how it work in various modes. Fileio_tgt provides
|
|
mostly the same functionality as scst_vdisk handler with the only
|
|
exceptions that it has implemented O_DIRECT mode and doesn't support
|
|
BLOCKIO one. O_DIRECT mode is basically the same as BLOCKIO, but also
|
|
supports files, so for some loads it could be significantly faster, than
|
|
regular FILEIO access. All the words about BLOCKIO from above apply to
|
|
O_DIRECT as well. While running fileio_tgt if you don't understand some
|
|
its options, use defaults for them, those values are the fastest.
|
|
|
|
Performance
|
|
-----------
|
|
|
|
Before doing any performance measurements note that:
|
|
|
|
I. Performance results are very much dependent from your type of load,
|
|
so it is crucial that you choose access mode (FILEIO, BLOCKIO,
|
|
O_DIRECT, pass-through), which suits your needs the best.
|
|
|
|
II. In order to get the maximum performance you should:
|
|
|
|
1. For SCST:
|
|
|
|
- Disable in Makefile STRICT_SERIALIZING, EXTRACHECKS, TRACING, DEBUG*,
|
|
SCST_STRICT_SECURITY, SCST_HIGHMEM
|
|
|
|
2. For target drivers:
|
|
|
|
- Disable in Makefiles EXTRACHECKS, TRACING, DEBUG*
|
|
|
|
3. For device handlers, including VDISK:
|
|
|
|
- Disable in Makefile TRACING, DEBUG
|
|
|
|
- If your initiator(s) use dedicated exported from the target virtual
|
|
SCSI devices and have more or equal amount of memory, than the
|
|
target, it is recommended to use O_DIRECT option (currently it is
|
|
available only with fileio_tgt user space program) or BLOCKIO. With
|
|
them you could have up to 100% increase in throughput.
|
|
|
|
IMPORTANT: Some of the compilation options enabled by default, i.e. SCST
|
|
========= is optimized currently rather for development and bug hunting,
|
|
not for performance.
|
|
|
|
4. For kernel:
|
|
|
|
- Don't enable debug/hacking features, i.e. use them as they are by
|
|
default.
|
|
|
|
- The default kernel read-ahead and queuing settings are optimized
|
|
for locally attached disks, therefore they are not optimal if they
|
|
attached remotely (SCSI target case), which sometimes could lead to
|
|
unexpectedly low throughput. You should increase read-ahead size to at
|
|
least 512KB or even more on all initiators and the target.
|
|
|
|
You should also limit on all initiators maximum amount of sectors per
|
|
SCSI command. To do it on Linux initiators, run:
|
|
|
|
echo “64” > /sys/block/sdX/queue/max_sectors_kb
|
|
|
|
where specify instead of X your imported from target device letter,
|
|
like 'b', i.e. sdb.
|
|
|
|
To increase read-ahead size on Linux, run:
|
|
|
|
blockdev --setra N /dev/sdX
|
|
|
|
where N is a read-ahead number in 512-byte sectors and X is a device
|
|
letter like above.
|
|
|
|
Note: you need to set read-ahead setting for device sdX again after
|
|
you changed the maximum amount of sectors per SCSI command for that
|
|
device.
|
|
|
|
- You may need to increase amount of requests that OS on initiator
|
|
sends to the target device. To do it on Linux initiators, run
|
|
|
|
echo “512” > /sys/block/sdX/queue/nr_requests
|
|
|
|
where X is a device letter like above.
|
|
|
|
You may also experiment with other parameters in /sys/block/sdX
|
|
directory, they also affect performance. If you find the best values,
|
|
please share them with us.
|
|
|
|
- Use on the target deadline IO scheduler with read_expire and
|
|
write_expire increased on all exported devices to 5000 and 20000
|
|
correspondingly.
|
|
|
|
- It is recommended to turn the kernel preemption off, i.e. set
|
|
the kernel preemption model to "No Forced Preemption (Server)".
|
|
|
|
5. For hardware.
|
|
|
|
- Make sure that your target hardware (e.g. target FC card) and underlaying
|
|
IO hardware (e.g. IO card, like SATA, SCSI or RAID to which your
|
|
disks connected) stay on different PCI buses. They have to work in
|
|
parallel, so it will be better if they don't compete for the bus. The
|
|
problem is not only in the bandwidth, which they have to share, but
|
|
also in the interaction between cards during that competition. In
|
|
some cases it could lead up to 5-10 times less performance, than
|
|
expected.
|
|
|
|
IMPORTANT: If you use on initiator some versions of Windows (at least W2K)
|
|
========= you can't get good write performance for VDISK FILEIO devices with
|
|
default 512 bytes block sizes. You could get about 10% of the
|
|
expected one. This is because of partition alignment, which
|
|
is (simplifying) incompatible with how Linux page cache
|
|
works, so for each write the corresponding block must be read
|
|
first. Use 4096 bytes block sizes for VDISK devices and you
|
|
will have the expected write performance. Actually, any OS on
|
|
initiators, not only Windows, will benefit from block size
|
|
max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE
|
|
is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size
|
|
on the underlying FS, on which the device file located, or 0,
|
|
if a device node is used. Both values are from the target.
|
|
See also important notes about setting block sizes >512 bytes
|
|
for VDISK FILEIO devices above.
|
|
|
|
Credits
|
|
-------
|
|
|
|
Thanks to:
|
|
|
|
* Mark Buechler <mark.buechler@gmail.com> for a lot of useful
|
|
suggestions, bug reports and help in debugging.
|
|
|
|
* Ming Zhang <mingz@ele.uri.edu> for fixes and comments.
|
|
|
|
* Nathaniel Clark <nate@misrule.us> for fixes and comments.
|
|
|
|
* Calvin Morrow <calvin.morrow@comcast.net> for testing and useful
|
|
suggestions.
|
|
|
|
* Hu Gang <hugang@soulinfo.com> for the original version of the
|
|
LSI target driver.
|
|
|
|
* Erik Habbinga <erikhabbinga@inphase-tech.com> for fixes and support
|
|
of the LSI target driver.
|
|
|
|
* Ross S. W. Walker <rswwalker@hotmail.com> for the original block IO
|
|
code and Vu Pham <huongvp@yahoo.com> who updated it for the VDISK dev
|
|
handler.
|
|
|
|
* Michael G. Byrnes <michael.byrnes@hp.com> for fixes.
|
|
|
|
* Alessandro Premoli <a.premoli@andxor.it> for fixes
|
|
|
|
* Nathan Bullock <nbullock@yottayotta.com> for fixes.
|
|
|
|
* Terry Greeniaus <tgreeniaus@yottayotta.com> for fixes.
|
|
|
|
Vladislav Bolkhovitin <vst@vlnb.net>, http://scst.sourceforge.net
|