diff --git a/scst/README_in-tree b/scst/README_in-tree deleted file mode 100644 index 837df1428..000000000 --- a/scst/README_in-tree +++ /dev/null @@ -1,2703 +0,0 @@ -Generic SCSI target mid-level for Linux (SCST) -============================================== - -SCST is designed to provide unified, consistent interface between SCSI -target drivers and Linux kernel and simplify target drivers development -as much as possible. Detail description of SCST's features and internals -could be found on its Internet page http://scst.sourceforge.net. - -SCST supports the following I/O modes: - - * Pass-through mode with one to many relationship, i.e. when multiple - initiators can connect to the exported pass-through devices, for - the following SCSI devices types: disks (type 0), tapes (type 1), - processors (type 3), CDROMs (type 5), MO disks (type 7), medium - changers (type 8) and RAID controllers (type 0xC). - - * FILEIO mode, which allows to use files on file systems or block - devices as virtual remotely available SCSI disks or CDROMs with - benefits of the Linux page cache. - - * BLOCKIO mode, which performs direct block IO with a block device, - bypassing page-cache for all operations. This mode works ideally with - high-end storage HBAs and for applications that either do not need - caching between application and disk or need the large block - throughput. - - * User space mode using scst_user device handler, which allows to - implement in the user space high performance virtual SCSI - devices. Comparing with fully in-kernel dev handlers this mode has - very low overhead (few %%). - - * "Performance" device handlers, which provide in pseudo pass-through - mode a way for direct performance measurements without overhead of - actual data transferring from/to underlying SCSI device. - -In addition, SCST supports advanced per-initiator access and devices -visibility management, so different initiators could see different set -of devices with different access permissions. See below for details. - -Full list of SCST features and comparison with other Linux targets you -can find on http://scst.sourceforge.net/comparison.html. - - -Installation ------------- - -To see your devices remotely, you need to add a corresponding LUN for -them (see below how). By default, no local devices are seen remotely. -There must be LUN 0 in each LUNs set (security group), i.e. LUs -numeration must not start from, e.g., 1. Otherwise you will see no -devices on remote initiators and SCST core will write into the kernel -log message: "tgt_dev for LUN 0 not found, command to unexisting LU?" - -It is highly recommended to use scstadmin utility for configuring -devices and security groups. - -The flow of SCST inialization should be as the following: - -1. Load of SCST modules with necessary module parameters, if needed. - -2. Configure targets, devices, LUNs, etc. using either scstadmin -(recommended), or the sysfs interface directly as described below. - -If you experience problems during modules load or running, check your -kernel logs (or run dmesg command for the few most recent messages). - -IMPORTANT: Without loading appropriate device handler, corresponding devices -========= will be invisible for remote initiators, which could lead to holes - in the LUN addressing, so automatic device scanning by remote SCSI - mid-level could not notice the devices. Therefore you will have - to add them manually via - 'echo "- - -" >/sys/class/scsi_host/hostX/scan', - where X - is the host number. - -IMPORTANT: Working of target and initiator on the same host is -========= supported, except the following 2 cases: swap over target exported - device and using a writable mmap over a file from target - exported device. The latter means you can't mount a file - system over target exported device. In other words, you can - freely use any sg, sd, st, etc. devices imported from target - on the same host, but you can't mount file systems or put - swap on them. This is a limitation of Linux memory/cache - manager, because in this case a memory allocation deadlock is - possible like: system needs some memory -> it decides to - clear some cache -> the cache is needed to be written on a - target exported device -> initiator sends request to the - target located on the same system -> the target needs memory - -> the system needs even more memory -> deadlock. - -IMPORTANT: In the current version simultaneous access to local SCSI devices -========= via standard high-level SCSI drivers (sd, st, sg, etc.) and - SCST's target drivers is unsupported. Especially it is - important for execution via sg and st commands that change - the state of devices and their parameters, because that could - lead to data corruption. If any such command is done, at - least related device handler(s) must be restarted. For block - devices READ/WRITE commands using direct disk handler are - generally safe. - - -Usage in failover mode ----------------------- - -It is recommended to use TEST UNIT READY ("tur") command to check if -SCST target is alive in MPIO configurations. - - -Device handlers ---------------- - -Device specific drivers (device handlers) are plugins for SCST, which -help SCST to analyze incoming requests and determine parameters, -specific to various types of devices. If an appropriate device handler -for a SCSI device type isn't loaded, SCST doesn't know how to handle -devices of this type, so they will be invisible for remote initiators -(more precisely, "LUN not supported" sense code will be returned). - -In addition to device handlers for real devices, there are VDISK, user -space and "performance" device handlers. - -VDISK device handler works over files on file systems and makes from -them virtual remotely available SCSI disks or CDROM's. In addition, it -allows to work directly over a block device, e.g. local IDE or SCSI disk -or ever disk partition, where there is no file systems overhead. Using -block devices comparing to sending SCSI commands directly to SCSI -mid-level via scsi_do_req()/scsi_execute_async() has advantage that data -are transferred via system cache, so it is possible to fully benefit -from caching and read ahead performed by Linux's VM subsystem. The only -disadvantage here that in the FILEIO mode there is superfluous data -copying between the cache and SCST's buffers. This issue is going to be -addressed in one of the future releases. Virtual CDROM's are useful for -remote installation. See below for details how to setup and use VDISK -device handler. - -"Performance" device handlers for disks, MO disks and tapes in their -exec() method skip (pretend to execute) all READ and WRITE operations -and thus provide a way for direct link performance measurements without -overhead of actual data transferring from/to underlying SCSI device. - -NOTE: Since "perf" device handlers on READ operations don't touch the -==== commands' data buffer, it is returned to remote initiators as it - was allocated, without even being zeroed. Thus, "perf" device - handlers impose some security risk, so use them with caution. - - -Compilation options -------------------- - -There are the following compilation options, that could be change using -your favorite kernel configuration Makefile target, e.g. "make xconfig": - - - CONFIG_SCST_DEBUG - if defined, turns on some debugging code, - including some logging. Makes the driver considerably bigger and slower, - producing large amount of log data. - - - CONFIG_SCST_TRACING - if defined, turns on ability to log events. Makes the - driver considerably bigger and leads to some performance loss. - - - CONFIG_SCST_EXTRACHECKS - if defined, adds extra validity checks in - the various places. - - - CONFIG_SCST_USE_EXPECTED_VALUES - if not defined (default), initiator - supplied expected data transfer length and direction will be used - only for verification purposes to return error or warn in case if one - of them is invalid. Instead, locally decoded from SCSI command values - will be used. This is necessary for security reasons, because - otherwise a faulty initiator can crash target by supplying invalid - value in one of those parameters. This is especially important in - case of pass-through mode. If CONFIG_SCST_USE_EXPECTED_VALUES is - defined, initiator supplied expected data transfer length and - direction will override the locally decoded values. This might be - necessary if internal SCST commands translation table doesn't contain - SCSI command, which is used in your environment. You can know that if - you enable "minor" trace level and have messages like "Unknown - opcode XX for YY. Should you update scst_scsi_op_table?" in your - kernel log and your initiator returns an error. Also report those - messages in the SCST mailing list scst-devel@lists.sourceforge.net. - Note, that not all SCSI transports support supplying expected values. - You should try to enable this option if you have a not working with - SCST pass-through device, for instance, an SATA CDROM. - - - CONFIG_SCST_DEBUG_TM - if defined, turns on task management functions - debugging, when on LUN 6 some of the commands will be delayed for - about 60 sec., so making the remote initiator send TM functions, eg - ABORT TASK and TARGET RESET. Also define - CONFIG_SCST_TM_DBG_GO_OFFLINE symbol in the Makefile if you want that - the device eventually become completely unresponsive, or otherwise to - circle around ABORTs and RESETs code. Needs CONFIG_SCST_DEBUG turned - on. - - - CONFIG_SCST_STRICT_SERIALIZING - if defined, makes SCST send all commands to - underlying SCSI device synchronously, one after one. This makes task - management more reliable, with cost of some performance penalty. This - is mostly actual for stateful SCSI devices like tapes, where the - result of command's execution depends from device's settings defined - by previous commands. Disk and RAID devices are stateless in the most - cases. The current SCSI core in Linux doesn't allow to abort all - commands reliably if they sent asynchronously to a stateful device. - Turned off by default, turn it on if you use stateful device(s) and - need as much error recovery reliability as possible. As a side effect - of CONFIG_SCST_STRICT_SERIALIZING, on kernels below 2.6.30 no kernel - patching is necessary for pass-through device handlers (scst_disk, - etc.). - - - CONFIG_SCST_TEST_IO_IN_SIRQ - if defined, allows SCST to submit selected - SCSI commands (TUR and READ/WRITE) from soft-IRQ context (tasklets). - Enabling it will decrease amount of context switches and slightly - improve performance. The goal of this option is to be able to measure - overhead of the context switches. If after enabling this option you - don't see under load in vmstat output on the target significant - decrease of amount of context switches, then your target driver - doesn't submit commands to SCST in IRQ context. For instance, - iSCSI-SCST doesn't do that, but qla2x00t with - CONFIG_QLA_TGT_DEBUG_WORK_IN_THREAD disabled - does. This option is - designed to be used with vdisk NULLIO backend. - - WARNING! Using this option enabled with other backend than vdisk - NULLIO is unsafe and can lead you to a kernel crash! - - - CONFIG_SCST_STRICT_SECURITY - if defined, makes SCST zero allocated data - buffers. Undefining it (default) considerably improves performance - and eases CPU load, but could create a security hole (information - leakage), so enable it, if you have strict security requirements. - - - CONFIG_SCST_ABORT_CONSIDER_FINISHED_TASKS_AS_NOT_EXISTING - if defined, - in case when TASK MANAGEMENT function ABORT TASK is trying to abort a - command, which has already finished, remote initiator, which sent the - ABORT TASK request, will receive TASK NOT EXIST (or ABORT FAILED) - response for the ABORT TASK request. This is more logical response, - since, because the command finished, attempt to abort it failed, but - some initiators, particularly VMware iSCSI initiator, consider TASK - NOT EXIST response as if the target got crazy and try to RESET it. - Then sometimes get crazy itself. So, this option is disabled by - default. - - - CONFIG_SCST_DIF_INJECT_CORRUPTED_TAGS - if defined, allows injection - of corrupted DIF tags according to the Oracle specification. This - functionality is working only if dif_mode doesn't contain dev_store - and dif_type is 1. - - - CONFIG_SCST_FORWARD_MODE_PASS_THROUGH - if defined, the pass-through - subsystem starts working in the forwarding mode, where reservation - commands processed locally and not passed to the backend SCSI device, - while COMPARE AND WRITE, EXTENDED COPY and RECEIVE COPY RESULTS - commands, which normally processed locally by the SCST core, not - processed locally, but passed to the backend device. Intended to be - used to implement NON-OPTIMIZED ALUA state together with "forwarding" - target attribute on the remote node. See below for more details. - Disabled by default for safety. - - - CONFIG_SCST_NO_TOTAL_MEM_CHECKS - disables checks of allocated - memory, see scst_max_cmd_mem below. Allows to avoid 2 global - variables on the fast path, hence get better multi-queue performance. - -HIGHMEM kernel configurations are fully supported, but not recommended -for performance reasons. - - -Module parameters ------------------ - -Module scst supports the following parameters: - - - scst_threads - allows to set count of SCST's threads. By default it - is CPU count. - - - scst_max_cmd_mem - sets maximum amount of memory in MB allowed to be - consumed by the SCST commands for data buffers at any given time. By - default it is approximately TotalMem/4. - - - auto_cm_assignment - enables the copy managers auto registration. - If a device is not registered in the copy manager, it can not be - source or target of EXTENDED COPY commands. Enabled by default. - Disable, if you want to manually control the copy manager - registration or need to change a device, e.g. a DM cache device, with - SCST LUN on top of it to avoid extra reference the copy manager holds - on this device. In the later case you can also remove this reference - by manually deleting the corresponding copy manager LUN via sysfs interface - (/sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/mgmt). - - -SCST sysfs interface --------------------- - -SCST sysfs interface designed to be self descriptive and self -containing. This means that a high level management tool for it can be -written once and automatically support any future sysfs interface -changes (attributes additions or removals, new target drivers and dev -handlers, etc.) without any modifications. Scstadmin is an example of -such management tool. - -To implement that an management tool should not be implemented around -drivers and their attributes, but around common rules those drivers and -attributes follow. You can find those rules in SysfsRules file. For -instance, each SCST sysfs file (attribute) can contain in the last line -mark "[key]". It is automatically added to allow scstadmin and other -management tools to see which attributes it should save in the config -file. If you are doing manual attributes manipulations, you can ignore -this mark. - -Root of SCST sysfs interface is /sys/kernel/scst_tgt. It has the -following entries: - - - devices - this is a root subdirectory for all SCST devices - - - handlers - this is a root subdirectory for all SCST dev handlers - - - max_tasklet_cmd - specifies how many commands at max can be queued in - the SCST core simultaneously on a single CPU from all connected - initiators to allow processing commands on this CPU in soft-IRQ - context in tasklets. If the count of the commands exceeds this value, - then all of them will be processed only in SCST threads. This is to - to prevent possible under heavy load starvation of processes on the - CPUs serving soft IRQs and in some cases to improve performance by - more evenly spreading load over available CPUs. - - - sgv - this is a root subdirectory for all SCST SGV caches - - - targets - this is a root subdirectory for all SCST targets - - - setup_id - allows to read and write SCST setup ID. This ID can be - used in cases, when the same SCST configuration should be installed - on several targets, but exported from those targets devices should - have different IDs and SNs. For instance, VDISK dev handler uses this - ID to generate T10 vendor specific identifier and SN of the devices. - - - poll_us - if polling is desired, sets how many us each SCST thread - is polling its queue after it became empty in a hope that a new - command can come. In some cases, polling can significantly increase - IOPS, especially if low power states on CPU not disabled, because on - high IOPS polling could be cheaper comparing to spending significant - time on entering, then exiting CPU low power states + corresponding - context switches. Disabled, i.e. set to 0, by default. - - - suspend - globally suspends or releases all SCSI activities on all - devices. Useful for mass management, like adding or deleting LUNs. - Writing to it value v: - - * v > 0 - suspends activities, but waits no more, than v seconds - - * v = 0 - suspends activities, waits indefinitely - - * V < 0 - releases activities. - - Reading from this attribute returns number of previous suspend - requests. - - - threads - allows to read and set number of global SCST I/O threads. - Those threads used with async. dev handlers, for instance, vdisk - BLOCKIO or NULLIO. - - - trace_cmds - shows current SCST commands up to size of the sysfs - buffer (4KB) - - - trace_mcmds - shows current SCST management commands up to size of - the sysfs buffer (4KB) - - - trace_level - allows to enable and disable various tracing - facilities. See content of this file for help how to use it. See also - section "Dealing with massive logs" for more info how to make correct - logs when you enabled trace levels producing a lot of logs data. - - - version - read-only attribute, which allows to see version of - SCST and enabled optional features. - - - last_sysfs_mgmt_res - read-only attribute returning completion status - of the last management command. In the sysfs implementation there are - some problems between internal sysfs and internal SCST locking. To - avoid them in some cases sysfs calls can return error with errno - EAGAIN. This doesn't mean the operation failed. It only means that - the operation queued and not yet completed. To wait for it to - complete, an management tool should poll this file. If the operation - hasn't yet completed, it will also return EAGAIN. But after it's - completed, it will return the result of this operation (0 for success - or -errno for error). The following two shell functions show how to do - this: - - - force_global_sgv_pool - if not set, buffers for SCSI commands are - allocated from per-CPU SGV pool. Otherwise, global SGV pool is used. - -# Read the SCST sysfs attribute $1. See also scst/README for more information. -scst_sysfs_read() { - local EAGAIN val - - EAGAIN="Resource temporarily unavailable" - while true; do - if val="$(LC_ALL=C cat "$1" 2>&1)"; then - echo -n "${val%\[key\]}" - return 0 - elif [ "${val/*: }" != "$EAGAIN" ]; then - return 1 - fi - sleep 1 - done -} - -# Write $1 into the SCST sysfs attribute $2. See also scst/README for more -# information. -scst_sysfs_write() { - local EAGAIN status - - EAGAIN="Resource temporarily unavailable" - if status="$(LC_ALL=C; (echo -n "$1" > "$2") 2>&1)"; then - return 0 - elif [ "${status/*: }" != "$EAGAIN" ]; then - return 1 - fi - scst_sysfs_read /sys/kernel/scst_tgt/last_sysfs_mgmt_res >/dev/null -} - -"Devices" subdirectory contains subdirectories for each SCST devices. - -Content of each device's subdirectory is dev handler specific. See -documentation for your dev handlers for more info about it as well as -SysfsRules file for more info about common to all dev handlers rules. -SCST dev handlers can have the following common entries: - - - block - allows to temporary block and unblock this device. See below. - - - exported - subdirectory containing links to all LUNs where this - device was exported. - - - handler - if dev handler determined for this device, this link points - to it. The handler can be not set for pass-through devices. - - - threads_num - shows and allows to set number of threads in this device's - threads pool. If 0 - no threads will be created, and global SCST - threads pool will be used. If <0 - creation of the threads pool is - prohibited. - - - threads_pool_type - shows and allows to sets threads pool type. - Possible values: "per_initiator" and "shared". When the value is - "per_initiator" (default), each session from each initiator will use - separate dedicated pool of threads. When the value is "shared", all - sessions from all initiators will share the same per-device pool of - threads. Valid only if threads_num attribute >0. - - - dump_prs - allows to dump persistent reservations information in the - kernel log. - - - type - SCSI type of this device - - - max_tgt_dev_commands - maximum number of SCSI commands any session to - this device can have in flight. - - - numa_node_id - NUMA node id this device physically belongs to. SCST - NUMA handling assumes that being used in the system NUMA memory - allocation policy is to always allocate from the current node. - -Attribute "block" allows to temporary block and unblock this device. -"Blocking" means that no new commands for this device will go into the -execution stage, but instead will be suspended just before it. The -blocked state is not reached until queue of the corresponding device is -completely drained. You can also call this state "frozen". It is useful -in many cases, like consistent snapshots and graceful shutdown. - -On write "block" entry allows the following 3 types of parameters: - - - 1 - block device synchronously, i.e. don't return until this device - becomes blocked, i.e. until queue of it is not completely drained. Can - be called as many times as needed. - - - 11 params - block device asynchronously, i.e. return immediately. - Notification about completing is delivered using SCST_EVENT_EXT_BLOCKING_DONE - event. "Params" delivered to it as is in "data" payload. Can be - called as many times as needed. Alternatively, status of blocking could be - polled by reading this attributes until the second number reaches 0 - (see below). - - - 0 - unblock this device. - -Reading from "block" entry returns two numbers separated by space: - -1. How many times this device was blocked, i.e. how many times writing -"0" to it is needed to unblock this device. - -2. Boolean (0 or 1) if blocking, if any, is done (0) or still pending (1). - -See below for more information about other entries of this subdirectory -of the standard SCST dev handlers. - -"Handlers" subdirectory contains subdirectories for each SCST dev -handler. - -Content of each handler's subdirectory is dev handler specific. See -documentation for your dev handlers for more info about it as well as -SysfsRules file for more info about common to all dev handlers rules. -SCST dev handlers can have the following common entries: - - - mgmt - this entry allows to create virtual devices and their - attributes (for virtual devices dev handlers) or assign/unassign real - SCSI devices to/from this dev handler (for pass-through dev - handlers). - - - trace_level - allows to enable and disable various tracing - facilities. See content of this file for help how to use it. See also - section "Dealing with massive logs" for more info how to make correct - logs when you enabled trace levels producing a lot of logs data. - - - type - SCSI type of devices served by this dev handler. - -See below for more information about other entries of this subdirectory -of the standard SCST dev handlers. - -"Sgv" subdirectory contains statistic information of SCST SGV caches. It -has the following entries: - - - None, one or more subdirectories for each existing SGV cache. - - - global_stats - file containing global SGV caches statistics. - -Each SGV cache's subdirectory has the following item: - - - stats - file containing statistics for this SGV caches. - -"Targets" subdirectory contains subdirectories for each SCST target. - -Content of each target's subdirectory is target specific. See -documentation for your target for more info about it as well as -SysfsRules file for more info about common to all targets rules. -Every target should have at least the following entries: - - - ini_groups - subdirectory, which contains and allows to define - initiator-oriented access control information, see below. - - - luns - subdirectory, which contains list of available LUNs in the - target-oriented access control and allows to define it, see below. - - - sessions - subdirectory containing connected to this target sessions. - - - comment - this attribute can be used to store any human readable info - to help identify target. For instance, to help identify the target's - mapping to the corresponding hardware port. It isn't anyhow used by - SCST. - - - enabled - using this attribute you can enable or disable this target. - It allows to finish configuring it before it starts accepting new - connections. 0 by default. - - - addr_method - used LUNs addressing method. Possible values: - "Peripheral", "Flat" or "LUN". Most initiators work well with - Peripheral addressing method (default), but some (HP-UX, for instance) - may require the Flat method or the LUN method (e.g. IBM systems). This - attribute is also available in the initiators security groups, so you - can assign the addressing method on per-initiator basis. See also the - "Logical unit addressing (LUN)" section in SAM-5 for more information. - - - black_hole - if set, all LUNs in the corresponding initiator group, - default target group in this case, start "swallowing" requests from - initiators. Possible values are: - - * 0 - disable black hole mode - - * 1 - immediately abort all coming SCSI commands, i.e. all SCSI commands - are dropped and TM requests return that they completed. It is - supposed to simulate lost front end responses. - - * 2 - immediately abort all coming SCSI commands and drop all coming TM - commands. It is supposed to simulate logical target hang, when the - target stops responding, but on the HW/TCP connection level still - appears to be online. - - * 3 - immediately abort all coming data transfer SCSI commands, i.e. - only data transfer SCSI commands are dropped, while commands like - INQUIRY and TEST UNIT READY pass well. It is supposed to simulate - flaky front end connectivity, when responses for small commands - pass well, but big data transfers fail. - - * 4 - immediately abort all coming data transfer SCSI commands and - drop all coming TM commands. It is supposed to simulate really - flaky front end connectivity, when TM requests or responses are - also lost. - - Modes 3 and 4 are the most evil ones, because they are not too well - handled by many initiator OS'es, including Linux, so they may never - recover from it. - - Note, dropping TM commands, i.e. not sending response on them, - implemented not for all target drivers. If it's implemented for your - particular target driver or not, you can find out by checking traces - or the target driver's source code. - - - dif_capabilities - if this target supports T10-PI, returns which - exact DIF capabilities this target supports. - - - dif_checks_failed - if this target supports T10-PI, returns - statistics how many DIF errors have been detected on the - corresponding processing stages on this target. It returns 3 rows of - numbers with 3 numbers in each row: for target driver stage, for SCST - stage and for dev handler stage. Numbers in each row: how many errors - detected checking application, reference and guard tags - correspondingly. Writing to this attribute resets the numbers. - - - cpu_mask - defines CPU affinity mask for threads serving this target. - For threads serving LUNs it is used only for devices with - threads_pool_type "per_initiator". - - - io_grouping_type - defines how I/O from sessions to this target are - grouped together. This I/O grouping is very important for - performance. By setting this attribute in a right value, you can - considerably increase performance of your setup. This grouping is - performed only if you use CFQ I/O scheduler on the target and for - devices with threads_num >= 0 and, if threads_num > 0, with - threads_pool_type "per_initiator". Possible values: - "this_group_only", "never", "auto", or I/O group number >0. When the - value is "this_group_only" all I/O from all sessions in this target - will be grouped together. When the value is "never", I/O from - different sessions will not be grouped together, i.e. all sessions in - this target will have separate dedicated I/O groups. When the value - is "auto" (default), all I/O from initiators with the same name - (iSCSI initiator name, for instance) in all targets will be grouped - together with a separate dedicated I/O group for each initiator name. - For iSCSI this mode works well, but other transports usually use - different initiator names for different sessions, so using such - transports in MPIO configurations you should either use value - "this_group_only", or an explicit I/O group number. This attribute is - also available in the initiators security groups, so you can assign - the I/O grouping on per-initiator basis. See below for more info how - to use this attribute. - - - rel_tgt_id - allows to read or write SCSI Relative Target Port - Identifier attribute. This identifier is used to identify SCSI Target - Ports by some SCSI commands, mainly by Persistent Reservations - commands. This identifier must be unique among all SCST targets, but - for convenience SCST allows disabled targets to have not unique - rel_tgt_id. In this case SCST will not allow to enable this target - until rel_tgt_id becomes unique. This attribute initialized unique by - SCST by default. - - - forwarding - if set this target is forwarding target, i.e. does not check - any local SCSI events (reservations, etc.). Those event supposed to - be checked on the another, requester's side. - - - *count*, e.g. read_io_count_kb, - statistics about executed - commands and transferred data. Those attributes have speaking names - built from parts: - - 1. Data transfer direction - - 2. Alignment type: not specified or unaligned (on 4K boundaries) - - 3. Type: IO (commands) count or amount of transferred data - - 4. For transferred data: measurement units - - For instance, read_unaligned_cmd_count means number of 4K unaligned IOs. - -A target driver may have also the following entries: - - - "hw_target" - if the target driver supports both hardware and virtual - targets (for instance, an FC adapter supporting NPIV, which has - hardware targets for its physical ports as well as virtual NPIV - targets), this read only attribute for all hardware targets will - exist and contain value 1. - -Subdirectory "sessions" contains one subdirectory for each connected -session with name equal to name of the connected initiator with the -following entries: - - - initiator_name - contains initiator name - - - force_close - optional write-only attribute, which allows to force - close this session. - - - active_commands - contains number of active, i.e. not yet or being - executed, SCSI commands in this session. - - - commands - contains overall number of SCSI commands in this session. - - - dif_checks_failed - if target of this session supports T10-PI, returns - statistics how many DIF errors have been detected on the - corresponding processing stages on all DIF-enabled LUNs in this - session. It returns 3 rows of numbers with 3 numbers in each row: for - target driver stage, for SCST stage and for dev handler stage. - Numbers in each row: how many errors detected checking application, - reference and guard tags correspondingly. Writing to this attribute - resets the numbers. Similar statistics returned in attribute with the - same name for each LUN in this session in this LUN's subdirectory, if - its device configured with dif_type > 0. - - - read_cmd_count - number of READ SCSI commands received since beginning - or last reset (writing 0 in this attribute) - - - read_io_count_kb - amount of data in KB read by the initiator since - beginning or last reset (writing 0 in this attribute) - - - write_cmd_count - number of WRITE SCSI commands received since - beginning or last reset (writing 0 in this attribute) - - - write_io_count_kb - amount of data in KB written by the initiator - since beginning or last reset (writing 0 in this attribute) - - - bidi_cmd_count - number of BIDI SCSI commands received since - beginning or last reset (writing 0 in this attribute) - - - bidi_io_count_kb - amount of data in KB transferred by the - initiator since beginning or last reset (writing 0 in this attribute) - - - none_cmd_count - number of not transferring data SCSI commands - (e.g. INQUIRY or TEST UNIT READY) received since beginning or last - reset (writing 0 in this attribute) - - - unknown_cmd_count - number of unknown SCSI commands received since - beginning or last reset (writing 0 in this attribute) - - - *count*, e.g. read_io_count_kb, - statistics about executed - commands and transferred data. See above for more details. - - - luns - a link pointing out to the corresponding LUNs set (security - group) where this session was attached to. - - - One or more "lunX" subdirectories, where 'X' is a number, for each LUN - this session has (see below). - - - other target driver specific attributes and subdirectories. - -See below description of the VDISK's sysfs interface for samples. - - -Each sessions//lun subdirectory contains the following entries: - - - active_commands - contains number of active, i.e. not yet or being - executed, SCSI commands for lun in session . - - - thread_pid - contains a single line with all the process identifiers - (PIDs) of the kernel threads that process SCSI commands intended for - lun in session . - - - thread_index - thread index assigned by scst_add_threads(). - Can be used to look up which export thread is serving which target - since this index also appears in the export thread name. This - information then could be used to set CPU affinity for those threads - to improve performance. Has a value in the range 0..n-1 for - threads_pool_type per_initiator or -1 when using a shared thread pool - per LUN or the global thread pool. - - -Access and devices visibility management (LUN masking) ------------------------------------------------------- - -Access and devices visibility management allows for an initiator or -group of initiators to see different devices with different LUNs -with necessary access permissions. - -SCST supports two modes of access control: - -1. Target-oriented. In this mode you define for each target a default -set of LUNs, which are accessible to all initiators, connected to that -target. This is a regular access control mode, which people usually mean -thinking about access control in general. For instance, in IET this is -the only supported mode. - -2. Initiator-oriented. In this mode you define which LUNs are accessible -for each initiator. In this mode you should create for each set of one -or more initiators, which should access to the same set of devices with -the same LUNs, a separate security group, then add to it devices and -names of allowed initiator(s). - -Both modes can be used simultaneously. In this case the -initiator-oriented mode has higher priority, than the target-oriented, -i.e. initiators are at first searched in all defined security groups for -this target and, if none matches, the default target's set of LUNs is -used. This set of LUNs might be empty, then the initiator will not see -any LUNs from the target. - -You can at any time find out which set of LUNs each session is assigned -to by looking where link -/sys/kernel/scst_tgt/targets/target_driver/target_name/sessions/initiator_name/luns -points to. - -To configure the target-oriented access control SCST provides the -following interface. Each target's sysfs subdirectory -(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "luns" -subdirectory. This subdirectory contains the list of already defined -target-oriented access control LUNs for this target as well as file -"mgmt". This file has the following commands, which you can send to it, -for instance, using "echo" shell command. You can always get a small -help about supported commands by looking inside this file. "Parameters" -are one or more param_name=value pairs separated by ';'. - - - "add H:C:I:L lun [parameters]" - adds a pass-through device with - host:channel:id:lun with LUN "lun". Optionally, the device could be - marked as read only by using parameter "read_only". The recommended - way to find out H:C:I:L numbers is use of lsscsi utility. - - - "replace H:C:I:L lun [parameters]" - replaces by pass-through device - with host:channel:id:lun existing with LUN "lun" device with - generation of INQUIRY DATA HAS CHANGED Unit Attention. If the old - device doesn't exist, this command acts as the "add" command. - Optionally, the device could be marked as read only by using - parameter "read_only". The recommended way to find out H:C:I:L - numbers is use of lsscsi utility. - - - "add VNAME lun [parameters]" - adds a virtual device with name VNAME - with LUN "lun". Optionally, the device could be marked as read only - by using parameter "read_only". - - - "replace VNAME lun [parameters]" - replaces by virtual device - with name VNAME existing with LUN "lun" device with generation of - INQUIRY DATA HAS CHANGED Unit Attention. If the old device doesn't - exist, this command acts as the "add" command. Optionally, the device - could be marked as read only by using parameter "read_only". - - - "del lun" - deletes LUN lun - - - "clear" - clears the list of devices - -To configure the initiator-oriented access control SCST provides the -following interface. Each target's sysfs subdirectory -(/sys/kernel/scst_tgt/targets/target_driver/target_name) has "ini_groups" -subdirectory. This subdirectory contains the list of already defined -security groups for this target as well as file "mgmt". This file has -the following commands, which you can send to it, for instance, using -"echo" shell command. You can always get a small help about supported -commands by looking inside this file. - - - "create GROUP_NAME" - creates a new security group. - - - "del GROUP_NAME" - deletes a new security group. - -Each security group's subdirectory contains 2 subdirectories: initiators -and luns as well as the following attributes: addr_method, cpu_mask and -io_grouping_type, black_hole. See above description of them. - -Each "initiators" subdirectory contains list of added to this groups -initiator as well as as well as file "mgmt". This file has the following -commands, which you can send to it, for instance, using "echo" shell -command. You can always get a small help about supported commands by -looking inside this file. - - - "add INITIATOR_NAME" - adds initiator with name INITIATOR_NAME to the - group. - - - "del INITIATOR_NAME" - deletes initiator with name INITIATOR_NAME - from the group. - - - "move INITIATOR_NAME DEST_GROUP_NAME" moves initiator with name - INITIATOR_NAME from the current group to group with name - DEST_GROUP_NAME. - - - "clear" - deletes all initiators from this group. - -For "add" and "del" commands INITIATOR_NAME can be a simple DOS-type -patterns, containing '*' and '?' symbols. '*' means match all any -symbols, '?' means match only any single symbol. For instance, -"blah.xxx" will match "bl?h.*". Additionally, you can use negative sign -'!' to revert the value of the pattern. For instance, "ah.xxx" will -match "!bl?h.*". - -Each "luns" subdirectory contains the list of already defined LUNs for -this group as well as file "mgmt". Content of this file as well as list -of available in it commands is fully identical to the "luns" -subdirectory of the target-oriented access control. - -Examples: - - - echo "create INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/mgmt - - creates security group INI for target iqn.2006-10.net.vlnb:tgt1. - - - echo "add 2:0:1:0 11" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt - - adds a pass-through device sitting on host 2, channel 0, ID 1, LUN 0 - to group with name INI as LUN 11. - - - echo "add disk1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI/luns/mgmt - - adds a virtual disk with name disk1 to group with name INI as LUN 0. - - - echo "add 21:*:e0:?b:83:*" >/sys/kernel/scst_tgt/targets/21:00:00:a0:8c:54:52:12/ini_groups/INI/initiators/mgmt - - adds a pattern to group with name INI to Fibre Channel target with - WWN 21:00:00:a0:8c:54:52:12, which matches WWNs of Fibre Channel - initiator ports. - -Consider you need to have an iSCSI target with name -"iqn.2007-05.com.example:storage.disk1.sys1.xyz", which should export -virtual device "dev1" with LUN 0 and virtual device "dev2" with LUN 1, -but initiator with name -"iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" should see only -virtual device "dev2" read only with LUN 0. To achieve that you should -do the following commands: - -# echo "iqn.2007-05.com.example:storage.disk1.sys1.xyz" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -# echo "add dev1 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt -# echo "add dev2 1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/luns/mgmt -# echo "create SPEC_INI" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/mgmt -# echo "add dev2 0 read_only=1" \ - >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/luns/mgmt -# echo "iqn.2007-05.com.example:storage.disk1.spec_ini.xyz" \ - >/sys/kernel/scst_tgt/targets/iscsi/iqn.2007-05.com.example:storage.disk1.sys1.xyz/ini_groups/SPEC_INI/initiators/mgmt - -For Fibre Channel or SAS in the above example you should use target's -and initiator ports WWNs instead of iSCSI names. - -It is highly recommended to use scstadmin utility instead of described -in this section low level interface. - -IMPORTANT -========= - -There must be LUN 0 in each set of LUNs, i.e. LUs numeration must not -start from, e.g., 1. Otherwise you will see no devices on remote -initiators and SCST core will write into the kernel log message: "tgt_dev -for LUN 0 not found, command to unexisting LU?" - -IMPORTANT -========= - -All the access control must be fully configured BEFORE the corresponding -target is enabled. When you enable a target, it will immediately start -accepting new connections, hence creating new sessions, and those new -sessions will be assigned to security groups according to the -*currently* configured access control settings. For instance, to -the default target's set of LUNs, instead of "HOST004" group as you may -need, because "HOST004" doesn't exist yet. So, you must configure all -the security groups before new connections from the initiators are -created, i.e. before the target enabled. - - -VDISK device handler --------------------- - -VDISK has 4 built-in dev handlers: vdisk_fileio, vdisk_blockio, -vdisk_nullio and vcdrom. Roots of their sysfs interface are -/sys/kernel/scst_tgt/handlers/handler_name, e.g. for vdisk_fileio: -/sys/kernel/scst_tgt/handlers/vdisk_fileio. Each root has the following -entries: - - - None, one or more links to devices with name equal to names - of the corresponding devices. - - - trace_level - allows to enable and disable various tracing - facilities. See content of this file for help how to use it. See also - section "Dealing with massive logs" for more info how to make correct - logs when you enabled trace levels producing a lot of logs data. - - - mgmt - main management entry, which allows to add/delete VDISK - devices with the corresponding type. - -The "mgmt" file has the following commands, which you can send to it, -for instance, using "echo" shell command. You can always get a small -help about supported commands by looking inside this file. "Parameters" -are one or more param_name=value pairs separated by ';'. - - - echo "add_device device_name [parameters]" - adds a virtual device - with name device_name and specified parameters (see below) - - - echo "del_device device_name" - deletes a virtual device with name - device_name. - -Handler vdisk_fileio provides FILEIO mode to create virtual devices. -This mode uses as backend files and accesses to them using regular -read()/write() file calls. This allows to use full power of Linux page -cache. The following parameters possible for vdisk_fileio: - - - filename - specifies path and file name of the backend file. The path - must be absolute. - - - blocksize - specifies block size used by this virtual device. The - block size must be power of 2 and >= 512 bytes. Default is 512. - - - write_through - disables write back caching. Note, this option - has sense only if you also *manually* disable write-back cache in - *all* your backstorage devices and make sure it's actually disabled, - since many devices are known to lie about this mode to get better - benchmark results. Default is 0. - - - read_only - read only. Default is 0. - - - o_direct - disables both read and write caching. This mode isn't - currently fully implemented, you should use user space fileio_tgt - program in O_DIRECT mode instead (see below). - - - nv_cache - enables "non-volatile cache" mode. In this mode it is - assumed that the target has a GOOD UPS with ability to cleanly - shutdown target in case of power failure and it is software/hardware - bugs free, i.e. all data from the target's cache are guaranteed - sooner or later to go to the media. Hence all data synchronization - with media operations, like SYNCHRONIZE_CACHE, are ignored in order - to bring more performance. Also in this mode target reports to - initiators that the corresponding device has write-through cache to - disable all write-back cache workarounds used by initiators. Use with - extreme caution, since in this mode after a crash of the target - journaled file systems don't guarantee the consistency after journal - recovery, therefore manual fsck MUST be ran. Note, that since usually - the journal barrier protection (see "IMPORTANT" note below) turned - off, enabling NV_CACHE could change nothing from data protection - point of view, since no data synchronization with media operations - will go from the initiator. This option overrides "write_through" - option. Disabled by default. - - - thin_provisioned - enables thin provisioning facility, when remote - initiators can unmap blocks of storage, if they don't need them - anymore. Backend storage also must support this facility. - - - tst - allows to specify TST control mode page field. It specifies - the type of task set in the device. Possible values are: 0 - the - device maintains one task set for all I_T nexuses and 1 - the device - maintains separate task sets for each I_T nexus. Default - 1. - - - removable - with this flag set the device is reported to remote - initiators as removable. - - - rotational - if set, this device reported as rotational. Otherwise, - it is reported as non-rotational (SSD, etc.) - - - zero_copy - if set, then this device uses zero copy access to the - page cache. At the moment, only read side zero copy is implemented. - - - dif_mode - specifies which T10-PI, or DIF, mode this device will use. - See SCSI standards from more info about T10-PI. Available DIF modes - (can be combined using '|'): - - * tgt - DIF tags are checked on the target hardware, if supported - - * scst - DIF tags are checked inside SCST core - - * dev_check - DIF tags are checked inside backend device. No DIF - tags storing is required, but optionally possible. - - * dev_store - DIF tags are stored inside backend device on the WRITE - path and read from it on the READ path. No DIF tags checking is - required, but optionally possible. - - For instance, if only tgt DIF mode specified, then target driver, - serving this device, will inside hardware check, then STRIP DIF tags - from SCSI commands on the WRITE path and generate, then INSERT DIF - tags into SCSI commands on the READ path, so neither SCST core, nor - dev handler will see them. - - Similarly, if only scst DIF mode specified, then target driver will - PASS DIF tags into SCST core, which then check/STRIP/generate/INSERT - them, so dev handler will not see them. - - If only dev_check DIF mode specified, then both target driver and - SCST core will PASS DIF tags into the dev handler, which is then - responsible to check them in the backend hardware. If only dev_store - specified, then DIF tags will only be stored by the dev handler in - the backend hardware without checking at any level. - - If all "tgt|scst|dev_check|dev_store" DIF mode specified, then all - target driver, SCST core and dev handler will check DIF tags, then - dev handler will store them in the backend hardware. - - - dif_type - specifies which DIF SCSI type this device will use. - - - dif_static_app_tag - specifies fixed (static) DIF application tag for - this device. - - - dif_filename - specifies full path to filename, where DIF tags will - be stored. - -Handler vdisk_blockio provides BLOCKIO mode to create virtual devices. -This mode performs direct block I/O with a block device, bypassing the -page cache for all operations. This mode works ideally with high-end -storage HBAs and for applications that either do not need caching -between application and disk or need the large block throughput. See -below for more info. - -The following parameters possible for vdisk_blockio: filename, -blocksize, nv_cache, read_only, removable, rotational, thin_provisioned, -tst, dif_mode, dif_type, dif_static_app_tag, dif_filename. See -vdisk_fileio above for description of those parameters. - -vdisk_blockio devices have the following two additional attributes: - -- active - if this flag is set (the default), the backing block device - will be opened when the SCST device is added/opened. If a SCST device - is opened with active=0 then the backing block device will not be - opened, allowing for an active/passive SCST configuration. In addition, - this attribute is writable via sysfs allowing the user to open/close the - backing block device on the fly, or via a script. - -- bind_alua_state - if this flag is set (the default), when the device is - associated with an ALUA device group, and a target group ALUA state - changes to the active/nonoptimized state, the active attribute will be - set to 1 which attempts to open the backing block device. If the target - group ALUA state changes to a value other than active/nonoptimized, the - backing device will be closed (active=0). If bind_alua_state=0 for a - device the ALUA state changes have NO effect on the active attribute, - it is left up to the user to use a script, or manually set the active - attribute to open/close the backing block device. - -Handler vdisk_nullio provides NULLIO mode to create virtual devices. In -this mode no real I/O is done, but success returned to initiators. -Intended to be used for performance measurements at the same way as -"*_perf" handlers. The following parameters possible for vdisk_nullio: -blocksize, read_only, removable, tst. See vdisk_fileio above for -description of those parameters. - -vdisk_nullio devices have the following two additional attributes: - - - dummy - if this flag is set, LUNs corresponding to this device will - not appear at the initiator side. This is because SCST will set the - PERIPHERAL QUALIFIER qualifier field to 1 (not connected) and the - PERIPHERAL DEVICE TYPE to 0x1f (no device) in the INQUIRY response. - See also SPC-4 for more information. It is designed to be used as a - "dummy" placeholder on LUN 0, if LUN 0 is not desired. - - - read_zero - if this flag is set, reading from a vdisk_nullio device - returns a buffer filled with byte 0x00. If this flag is cleared - (which is the default behavior), the buffer returned to the - initiator is not cleared. Although this results in slightly faster - operation this is a security hole since any data that is present in - kernel memory can be returned to the initiator. - -Handler vcdrom allows emulation of a virtual CDROM device using an ISO -file as backend. It has only single parameter: tst. - -For example: - -echo "add_device disk1 filename=/disk1; blocksize=4096; nv_cache=1" >/sys/kernel/scst_tgt/handlers/vdisk_fileio/mgmt - -will create a FILEIO virtual device disk1 with backend file /disk1 -with block size 4K and NV_CACHE enabled. - -Each vdisk_fileio's device has the following attributes in -/sys/kernel/scst_tgt/devices/device_name: - - - filename - contains path and file name of the backend file. - - - blocksize - contains block size used by this virtual device. - - - write_through - contains status of write back caching of this virtual - device. - - - sync - writing into this attribute causes the page cache contents to - be flushed to disk. - - - read_only - contains read only status of this virtual device. - - - o_direct - contains O_DIRECT status of this virtual device. - - - inq_vend_specific - Vendor specific data that will be reported via - either bytes 36..55 or bytes 96..256 of the INQUIRY response, depending - on whether this field is <= 20 or > 20 bytes long. - - - nv_cache - contains NV_CACHE status of this virtual device. - - - prod_id - PRODUCT IDENTIFICATION as reported via the INQUIRY response. - The default value for this field is the SCST device name. - - - prod_rev_lvl - PRODUCT REVISION LEVEL as reported via the INQUIRY - response. The default value for this field is " 300". - - - scsi_device_name - optional SCSI target device name to which this - SCST device belongs to (in SCSI terminology all SCST devices called - Logical Units). See SPC for more info. - - - tst - contains TST field of SCSI Control mode page. See SPC-4 for - more details about this field. - - - thin_provisioned - contains thin provisioning status of this virtual - device. - - - gen_tp_soft_threshold_reached_UA - for thin provisioned devices - writing of anything into this write-only attribute will generate THIN - PROVISIONING SOFT THRESHOLD REACHED Unit Attention to all connected - to this device initiators. - - - removable - contains removable status of this virtual device. - - - rotational - contains rotational status of this virtual device. - - - size_mb - contains size of this virtual device in MB. - - - pr_file_name - Full path of the file or block device in which to store - persistent reservation information. The default value for this attribute is - /var/lib/scst/pr/${device_name}. Writing a new value into this sysfs - attribute is only allowed if the device is not exported. Modifying this - sysfs attribute causes the persistent reservation state to be reloaded. - - - t10_dev_id - contains and allows to set T10 vendor specific - identifier for Device Identification VPD page (0x83) of INQUIRY data. - By default VDISK handler always generates t10_dev_id for every new - created device at creation time based on the device name and - scst_vdisk_ID scst_vdisk.ko module parameter (see below). - Note: some initiators, e.g. VMware's ESXi or MS Hyper-V, only looks - at the first eight characters of t10_dev_id. You have to make sure - that these first eight characters are unique or VMware will consider - these devices as identical. - - - eui64_id - allows to set the EUI-64 based device identifier in the - SCSI device identification VPD page (83h). This identifier must be 8, - 12 or 16 bytes long and must be specified in hexadecimal format (EUI = - Extended Unique Identifier). A leading "0x" is allowed but is not - required. Writing a newline into this attribute discards the EUI-64 - identifier. If neither eui64_id nor naa_id have been set the first - eight bytes of the t10_dev_id are used as the EUI-64 ID. If naa_id has - been set but eui64_id has not been set no EUI-64 identifier is - reported in the SCSI device identification VPD page. If eui64_id has - been set the value of this attribute is reported as the EUI-64 ID. The - first three bytes of an EUI-64 ID are a so-called organizationally - unique identifier (OUI). The remaining bytes may be chosen by the - organization that owns the OUI. For more information about OUIs, see - also http://standards.ieee.org/develop/regauth/oui/public.html. - - - naa_id - allows to set the NAA ID in the SCSI INQUIRY response (NAA = - Network Address Authority). This identifier must be 8 or 16 bytes long - and must be specified in hex format. A leading "0x" is allowed but is - not required. Writing a newline into this attribute discards the NAA - ID. If this ID is set it is reported in the SCSI VPD device - identification page (83h). More information about NAA identifiers can - be found in the following documents: - * ANSI T11 committee, Fibre Channel Framing and Signaling Interface - 4 - (FC-FS-4) rev 0.50, May 2014 (http://www.t11.org/). - * IETF, RFC 3980 - T11 Network Address Authority (NAA) Naming Format for - iSCSI Node Names, February 2005 (https://tools.ietf.org/html/rfc3980). - - - t10_vend_id - Contents of the T10 VENDOR IDENTIFICATION field of the - INQUIRY response. The default value for this field is "SCST_BIO" for - vdisk_block devices and "SCST_FIO" for vdisk_fileio devices. - - - usn - contains the virtual device's serial number of INQUIRY data. It - is created at the device creation time based on the device name and - scst_vdisk_ID scst_vdisk.ko module parameter (see below). - - - type - contains SCSI type of this virtual device. - - - resync_size - write only attribute, which makes vdisk_fileio to - rescan size of the backend file. It is useful if you changed it, for - instance, if you resized it. - - - vend_specific_id - Vendor specific ID as reported via the Device - Identification VPD page (83h). The default value for this attribute - is the value of the t10_dev_id attribute. - -For example: - -/sys/kernel/scst_tgt/devices/disk1 -|-- block -|-- blocksize -|-- exported -| |-- export0 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/luns/0 -| |-- export1 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt/ini_groups/INI/luns/0 -| |-- export2 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/luns/0 -| |-- export3 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI1/luns/0 -| |-- export4 -> ../../../targets/iscsi/iqn.2006-10.net.vlnb:tgt1/ini_groups/INI2/luns/0 -|-- filename -|-- handler -> ../../handlers/vdisk_fileio -|-- nv_cache -|-- o_direct -|-- read_only -|-- removable -|-- resync_size -|-- rotational -|-- size_mb -|-- t10_dev_id -|-- thin_provisioned -|-- threads_num -|-- threads_pool_type -|-- tst -|-- type -|-- usn -`-- write_through - -Each vdisk_blockio's device has the following attributes in -/sys/kernel/scst_tgt/devices/device_name: blocksize, filename, nv_cache, -read_only, removable, resync_size, rotational, size_mb, t10_dev_id, -thin_provisioned, gen_tp_soft_threshold_reached_UA, threads_num, -threads_pool_type, tst, type, usn. See above description of those -parameters. - -Each vdisk_nullio's device has the following attributes in -/sys/kernel/scst_tgt/devices/device_name: blocksize, read_only, -removable, size_mb, t10_dev_id, threads_num, threads_pool_type, type, -tst, usn, dummy. See above description of those parameters. - -Each vcdrom's device has the following attributes in -/sys/kernel/scst_tgt/devices/device_name: filename, size_mb, -t10_dev_id, threads_num, threads_pool_type, type, usn, tst. See above -description of those parameters. Exception is filename attribute. For -vcdrom it is writable. Writing to it allows to virtually insert or -change virtual CD media in the virtual CDROM device. For example: - - - echo "/image.iso" >/sys/kernel/scst_tgt/devices/cdrom/filename - will - insert file /image.iso as virtual media to the virtual CDROM cdrom. - - - echo "" >/sys/kernel/scst_tgt/devices/cdrom/filename - will remove - "media" from the virtual CDROM cdrom. - -Additionally VDISK handler has module parameter "num_threads", which -specifies count of I/O threads for each FILEIO VDISK's or VCDROM device. -If you have a workload, which tends to produce rather random accesses -(e.g. DB-like), you should increase this count to a bigger value, like -32. If you have a rather sequential workload, you should decrease it to -a lower value, like number of CPUs on the target or even 1. Due to some -limitations of Linux I/O subsystem, increasing number of I/O threads too -much leads to sequential performance drop, especially with deadline -scheduler, so decreasing it can improve sequential performance. The -default provides a good compromise between random and sequential -accesses. - -You shouldn't be afraid to have too many VDISK I/O threads if you have -many VDISK devices. Kernel threads consume very little amount of -resources (several KBs) and only necessary threads will be used by SCST, -so the threads will not trash your system. - -CAUTION: If you partitioned/formatted your device with block size X, *NEVER* -======== ever try to export and then mount it (even accidentally) with another - block size. Otherwise you can *instantly* damage it pretty - badly as well as all your data on it. Messages on initiator - like: "attempt to access beyond end of device" is the sign of - such damage. - - Moreover, if you want to compare how well different block sizes - work for you, you **MUST** EVERY TIME AFTER CHANGING BLOCK SIZE - **COMPLETELY** **WIPE OFF** ALL THE DATA FROM THE DEVICE. In - other words, THE **WHOLE** DEVICE **MUST** HAVE ONLY **ZEROS** - AS THE DATA AFTER YOU SWITCH TO NEW BLOCK SIZE. Switching block - sizes isn't like switching between FILEIO and BLOCKIO, after - changing block size all previously written with another block - size data MUST BE ERASED. Otherwise you will have a full set of - very weird behaviors, because blocks addressing will be - changed, but initiators in most cases will not have a - possibility to detect that old addresses written on the device - in, e.g., partition table, don't refer anymore to what they are - intended to refer. - -IMPORTANT: Some disk and partition table management utilities don't support -========= block sizes >512 bytes, therefore make sure that your favorite one - supports it. Currently only cfdisk is known to work only with - 512 bytes blocks, other utilities like fdisk on Linux or - standard disk manager on Windows are proved to work well with - non-512 bytes blocks. Note, if you export a disk file or - device with some block size, different from one, with which - it was already partitioned, you could get various weird - things like utilities hang up or other unexpected behavior. - Hence, to be sure, zero the exported file or device before - the first access to it from the remote initiator with another - block size. On Window initiator make sure you "Set Signature" - in the disk manager on the imported from the target drive - before doing any other partitioning on it. After you - successfully mounted a file system over non-512 bytes block - size device, the block size stops matter, any program will - work with files on such file system. - - -Dealing with massive logs -------------------------- - -If you want to enable using "trace_level" file logging levels, which -produce a lot of events, like "debug", to not loose logged events you -should also: - - * Increase in .config of your kernel CONFIG_LOG_BUF_SHIFT variable - to much bigger value, then recompile it. For example, value 25 will - provide good protection from logging overflow even under high volume - of logging events. To use it you will need to modify the maximum - allowed value for CONFIG_LOG_BUF_SHIFT in the corresponding Kconfig - file to 25 as well. - - * Change in your /etc/syslog.conf or other config file of your favorite - logging program to store kernel logs in async manner. For example, - you can add in rsyslog.conf line "kern.info -/var/log/kernel" and - add "kern.none" in line for /var/log/messages, so the resulting line - would looks like: - - "*.info;kern.none;mail.none;authpriv.none;cron.none /var/log/messages" - - -Persistent Reservations ------------------------ - -SCST implements Persistent Reservations with full set of capabilities, -including "Persistence Through Power Loss". - -The "Persistence Through Power Loss" data are saved in /var/lib/scst/pr -with files with names the same as the names of the corresponding -devices. Also this directory contains backup versions of those files -with suffix ".1". Those backup files are used in case of power or other -failure to prevent Persistent Reservation information from corruption -during update. It is safe to assume that each of those files can be up -to 1KB big. - -The Persistent Reservations available on all transports implementing -get_initiator_port_transport_id() callback. Transports not implementing -this callback will act in one of 2 possible scenarios ("all or -nothing"): - -1. If a device has such transport connected and doesn't have persistent -reservations, it will refuse Persistent Reservations commands as if it -doesn't support them. - -2. If a device has persistent reservations, all initiators newly -connecting via such transports will not see this device. After all -persistent reservations from this device are released, upon reconnect -the initiators will see it. - - -ALUA Support ------------- - -SCST supports both implicit and explicit asymmetric logical unit access -(ALUA). ALUA is a feature defined by the ANSI T10 SCSI committee. It -allows a target to tell the initiator which path to use in a multipath -setup plus, in the explicit case, control state of each path via SET -TARGET PORT GROUPS SCSI command. The redundant paths between initiator -and target can be used either for redundancy or for load sharing -purposes. The target can either be a single target system running SCST -with multiple communication interfaces or two target systems each -running SCST and configured in a high availability setup. - -In the SPC-4 standard the following concepts are defined related to ALUA: -* Relative target port ID. A number between 1 and 65535 that uniquely - identifies a target port. These numbers must be unique over the target as - a whole, even if that target consists of multiple systems each running SCST. -* Target port group asymmetric access state. One of active/optimized, - active/non-optimized, standby, unavailable, logical block dependent or - offline. The access state of a port defines which (if any) SCSI commands - will be processed by the target port. -* Target port preference indicator. This indicator is additional information - next to the asymmetric access state that is provided by the target to an - initiator and that may impact the decision taken by the initiator about - which path that will be chosen. - -More detailed information about ALUA can be found in section 5.11.2 of the -ANSI T10 standard called SPC-4. - -ALUA support in SCST -.................... - -SCST allows to define ALUA settings for each unique combination of SCST -device and SCST target. An initiator however queries ALUA settings by -sending an appropriate SCSI command to a specific LUN of an SCST target. -Each such LUN maps uniquely to an SCST device. For hardware SCST target -drivers, e.g. ib_srpt, there is a one-to-one correspondence between SCST -target and SCSI target port. With other SCST targets, e.g. iSCSI-SCST, -by default the only relationship between SCST targets and SCSI target -ports is that all SCST targets defined on a system are visible via all -SCSI target ports. See also the iSCSI-SCST documentation about the -allowed_portal attribute for information about how to associate iSCSI -targets with a single physical interface. - -Notes: -- In a H.A. setup it is the responsibility of the user to synchronize ALUA - information between the individual systems running SCST. There are no - provisions in SCST to exchange ALUA information automatically between - individual systems. -- In order to support H.A. setups it is possible to let one SCST system - report information about target ports present in other SCST systems. -- With SCST, and certainly in a H.A. setup, it is possible to configure ALUA - such that an initiator receives information that is not standard compliant, - e.g. setting all target ports in the offline state. It is the responsibility - of the user to make sure that the information queried by an initiator is - consistent independent of the LUN and the target port used by the initiator - to query this information. -- Before building a H.A. setup consisting of two or more SCST systems one - should evaluate whether it's acceptable that persistent reservation commands, - SCSI task management commands and MODE SELECT commands will only be processed - by a single node instead of being processed by all nodes. - -Configuring ALUA in SCST -........................ - -SCST allows to configure the following settings related to ALUA -for each unique combination of SCST target and virtual SCST device -(vdisk_fileio, vdisk_blockio, vcdrom, ...): -* The target port group asymmetric access state. SCST supports all ALUA port - states except logical block dependent. -* The preference indicator for a target port group. -* The relative target port ID associated with the SCST target. - -It is possible to configure the following ALUA-related information via the -sysfs interface of SCST: -* Device groups, where each device group has a name and contains zero or more - SCST devices. If a device group contains only a single SCST device, the name - of the group may be identical to the device name. See also - /sys/kernel/scst_tgt/device_groups/mgmt. -* Which devices are inside a device group. See also - /sys/kernel/scst_tgt/device_groups//devices/mgmt. -* Target groups, where each target group has a name and contains zero or more - SCST target names. See also - /sys/kernel/scst_tgt/device_groups//target_groups/mgmt. -* Target port group identifier. This is a number in the range 0..65535 and is - called the TARGET PORT GROUP in SPC-4. See also - /sys/kernel/scst_tgt/device_groups//target_groups//group_id. -* Target port group preference indicator. This is a boolean value called the - PREF bit in SPC-4. See also /sys/kernel/scst_tgt/device_groups//target_groups//preferred. -* Target port group state name. One of active, nonoptimized, standby, - unavailable, offline or transitioning. See also - /sys/kernel/scst_tgt/device_groups//target_groups//state. -* Target group contents - zero or more target names. The target names either - exist on the local system or on a remote system in a H.A. setup. For target - names that refer to SCST targets on another system only the relative target - port identifier matters, not the assigned name. See also - /sys/kernel/scst_tgt/device_groups//target_groups//mgmt. -* Relative target identifier. See also - /sys/kernel/scst_tgt/device_groups//target_groups///rel_tgt_id. - -The steps involved in configuring ALUA are: -* Identify the SCST devices that will always share the same ALUA settings and - state. Assign a name to each such group of SCST devices. If a device group - only contains a single device, the group name may be identical to the device - name. -* Configure that device group in SCST via sysfs. -* Identify the SCSI target ports that will always share the same ALUA settings - and state. Assign a name, a group ID and preference indicator to each such - SCSI target port group. -* Configure the target port group information in SCST via sysfs. -* Identify all SCST targets that can be accessed via a target port group. -* Assign all these SCST target names to the target group via sysfs. -* Assign a relative target port identifier to each target. - -As an example, in a H.A. setup with two systems each having one InfiniBand -HCA controlled by the ib_srpt driver and where each system exports two LUNs -the following configuration can be used in scst.conf on both systems: - -DEVICE_GROUP dgroup1 { - DEVICE disk01 - - TARGET_GROUP tgroup1 { - group_id 256 - preferred 1 - state active - TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 { - rel_tgt_id 1 - } - } - TARGET_GROUP tgroup2 { - group_id 257 - state standby - TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 { - rel_tgt_id 2 - } - } -} - -DEVICE_GROUP dgroup2 { - DEVICE disk02 - - TARGET_GROUP tgroup1 { - group_id 258 - state standby - TARGET fe80:0000:0000:0000:0002:c903:00fa:b7e1 { - rel_tgt_id 1 - } - } - TARGET_GROUP tgroup2 { - group_id 259 - preferred 1 - state active - TARGET fe80:0000:0000:0000:0002:c903:00fa:b7f2 { - rel_tgt_id 2 - } - } -} - -Note, if you are using "active" BLOCKIO device attribute to prevent open -of the backend block device on the passive node, it is not recommended -to set both active ("active", "nonoptimized") and passive ("standby", -etc.) ALUA states for the same device if "bind_alua_state=1" is used, as -shown above to keep internal "active" state of the BLOCKIO device consistent. - -If using the "active" BLOCKIO device attribute and multiple target groups -exist per device on a SCST instance then "bind_alua_state=0" should be used -and it is left up to the user to modify the "active" attribute value. - -Explicit ALUA -............. - -To enable explicit ALUA you need in addition to the above settings set -expl_alua device attribute to 1 (by default it is 0). Also you need to -run stpgd and supply to it path to a script or program, which will -perform actual path state switching on SET TARGET PORT GROUPS command, -for instance, by calling drbdadm. For more information see stpgd README -as well as sample script scst_on_stpg. - -DRBD and other replication/failover SW compatibility -.................................................... - -DRBD as well as other replication/failover SW does not allow to open its -device on the secondary as well as does not allow to perform primary to -secondary transition, if this device is open. - -SCST BLOCKIO handler has necessary support for such behavior: - -1. If you need to prevent an SCST BLOCKIO device from opening its block -device, you need to create it with parameter "active=0". In case of DRBD -it would be done automatically, you don't have to use the "active" -attribute. - -2. By default, if you write new ALUA state in the "state" attribute and -"bind_alua_state=1" for the device, SCST BLOCKIO handler before transition -closes open handles on all affected SCST devices and after transition -reopens them, if the new state is active or nonoptimized. Alternatively, -set "bind_alua_state=0" for SCST BLOCKIO devices and ALUA state changes -will not open/close the backing block device, the user will neeed to handle -this manually or via a cluster RA in an HA setup. - -Thus, the recommended implicit ALUA state change procedure for primary -to secondary transition is: - -1. Block all involved SCST devices using "block" sysfs attribute (see -above). Wait until the blocking finished. - -2. Change the ALUA state to "transitioning". At this moment all open -file handles will be closed. - -3. Perform the DRBD or other replication/failover SW state transition - -4. Change the ALUA state to your desired secondary state. - -5. Unblock the blocked on step 1 devices. - -Optionally, if your initiators support Transitioning ALUA state, for -more responsive behavior the blocked devices can be unblocked -immediately after step (2). However, not all initiators correctly -behave, if they receive ASYMMETRIC STATE TRANSITION sense. - -For the secondary to primary transition procedure is similar. - -In case of explicit ALUA, SCST automatically performs the necessary -devices blocking around sending SCST_EVENT_STPG_USER_INVOKE event. - -Checking the Target Configuration -................................. - -One way to verify the ALUA configuration from a Linux initiator is via -the commands provided in the sg3_utils package. The first step is to -verify whether for a certain LUN ALUA has been configured on the target. -This is possible by checking whether the TPGS=1 text appears in the -sg_inq output, where /dev/sdb is a device node created by the ib_srp -initiator: - -# sg_inq /dev/sdb -standard INQUIRY: - PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] - [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 - SCCS=0 ACC=0 TPGS=1 3PC=0 Protect=0 BQue=0 - EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 - [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 - [SPI: Clocking=0x0 QAS=0 IUS=0] - length=66 (0x42) Peripheral device type: disk - Vendor identification: SCST_FIO - Product identification: disk01 - Product revision level: 300 - Unit serial number: 27cddc71 - -The next step is to verify the target group configuration. That is possible -by verifying whether the output of the sg_rtpg command matches the values -configured on the target: - -# sg_rtpg /dev/sdb -Report target port groups: - target port group id : 0x100 , Pref=1 - target port group asymmetric access state : 0x00 - T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1 - status code : 0x02 - vendor unique status : 0x00 - target port count : 01 - Relative target port ids: - 0x01 - target port group id : 0x101 , Pref=0 - target port group asymmetric access state : 0x02 - T_SUP : 0, O_SUP : 0, LBD_SUP : 0, U_SUP : 1, S_SUP : 1, AN_SUP : 1, AO_SUP : 1 - status code : 0x02 - vendor unique status : 0x00 - target port count : 01 - Relative target port ids: - 0x02 - -The relative target port ID and the target port group ID for a certain path -can be queried e.g. as follows: - -# sg_vpd -p di /dev/sdb -Device Identification VPD page: - Addressed logical unit: - designator type: T10 vendor identification, code set: ASCII - vendor id: SCST_FIO - vendor specific: 27cddc71-disk01 - designator type: EUI-64 based, code set: Binary - 0x3237636464633731 - Target port: - designator type: Relative target port, code set: Binary - Relative target port: 0x1 - designator type: Target port group, code set: Binary - Target port group: 0x100 - - -Initiator Support -................. - -On Linux systems ALUA support is provided by the scsi_dh_alua kernel -driver in combination with the user space multipathd daemon. You will -have to modify at least the following in /etc/multipath.conf to enable -ALUA: - -* hardware_handler "1 alua" -* prio alua -* path_grouping_policy group_by_prio -* path_checker tur - -Notes: -- Newer versions of multipathd support a parameter called - "detect_prio". It can be more convenient to enable this parameter instead of - setting the parameter "prio" to "alua" for only those LUNs that support ALUA. -- Older versions of multipathd (e.g. RHEL 5 and SLES 10 SP1) need - 'prio_callout "/sbin/mpath_prio_alua /dev/%n"' instead of 'prio alua'. - -# multipath -ll -23237636464633731 dm-3 SCST_FIO,disk01 -size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw -|-+- policy='service-time 0' prio=1 status=active -| `- 10:0:0:0 sdd 8:48 active ready running -`-+- policy='service-time 0' prio=130 status=enabled - `- 11:0:0:0 sde 8:64 active ready running -23133326137346538 dm-4 SCST_FIO,disk02 -size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw -|-+- policy='service-time 0' prio=130 status=active -| `- 10:0:0:2 sdn 8:208 active ready running -`-+- policy='service-time 0' prio=1 status=enabled - `- 11:0:0:2 sdp 8:240 active ready running - -The following information can be derived from the above output: -* That the hardware handler (hw_handler) has been set to "1 alua". -* That multipathd created two priority groups - one with priority 1 and one - with priority 130. -* That the SRP path with SCSI host number 10 will be used for communication - with LUN "disk01" and that the SRP path with SCSI host number 11 will be used - for communication with LUN "disk02". - -More information about how to configure the device mapper and the scsi_dh_alua -driver can be found in the manual of your Linux distribution ("man -multipath.conf", "man multipath" and "man multipathd"). - -Windows initiator systems support ALUA from Windows Server 2008 on. For more -information about ALUA support in Windows Server, see also: -* Microsoft, Windows Server 2008 R2 Multipath I/O Overview, MSDN - (http://technet.microsoft.com/en-us/library/cc725907.aspx). -* Microsoft, Multipathing Support in Windows Server 2008, July 2008, MSDN - (http://blogs.msdn.com/b/san/archive/2008/07/27/multipathing-support-in-windows-server-2008.aspx). -* Microsoft, ALUA MPIO Logo Test, MSDN - (http://msdn.microsoft.com/en-us/library/gg607458%28v=vs.85%29.aspx). - -Active/Non-Optimized via internal redirection -............................................. - -The Active-Standby configuration is simple to understand and setup, -however, it might have serious interoperability issues, because not all -initiators handle Standby state correctly. For instance, some versions -of VMware reported to have such issues. Same for Windows. - -Hence, it is better to use Non-Optimized state on the passive node -instead of Standby with internal commands redirection to the active -node. This is what the vast majority of storage vendors are doing, which -is, actually, the reason why Standby and Unavailable states have all -those initiator issues. Simply, they have had too few testing, because -only marginally used. - -SCST has necessary support for such redirection, it just needs to be -configured correctly. It's a little bit of effort, especially to -understand how it's going to function, but then it would work MUCH more -reliable for full range of initiators. Ever poor initiators, who have no -idea about ALUA (boot from SAN, e.g.) would work now. - -1. Build SCST with CONFIG_SCST_FORWARD_MODE_PASS_THROUGH enabled in scst.h - -2. Setup on active node internal redirect target, which is going to -accept redirected commands from the passive node. It must be visible -only to the passive node. - -3. Set "forwarding" attribute for this target to 1. This is necessary to -correctly handle PRs. - -4. Export through this target the SAME backend SCST device as being -served to initiator(s) (consider for simplicity that there is only one -served device) - -5. Connect to this SCST device through this internal target from the -passive node, for instance, using iSCSI. Now you have a local SCSI -device on the passive side pointing to the active node. - -6. Export this local device to the initiator(s) using SCST -*pass-through* handler (scst_disk). Pass-though is needed to redirect -non-block commands as well: ATS, XCOPY, etc. - -7. Set ALUA state to this target as "nonoptimized". - -That's it on the normal path. Now the initiator(s) would see 2 paths: -OPTIMIZED going to the active node and NON-OPTIMIZED going to the -passive node, then redirected to the active node. - -On failover (i.e. switching active and passive states): - -1. Setup similar redirect target on the new active node. - -2. Setup connectivity to that new redirect target from the new passive -node - -3. Start ALUA change (see above) on both nodes - -4. !! Exchange in the sysfs security group(s) for the initiator(s) *LUN* -from old SCST device to the new one (blockio -> pass-through on the new -passive and pass-through -> blockio on the new active) using "replace_no_ua" -SCST command. You need to do it directly in the sysfs interface, -scstadmin can't do it. - -5. Set ALUA states to "active" on the new active node and "nonoptimized" -on the new passive node. - -6. Finish ALUA states change. - -Example using direct sysfs interface could look like: - -Active-Optimized node: - -modprobe scst -modprobe scst_disk -modprobe scst_vdisk - -# Main device, DRBD primary here -echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt - -# Redirect device, not used here. Coming from connecting via iSCSI to the -# corresponding redirect target on the other side. -DEVICE=10:0:0:0 -echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt - -service iscsi-scst start - -# This is a regular, user-visible target -echo "add_target iqn.2006-10.net.v:tgt " >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/rel_tgt_id -echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/luns/mgmt - -# This is redirect target, 192.168.9.x is the redirect network -echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id -echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding -echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt - -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt/enabled -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled - -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled - -# ALUA config - -echo create aa >/sys/kernel/scst_tgt/device_groups/mgmt -echo add aa >/sys/kernel/scst_tgt/device_groups/aa/devices/mgmt - -echo add tgt_a >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt -echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/mgmt -echo 1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/group_id -echo active >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_a/state - -echo add tgt_n >/sys/kernel/scst_tgt/device_groups/aa/target_groups/mgmt -echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/mgmt -echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/iqn.2006-10.net.v:tgt1/rel_tgt_id -echo 2 >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/group_id -echo nonoptimized >/sys/kernel/scst_tgt/device_groups/aa/target_groups/tgt_n/state - -Active-Non-Optimized node: - -modprobe scst -modprobe scst_disk -modprobe scst_vdisk - -# Main device, DRBD secondary, not used here -echo "add_device aa filename=/dev/drbd1" >/sys/kernel/scst_tgt/handlers/vdisk_blockio/mgmt - -# Redirect device. Coming from connecting via iSCSI to the -# corresponding redirect target on the other side. -DEVICE=10:0:0:0 -echo add_device $DEVICE >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt - -service iscsi-scst start - -echo "add_target iqn.2006-10.net.v:tgt1" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/rel_tgt_id -echo "add $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt - -# Redirect target, 192.168.9.x is the redirect network -echo "add_target iqn.2006-10.net.v:tgtR" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo 2 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/rel_tgt_id -echo "add_target_attribute iqn.2006-10.net.v:tgtR allowed_portal 192.168.9.2" >/sys/kernel/scst_tgt/targets/iscsi/mgmt -echo "1" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/forwarding -echo "add aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/luns/mgmt - -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/enabled - -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/enabled -echo 1 >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgtR/enabled - -# ALUA config - -echo create $DEVICE >/sys/kernel/scst_tgt/device_groups/mgmt -echo add $DEVICE >/sys/kernel/scst_tgt/device_groups/$DEVICE/devices/mgmt - -echo add tgt_a >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt -echo add iqn.2006-10.net.v:tgt >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/mgmt -echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/iqn.2006-10.net.v:tgt/rel_tgt_id -echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/group_id -echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state - -echo add tgt_n >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/mgmt -echo add iqn.2006-10.net.v:tgt1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/mgmt -echo 1 >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/group_id -echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state - -ALUA state switch after DRBD primary-secondary transition: - -Ex-Optimized: - -echo "replace_no_ua $DEVICE 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt -echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state -echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state - -Ex-Non-Optimized: - -echo "replace_no_ua aa 0" >/sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.v:tgt1/luns/mgmt -echo nonoptimized >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_a/state -echo active >/sys/kernel/scst_tgt/device_groups/$DEVICE/target_groups/tgt_n/state - -If you have any questions, please read this above text at least 3 times -before asking. It might be tricky to understand :-) - - -VAAI ----- - -SCST supports all 3 VAAI SCSI commands: WRITE SAME, COMPARE AND WRITE -(ATS) and EXTENDED COPY. Additionally, it supports not directly related -to VAAI Thin Provisioning capabilities, particularly, UNMAP SCSI -commands, WRITE SAME with UNMAP bit as well as thin provisioning related -devices' sysfs attributes (see above). - -In some cases dev handlers should perform some manual actions to fully -benefit from SCST VAAI implementation. Those actions described in the -implementation notes below. For vdisk and fileio_tgt handlers they have -already been implemented. - -IMPORTANT: To use EXTENDED COPY command between LUNs (datastores) they all -========= MUST have the same PRODUCT IDENTIFICATION INQUIRY field. By - default, to simplify remote devices identification, SCST uses - vdisk names as PRODUCT IDENTIFICATION, so SCST devices look - differently from the initiators. However, for some reasons, - VMware does not use EXTENDED COPY between LUNs with different - PRODUCT IDENTIFICATION. Thus, to be able to use full VAAI in - your VMware setups you must manually set PRODUCT - IDENTIFICATION for all your VMware LUNs to the same value, - for instance, "SCST", via using "prod_id" attribute. It could - be done either by adding "prod_id" attribute to scstadmin - scst.conf, or by directly writing to SCST sysfs attribute. - For example: - - HANDLER vdisk_blockio { - DEVICE blockio1 { - filename /dev/sda5 - prod_id SCST - } - - or - - echo SCST >/sys/kernel/scst_tgt/devices/blockio1/prod_id - correspondingly. - - Note, this prod_id modification must be done on all - datastores BEFORE VMware connects to them. - - -Implementation notes -.................... - -WRITE SAME -~~~~~~~~~~ - -WRITE SAME command supports 2 modes: - -1. Manual writing mode. In this mode WRITE SAME generates a set of -internal WRITE(16) SCSI commands to perform requested writing. - -2. Remap mode. In this mode a dev handler, if supported, can remap being -written blocks to a single block and then tell SCST to manually write -parts of the requested area, which for some reason can not be remapped. - -In both cases dev handlers should call from WRITE SAME command handler -scst_write_same() function. This function as the second argument gets -array of descriptors where to write the requested block of data. Last -element in this array must have len 0. If this argument is NULL, then -the whole area will be manually written by SCST. This value should be -used by dev handlers not supporting remapping blocks. - -User space dev handlers should use SCST_EXEC_REPLY_DO_WRITE_SAME -reply_type of SCST_USER_EXEC subcommand. See scst_user doc for more -info. - - -COMPARE AND WRITE -~~~~~~~~~~~~~~~~~ - -COMPARE AND WRITE implemented by SCST a set of read, compare and write -actions done in atomic manner against affected blocks as well as regular -RESERVE SCSI commands. Particularly, COMPARE AND WRITE doesn't need any -queue flushing and unlimited number of COMPARE AND WRITE commands on -different blocks can be executed simultaneously. - -The read and write actions implemented as generation of internal -READ(16) and WRITE(16) SCSI commands. - -COMPARE AND WRITE command is completely transparent to dev handlers -(they only see the corresponding READ(16) and WRITE(16) commands), so -doesn't require any manual actions from them. - - -EXTENDED COPY -~~~~~~~~~~~~~ - -SCST implements EXTENDED COPY via internal Copy Manager target. This -target has the following specific attribute in its sysfs: - - - allow_not_connected_copy - if not set (default), an initiator can -perform copy only between devices it has direct access to via any -target/session. If set, any initiator can copy between any devices in -the system. - -The Copy Manager has access only to those devices, for which it has LUNs -in /sys/kernel/scst_tgt/targets/copy_manager/copy_manager_tgt/luns/. -Devices from scst_vdisk dev handler added to it automatically upon -registration, but for other devices you need to manually add LUNs there -the same way as for any target driver. You can also delete any device at -any time from the Copy Manager visibility by deleting the corresponding -LUN from the sysfs. It might be useful during ALUA state switching. - -Internally SCST implements EXTENDED COPY as generation of sets of -internal READ(16) and WRITE(16) SCSI commands. Dev handlers don't need -any manual actions to use it. - -Also SCST provides for dev handlers possibility to remap blocks instead -of copy them, if they support this feature. It allows them to perform -EXTENDED COPY command much faster by just metadata update of their -backend storage, which supposed to be nearly instantaneous. - -To use this feature, a dev handler should setup ext_copy_remap() -callback in its struct scst_dev_type. This callback is called by SCST -during EXTENDED COPY command processing to let the dev handler try to -remap affected blocks at first. - -Upon finish, the dev handler should call scst_ext_copy_remap_done(). In -case of error, the dev handler should set the corresponding sense to cmd -and then also call scst_ext_copy_remap_done(cmd, NULL, 0). - -If dev handler is not able to remap any part of the segment, if should -kmalloc(), then fill all leftover subsegments and supply them to -scst_ext_copy_remap_done(). SCST then will copy the subsegments using -internal copy machine, then kfree() the supplied array. If the dev -handler is not able to remap the whole segment, it can simply directly -supply the original segment to scst_ext_copy_remap_done(). - -It is highly recommended that in normal circumstances dev handlers call -scst_ext_copy_remap_done() from another thread context than one where -ext_copy_remap() callback was originally called, because otherwise there -could be recursion in the segments processing. Hopefully, this thread -context switch is natural for such potentially long operation as -EXTENDED COPY. - - -VMware and Ceph RBD space reclaim ---------------------------------- - -VMware with VMFS5 filesystem ignores UNMAP alignment, so if you use 4MB -Ceph RBD objects and VMFS5, only some discards will reclaim RBD space -due to 1MB discard not often hitting the tail of objects. - -Thus, to have efficient ESXi space reclamation with RBD and VMFS5, you are -recommended to use 1 MB object size in Ceph. - -See https://sourceforge.net/p/scst/mailman/message/35287598 thread for -details. - - -Caching -------- - -By default for performance reasons VDISK FILEIO devices use write back -caching policy. - -Generally, write back caching is safe for use and danger of it is -greatly overestimated, because most modern (especially, Enterprise -level) applications are well prepared to work with write back cached -storage. Particularly, such are all transactions-based applications. -Those applications flush cache to completely avoid ANY data loss on a -crash or power failure. For instance, journaled file systems flush cache -on each meta data update, so they survive power/hardware/software -failures pretty well. - -Since locally on initiators write back caching is always on, if an -application cares about its data consistency, it does flush the cache -when necessary or on any write, if open files with O_SYNC. If it doesn't -care, it doesn't flush the cache. As soon as the cache flushes -propagated to the storage, write back caching on it doesn't make any -difference. If application doesn't flush the cache, it's doomed to loose -data in case of a crash or power failure doesn't matter where this cache -located, locally or on the storage. - -To illustrate that consider, for example, a user who wants to copy /src -directory to /dst directory reliably, i.e. after the copy finished no -power failure or software/hardware crash could lead to a loss of the -data in /dst. There are 2 ways to achieve this. Let's suppose for -simplicity cp opens files for writing with O_SYNC flag, hence bypassing -the local cache. - -1. Slow. Make the device behind /dst working in write through caching -mode and then run "cp -a /src /dst". - -2. Fast. Let the device behind /dst working in write back caching mode -and then run "cp -a /src /dst; sync". The reliability of the result is -the same, but it's much faster than (1). Nobody would care if a crash -happens during the copy, because after recovery simply leftovers from -the not completed attempt would be deleted and the operation would be -restarted from the very beginning. - -So, you can see in (2) there is no danger of ANY data loss from the -write back caching. Moreover, since on practice cp doesn't open files -for writing with O_SYNC flag, to get the copy done reliably, sync -command must be called after cp anyway, so enabling write back caching -wouldn't make any difference for reliability. - -Also you can consider it from another side. Modern HDDs have at least -16MB of cache working in write back mode by default, so for a 10 drives -RAID it is 160MB of a write back cache. How many people are happy with -it and how many disabled write back cache of their HDDs? Almost all and -almost nobody correspondingly? Moreover, many HDDs lie about state of -their cache and report write through while working in write back mode. -They are also successfully used. - -Note, Linux I/O subsystem guarantees to propagated cache flushes to the -storage only using data protection barriers, which usually turned off by -default (see http://lwn.net/Articles/283161). Without barriers enabled -Linux doesn't provide a guarantee that after sync()/fsync() all written -data really hit permanent storage. They can be stored in the cache of -your backstorage devices and, hence, lost on a power failure event. -Thus, ever with write-through cache mode, you still either need to -enable barriers on your backend file system on the target (for direct -/dev/sdX devices this is, indeed, impossible), or need a good UPS to -protect yourself from not committed data loss. Some info about barriers -from the XFS point of view could be found at -http://xfs.org/index.php/XFS_FAQ#Write_barrier_support. On Linux -initiators for Ext3 and ReiserFS file systems the barrier protection -could be turned on using "barrier=1" and "barrier=flush" mount options -correspondingly. You can check if the barriers turn on or off by looking -in /proc/mounts. Windows and, AFAIK, other UNIX'es don't need any -special explicit options and do necessary barrier actions on write-back -caching devices by default. - -To limit this data loss with write back caching you can use files in -/proc/sys/vm to limit amount of unflushed data in the system cache. - -If you for some reason have to use VDISK FILEIO devices in write through -caching mode, don't forget to disable internal caching on their backend -devices or make sure they have additional battery or supercapacitors -power supply on board. Otherwise, you still on a power failure would -loose all the unsaved yet data in the devices internal cache. - -Note, on some real-life workloads write through caching might perform -better, than write back one with the barrier protection turned on. - - -Errors caching -.............. - -When using virtual device in FILEIO mode, the Linux page cache comes -into picture. The negative side of it is that it's sometimes also -caching errored pages. That is, if the underlying file experiences IO -errors, those errors might be cached by the Linux page cache. As a -result, even when the underlying file recovers and stops failing IOs, -the initiator may still hit IO errors returned by the Linux page cache, -until the cache re-reads the errored pages (usually it happens pretty -soon, but not immediately). To make sure that cached pages are dropped, -one of the following can be done: - -- Detach the SCSI virtual device (del_device) and re-attach it - (add_device). This should evict all the cached pages, unless somebody - else holds the same "filename" opened. - -- Issue a BLKFLSBUF ioctl to the same "filename" you provided for "add_device". - -For the second option, a rudimentary C code is required: - -fd = open(filename, O_RDWR); -if (fd < 0) { - err = errno; - ... -} else { - err = ioctl(fd, BLKFLSBUF); - if (err < 0) { - err = errno; - ... - } - close(fd); -} - - -BLOCKIO VDISK mode ------------------- - -This module works best for these types of scenarios: - -1) Data that are not aligned to 4K sector boundaries and <4K block sizes -are used, which is normally found in virtualization environments where -operating systems start partitions on odd sectors (Windows and it's -sector 63). - -2) Large block data transfers normally found in database loads/dumps and -streaming media. - -3) Advanced relational database systems that perform their own caching -which prefer or demand direct IO access and, because of the nature of -their data access, can actually see worse performance with -non-discriminate caching. - -4) Multiple layers of targets were the secondary and above layers need -to have a consistent view of the primary targets in order to preserve -data integrity which a page cache backed IO type might not provide -reliably. - -Also it has an advantage over FILEIO that it doesn't copy data between -the system cache and the commands data buffers, so it saves a -considerable amount of CPU power and memory bandwidth. - -IMPORTANT: Since data in BLOCKIO and FILEIO modes are not consistent between -========= each other, if you try to use a device in both those modes - simultaneously, you will almost instantly corrupt your data - on that device. - -IMPORTANT: If SCST 1.x BLOCKIO worked by default in NV_CACHE mode, when -========= each device reported to remote initiators as having write through - caching. But if your backend block device has internal write - back caching it might create a possibility for data loss of - the cached in the internal cache data in case of a power - failure. Starting from SCST 2.0 BLOCKIO works by default in - non-NV_CACHE mode, when each device reported to remote - initiators as having write back caching, and synchronizes the - internal device's cache on each SYNCHRONIZE_CACHE command - from the initiators. It might lead to some *PERFORMANCE LOSS*, - so if you are are sure in your power supply and want to - restore the 1.x behavior, your should recreate your BLOCKIO - devices in NV_CACHE mode. - - -Pass-through mode ------------------ - -In the pass-through mode (i.e. using the pass-through device handlers -scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators, -are passed to local SCSI devices on target as is, without any -modifications. - -SCST supports 1 to many pass-through, when several initiators can safely -connect a single pass-through device (a tape, for instance). For such -cases SCST emulates all the necessary functionality. - -In the sysfs interface all real SCSI devices are listed in -/sys/kernel/scst_tgt/devices in form host:channel:id:lun numbers, for -instance 1:0:0:0. The recommended way to match those numbers to your -devices is use of lsscsi utility. - -Each pass-through dev handler has in its root subdirectory -/sys/kernel/scst_tgt/handlers/handler_name, e.g. -/sys/kernel/scst_tgt/handlers/dev_disk, "mgmt" file. It allows the -following commands. They can be sent to it using, e.g., echo command. - - - "add_device" - this command assigns SCSI device with -host:channel:id:lun numbers to this dev handler. - -echo "add_device 1:0:0:0" >/sys/kernel/scst_tgt/handlers/dev_disk/mgmt - -will assign SCSI device 1:0:0:0 to this dev handler. - - - "del_device" - this command unassigns SCSI device with -host:channel:id:lun numbers from this dev handler. - -As usually, on read the "mgmt" file returns small help about available -commands. - -You need to manually assign each your real SCSI device to the -corresponding pass-through dev handler using the "add_device" command, -otherwise the real SCSI devices will not be visible remotely. The -assignment isn't done automatically, because it could lead to the -pass-through dev handlers load and initialization problems if any of the -local real SCSI devices are malfunctioning. - -As any other hardware, the local SCSI hardware can not handle commands -with amount of data and/or segments count in scatter-gather array bigger -some values. Therefore, when using the pass-through mode you should note -that values for maximum number of segments and maximum amount of -transferred data (max_sectors) for each SCSI command on devices on -initiators can not be bigger, than corresponding values of the -corresponding SCSI devices on the target. Otherwise you will see -symptoms like small transfers work well, but large ones stall and -messages like: "Unable to complete command due to SG IO count -limitation" are printed in the kernel logs. - -You can't control from the user space limit of the scatter-gather -segments, but for block devices usually it is sufficient if you set on -the initiators /sys/block/DEVICE_NAME/queue/max_sectors_kb in the same -or lower value as in /sys/block/DEVICE_NAME/queue/max_hw_sectors_kb for -the corresponding devices on the target. - -For not-block devices SCSI commands are usually generated directly by -applications, so, if you experience large transfers stalls, you should -check documentation for your application how to limit the transfer -sizes. - -Another way to solve this issue is to build SG entries with more than 1 -page each. See the following patch as an example: -http://scst.sourceforge.net/sgv_big_order_alloc.diff - - -Performance ------------ - -SCST from the very beginning has been designed and implemented to -provide the best possible performance. Since there is no "one fit all" -the best performance configuration for different setups and loads, SCST -provides extensive set of settings to allow to tune it for the best -performance in each particular case. You don't have to necessary use -those settings. If you don't, SCST will do very good job to autotune for -you, so the resulting performance will, in average, be better -(sometimes, much better) than with other SCSI targets. But in some cases -you can by manual tuning improve it even more. - -Before doing any performance measurements note that performance results -are very much dependent from your type of load, so it is crucial that -you choose access mode (FILEIO, BLOCKIO, O_DIRECT, pass-through), which -suits your needs the best. - -In order to get the maximum performance you should: - -1. For SCST: - - - Disable CONFIG_SCST_STRICT_SERIALIZING, CONFIG_SCST_EXTRACHECKS, - CONFIG_SCST_TRACING, CONFIG_SCST_DEBUG*, CONFIG_SCST_STRICT_SECURITY. - -2. For target drivers: - - - Disable in Makefiles CONFIG_SCST_EXTRACHECKS, CONFIG_SCST_TRACING, - CONFIG_SCST_DEBUG* - -3. For device handlers, including VDISK: - - - Disable in Makefile CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG. - -Note, by disabling CONFIG_SCST_TRACING and CONFIG_SCST_DEBUG you are -disabling many useful SCST diagnostic messages, which can significantly -help in many troubleshooting cases. So, if you may consider to keep -CONFIG_SCST_TRACING, its performance impact is very limited. - -4. Make sure you have io_grouping_type option set correctly, especially -in the following cases: - - - Several initiators share your target's backstorage. It can be a - shared LU using some cluster FS, like VMFS, as well as can be - different LUs located on the same backstorage (RAID array). For - instance, if you have 3 initiators and each of them using its own - dedicated FILEIO device file from the same RAID-6 array on the - target. - - In this case for the best performance you should have - io_grouping_type option set in value "never" in all the LUNs' targets - and security groups. - - - Your initiator connected to your target in MPIO mode. In this case for - the best performance you should: - - * Either connect all the sessions from the initiator to a single - target or security group and have io_grouping_type option set in - value "this_group_only" in the target or security group, - - * Or, if it isn't possible to connect all the sessions from the - initiator to a single target or security group, assign the same - numeric io_grouping_type value for each target/security group this - initiator connected to. The exact value itself doesn't matter, - important only that all the targets/security groups use the same - value. - -Don't forget, io_grouping_type makes sense only if you use CFQ I/O -scheduler on the target and for devices with threads_num >= 0 and, if -threads_num > 0, with threads_pool_type "per_initiator". - -You can check if in your setup io_grouping_type set correctly as well as -if the "auto" io_grouping_type value works for you by tests like the -following: - - - For not MPIO case you can run single thread sequential reading, e.g. - using buffered dd, from one initiator, then run the same single - thread sequential reading from the second initiator in parallel. If - io_grouping_type is set correctly the aggregate throughput measured - on the target should only slightly decrease as well as all initiators - should have nearly equal share of it. If io_grouping_type is not set - correctly, the aggregate throughput and/or throughput on any - initiator will decrease significantly, in 2 times or even more. For - instance, you have 80MB/s single thread sequential reading from the - target on any initiator. When then both initiators are reading in - parallel you should see on the target aggregate throughput something - like 70-75MB/s with correct io_grouping_type and something like - 35-40MB/s or 8-10MB/s on any initiator with incorrect. - - - For the MPIO case it's quite easier. With incorrect io_grouping_type - you simply won't see performance increase from adding the second - session (assuming your hardware is capable to transfer data through - both sessions in parallel), or can even see a performance decrease. - -5. If you are going to use your target in an VM environment, for -instance as a shared storage with VMware, make sure all your VMs -connected to the target via *separate* sessions. For instance, for iSCSI -it means that each VM has own connection to the target, not all VMs -connected using a single connection. You can check it using SCST sysfs -interface. For other transports you should use available facilities, -like NPIV for Fibre Channel, to make separate sessions for each VM. If -you miss it, you can greatly loose performance of parallel access to -your target from different VMs. This isn't related to the case if your -VMs are using the same shared storage, like with VMFS, for instance. In -this case all your VM hosts will be connected to the target via separate -sessions, which is enough. - -6. For other target and initiator software parts: - - - Make sure you applied on your kernel all available SCST patches. - If for your kernel version this patch doesn't exist, it is strongly - recommended to upgrade your kernel to version, for which this patch - exists. - - - Don't enable debug/hacking features in the kernel, i.e. use them as - they are by default. - - - The default kernel read-ahead and queuing settings are optimized - for locally attached disks, therefore they are not optimal if they - attached remotely (SCSI target case), which sometimes could lead to - unexpectedly low throughput. You should increase read-ahead size to at - least 512KB or even more on all initiators and the target. - - You should also limit on all initiators maximum amount of sectors per - SCSI command. This tuning is also recommended on targets with large - read-ahead values. To do it on Linux, run: - - echo “64” > /sys/block/sdX/queue/max_sectors_kb - - where specify instead of X your imported from target device letter, - like 'b', i.e. sdb. - - To increase read-ahead size on Linux, run: - - blockdev --setra N /dev/sdX - - where N is a read-ahead number in 512-byte sectors and X is a device - letter like above. - - Note: you need to set read-ahead setting for device sdX again after - you changed the maximum amount of sectors per SCSI command for that - device. - - Note2: you need to restart SCST after you changed read-ahead settings - on the target. It is a limitation of the Linux read ahead - implementation. It reads RA values for each file only when the file - is open and not updates them when the global RA parameters changed. - Hence, the need for vdisk to reopen all its files/devices. - - - You may need to increase amount of requests that OS on initiator - sends to the target device. To do it on Linux initiators, run - - echo “64” > /sys/block/sdX/queue/nr_requests - - where X is a device letter like above. - - You may also experiment with other parameters in /sys/block/sdX - directory, they also affect performance. If you find the best values, - please share them with us. - - - On the target use CFQ IO scheduler. In most cases it has performance - advantage over other IO schedulers, sometimes huge (2+ times - aggregate throughput increase). - - - It is recommended to turn the kernel preemption off, i.e. set - the kernel preemption model to "No Forced Preemption (Server)". - - - Looks like XFS is the best filesystem on the target to store device - files, because it allows considerably better linear write throughput, - than ext3. - -7. For hardware on target. - - - Make sure that your target hardware (e.g. target FC or network card) - and underlying IO hardware (e.g. IO card, like SATA, SCSI or RAID to - which your disks connected) don't share the same PCI bus. You can - check it using lspci utility. They have to work in parallel, so it - will be better if they don't compete for the bus. The problem is not - only in the bandwidth, which they have to share, but also in the - interaction between cards during that competition. This is very - important, because in some cases if target and backend storage - controllers share the same PCI bus, it could lead up to 5-10 times - less performance, than expected. Moreover, some motherboard (by - Supermicro, particularly) have serious stability issues if there are - several high speed devices on the same bus working in parallel. If - you have no choice, but PCI bus sharing, set in the BIOS PCI latency - as low as possible. - -8. If you use VDISK IO module in FILEIO mode, NV_CACHE option will -provide you the best performance. But using it make sure you use a good -UPS with ability to shutdown the target on the power failure. - -Baseline performance numbers you can find in those measurements: -http://lkml.org/lkml/2009/3/30/283. - -IMPORTANT: If you use on initiator some versions of Windows (at least W2K) -========= you can't get good write performance for VDISK FILEIO devices with - default 512 bytes block sizes. You could get about 10% of the - expected one. This is because of the partition alignment, which - is (simplifying) incompatible with how Linux page cache - works, so for each write the corresponding block must be read - first. Use 4096 bytes block sizes for VDISK devices and you - will have the expected write performance. Actually, any OS on - initiators, not only Windows, will benefit from block size - max(PAGE_SIZE, BLOCK_SIZE_ON_UNDERLYING_FS), where PAGE_SIZE - is the page size, BLOCK_SIZE_ON_UNDERLYING_FS is block size - on the underlying FS, on which the device file located, or 0, - if a device node is used. Both values are from the target. - See also important notes about setting block sizes >512 bytes - for VDISK FILEIO devices above. - - -9. In some cases, for instance working with SSD devices, which consume -100% of a single CPU load for data transfers in their internal threads, -to maximize IOPS it can be needed to assign for those threads dedicated -CPUs. Consider using cpu_mask attribute for devices with -threads_pool_type "per_initiator" or Linux CPU affinity facilities for -other threads_pool_types. No IRQ processing should be done on those -CPUs. Check that using /proc/interrupts. See taskset command and -Documentation/IRQ-affinity.txt in your kernel's source tree for how to -assign IRQ affinity to tasks and IRQs. - -The reason for that is that processing of coming commands in SIRQ -context might be done on the same CPUs as SSD devices' threads doing data -transfers. As the result, those threads won't receive all the processing -power of those CPUs and perform worse. - -10. If your storage is capable of operation on hundreds of thousands -IOPS level, you can use poll_us sysfs attribute to set how many us each -SCST thread is polling its queue after it became empty in a hope that a -new command can come. In some cases, polling can significantly increase -IOPS, especially if low power states on CPU not disabled, because on -high IOPS polling could be cheaper comparing to spending significant -time on entering, then exiting CPU low power states + corresponding -context switches. Polling is disabled by default. The recommended value -to start from is 5-10 us. Then you can increase or decrease it to see if -your IOPS are increasing or decreasing. - - -Commands suspending takes too long ----------------------------------- - -SCST is suspending commands during some management activities like -adding/deleting LUNs or devices. It is done to have lockless LUNs -translation on the hot commands processing path. This brings significant -performance advantage. You will see a message like "Waiting for X active -commands to complete" when this wait started. - -But downside of it is that no new commands start executing until older -ones, which had started before the suspending begun, finished. This -wait can not be any longer, than the worst command latency any your -initiator is seeing at this particular time. - -So, if this wait takes too long, in majority of cases it means that you -are overloading your storage. A proper storage should have worst case -latency below few hundreds of milliseconds. In this case the SCST -suspending will finish in few hundreds of milliseconds at worse. - -Another case, when it can take too long to suspend is a hung user space -device (i.e. scst_user device) not responding to any command. In this -case you should kill the corresponding user space program to finish -suspending. - - -Work if target's backstorage or link is too slow ------------------------------------------------- - -Under high I/O load, when your target's backstorage gets overloaded, or -working over a slow link between initiator and target, when the link -can't serve all the queued commands on time, you can experience I/O -stalls or see in the kernel log abort or reset messages. - -At first, consider the case of too slow target's backstorage. On some -seek intensive workloads even fast disks or RAIDs, which able to serve -continuous data stream on 500+ MB/s speed, can be as slow as 0.3 MB/s. -Another possible cause for that can be MD/LVM/RAID on your target as in -http://lkml.org/lkml/2008/2/27/96 (check the whole thread as well). - -Thus, in such situations simply processing of one or more commands takes -too long time, hence initiator decides that they are stuck on the target -and tries to recover. Particularly, it is known that the default amount -of simultaneously queued commands (48) is sometimes too high if you do -intensive writes from VMware on a target disk, which uses LVM in the -snapshot mode. In this case value like 16 or even 8-10 depending of your -backstorage speed could be more appropriate. - -There are 6 possible actions, which you can do to workaround or fix such -issues: - -1. Ignore incoming task management (TM) commands. It's fine if there are -not too many of them, so average performance isn't hurt and the -corresponding device isn't getting put offline, i.e. if the backstorage -isn't a way too slow. - -2. Decrease /sys/block/sdX/device/queue_depth on the initiator in case -if it's Linux (see below how) or/and SCST_MAX_TGT_DEV_COMMANDS constant -in scst_priv.h file until you stop seeing incoming TM commands. -ISCSI-SCST driver also has its own iSCSI specific parameter for that, -see its README file. - -To decrease device queue depth on Linux initiators you can run command: - -# echo Y >/sys/block/sdX/device/queue_depth - -where Y is the new number of simultaneously queued commands, X - your -imported device letter, like 'a' for sda device. There are no special -limitations for Y value, it can be any value from 1 to possible maximum -(usually, 32), so start from dividing the current value on 2, i.e. set -16, if /sys/block/sdX/device/queue_depth contains 32. - -3. Increase the corresponding timeout on the initiator. For Linux it is -located in -/sys/devices/platform/host*/session*/target*:0:0/*:0:0:1/timeout. It can -be done automatically by an udev rule. For instance, the following -rule will increase it to 300 seconds: - -SUBSYSTEM=="scsi", KERNEL=="[0-9]*:[0-9]*", ACTION=="add", ATTR{type}=="0|7|14", ATTR{timeout}="300" - -By default, this timeout is 30 or 60 seconds, depending on your distribution. - -4. Try to avoid such seek intensive workloads. - -5. Increase speed of the target's backstorage. - -6. Implement in SCST QoS, so queue depth size on the target is -dynamically adjusted, hence worst case initiator seen latencies are -controlled. - -Next, consider the case of too slow link between initiator and target, -when the initiator tries to simultaneously push N commands to the target -over it. In this case time to serve those commands, i.e. send or receive -data for them over the link, can be more, than timeout for any single -command, hence one or more commands in the tail of the queue can not be -served on time less than the timeout, so the initiator will decide that -they are stuck on the target and will try to recover. - -To workaround/fix this issue in this case you can use ways 1, 2, 3 above -or (7): increase speed of the link between target and initiator. - -Note, that logged messages about QUEUE_FULL status are quite different -by nature. This is a normal work, just SCSI flow control in action. -Simply don't enable "mgmt_minor" logging level, or, alternatively, if -you are confident in the worst case performance of your back-end storage -or initiator-target link, you can increase SCST_MAX_TGT_DEV_COMMANDS in -scst_priv.h to 64. Usually initiators don't try to push more commands on -the target. - - -Credits -------- - -Thanks to: - - * Mark Buechler for a lot of useful - suggestions, bug reports and help in debugging. - - * Ming Zhang for fixes and comments. - - * Nathaniel Clark for fixes and comments. - - * Calvin Morrow for testing and useful - suggestions. - - * Hu Gang for the original version of the - LSI target driver. - - * Erik Habbinga for fixes and support - of the LSI target driver. - - * Ross S. W. Walker for BLOCKIO inspiration - and Vu Pham who implemented it for VDISK dev handler. - - * Alessandro Premoli for fixes - - * Nathan Bullock for fixes. - - * Terry Greeniaus for fixes. - - * Krzysztof Blaszkowski for many fixes and bug reports. - - * Jianxi Chen for fixing problem with - devices >2TB in size - - * Bart Van Assche for a lot of help - - * Daniel Debonzi for a big part of the - initial SCST sysfs tree implementation - - -Vladislav Bolkhovitin , http://scst.sourceforge.net