mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-14 09:11:27 +00:00
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@783 d57e44dd-8a1f-0410-8b47-8ef2f437770f
434 lines
20 KiB
HTML
434 lines
20 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
|
|
<meta name="author" content="Daniel Fernandes"/>
|
|
<meta name="Robots" content="index,follow" />
|
|
<link rel="stylesheet" href="images/Orange.css" type="text/css" />
|
|
<title>SCST Contributing</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="wrap">
|
|
<div id="header">
|
|
<div class="logoimg"></div><h1 id="logo"><span class="orange"></span></h1>
|
|
<h2 id="slogan">SCSI Target Middle Level for Linux</h2>
|
|
</div>
|
|
<div id="menu">
|
|
<ul>
|
|
<li id="sponsorship"><a href="sponsorship.html">Sponsorship</a></li>
|
|
<li><a href="index.html">Home</a></li>
|
|
<li><a href="http://www.sourceforge.net/projects/scst">Main</a></li>
|
|
<li><a href="targets.html">Drivers</a></li>
|
|
<li><a href="downloads.html">Downloads</a></li>
|
|
<li id="current"><a href="contributing.html">Contributing</a></li>
|
|
<li><a href="comparison.html">Comparison</a></li>
|
|
</ul>
|
|
</div>
|
|
<div id="content-wrap">
|
|
<div id="main">
|
|
<h1>Contributing to SCST</h1>
|
|
|
|
<p>If you would like to contribute to SCST development, you can do in many ways:</p>
|
|
|
|
<ul>
|
|
<li><span>By reporting bugs or other problems.</span></li>
|
|
<li><span>By writing or updating various documentation to keep it complete and up to date.
|
|
For instance, <a href="scst_pg.html">SCST internals description</a> document is
|
|
in some areas quite outdated. Particularly, many functions were renamed since
|
|
time, when it was written. It would be good to bring it up to date.</span></li>
|
|
<li><span>By sending patches, which fix bugs or implement new functionality.
|
|
See below a list of possible SCST improvements with some possible
|
|
implementation ideas.</span></li>
|
|
<li><span>By sending donations. They will be spent on making SCST even better as well as on providing
|
|
better support and troubleshooting for you.
|
|
</ul>
|
|
|
|
<h1>Possible SCST extensions and improvements</h1>
|
|
|
|
<A NAME="ZC_READ"></A><h3>Zero-copy FILEIO for READ-direction commands</h3>
|
|
|
|
<p>At the moment, SCST in FILEIO mode uses standard Linux read() and write() syscalls paths,
|
|
which copy data from the page cache to the supplied buffer and back. Zero-copy FILEIO
|
|
would use page cache data directly. This would be a major performance improvement,
|
|
especially for fast hardware, like Infiniband, because it would eliminate the data copy
|
|
latency as well as considerably ease CPU and memory bandwidth load. This proposal is limited for
|
|
READs only, because for WRITEs it is a lot harder to
|
|
implement, so it is worth to do zero-copy for READs and WRITEs separately.</p>
|
|
|
|
<p>The main idea is to add one more flag to filp_open() "flags" parameter
|
|
(like O_RDONLY, O_DIRECT, etc.) O_ZEROCOPY, which would be available
|
|
only if the caller is from the kernel space. In this case fd->f_op->readv(),
|
|
do_sync_readv_writev(), etc. would receive as the pointer to data
|
|
buffer not a real data buffer, but pointer to an empty SG vector. Then:</p>
|
|
|
|
<ul>
|
|
<li><span>Generic buffer allocation in SCST would not be used, instead vdisk_parse()
|
|
would allocate the SG vector, but wouldn't fill it with actual pages.</span></li>
|
|
|
|
<li><span>In generic_file_aio_read(), if O_ZEROCOPY flag was set,
|
|
function do_generic_file_read() would be called with the last parameter set
|
|
to a pointer to new function file_zero_copy_read_actor() instead of file_read_actor().</span></li>
|
|
|
|
<li><span>Function file_zero_copy_read_actor() would be basically the same as
|
|
file_read_actor(), but, instead of copy data using __copy_to_user*() functions,
|
|
it would add the supplied page to the appropriate place in the received in
|
|
desc->arg.buf SG vector and reference, i.e. page_get(), that page.</span></li>
|
|
|
|
<li><span>In vdisk_devtype.on_free_cmd(), which doesn't exist yet, all pages
|
|
from the SG vector would be dereferenced, i.e. page_put(). Then the SG vector itself
|
|
would be freed.</span></li>
|
|
</ul>
|
|
|
|
<p>That's all. For WRITEs the current code path would remain unchanged.</p>
|
|
|
|
<A NAME="ZC_WRITE"></A><h3>Zero-copy FILEIO for WRITE-direction commands</h3>
|
|
|
|
<p>Implementation should be similar to zero-copy FILEIO for READ commands and should
|
|
be done after it. All incoming data should be inserted in the page cache, then dereferenced in
|
|
vdisk_devtype.on_free_cmd(). The main problem is insertion of data pages in the
|
|
page cache, namely, locking issues related to it. They should be carefully
|
|
investigated.</p>
|
|
|
|
<A NAME="PR"></A><h3>Persistent reservations</h3>
|
|
|
|
<p>Support for PERSISTENT RESERVE IN and PERSISTENT RESERVE OUT is required to
|
|
work in many cluster environments, e.g. Windows 2003 Cluster.</p>
|
|
|
|
<p>For implementation you should use scst_reserve_local() and
|
|
scst_release_local() as a base. You should store all reservation keys
|
|
for in files in /var/scst, one file per device
|
|
(it would allow to eliminate additional locking), like
|
|
/var/scst/boot_disk for device "boot_disk" and load them in memory, when
|
|
device would be registered.</p>
|
|
|
|
<p>In the first version it can be done for virtual
|
|
devices only and reject PERSISTENT RESERVE IN and OUT commands for
|
|
pass-through devices with "COMMAND NOT SUPPORTED" sense data.</p>
|
|
|
|
<A NAME="AUTO_SESS"></A><h3>Automatic sessions reassignment</h3>
|
|
|
|
<p>At the moment, if security name for an initiator reassigned (moved) to another security
|
|
group, the existing sessions from that initiator are not automatically reassigned to
|
|
the new security group, i.e. they remain in the old one. The only ways to reassign them
|
|
are either sessions restart, or restart of the corresponding target driver. Both in many
|
|
cases are not options.</p>
|
|
|
|
<p>To implement that you should on event of any group change:</p>
|
|
<ul>
|
|
<li><span>Globally suspend all activities by scst_suspend_activity().</span></li>
|
|
|
|
<li><span>Go over all existing sessions. For each find the corresponding ACG
|
|
(see scst_init_session() as an example) and check if it's the same as the existing
|
|
one. If it's the same, then go to the next session. Otherwise, reassign
|
|
it to the new ACG. For that you should go over all devices in the group/session
|
|
pair (tgt_dev's) and delete not existing in the new ACG tgt_dev's,
|
|
add new ones and keep the existing ones.</span></li>
|
|
|
|
<li><span>Resume the activities.</span></li>
|
|
</ul>
|
|
|
|
<A NAME="DYN_FLOW"></A><h3>Dynamic I/O flow control</h3>
|
|
|
|
<p>At the moment, if an initiator or several initiators simultaneously send to
|
|
target too many commands, especially in seek intensive workloads, target can get
|
|
overloaded and not able to finish commands on time. In such cases you can see on
|
|
the initiator(s) messages about aborting commands or resetting the target. See in SCST core
|
|
README section "What if target's backstorage is too slow" for more details.
|
|
To fix this problem it is necessary to implement a dynamic I/O flow control in
|
|
SCST core.</p>
|
|
|
|
<p>The flow control, generally, is quite simple. Each SCST command has timeout value,
|
|
which is set by the corresponding dev handler. SCST core should keep device's queue depth
|
|
at the level that the worst command's execution time, i.e. time between scst_rx_cmd()
|
|
and scst_finish_cmd(), would be between something like timeout/10 and timeout/5.
|
|
So, commands execution time should be checked and:</p>
|
|
|
|
<ul>
|
|
<li><span>If it's > timeout/5, then the new queue depth should be set to max(1,
|
|
cur_depth/2)</span></li>
|
|
|
|
<li><span>If it's < timeout/10, then new queue depth should be set to min(MAX_DEPTH,
|
|
cur_depth+1). This shouldn't be done too often, once in a few minutes should be
|
|
sufficient</span></li>
|
|
</ul>
|
|
|
|
<p>The above is, of course, an oversimplification to let you see the idea.
|
|
Implementation considering real life cases should be as the following:</p>
|
|
|
|
<p>1. There are several parameters:</p>
|
|
|
|
<ul>
|
|
<li><span>P - load watch period. During this period all the statistic is
|
|
gathered and processed.</span></li>
|
|
|
|
<li><span>MN - underload ratio divisor, which sets the underload portion of
|
|
timeout. If the longest execution time among all commands completed
|
|
during period P is below timeout/MN, the corresponding device considered
|
|
underloaded.</span></li>
|
|
|
|
<li><span>MX - overload ratio divisor, which sets the overload portion of
|
|
timeout. If the longest execution time among all commands completed
|
|
during period P is above timeout/MX, the corresponding device considered
|
|
overloaded.</span></li>
|
|
|
|
<li><span>I - step on which device's queue size will be increased if device
|
|
considered underloaded.</span></li>
|
|
|
|
<li><span>D - divisor on which device's queue size will be decreased if device
|
|
considered overloaded.</span></li>
|
|
|
|
<li><span>QI - quick fall interval. See description of Q parameter.</span></li>
|
|
|
|
<li><span>Q - quick fall ratio divisor. If the longest execution time of a
|
|
completed command is above timeout/Q and time from the previous quick
|
|
fall is smaller than QI, the corresponding device considered heavily
|
|
overloaded. The quick fall is needed to handle cases when load on device
|
|
is instantly increased on the way, where it can't handle it properly.</span></li>
|
|
|
|
<li><span>QD - divisor on which device's queue size will be decreased if
|
|
device considered heavily overloaded.</span></li>
|
|
</ul>
|
|
|
|
<p>The default values should be something like: P=15 sec., MN=20, MX=10, Q=3,
|
|
I=1, D=2, QI=5 sec., QD=10.</p>
|
|
|
|
<p>2. There are the following new variables in struct scst_device:</p>
|
|
|
|
<ul>
|
|
<li><span>queue_depth - current queue depth.</span></li>
|
|
|
|
<li><span>max_exec_ratio - maximum commands timeout/(execution time).</span></li>
|
|
|
|
<li><span>queue_was_full - flag, marking that the queue was at least once full
|
|
during period P.</span></li>
|
|
|
|
<li><span>quick_fall_time - time of the last quick fall.</span></li>
|
|
|
|
<li><span>flow_lock - protects flow control related variables, where needed.</span></li>
|
|
|
|
<li><span>...</span></li>
|
|
</ul>
|
|
|
|
<p>3. The commands processing path should be as the following:</p>
|
|
|
|
<ul>
|
|
<li><span>In scst_rx_cmd() the start time of the command is recorded (already done).</span></li>
|
|
|
|
<li><span>In __scst_init_cmd(), if dev->dev_cmd_count == dev->queue_depth,
|
|
dev->queue_was_full set to true.</span></li>
|
|
|
|
<li><span>In scst_finish_cmd() dev->max_exec_ratio set to max(dev->max_exec_ratio,
|
|
(cmd's exec_time)*100/cmd->timeout).</span></li>
|
|
|
|
<li><span>If in scst_finish_cmd() cmd's exec time is above cmd->timeout/Q and
|
|
time from the latest quick fall is above QI, then:
|
|
|
|
<ul>
|
|
<li><span>dev->queue_depth set to max(1, dev->queue_depth/QD).</span></li>
|
|
|
|
<li><span>Flow control period reset, i.e. started again, including setting
|
|
dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</span></li>
|
|
</ul>
|
|
</span></li>
|
|
</ul>
|
|
|
|
<p>4. There should be a work, which once in a P seconds will check
|
|
dev->max_exec_ratio, then:</p>
|
|
|
|
<ul>
|
|
<li><span>If device neither underloaded, nor overloaded. i.e. max_exec_ratio
|
|
between defined by MN and MX, do nothing.</span></li>
|
|
|
|
<li><span>If device was underloaded:
|
|
|
|
<ul>
|
|
<li><span>if dev->queue_was_full is false, then do nothing.</span></li>
|
|
|
|
<li><span>if dev->queue_was_full is true, then set dev->queue_depth to
|
|
min(SCST_MAX_DEV_COMMANDS, dev->queue_depth + I).</span></li>
|
|
</ul>
|
|
</span></li>
|
|
|
|
<li><span>If device was overloaded, then set dev->queue_depth to max(1,
|
|
dev->queue_depth/D).</span></li>
|
|
</ul>
|
|
|
|
<p>Then the flow control period is reset, i.e. started again, including
|
|
setting dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</p>
|
|
|
|
<p>That's all. Then only support for initiators, like iSCSI,
|
|
which don't handle QUEUE FULL to decrease amount of queued
|
|
commands, should be added. Such initiators expect target to control size of
|
|
the queue, via, e.g., through MAX_SN for iSCSI.</p>
|
|
|
|
<p>For it at the stage 2 of the dynamic flow control development
|
|
the following should be done:</p>
|
|
|
|
<ul>
|
|
<li><span>New callback on_queue_depth_adjustment() should be added to struct
|
|
scst_tgt_template.</span></li>
|
|
|
|
<li><span>If target driver defined it, each time after dev->queue_depth changed
|
|
on_queue_depth_adjustment() should be called. In this callback target
|
|
driver should change internal queue_depth to, e.g. for iSCSI target, set
|
|
max_sn in the replies correctly.</span></li>
|
|
</ul>
|
|
|
|
<p>Then, at the latest stage of the development, logic to not schedule the
|
|
flow control work on idle devices should be added.</p>
|
|
|
|
<A NAME="O_DIRECT"></A><h3>Support for O_DIRECT in scst_vdisk handler</h3>
|
|
|
|
<p>At the moment, scst_vdisk handler doesn't support O_DIRECT option and possibility to set it
|
|
was disabled. This limitation caused by Linux kernel expectation that memory supplied to
|
|
read() and write() functions with O_DIRECT flag is mapped to some user space application.</p>
|
|
|
|
<p>It is relatively easy to remove that limitation. Function dio_refill_pages()
|
|
should be modified to check before calling get_user_pages() if current->mm is not NULL.
|
|
If it is NULL, then, instead of calling get_user_pages(), dio->pages should be filled
|
|
by pages, taken directly from dio->curr_user_address. Each such page should be referenced
|
|
by page_cache_get(). That's all.</p>
|
|
|
|
<A NAME="VDISK_REFACTOR"></A><h3>Refactoring of command execution path in scst_vdisk handler</h3>
|
|
|
|
<p>At the moment, in scst_vdisk handler command execution function vdisk_do_job() is
|
|
overcomplicated and not very performance effective. It would be good to replace all those
|
|
ugly "switch" statements by choosing the handler for each SCSI command by indirect
|
|
function call on an array of function pointers.</p>
|
|
|
|
<p>I.e., there should be an array vdisk_exec_fns with 256 entries of function pointers:</p>
|
|
|
|
<p>void (*cmd_exec_fn) (struct scst_cmd *cmd)</p>
|
|
|
|
<p>Then vdisk_do_job() should look like</p>
|
|
|
|
<listing><p>static int vdisk_do_job(struct scst_cmd *cmd)
|
|
{
|
|
return vdisk_exec_fns[cmd->cdb[0]](cmd);
|
|
}</p></listing>
|
|
|
|
<A NAME="SG_LIMIT"></A><h3>Solve SG IO count limitation issue in pass-through mode</h3>
|
|
|
|
<p>In the pass-through mode (i.e. using the pass-through device handlers
|
|
scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
|
|
are passed to local SCSI hardware on target as is, without any
|
|
modifications. As any other hardware, the local SCSI hardware can not
|
|
handle commands with amount of data and/or segments count in
|
|
scatter-gather array bigger some values. If you have this issue you will see
|
|
symptoms like small transfers work well, but large ones stall and
|
|
messages like: "Unable to complete command due to SG IO count
|
|
limitation" are printed in the kernel logs.</p>
|
|
|
|
<p>In <a href="sgv_big_order_alloc.diff">sgv_big_order_alloc.diff</a> you
|
|
can find a possible way to solve this issue.</p>
|
|
|
|
<A NAME="MEM_REG"></A><h3>Memory registration</h3>
|
|
|
|
<p>In some cases a target driver might need to register memory used for data buffers in the
|
|
hardware. At the moment, none of SCST target drivers, including InfiniBand SRP target driver,
|
|
need that feature. But in case if in future there is a need in such a feature, it can be easily
|
|
added by extending SCST SGV cache. The SCST SGV cache is a memory management
|
|
subsystem in SCST. It doesn't free to the system each data buffer,
|
|
which is not used anymore, but keeps it for a while to let it be reused by the
|
|
next consecutive command to reduce command processing latency and, hence, improve performance.</p>
|
|
|
|
<p>To support memory buffers registrations, it can be extended by the following way:</p>
|
|
|
|
<p>1. Struct scst_tgt_template would be extended to have 2 new callbacks:</p>
|
|
|
|
<ul>
|
|
|
|
<li><span>int register_buffer(struct scst_cmd *cmd)</span></li>
|
|
|
|
<li><span>int unregister_buffer(unsigned long mem_priv, void *scst_priv)</span></li>
|
|
|
|
</ul>
|
|
|
|
<p>2. SCST core would be extended to have 4 new functions:</p>
|
|
|
|
<ul>
|
|
|
|
<li><span>int scst_mem_registered(struct scst_cmd *cmd)</span></li>
|
|
|
|
<li><span>int scst_mem_deregistered(void *scst_priv)</span></li>
|
|
|
|
<li><span>int scst_set_mem_priv(struct scst_cmd *cmd, unsigned long mem_priv)</span></li>
|
|
|
|
<li><span>unsigned long scst_get_mem_priv(struct scst_cmd *cmd)</span></li>
|
|
|
|
</ul>
|
|
|
|
<p>3. The workflow would be the following:</p>
|
|
|
|
<ol>
|
|
<li><span>If target driver defined register_buffer() and unregister_buffer() callbacks,
|
|
SCST core would allocate a dedicated SGV cache for each instance of struct scst_tgt,
|
|
i.e. target.</span></li>
|
|
|
|
<li><span>When there would be an SGV cache miss in memory buffer for a command allocation,
|
|
SCST would check if register_buffer() callback was defined in the target driver's template
|
|
and, if yes, would call it.</span></li>
|
|
|
|
<li><span>In register_buffer() callback the target driver would do necessary actions to
|
|
start registration of the commands memory buffer.</span></li>
|
|
|
|
<li><span>Upon register_buffer() callback returns, SCST core would suspend processing the
|
|
corresponding command and would switch to the next commands processing.</span></li>
|
|
|
|
<li><span>After the memory registration finished, the target driver would call scst_set_mem_priv()
|
|
to associate the memory buffer with some internal data.</span></li>
|
|
|
|
<li><span>Then the target driver would call scst_mem_registered() and SCST would resume processing
|
|
the command. Functions scst_set_mem_priv() and scst_mem_registered() can be called from inside register_buffer().
|
|
In this case SCST core would continue processing the command immediately without suspending.</span></li>
|
|
|
|
<li><span>After the command finished, the corresponding memory buffer would remain in the
|
|
SGV cache in the registered state and would be reused by the next commands. For each of them
|
|
the target driver can at any time figure out the associated with the registered buffer data
|
|
by using scst_get_mem_priv().</span></li>
|
|
|
|
<li><span>When the SGV cache decide that there is a time to free the memory buffer, it would
|
|
call the target driver's unregister_buffer() callback.</span></li>
|
|
|
|
<li><span>In this callback the target driver would do necessary actions to start deregistration of the
|
|
commands memory buffer.</span></li>
|
|
|
|
<li><span>Upon unregister_buffer() callback returns, SGV cache would suspend freeing the corresponding buffer
|
|
and would switch to other deals it has.</span></li>
|
|
|
|
<li><span>After the memory deregistration finished, the target driver would call scst_mem_deregistered()
|
|
and pass to it scst_priv pointer, received in unregister_buffer(). Then the memory buffer
|
|
would be freed by the SGV cache. Function scst_mem_deregistered() can be called from inside unregister_buffer().
|
|
In this case SGV cache would free the buffer immediately without suspending.
|
|
</span></li>
|
|
</ol>
|
|
|
|
<A NAME="NON_SCSI_TGT"></A><h3>SCST usage with non-SCSI transports</h3>
|
|
|
|
<p>SCST might also be used with non-SCSI speaking transports, like NBD or AoE. Such cooperation
|
|
would allow them to use SCST-emulated backend.</p>
|
|
|
|
<p>For user space targets this is trivial: they simply should use SCST-emulated devices locally
|
|
via scst_local module.</p>
|
|
|
|
<p>For in-kernel non-SCSI target driver it's a bit more complicated. They should implement a small layer,
|
|
which would translate their internal READ/WRITE requests to corresponding SCSI commands and, on the
|
|
way back, SCSI status and sense codes to their internal status codes.</p>
|
|
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<!-- wrap ends here -->
|
|
|
|
<!-- footer starts here -->
|
|
<div id="footer">
|
|
<p>
|
|
© Copyright 2008 <b><font color="#EC981F">Vladislav Bolkhovitin & others.</font>
|
|
Design by: <b><font color="#EC981F">Daniel Fernandes</font></b>
|
|
|
|
</p>
|
|
</div>
|
|
<!-- footer ends here -->
|
|
</body>
|
|
</html>
|