mirror of
https://github.com/SCST-project/scst.git
synced 2026-05-14 09:11:27 +00:00
git-svn-id: http://svn.code.sf.net/p/scst/svn/trunk@2037 d57e44dd-8a1f-0410-8b47-8ef2f437770f
462 lines
22 KiB
HTML
462 lines
22 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<meta name="author" content="Daniel Fernandes">
|
|
<meta name="Robots" content="index,follow">
|
|
<link rel="stylesheet" href="images/Orange.css" type="text/css">
|
|
<title>SCST Contributing</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="wrap">
|
|
<div id="header">
|
|
<div class="logoimg"></div><h1 id="logo"><span class="orange"></span></h1>
|
|
<h2 id=slogan>Generic SCSI Target Subsystem for Linux</h2>
|
|
</div>
|
|
<div id="menu">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
<li><a href="http://www.sourceforge.net/projects/scst">Main</a></li>
|
|
<li><a href="http://sourceforge.net/news/?group_id=110471">News</a></li>
|
|
<li><a href="targets.html">Drivers</a></li>
|
|
<li><a href="downloads.html">Downloads</a></li>
|
|
<li id="current"><a href="contributing.html">Contributing</a></li>
|
|
<li><a href="comparison.html">Comparison</a></li>
|
|
<li><a href="users.html">Users</a></li>
|
|
<li><a href="solutions.html">Solutions</a></li>
|
|
</ul>
|
|
</div>
|
|
<div id="content-wrap">
|
|
<div id="main">
|
|
<h1>Contributing to SCST</h1>
|
|
|
|
<p>If you would like to contribute to SCST development, you can do in many ways:</p>
|
|
|
|
<ul>
|
|
<li><span>By sending donations. They will be spent on further work making SCST better as well as on providing
|
|
better support and troubleshooting for you. Donations can be on one time or per period of time basis,
|
|
from companies or individuals.</span></li>
|
|
<li><span>By sending patches, which fix bugs or implement new functionality.
|
|
See below a list of possible SCST improvements with some possible
|
|
implementation ideas.</span></li>
|
|
<li><span>By writing or updating various documentation to keep it complete and up to date.
|
|
For instance, <a href="scst_pg.html">SCST internals description</a> document is
|
|
in some areas quite outdated. Particularly, many functions were renamed since
|
|
time, when it was written. It would be good to bring it up to date.</span></li>
|
|
<li><span>By reporting bugs or other problems.</span></li>
|
|
</ul>
|
|
|
|
<h1>Possible SCST extensions and improvements</h1>
|
|
|
|
<A NAME="ZC_READ"></A><h3>Zero-copy FILEIO for READ-direction commands</h3>
|
|
|
|
<p>At the moment, SCST in FILEIO mode uses standard Linux read() and write() syscalls paths,
|
|
which copy data from the page cache to the supplied buffer and back. Zero-copy FILEIO
|
|
would use page cache data directly. This would be a major performance improvement,
|
|
especially for fast hardware, like Infiniband, because it would eliminate the data copy
|
|
latency as well as considerably ease CPU and memory bandwidth load. This proposal is limited for
|
|
READs only, because for WRITEs it is a lot harder to
|
|
implement, so it is worth to do zero-copy for READs and WRITEs separately.</p>
|
|
|
|
<p>The main idea is to add one more flag to filp_open() "flags" parameter
|
|
(like O_RDONLY, O_DIRECT, etc.) O_ZEROCOPY, which would be available
|
|
only if the caller is from the kernel space. In this case fd->f_op->readv(),
|
|
do_sync_readv_writev(), etc. would receive as the pointer to data
|
|
buffer not a real data buffer, but pointer to an empty SG vector. Then:</p>
|
|
|
|
<ul>
|
|
<li><span>Generic buffer allocation in SCST would not be used, instead vdisk_parse()
|
|
would allocate the SG vector, but wouldn't fill it with actual pages.</span></li>
|
|
|
|
<li><span>In generic_file_aio_read(), if O_ZEROCOPY flag was set,
|
|
function do_generic_file_read() would be called with the last parameter set
|
|
to a pointer to new function file_zero_copy_read_actor() instead of file_read_actor().</span></li>
|
|
|
|
<li><span>Function file_zero_copy_read_actor() would be basically the same as
|
|
file_read_actor(), but, instead of copy data using __copy_to_user*() functions,
|
|
it would add the supplied page to the appropriate place in the received in
|
|
desc->arg.buf SG vector and reference, i.e. page_get(), that page.</span></li>
|
|
|
|
<li><span>In vdisk_devtype.on_free_cmd(), which doesn't exist yet, all pages
|
|
from the SG vector would be dereferenced, i.e. page_put(). Then the SG vector itself
|
|
would be freed.</span></li>
|
|
</ul>
|
|
|
|
<p>That's all. For WRITEs the current code path would remain unchanged.</p>
|
|
|
|
<A NAME="ZC_WRITE"></A><h3>Zero-copy FILEIO for WRITE-direction commands</h3>
|
|
|
|
<p>Implementation should be similar to zero-copy FILEIO for READ commands and should
|
|
be done after it. All incoming data should be inserted in the page cache, then dereferenced in
|
|
vdisk_devtype.on_free_cmd(). The main problem is insertion of data pages in the
|
|
page cache, namely, locking issues related to it. They should be carefully
|
|
investigated.</p>
|
|
|
|
<A NAME="DYN_FLOW"></A><h3>Dynamic I/O flow control</h3>
|
|
|
|
<p>At the moment, if an initiator or several initiators simultaneously send to
|
|
target too many commands, especially in seek intensive workloads, target can get
|
|
overloaded and not able to finish commands on time. In such cases you can see on
|
|
the initiator(s) messages about aborting commands or resetting the target. See in SCST core
|
|
README section "What if target's backstorage is too slow" for more details.
|
|
To fix this problem it is necessary to implement a dynamic I/O flow control in
|
|
SCST core.</p>
|
|
|
|
<p>The flow control, generally, is quite simple. Each SCST command has timeout value,
|
|
which is set by the corresponding dev handler. SCST core should keep device's queue depth
|
|
at the level that the worst command's execution time, i.e. time between scst_rx_cmd()
|
|
and scst_finish_cmd(), would be between something like timeout/10 and timeout/5.
|
|
So, commands execution time should be checked and:</p>
|
|
|
|
<ul>
|
|
<li><span>If it's > timeout/5, then the new queue depth should be set to max(1,
|
|
cur_depth/2)</span></li>
|
|
|
|
<li><span>If it's < timeout/10, then new queue depth should be set to min(MAX_DEPTH,
|
|
cur_depth+1). This shouldn't be done too often, once in a few minutes should be
|
|
sufficient</span></li>
|
|
</ul>
|
|
|
|
<p>The above is, of course, an oversimplification to let you see the idea.
|
|
Implementation considering real life cases should be as the following:</p>
|
|
|
|
<p>1. There are several parameters:</p>
|
|
|
|
<ul>
|
|
<li><span>P - load watch period. During this period all the statistic is
|
|
gathered and processed.</span></li>
|
|
|
|
<li><span>MN - underload ratio divisor, which sets the underload portion of
|
|
timeout. If the longest execution time among all commands completed
|
|
during period P is below timeout/MN, the corresponding device considered
|
|
underloaded.</span></li>
|
|
|
|
<li><span>MX - overload ratio divisor, which sets the overload portion of
|
|
timeout. If the longest execution time among all commands completed
|
|
during period P is above timeout/MX, the corresponding device considered
|
|
overloaded.</span></li>
|
|
|
|
<li><span>I - step on which device's queue size will be increased if device
|
|
considered underloaded.</span></li>
|
|
|
|
<li><span>D - divisor on which device's queue size will be decreased if device
|
|
considered overloaded.</span></li>
|
|
|
|
<li><span>QI - quick fall interval. See description of Q parameter.</span></li>
|
|
|
|
<li><span>Q - quick fall ratio divisor. If the longest execution time of a
|
|
completed command is above timeout/Q and time from the previous quick
|
|
fall is smaller than QI, the corresponding device considered heavily
|
|
overloaded. The quick fall is needed to handle cases when load on device
|
|
is instantly increased on the way, where it can't handle it properly.</span></li>
|
|
|
|
<li><span>QD - divisor on which device's queue size will be decreased if
|
|
device considered heavily overloaded.</span></li>
|
|
</ul>
|
|
|
|
<p>The default values should be something like: P=15 sec., MN=20, MX=10, Q=3,
|
|
I=1, D=2, QI=5 sec., QD=10.</p>
|
|
|
|
<p>2. There are the following new variables in struct scst_device:</p>
|
|
|
|
<ul>
|
|
<li><span>queue_depth - current queue depth.</span></li>
|
|
|
|
<li><span>max_exec_ratio - maximum commands timeout/(execution time).</span></li>
|
|
|
|
<li><span>queue_was_full - flag, marking that the queue was at least once full
|
|
during period P.</span></li>
|
|
|
|
<li><span>quick_fall_time - time of the last quick fall.</span></li>
|
|
|
|
<li><span>flow_lock - protects flow control related variables, where needed.</span></li>
|
|
|
|
<li><span>...</span></li>
|
|
</ul>
|
|
|
|
<p>3. The commands processing path should be as the following:</p>
|
|
|
|
<ul>
|
|
<li><span>In scst_rx_cmd() the start time of the command is recorded (already done).</span></li>
|
|
|
|
<li><span>In __scst_init_cmd(), if dev->dev_cmd_count == dev->queue_depth,
|
|
dev->queue_was_full set to true.</span></li>
|
|
|
|
<li><span>In scst_finish_cmd() dev->max_exec_ratio set to max(dev->max_exec_ratio,
|
|
(cmd's exec_time)*100/cmd->timeout).</span></li>
|
|
|
|
<li><font color="#666666">If in scst_finish_cmd() cmd's exec time is above cmd->timeout/Q and
|
|
time from the latest quick fall is above QI, then:</font>
|
|
|
|
<ul>
|
|
<li><span>dev->queue_depth set to max(1, dev->queue_depth/QD).</span></li>
|
|
|
|
<li><span>Flow control period reset, i.e. started again, including setting
|
|
dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</span></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>4. There should be a work, which once in a P seconds will check
|
|
dev->max_exec_ratio, then:</p>
|
|
|
|
<ul>
|
|
<li><span>If device neither underloaded, nor overloaded. i.e. max_exec_ratio
|
|
between defined by MN and MX, do nothing.</span></li>
|
|
|
|
<li><font color="#666666">If device was underloaded:</font>
|
|
|
|
<ul>
|
|
<li><span>if dev->queue_was_full is false, then do nothing.</span></li>
|
|
|
|
<li><span>if dev->queue_was_full is true, then set dev->queue_depth to
|
|
min(SCST_MAX_DEV_COMMANDS, dev->queue_depth + I).</span></li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li><span>If device was overloaded, then set dev->queue_depth to max(1,
|
|
dev->queue_depth/D).</span></li>
|
|
</ul>
|
|
|
|
<p>Then the flow control period is reset, i.e. started again, including
|
|
setting dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</p>
|
|
|
|
<p>That's all. Then only support for initiators, like iSCSI,
|
|
which don't handle QUEUE FULL to decrease amount of queued
|
|
commands, should be added. Such initiators expect target to control size of
|
|
the queue, via, e.g., through MAX_SN for iSCSI.</p>
|
|
|
|
<p>For it at the stage 2 of the dynamic flow control development
|
|
the following should be done:</p>
|
|
|
|
<ul>
|
|
<li><span>New callback on_queue_depth_adjustment() should be added to struct
|
|
scst_tgt_template.</span></li>
|
|
|
|
<li><span>If target driver defined it, each time after dev->queue_depth changed
|
|
on_queue_depth_adjustment() should be called. In this callback target
|
|
driver should change internal queue_depth to, e.g. for iSCSI target, set
|
|
max_sn in the replies correctly.</span></li>
|
|
</ul>
|
|
|
|
<p>Then, at the latest stage of the development, logic to not schedule the
|
|
flow control work on idle devices should be added.</p>
|
|
|
|
<A NAME="O_DIRECT"></A><h3>Support for O_DIRECT in scst_vdisk handler</h3>
|
|
|
|
<p>At the moment, scst_vdisk handler doesn't support O_DIRECT option and possibility to set it
|
|
was disabled. This limitation caused by Linux kernel expectation that memory supplied to
|
|
read() and write() functions with O_DIRECT flag is mapped to some user space application.</p>
|
|
|
|
<p>It is relatively easy to remove that limitation. Function dio_refill_pages()
|
|
should be modified to check before calling get_user_pages() if current->mm is not NULL.
|
|
If it is NULL, then, instead of calling get_user_pages(), dio->pages should be filled
|
|
by pages, taken directly from dio->curr_user_address. Each such page should be referenced
|
|
by page_cache_get(). That's all.</p>
|
|
|
|
<A NAME="VDISK_REFACTOR"></A><h3>Refactoring of command execution path in scst_vdisk handler</h3>
|
|
|
|
<p>At the moment, in scst_vdisk handler command execution function vdisk_do_job() is
|
|
overcomplicated and not very performance effective. It would be good to replace all those
|
|
ugly "switch" statements by choosing the handler for each SCSI command by indirect
|
|
function call on an array of function pointers.</p>
|
|
|
|
<p>I.e., there should be an array vdisk_exec_fns with 256 entries of function pointers:</p>
|
|
|
|
<p>void (*cmd_exec_fn) (struct scst_cmd *cmd)</p>
|
|
|
|
<p>Then vdisk_do_job() should look like</p>
|
|
|
|
<p><code>static int vdisk_do_job(struct scst_cmd *cmd)<br>
|
|
{<br>
|
|
<span class="tab">return vdisk_exec_fns[cmd->cdb[0]](cmd);</span><br>
|
|
}
|
|
</code></p>
|
|
|
|
<A NAME="SG_LIMIT"></A><h3>Solve SG IO count limitation issue in pass-through mode</h3>
|
|
|
|
<p>In the pass-through mode (i.e. using the pass-through device handlers like
|
|
scst_tape, etc) SCSI commands, coming from remote initiators,
|
|
are passed to local SCSI hardware on target as is, without any
|
|
modifications. As any other hardware, the local SCSI hardware can not
|
|
handle commands with amount of data and/or segments count in
|
|
scatter-gather array bigger some values. For some commands SCST can
|
|
split them on subcommands and, hence, workaround this problem, but it isn't
|
|
always possible. For instance, for tapes splitting write commands may mean
|
|
corrupting the tape data.</p>
|
|
|
|
<p>If you have this issue you will see
|
|
symptoms like small transfers work well, but large transfers stall and
|
|
messages like: "Unable to complete command due to SG IO count
|
|
limitation" are printed in the kernel logs.</p>
|
|
|
|
<p>
|
|
|
|
<p>The only complete way to fix this problem is to allocate data buffers with number
|
|
of entries inside the SG IO count limitation. In <a href="sgv_big_order_alloc.diff">sgv_big_order_alloc.diff</a>
|
|
you can find a possible way to solve this issue.</p>
|
|
|
|
<p>There are also 2 more patches you can look at:</p>
|
|
|
|
<ul>
|
|
|
|
<li><span><a href="sgv_big_order_alloc-r2.diff">sgv_big_order_alloc-r2.diff</a> - this patch
|
|
has all the required features, but has a memory corruption.</span></li>
|
|
|
|
<li><span><a href="sgv_big_order_alloc-sfw4.diff">sgv_big_order_alloc-sfw4.diff</a> - this patch,
|
|
created by Frank Zago, works for him, but doesn't have all the required features to be merged
|
|
in SCST.</span></li>
|
|
|
|
</ul>
|
|
|
|
<A NAME="MEM_REG"></A><h3>Memory registration</h3>
|
|
|
|
<p>In some cases a target driver might need to register memory used for data buffers in the
|
|
hardware. At the moment, none of SCST target drivers, including InfiniBand SRP target driver,
|
|
need that feature. But in case if in future there is a need in such a feature, it can be easily
|
|
added by extending SCST SGV cache. The SCST SGV cache is a memory management
|
|
subsystem in SCST. It doesn't free to the system each data buffer,
|
|
which is not used anymore, but keeps it for a while to let it be reused by the
|
|
next consecutive command to reduce command processing latency and, hence, improve performance.</p>
|
|
|
|
<p>To support memory buffers registrations, it can be extended by the following way:</p>
|
|
|
|
<p>1. Struct scst_tgt_template would be extended to have 2 new callbacks:</p>
|
|
|
|
<ul>
|
|
|
|
<li><span>int register_buffer(struct scst_cmd *cmd)</span></li>
|
|
|
|
<li><span>int unregister_buffer(unsigned long mem_priv, void *scst_priv)</span></li>
|
|
|
|
</ul>
|
|
|
|
<p>2. SCST core would be extended to have 4 new functions:</p>
|
|
|
|
<ul>
|
|
|
|
<li><span>int scst_mem_registered(struct scst_cmd *cmd)</span></li>
|
|
|
|
<li><span>int scst_mem_deregistered(void *scst_priv)</span></li>
|
|
|
|
<li><span>int scst_set_mem_priv(struct scst_cmd *cmd, unsigned long mem_priv)</span></li>
|
|
|
|
<li><span>unsigned long scst_get_mem_priv(struct scst_cmd *cmd)</span></li>
|
|
|
|
</ul>
|
|
|
|
<p>3. The workflow would be the following:</p>
|
|
|
|
<ol>
|
|
<li><span>If target driver defined register_buffer() and unregister_buffer() callbacks,
|
|
SCST core would allocate a dedicated SGV cache for each instance of struct scst_tgt,
|
|
i.e. target.</span></li>
|
|
|
|
<li><span>When there would be an SGV cache miss in memory buffer for a command allocation,
|
|
SCST would check if register_buffer() callback was defined in the target driver's template
|
|
and, if yes, would call it.</span></li>
|
|
|
|
<li><span>In register_buffer() callback the target driver would do necessary actions to
|
|
start registration of the commands memory buffer.</span></li>
|
|
|
|
<li><span>Upon register_buffer() callback returns, SCST core would suspend processing the
|
|
corresponding command and would switch to the next commands processing.</span></li>
|
|
|
|
<li><span>After the memory registration finished, the target driver would call scst_set_mem_priv()
|
|
to associate the memory buffer with some internal data.</span></li>
|
|
|
|
<li><span>Then the target driver would call scst_mem_registered() and SCST would resume processing
|
|
the command. Functions scst_set_mem_priv() and scst_mem_registered() can be called from inside register_buffer().
|
|
In this case SCST core would continue processing the command immediately without suspending.</span></li>
|
|
|
|
<li><span>After the command finished, the corresponding memory buffer would remain in the
|
|
SGV cache in the registered state and would be reused by the next commands. For each of them
|
|
the target driver can at any time figure out the associated with the registered buffer data
|
|
by using scst_get_mem_priv().</span></li>
|
|
|
|
<li><span>When the SGV cache decide that there is a time to free the memory buffer, it would
|
|
call the target driver's unregister_buffer() callback.</span></li>
|
|
|
|
<li><span>In this callback the target driver would do necessary actions to start deregistration of the
|
|
commands memory buffer.</span></li>
|
|
|
|
<li><span>Upon unregister_buffer() callback returns, SGV cache would suspend freeing the corresponding buffer
|
|
and would switch to other deals it has.</span></li>
|
|
|
|
<li><span>After the memory deregistration finished, the target driver would call scst_mem_deregistered()
|
|
and pass to it scst_priv pointer, received in unregister_buffer(). Then the memory buffer
|
|
would be freed by the SGV cache. Function scst_mem_deregistered() can be called from inside unregister_buffer().
|
|
In this case SGV cache would free the buffer immediately without suspending.
|
|
</span></li>
|
|
</ol>
|
|
|
|
<A NAME="NON_SCSI_TGT"></A><h3>SCST usage with non-SCSI transports</h3>
|
|
|
|
<p>SCST might also be used with non-SCSI speaking transports, like NBD or AoE. Such cooperation
|
|
would allow them to use SCST-emulated backend.</p>
|
|
|
|
<p>For user space targets this is trivial: they simply should use SCST-emulated devices locally
|
|
via scst_local module.</p>
|
|
|
|
<p>For in-kernel non-SCSI target driver it's a bit more complicated. They should implement a small layer,
|
|
which would translate their internal READ/WRITE requests to corresponding SCSI commands and, on the
|
|
way back, SCSI status and sense codes to their internal status codes.</p>
|
|
|
|
<A NAME="iSER_target"></A><h3>iSER target</h3>
|
|
|
|
<p><a href="http://en.wikipedia.org/wiki/ISCSI_Extensions_for_RDMA">iSER</a> (iSCSI Extensions for RDMA)
|
|
protocol accelerates iSCSI by allowing direct data transfers using RDMA services (iWARP or
|
|
InfiniBand) bypassing the regular heavy weighted and CPU consuming TCP/IP data transfers path.</p>
|
|
|
|
<p>It would be good to add support for iSER in iSCSI-SCST.</p>
|
|
|
|
<A NAME="GET_CONFIGURATION"></A><h3>GET CONFIGURATION command</h3>
|
|
|
|
<p>SCSI command GET CONFIGURATION is mandatory for SCSI multimedia devices, like CD/DVD-ROMs or
|
|
recorders, see MMC standard. Currently SCST lacks support for it, which leads to problems
|
|
with some programs depending on the result of GET CONFIGURATION command execution.</p>
|
|
|
|
<p>It would be good to add support for it in the SCST core.</p>
|
|
|
|
<A NAME="Per_Device_Suspend"></A><h3>Per-device suspending</h3>
|
|
|
|
<p>Currently before doing any management operations SCST core performs so called "activities suspending", i.e.
|
|
it suspends new coming SCSI commands and wait until currently being executed ones finished. It allows to
|
|
simplify internal locking and reference counting a lot, but has a drawback that it is global, i.e. affects
|
|
all devices and SCSI commands, even ones which don't participate in the management operation. In the majority
|
|
of regular cases it works pretty well, but sometimes it can be a problem.
|
|
For instance, if a SCSI command needs a big amount of execution time (hours for some tapes operations),
|
|
the management command and all other SCSI commands will wait until it's finished. Even worse, if a user space
|
|
dev handler hangs and stops processing commands, any SCST management command will not be able to complete and fail
|
|
with timeout until the user space dev handler gets killed.</p>
|
|
|
|
<p>The global suspending should be changed to more fine-grained per-device suspending
|
|
and only for cases where it's really needed, like device unregistration. This is a very tricky task, because
|
|
all the internal SCST locking should be reimplemented.</p>
|
|
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<!-- wrap ends here -->
|
|
<!-- footer starts here -->
|
|
<div id="footer">
|
|
<p>© Copyright 2004-2010 <b><font color="#EC981F">Vladislav Bolkhovitin & others.</font></b>
|
|
Design by: <b><font color="#EC981F">Daniel Fernandes</font></b> </p>
|
|
</div>
|
|
<!-- footer ends here -->
|
|
<!-- Piwik -->
|
|
<script type="text/javascript">
|
|
var pkBaseURL = (("https:" == document.location.protocol) ? "https://apps.sourceforge.net/piwik/scst/" : "http://apps.sourceforge.net/piwik/scst/");
|
|
document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script><script type="text/javascript">
|
|
piwik_action_name = '';
|
|
piwik_idsite = 1;
|
|
piwik_url = pkBaseURL + "piwik.php";
|
|
piwik_log(piwik_action_name, piwik_idsite, piwik_url);
|
|
</script>
|
|
<object><noscript><p><img src="http://apps.sourceforge.net/piwik/scst/piwik.php?idsite=1" alt="piwik"></p></noscript></object>
|
|
<!-- End Piwik Tag -->
|
|
</body>
|
|
</html>
|