scst/www/contributing.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="author" content="Daniel Fernandes"/>
<meta name="Robots" content="index,follow" />
<link rel="stylesheet" href="images/Orange.css" type="text/css" />
<title>SCST Contributing</title>
</head>

<body>
<div id="wrap">
	<div id="header">
		<div class="logoimg"></div><h1 id="logo"><span class="orange"></span></h1>
		<h2 id="slogan">SCSI Target Middle Level for Linux</h2>
	</div>
	<div id="menu">
		<ul>
			<li id="sponsorship"><a href="sponsorship.html">Sponsorship</a></li>
			<li><a href="index.html">Home</a></li>
			<li><a href="http://www.sourceforge.net/projects/scst">Main</a></li>
			<li><a href="targets.html">Drivers</a></li>
			<li><a href="downloads.html">Downloads</a></li>
			<li id="current"><a href="contributing.html">Contributing</a></li>
			<li><a href="comparison.html">Comparison</a></li>
		</ul>
	</div>
	<div id="content-wrap">
	  		<div id="main">
				<h1>Contributing to SCST</h1>

				<p>If you would like to contribute to SCST development, you can do in many ways:</p>

				<ul>
				<li><span>By reporting bugs or other problems.</span></li>
				<li><span>By writing or updating various documentation to keep it complete and up to date.
					For instance, <a href="scst_pg.html">SCST internals description</a> document is
					in some areas quite outdated. Particularly, many functions were renamed since
					time, when it was written. It would be good to bring it up to date.</span></li>
    				<li><span>By sending patches, which fix bugs or implement new functionality.
    				          See below a list of possible SCST improvements with some possible
    				          implementation ideas.</span></li>
				<li><span>By sending donations. They will be spent on making SCST even better as well as on providing
				better support and troubleshooting for you.
				</ul>

				<h1>Possible SCST extensions and improvements</h1>

				<A NAME="ZC_READ"></A><h3>Zero-copy FILEIO for READ-direction commands</h3>

				<p>At the moment, SCST in FILEIO mode uses standard Linux read() and write() syscalls paths,
				which copy data from the page cache to the supplied buffer and back. Zero-copy FILEIO
				would use page cache data directly. This would be a major performance improvement,
				especially for fast hardware, like Infiniband, because it would eliminate the data copy
				latency as well as considerably ease CPU and memory bandwidth load. This proposal is limited for
				READs only, because for WRITEs it is a lot harder to
				implement, so it is worth to do zero-copy for READs and WRITEs separately.</p>

				<p>The main idea is to add one more flag to filp_open() "flags" parameter
				(like O_RDONLY, O_DIRECT, etc.) O_ZEROCOPY, which would be available
				only if the caller is from the kernel space. In this case fd->f_op->readv(),
				do_sync_readv_writev(), etc. would receive as the pointer to data
				buffer not a real data buffer, but pointer to an empty SG vector. Then:</p>

				<ul>
				<li><span>Generic buffer allocation in SCST would not be used, instead vdisk_parse()
				would allocate the SG vector, but wouldn't fill it with actual pages.</span></li>

				<li><span>In generic_file_aio_read(), if O_ZEROCOPY flag was set,
				function do_generic_file_read() would be called with the last parameter set
				to a pointer to new function file_zero_copy_read_actor() instead of file_read_actor().</span></li>

				<li><span>Function file_zero_copy_read_actor() would be basically the same as
				file_read_actor(), but, instead of copy data using __copy_to_user*() functions,
				it would add the supplied page to the appropriate place in the received in
				desc->arg.buf SG vector and reference, i.e. page_get(), that page.</span></li>

				<li><span>In vdisk_devtype.on_free_cmd(), which doesn't exist yet, all pages
				from the SG vector would be dereferenced, i.e. page_put(). Then the SG vector itself
				would be freed.</span></li>
				</ul>

				<p>That's all. For WRITEs the current code path would remain unchanged.</p>

				<A NAME="ZC_WRITE"></A><h3>Zero-copy FILEIO for WRITE-direction commands</h3>

				<p>Implementation should be similar to zero-copy FILEIO for READ commands and should
				be done after it. All incoming data should be inserted in the page cache, then dereferenced in
				vdisk_devtype.on_free_cmd(). The main problem is insertion of data pages in the
				page cache, namely, locking issues related to it. They should be carefully
				investigated.</p>

				<A NAME="PR"></A><h3>Persistent reservations</h3>

				<p>Support for PERSISTENT RESERVE IN and PERSISTENT RESERVE OUT is required to
				work in many cluster environments, e.g. Windows 2003 Cluster.</p>

				<p>For implementation you should use scst_reserve_local() and
				scst_release_local() as a base. You should store all reservation keys
				for in files in /var/scst, one file per device
				(it would allow to eliminate additional locking), like
				/var/scst/boot_disk for device "boot_disk" and load them in memory, when
				device would be registered.</p>

				<p>In the first version it can be done for virtual
				devices only and reject PERSISTENT RESERVE IN and OUT commands for
				pass-through devices with "COMMAND NOT SUPPORTED" sense data.</p>

				<A NAME="AUTO_SESS"></A><h3>Automatic sessions reassignment</h3>

				<p>At the moment, if security name for an initiator reassigned (moved) to another security
				group, the existing sessions from that initiator are not automatically reassigned to
				the new security group, i.e. they remain in the old one. The only ways to reassign them
				are either sessions restart, or restart of the corresponding target driver. Both in many
				cases are not options.</p>

				<p>To implement that you should on event of any group change:</p>
				<ul>
				<li><span>Globally suspend all activities by scst_suspend_activity().</span></li>

				<li><span>Go over all existing sessions. For each find the corresponding ACG
				      (see scst_init_session() as an example) and check if it's the same as the existing
				      one. If it's the same, then go to the next session. Otherwise, reassign
				      it to the new ACG. For that you should go over all devices in the group/session
				      pair (tgt_dev's) and delete not existing in the new ACG tgt_dev's,
				      add new ones and keep the existing ones.</span></li>

				<li><span>Resume the activities.</span></li>
				</ul>

				<A NAME="DYN_FLOW"></A><h3>Dynamic I/O flow control</h3>

				<p>At the moment, if an initiator or several initiators simultaneously send to
				target too many commands, especially in seek intensive workloads, target can get
				overloaded and not able to finish commands on time. In such cases you can see on
				the initiator(s) messages about aborting commands or resetting the target. See in SCST core
				README section "What if target's backstorage is too slow" for more details.
				To fix this problem it is necessary to implement a dynamic I/O flow control in
				SCST core.</p>

				<p>The flow control, generally, is quite simple. Each SCST command has timeout value,
				which is set by the corresponding dev handler. SCST core should keep device's queue depth
				at the level that the worst command's execution time, i.e. time between scst_rx_cmd()
				and scst_finish_cmd(), would be between something like timeout/10 and timeout/5.
				So, commands execution time should be checked and:</p>

				<ul>
				<li><span>If it's > timeout/5, then the new queue depth should be set to max(1,
				  cur_depth/2)</span></li>

				<li><span>If it's < timeout/10, then new queue depth should be set to min(MAX_DEPTH,
				    cur_depth+1). This shouldn't be done too often, once in a few minutes should be
				    sufficient</span></li>
				</ul>

				<p>The above is, of course, an oversimplification to let you see the idea.
				Implementation considering real life cases should be as the following:</p>

				<p>1. There are several parameters:</p>

				<ul>
				<li><span>P - load watch period. During this period all the statistic is
				gathered and processed.</span></li>

				<li><span>MN - underload ratio divisor, which sets the underload portion of
				timeout. If the longest execution time among all commands completed
				during period P is below timeout/MN, the corresponding device considered
				underloaded.</span></li>

				<li><span>MX - overload ratio divisor, which sets the overload portion of
				timeout. If the longest execution time among all commands completed
				during period P is above timeout/MX, the corresponding device considered
				overloaded.</span></li>

				<li><span>I - step on which device's queue size will be increased if device
				considered underloaded.</span></li>

				<li><span>D - divisor on which device's queue size will be decreased if device
				considered overloaded.</span></li>

				<li><span>QI - quick fall interval. See description of Q parameter.</span></li>

				<li><span>Q - quick fall ratio divisor. If the longest execution time of a
				completed command is above timeout/Q and time from the previous quick
				fall is smaller than QI, the corresponding device considered heavily
				overloaded. The quick fall is needed to handle cases when load on device
				is instantly increased on the way, where it can't handle it properly.</span></li>

				<li><span>QD  - divisor on which device's queue size will be decreased if
				device considered heavily overloaded.</span></li>
				</ul>

				<p>The default values should be something like: P=15 sec., MN=20, MX=10, Q=3,
				I=1, D=2, QI=5 sec., QD=10.</p>

				<p>2. There are the following new variables in struct scst_device:</p>

				<ul>
				<li><span>queue_depth - current queue depth.</span></li>

				<li><span>max_exec_ratio - maximum commands timeout/(execution time).</span></li>

				<li><span>queue_was_full - flag, marking that the queue was at least once full
				during period P.</span></li>

				<li><span>quick_fall_time - time of the last quick fall.</span></li>

				<li><span>flow_lock - protects flow control related variables, where needed.</span></li>

				<li><span>...</span></li>
				</ul>

				<p>3. The commands processing path should be as the following:</p>

				<ul>
				<li><span>In scst_rx_cmd() the start time of the command is recorded (already done).</span></li>

				<li><span>In __scst_init_cmd(), if dev->dev_cmd_count == dev->queue_depth,
				dev->queue_was_full set to true.</span></li>

				<li><span>In scst_finish_cmd() dev->max_exec_ratio set to max(dev->max_exec_ratio,
				(cmd's exec_time)*100/cmd->timeout).</span></li>

				<li><span>If in scst_finish_cmd() cmd's exec time is above cmd->timeout/Q and
				time from the latest quick fall is above QI, then:

					<ul>
					<li><span>dev->queue_depth set to max(1, dev->queue_depth/QD).</span></li>

				   	<li><span>Flow control period reset, i.e. started again, including setting
					dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</span></li>
					</ul>
					</span></li>
				</ul>

				<p>4. There should be a work, which once in a P seconds will check
				dev->max_exec_ratio, then:</p>

				<ul>
				<li><span>If device neither underloaded, nor overloaded. i.e. max_exec_ratio
				between defined by MN and MX, do nothing.</span></li>

				<li><span>If device was underloaded:

					<ul>
					<li><span>if dev->queue_was_full is false, then do nothing.</span></li>

				    	<li><span>if dev->queue_was_full is true, then set dev->queue_depth to
					min(SCST_MAX_DEV_COMMANDS, dev->queue_depth + I).</span></li>
					</ul>
					</span></li>

				<li><span>If device was overloaded, then set dev->queue_depth to max(1,
				dev->queue_depth/D).</span></li>
				</ul>

				<p>Then the flow control period is reset, i.e. started again, including
				setting dev->max_exec_ratio to 0 and dev->quick_fall_time to jiffies.</p>

				<p>That's all. Then only support for initiators, like iSCSI,
				which don't handle QUEUE FULL to decrease amount of queued
				commands, should be added. Such initiators expect target to control size of
				the queue, via, e.g., through MAX_SN for iSCSI.</p>

				<p>For it at the stage 2 of the dynamic flow control development
				the following should be done:</p>

				<ul>
				<li><span>New callback on_queue_depth_adjustment() should be added to struct
				scst_tgt_template.</span></li>

				<li><span>If target driver defined it, each time after dev->queue_depth changed
				on_queue_depth_adjustment() should be called. In this callback target
				driver should change internal queue_depth to, e.g. for iSCSI target, set
				max_sn in the replies correctly.</span></li>
				</ul>

				<p>Then, at the latest stage of the development, logic to not schedule the
				flow control work on idle devices should be added.</p>

				<A NAME="O_DIRECT"></A><h3>Support for O_DIRECT in scst_vdisk handler</h3>

				<p>At the moment, scst_vdisk handler doesn't support O_DIRECT option and possibility to set it
				was disabled. This limitation caused by Linux kernel expectation that memory supplied to
				read() and write() functions with O_DIRECT flag is mapped to some user space application.</p>

				<p>It is relatively easy to remove that limitation. Function dio_refill_pages()
				should be modified to check before calling get_user_pages() if current->mm is not NULL.
				If it is NULL, then, instead of calling get_user_pages(), dio->pages should be filled
				by pages, taken directly from dio->curr_user_address. Each such page should be referenced
				by page_cache_get(). That's all.</p>

				<A NAME="VDISK_REFACTOR"></A><h3>Refactoring of command execution path in scst_vdisk handler</h3>

				<p>At the moment, in scst_vdisk handler command execution function vdisk_do_job() is
				overcomplicated and not very performance effective. It would be good to replace all those
				ugly "switch" statements by choosing the handler for each SCSI command by indirect
				function call on an array of function pointers.</p>

				<p>I.e., there should be an array vdisk_exec_fns with 256 entries of function pointers:</p>

				 <p>void (*cmd_exec_fn) (struct scst_cmd *cmd)</p>

				 <p>Then vdisk_do_job() should look like</p>

				 <listing><p>static int vdisk_do_job(struct scst_cmd *cmd)
{
	return vdisk_exec_fns[cmd->cdb[0]](cmd);
}</p></listing>

				<A NAME="SG_LIMIT"></A><h3>Solve SG IO count limitation issue in pass-through mode</h3>

				<p>In the pass-through mode (i.e. using the pass-through device handlers
				scst_disk, scst_tape, etc) SCSI commands, coming from remote initiators,
				are passed to local SCSI hardware on target as is, without any
				modifications. As any other hardware, the local SCSI hardware can not
				handle commands with amount of data and/or segments count in
				scatter-gather array bigger some values. If you have this issue you will see
				symptoms like small transfers work well, but large ones stall and
				messages like: "Unable to complete command due to SG IO count
				limitation" are printed in the kernel logs.</p>

				<p>In <a href="sgv_big_order_alloc.diff">sgv_big_order_alloc.diff</a> you
				can find a possible way to solve this issue.</p>

				<A NAME="MEM_REG"></A><h3>Memory registration</h3>

				<p>In some cases a target driver might need to register memory used for data buffers in the
				hardware. At the moment, none of SCST target drivers, including InfiniBand SRP target driver,
				need that feature. But in case if in future there is a need in such a feature, it can be easily
				added by extending SCST SGV cache. The SCST SGV cache is a memory management
				subsystem in SCST. It doesn't free to the system each data buffer,
				which is not used anymore, but keeps it for a while to let it be reused by the
				next consecutive command to reduce command processing latency and, hence, improve performance.</p>

				<p>To support memory buffers registrations, it can be extended by the following way:</p>

				<p>1. Struct scst_tgt_template would be extended to have 2 new callbacks:</p>

				<ul>

					<li><span>int register_buffer(struct scst_cmd *cmd)</span></li>

					<li><span>int unregister_buffer(unsigned long mem_priv, void *scst_priv)</span></li>

				</ul>

				<p>2. SCST core would be extended to have 4 new functions:</p>

				<ul>

					<li><span>int scst_mem_registered(struct scst_cmd *cmd)</span></li>

					<li><span>int scst_mem_deregistered(void *scst_priv)</span></li>

					<li><span>int scst_set_mem_priv(struct scst_cmd *cmd, unsigned long mem_priv)</span></li>

					<li><span>unsigned long scst_get_mem_priv(struct scst_cmd *cmd)</span></li>

				</ul>

				<p>3. The workflow would be the following:</p>

				<ol>
					<li><span>If target driver defined register_buffer() and unregister_buffer() callbacks,
					SCST core would allocate a dedicated SGV cache for each instance of struct scst_tgt,
					i.e. target.</span></li>

					<li><span>When there would be an SGV cache miss in memory buffer for a command allocation,
					SCST would check if register_buffer() callback was defined in the target driver's template
					and, if yes, would call it.</span></li>

					<li><span>In register_buffer() callback the target driver would do necessary actions to
					start registration of the commands memory buffer.</span></li>

					<li><span>Upon register_buffer() callback returns, SCST core would suspend processing the
					corresponding command and would switch to the next commands processing.</span></li>

					<li><span>After the memory registration finished, the target driver would call scst_set_mem_priv()
					to associate the memory buffer with some internal data.</span></li>

					<li><span>Then the target driver would call scst_mem_registered() and SCST would resume processing
					the command. Functions scst_set_mem_priv() and scst_mem_registered() can be called from inside register_buffer().
					In this case SCST core would continue processing the command immediately without suspending.</span></li>

					<li><span>After the command finished, the corresponding memory buffer would remain in the
					SGV cache in the registered state and would be reused by the next commands. For each of them
					the target driver can at any time figure out the associated with the registered buffer data
					by using scst_get_mem_priv().</span></li>

					<li><span>When the SGV cache decide that there is a time to free the memory buffer, it would
					call the target driver's unregister_buffer() callback.</span></li>

					<li><span>In this callback the target driver would do necessary actions to start deregistration of the
					commands memory buffer.</span></li>

					<li><span>Upon unregister_buffer() callback returns, SGV cache would suspend freeing the corresponding buffer
					and would switch to other deals it has.</span></li>

					<li><span>After the memory deregistration finished, the target driver would call scst_mem_deregistered()
					and pass to it scst_priv pointer, received in unregister_buffer(). Then the  memory buffer
					would be freed by the SGV cache. Function scst_mem_deregistered() can be called from inside unregister_buffer().
					In this case SGV cache would free the buffer immediately without suspending.
					</span></li>
				</ol>

				<A NAME="NON_SCSI_TGT"></A><h3>SCST usage with non-SCSI transports</h3>

				<p>SCST might also be used with non-SCSI speaking transports, like NBD or AoE. Such cooperation
				would allow them to use SCST-emulated backend.</p>

				<p>For user space targets this is trivial: they simply should use SCST-emulated devices locally
				via scst_local module.</p>

				<p>For in-kernel non-SCSI target driver it's a bit more complicated. They should implement a small layer,
				which would translate their internal READ/WRITE requests to corresponding SCSI commands and, on the
				way back, SCSI status and sense codes to their internal status codes.</p>

			</div>
	</div>
</div>
<!-- wrap ends here -->

<!-- footer starts here -->
		<div id="footer">
			<p>
			&copy; Copyright 2008 <b><font color="#EC981F">Vladislav Bolkhovitin & others.</font>&nbsp;&nbsp;
			Design by: <b><font color="#EC981F">Daniel Fernandes</font></b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

			</p>
		</div>
<!-- footer ends here -->
</body>
</html>