Use SEEK_HOLE for hole detection

Based on patch by Pavel Raiskup.

Use SEEK_HOLE/SEEK_DATA feature of lseek on systems that support
it.  This can make archiving of sparse files much faster.

Implement the --hole-detection option to allow users to select
hole-detection method.

* src/common.h (hole_detection_method): New enum.
(hole_detection): New global.
* src/sparse.c  (sparse_scan_file_wholesparse): New function as a
method for detecting sparse files without any data.
(sparse_scan_file_raw): Rename from sparse_scan_file; with edits.
(sparse_scan_file_seek): New function.
(sparse_scan_file): Reimplement function.
* src/tar.c: New option --hole-detection

* tests/checkseekhole.c: New file.
* tests/.gitignore: Mention two test binaries.
* tests/Makefile.am: Add new tests.
* tests/testsuite.at (AT_SEEKHOLE_PREREQ): New macro.
Include sparse06.at.
* tests/sparse06.at: New test case.
* tests/sparse02.at: Force raw hole-detection method.
* tests/sparsemv.at: Likewise.
* tests/sparsemvp.at: Likewise.

* doc/tar.1: Document --hole-detection option.
* doc/tar.texi: Document hole-detection algorithms and
command-line options.
* NEWS: Document hole-detection.
This commit is contained in:
Sergey Poznyakoff
2015-12-05 23:36:22 +02:00
parent 589ba77faf
commit b684326e69
14 changed files with 425 additions and 78 deletions

21
NEWS
View File

@@ -1,4 +1,4 @@
GNU tar NEWS - User visible changes. 2015-11-02
GNU tar NEWS - User visible changes. 2015-12-06
Please send GNU tar bug reports to <bug-tar@gnu.org>
@@ -48,6 +48,25 @@ read from null-delimited file lists is treated as a file name.
This restores the documented behavior, which was broken in version
1.27.
* Sparse file detection
Tar now uses SEEK_DATA/SEEK_HOLE on systems that support it. This
allows for considerable speed-up in sparse-file detection.
New option --hole-detection is provided, that allows the user to
select the algorithm used for hole detection. Available arguments
are:
--hole-detection=seek
Use lseek(2) SEEK_DATA and SEEK_HOLE "whence" parameters.
--hole-detection=raw
Scan entire file before storing it to determine where holes
are located.
The default is to use "seek" whenever possible, and fall back to
"raw" otherwise.
version 1.28, 2014-07-28

View File

@@ -13,7 +13,7 @@
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program. If not, see <http://www.gnu.org/licenses/>.
.TH TAR 1 "November 2, 2015" "TAR" "GNU TAR Manual"
.TH TAR 1 "December 5, 2015" "TAR" "GNU TAR Manual"
.SH NAME
tar \- an archiving utility
.SH SYNOPSIS
@@ -259,6 +259,12 @@ When listing or extracting, the actual contents of \fIFILE\fR is not
inspected, it is needed only due to syntactical requirements. It is
therefore common practice to use \fB/dev/null\fR in its place.
.TP
\fB\-\-hole\-detection\fR=\fIMETHOD\fR
Use \fIMETHOD\fR to detect holes in sparse files. This option implies
\fB\-\-sparse\fR. Valid values for \fIMETHOD\fR are \fBseek\fR and
\fBraw\fR. Default is \fBseek\fR with fallback to \fBraw\fR when not
applicable.
.TP
\fB\-G\fR, \fB\-\-incremental\fR
Handle old GNU-format incremental backups.
.TP
@@ -821,7 +827,8 @@ environment variable. If it is not set, \fBexisting\fR is assumed.
.RE
.TP
\fB\-C\fR, \fB\-\-directory\fR=\fIDIR\fR
Change to directory DIR.
Change to \fIDIR\fR before performing any operations. This option is
order-sensitive, i.e. it affects all options that follow.
.TP
\fB\-\-exclude\fR=\fIPATTERN\fR
Exclude files matching \fIPATTERN\fR, a

View File

@@ -2782,6 +2782,13 @@ they refer to, instead of creating usual hard link members.
@command{tar} will print out a short message summarizing the operations and
options to @command{tar} and exit. @xref{help}.
@opsummary{hole-detection}
@item --hole-detection=@var{method}
Use @var{method} to detect holes in sparse files. This option implies
@option{--sparse}. Valid methods are @samp{seek} and @samp{raw}.
Default is @samp{seek} with fallback to @samp{raw} when not
applicable. @xref{sparse}.
@opsummary{ignore-case}
@item --ignore-case
Ignore case when matching member or file names with
@@ -9536,13 +9543,15 @@ could create an archive longer than the original. To have @command{tar}
attempt to recognize the holes in a file, use @option{--sparse}
(@option{-S}). When you use this option, then, for any file using
less disk space than would be expected from its length, @command{tar}
searches the file for consecutive stretches of zeros. It then records
in the archive for the file where the consecutive stretches of zeros
are, and only archives the ``real contents'' of the file. On
extraction (using @option{--sparse} is not needed on extraction) any
such files have holes created wherever the continuous stretches of zeros
were found. Thus, if you use @option{--sparse}, @command{tar} archives
won't take more space than the original.
searches the file for holes. It then records in the archive for the file where
the holes (consecutive stretches of zeros) are, and only archives the
``real contents'' of the file. On extraction (using @option{--sparse} is not
needed on extraction) any such files have also holes created wherever the holes
were found. Thus, if you use @option{--sparse}, @command{tar} archives won't
take more space than the original.
@GNUTAR{} uses two methods for detecting holes in sparse files. These
methods are described later in this subsection.
@table @option
@opindex sparse
@@ -9568,37 +9577,12 @@ will never take more space on the media than the files take on disk
(otherwise, archiving a disk filled with sparse files might take
hundreds of tapes). @xref{Incremental Dumps}.
However, be aware that @option{--sparse} option presents a serious
drawback. Namely, in order to determine if the file is sparse
@command{tar} has to read it before trying to archive it, so in total
the file is read @strong{twice}. So, always bear in mind that the
time needed to process all files with this option is roughly twice
the time needed to archive them without it.
@FIXME{A technical note:
Programs like @command{dump} do not have to read the entire file; by
examining the file system directly, they can determine in advance
exactly where the holes are and thus avoid reading through them. The
only data it need read are the actual allocated data blocks.
@GNUTAR{} uses a more portable and straightforward
archiving approach, it would be fairly difficult that it does
otherwise. Elizabeth Zwicky writes to @file{comp.unix.internals}, on
1990-12-10:
@quotation
What I did say is that you cannot tell the difference between a hole and an
equivalent number of nulls without reading raw blocks. @code{st_blocks} at
best tells you how many holes there are; it doesn't tell you @emph{where}.
Just as programs may, conceivably, care what @code{st_blocks} is (care
to name one that does?), they may also care where the holes are (I have
no examples of this one either, but it's equally imaginable).
I conclude from this that good archivers are not portable. One can
arguably conclude that if you want a portable program, you can in good
conscience restore files with as many holes as possible, since you can't
get it right.
@end quotation
}
However, be aware that @option{--sparse} option may present a serious
drawback. Namely, in order to determine the positions of holes in a file
@command{tar} may have to read it before trying to archive it, so in total
the file may be read @strong{twice}. This may happen when your OS or your FS
does not support @dfn{SEEK_HOLE/SEEK_DATA} feature in @dfn{lseek} (See
@option{--hole-detection}, below).
@cindex sparse formats, defined
When using @samp{POSIX} archive format, @GNUTAR{} is able to store
@@ -9612,7 +9596,6 @@ use an earlier format, you can select it using
@table @option
@opindex sparse-version
@item --sparse-version=@var{version}
Select the format to store sparse files in. Valid @var{version} values
are: @samp{0.0}, @samp{0.1} and @samp{1.0}. @xref{Sparse Formats},
for a detailed description of each format.
@@ -9620,6 +9603,39 @@ for a detailed description of each format.
Using @option{--sparse-format} option implies @option{--sparse}.
@table @option
@opindex hole-detection
@cindex hole detection
@item --hole-detection=@var{method}
Enforce concrete hole detection method. Before the real contents of sparse
file are stored, @command{tar} needs to gather knowledge about file
sparseness. This is because it needs to have the file's map of holes
stored into tar header before it starts archiving the file contents.
Currently, two methods of hole detection are implemented:
@itemize @bullet
@item @option{--hole-detection=seek}
Seeking the file for data and holes. It uses enhancement of the @code{lseek}
system call (@code{SEEK_HOLE} and @code{SEEK_DATA}) which is able to
reuse file system knowledge about sparse file contents - so the
detection is usually very fast. To use this feature, your file system
and operating system must support it. At the time of this writing
(2015) this feature, in spite of not being accepted by POSIX, is
fairly widely supported by different operating systems.
@item @option{--hole-detection=raw}
Reading byte-by-byte the whole sparse file before the archiving. This
method detects holes like consecutive stretches of zeroes. Comparing to
the previous method, it is usually much slower, although more
portable.
@end itemize
@end table
When no @option{--hole-detection} option is given, @command{tar} uses
the @samp{seek}, if supported by the operating system.
Using @option{--hole-detection} option implies @option{--sparse}.
@node Attributes
@section Handling File Attributes
@cindex attributes, files

View File

@@ -280,6 +280,15 @@ GLOBAL bool sparse_option;
GLOBAL unsigned tar_sparse_major;
GLOBAL unsigned tar_sparse_minor;
enum hole_detection_method
{
HOLE_DETECTION_DEFAULT,
HOLE_DETECTION_RAW,
HOLE_DETECTION_SEEK
};
GLOBAL enum hole_detection_method hole_detection;
GLOBAL bool starting_file_option;
/* Specified maximum byte length of each tape volume (multiple of 1024). */

View File

@@ -1,6 +1,6 @@
/* Functions for dealing with sparse files
Copyright 2003-2007, 2010, 2013-2014 Free Software Foundation, Inc.
Copyright 2003-2007, 2010, 2013-2015 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -208,9 +208,9 @@ sparse_add_map (struct tar_stat_info *st, struct sp_array const *sp)
st->sparse_map_avail = avail + 1;
}
/* Scan the sparse file and create its map */
/* Scan the sparse file byte-by-byte and create its map. */
static bool
sparse_scan_file (struct tar_sparse_file *file)
sparse_scan_file_raw (struct tar_sparse_file *file)
{
struct tar_stat_info *st = file->stat_info;
int fd = file->fd;
@@ -221,41 +221,38 @@ sparse_scan_file (struct tar_sparse_file *file)
st->archive_file_size = 0;
if (ST_NBLOCKS (st->stat) == 0)
offset = st->stat.st_size;
else
if (!tar_sparse_scan (file, scan_begin, NULL))
return false;
while ((count = blocking_read (fd, buffer, sizeof buffer)) != 0
&& count != SAFE_READ_ERROR)
{
if (!tar_sparse_scan (file, scan_begin, NULL))
return false;
/* Analyze the block. */
if (zero_block_p (buffer, count))
{
if (sp.numbytes)
{
sparse_add_map (st, &sp);
sp.numbytes = 0;
if (!tar_sparse_scan (file, scan_block, NULL))
return false;
}
}
else
{
if (sp.numbytes == 0)
sp.offset = offset;
sp.numbytes += count;
st->archive_file_size += count;
if (!tar_sparse_scan (file, scan_block, buffer))
return false;
}
while ((count = blocking_read (fd, buffer, sizeof buffer)) != 0
&& count != SAFE_READ_ERROR)
{
/* Analyze the block. */
if (zero_block_p (buffer, count))
{
if (sp.numbytes)
{
sparse_add_map (st, &sp);
sp.numbytes = 0;
if (!tar_sparse_scan (file, scan_block, NULL))
return false;
}
}
else
{
if (sp.numbytes == 0)
sp.offset = offset;
sp.numbytes += count;
st->archive_file_size += count;
if (!tar_sparse_scan (file, scan_block, buffer))
return false;
}
offset += count;
}
offset += count;
}
/* save one more sparse segment of length 0 to indicate that
the file ends with a hole */
if (sp.numbytes == 0)
sp.offset = offset;
@@ -264,6 +261,114 @@ sparse_scan_file (struct tar_sparse_file *file)
return tar_sparse_scan (file, scan_end, NULL);
}
static bool
sparse_scan_file_wholesparse (struct tar_sparse_file *file)
{
struct tar_stat_info *st = file->stat_info;
struct sp_array sp = {0, 0};
/* Note that this function is called only for truly sparse files of size >= 1
block size (checked via ST_IS_SPARSE before). See the thread
http://www.mail-archive.com/bug-tar@gnu.org/msg04209.html for more info */
if (ST_NBLOCKS (st->stat) == 0)
{
st->archive_file_size = 0;
sp.offset = st->stat.st_size;
sparse_add_map (st, &sp);
return true;
}
return false;
}
#ifdef SEEK_HOLE
/* Try to engage SEEK_HOLE/SEEK_DATA feature. */
static bool
sparse_scan_file_seek (struct tar_sparse_file *file)
{
struct tar_stat_info *st = file->stat_info;
int fd = file->fd;
struct sp_array sp = {0, 0};
off_t offset = 0;
off_t data_offset;
off_t hole_offset;
st->archive_file_size = 0;
for (;;)
{
/* locate first chunk of data */
data_offset = lseek (fd, offset, SEEK_DATA);
if (data_offset == (off_t)-1)
/* ENXIO == EOF; error otherwise */
{
if (errno == ENXIO)
{
/* file ends with hole, add one more empty chunk of data */
sp.numbytes = 0;
sp.offset = st->stat.st_size;
sparse_add_map (st, &sp);
return true;
}
return false;
}
hole_offset = lseek (fd, data_offset, SEEK_HOLE);
/* according to specs, if FS does not fully support
SEEK_DATA/SEEK_HOLE it may just implement kind of "wrapper" around
classic lseek() call. We must detect it here and try to use other
hole-detection methods. */
if (offset == 0 /* first loop */
&& data_offset == 0
&& hole_offset == st->stat.st_size)
{
lseek (fd, 0, SEEK_SET);
return false;
}
sp.offset = data_offset;
sp.numbytes = hole_offset - data_offset;
sparse_add_map (st, &sp);
st->archive_file_size += sp.numbytes;
offset = hole_offset;
}
return true;
}
#endif
static bool
sparse_scan_file (struct tar_sparse_file *file)
{
/* always check for completely sparse files */
if (sparse_scan_file_wholesparse (file))
return true;
switch (hole_detection)
{
case HOLE_DETECTION_DEFAULT:
case HOLE_DETECTION_SEEK:
#ifdef SEEK_HOLE
if (sparse_scan_file_seek (file))
return true;
#else
if (hole_detection == HOLE_DETECTION_SEEK)
WARN((0, 0,
_("\"seek\" hole detection is not supported, using \"raw\".")));
/* fall back to "raw" for this and all other files */
hole_detection = HOLE_DETECTION_RAW;
#endif
case HOLE_DETECTION_RAW:
if (sparse_scan_file_raw (file))
return true;
}
return false;
}
static struct tar_sparse_optab const oldgnu_optab;
static struct tar_sparse_optab const star_optab;
static struct tar_sparse_optab const pax_optab;

View File

@@ -362,6 +362,7 @@ enum
SHOW_TRANSFORMED_NAMES_OPTION,
SKIP_OLD_FILES_OPTION,
SORT_OPTION,
HOLE_DETECTION_OPTION,
SPARSE_VERSION_OPTION,
STRIP_COMPONENTS_OPTION,
SUFFIX_OPTION,
@@ -451,6 +452,8 @@ static struct argp_option options[] = {
{"sparse", 'S', 0, 0,
N_("handle sparse files efficiently"), GRID+1 },
{"hole-detection", HOLE_DETECTION_OPTION, N_("TYPE"), 0,
N_("technique to detect holes"), GRID+1 },
{"sparse-version", SPARSE_VERSION_OPTION, N_("MAJOR[.MINOR]"), 0,
N_("set version of the sparse format to use (implies --sparse)"), GRID+1},
{"incremental", 'G', 0, 0,
@@ -1464,6 +1467,19 @@ static int sort_mode_flag[] = {
};
ARGMATCH_VERIFY (sort_mode_arg, sort_mode_flag);
static char const *const hole_detection_args[] =
{
"raw", "seek", NULL
};
static int const hole_detection_types[] =
{
HOLE_DETECTION_RAW, HOLE_DETECTION_SEEK
};
ARGMATCH_VERIFY (hole_detection_args, hole_detection_types);
static void
set_old_files_option (int code, struct option_locus *loc)
@@ -1753,6 +1769,12 @@ parse_opt (int key, char *arg, struct argp_state *state)
set_old_files_option (SKIP_OLD_FILES, args->loc);
break;
case HOLE_DETECTION_OPTION:
hole_detection = XARGMATCH ("--hole-detection", arg,
hole_detection_args, hole_detection_types);
sparse_option = true;
break;
case SPARSE_VERSION_OPTION:
sparse_option = true;
{
@@ -2523,6 +2545,7 @@ decode_options (int argc, char **argv)
blocking_factor = DEFAULT_BLOCKING;
record_size = DEFAULT_BLOCKING * BLOCKSIZE;
excluded = new_exclude ();
hole_detection = HOLE_DETECTION_DEFAULT;
newer_mtime_option.tv_sec = TYPE_MINIMUM (time_t);
newer_mtime_option.tv_nsec = -1;

2
tests/.gitignore vendored
View File

@@ -9,3 +9,5 @@ argcv.h
genfile.c
genfile
download
ttyemu
checkseekhole

View File

@@ -207,6 +207,7 @@ TESTSUITE_AT = \
sparse03.at\
sparse04.at\
sparse05.at\
sparse06.at\
sparsemv.at\
sparsemvp.at\
spmvp00.at\
@@ -275,13 +276,14 @@ installcheck-local: $(check_PROGRAMS)
## genfile ##
## ------------ ##
check_PROGRAMS = genfile
check_PROGRAMS = genfile checkseekhole
if TAR_COND_GRANTPT
check_PROGRAMS += ttyemu
endif
genfile_SOURCES = genfile.c argcv.c argcv.h
checkseekhole_SOURCES = checkseekhole.c
ttyemu_SOURCES = ttyemu.c

92
tests/checkseekhole.c Normal file
View File

@@ -0,0 +1,92 @@
/* Test suite for GNU tar - SEEK_HOLE detector.
Copyright 2015 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 3, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program. If not, see <http://www.gnu.org/licenses/>.
Description: detect whether it is possible to work with SEEK_HOLE on
particular operating system and file system. */
#include "config.h"
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
enum {
EX_OK = 0, /* SEEK_HOLE support */
EX_FAIL, /* test failed - no SEEK_HOLE support */
EX_BAD, /* test is not relevant */
};
int
check_seek_hole (int fd)
{
#ifdef SEEK_HOLE
struct stat stat;
off_t offset;
/* hole of 100MB */
if (lseek (fd, 100*1024*1024, SEEK_END) < 0)
return EX_BAD;
/* piece of data */
if (write (fd, "data\n", 5) != 5)
return EX_BAD;
/* another hole */
if (lseek (fd, 100*1024*1024, SEEK_END) < 0)
return EX_BAD;
/* piece of data */
if (write (fd, "data\n", 5) != 5)
return EX_BAD;
if (fstat (fd, &stat))
return EX_BAD;
offset = lseek (fd, 0, SEEK_DATA);
if (offset == (off_t)-1)
return EX_FAIL;
offset = lseek (fd, offset, SEEK_HOLE);
if (offset == (off_t)-1 || offset == stat.st_size)
return EX_FAIL;
return EX_OK;
#else
return EX_BAD;
#endif
}
int
main ()
{
#ifdef SEEK_HOLE
int rc;
char template[] = "testseekhole-XXXXXX";
int fd = mkstemp (template);
if (fd == -1)
return EX_BAD;
rc = check_seek_hole (fd);
close (fd);
unlink (template);
return rc;
#else
return EX_FAIL;
#endif
}

View File

@@ -27,7 +27,7 @@ AT_KEYWORDS([sparse sparse02])
AT_TAR_CHECK([
genfile --sparse --file sparsefile --block-size 512 0 ABCD 1M EFGH 2000K IJKL || AT_SKIP_TEST
tar -c -f archive --sparse sparsefile || exit 1
tar --hole-detection=raw -c -f archive --sparse sparsefile || exit 1
echo separator
tar xfO archive | cat - > sparsecopy || exit 1

56
tests/sparse06.at Normal file
View File

@@ -0,0 +1,56 @@
# Process this file with autom4te to create testsuite. -*- Autotest -*-
#
# Test suite for GNU tar.
# Copyright 2014 Free Software Foundation, Inc.
# This file is part of GNU tar.
# GNU tar is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# GNU tar is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
AT_SETUP([storing sparse file using seek method])
AT_KEYWORDS([sparse sparse06])
m4_define([check_pattern],[
rm -rf out archive.tar smallsparse && mkdir out
genfile --sparse --file smallsparse $1
tar -cSf archive.tar smallsparse
tar -xf archive.tar -C out
cmp smallsparse out/smallsparse
])
AT_TAR_CHECK([
AT_SEEKHOLE_PREREQ
AT_TIMEOUT_PREREQ
TAR_OPTIONS="$TAR_OPTIONS --hole-detection=seek"
genfile --sparse --file bigsparse 0 ABC 8G DEF
timeout 2 tar -cSf a bigsparse
test $? -eq 0 || exit 1
check_pattern([0 ABC])
check_pattern([0 ABC 10M])
check_pattern([0 ABC 10M DEF])
check_pattern([10M])
check_pattern([10M ABC])
check_pattern([10M ABC 20M])
check_pattern([10M DEF 20M GHI 30M JKL 40M])
],
[0],,
[genfile: created file is not sparse
],,,[posix])
AT_CLEANUP

View File

@@ -30,6 +30,7 @@ AT_KEYWORDS([sparse multiv sparsemv])
AT_TAR_CHECK([
exec <&-
TAR_OPTIONS="$TAR_OPTIONS --hole-detection=raw"
genfile --sparse --file sparsefile 0 ABCDEFGHIJK 1M ABCDEFGHI || AT_SKIP_TEST
echo "Pass 1: Split between data blocks"
echo "Create archive"

View File

@@ -26,6 +26,7 @@ dnl TAR_MVP_TEST version map1 map2
m4_define([TAR_MVP_TEST],[
AT_TAR_CHECK([
exec <&-
TAR_OPTIONS="$TAR_OPTIONS --hole-detection=raw"
genfile --sparse --file sparsefile $2 || AT_SKIP_TEST
echo "Pass 1: Split between data blocks"
echo "Create archive"

View File

@@ -112,6 +112,19 @@ rm -f $[]$
test $result -eq 0 || AT_SKIP_TEST
])
dnl AT_SEEKHOLE_PREREQ
m4_define([AT_SEEKHOLE_PREREQ],[
checkseekhole || AT_SKIP_TEST
])
m4_define([AT_TIMEOUT_PREREQ],[
timeout 100 true
if test $? -ne 0; then
echo >&2 "the 'timeout' utility not found"
AT_SKIP_TEST
fi
])
m4_define([AT_TAR_MKHIER],[
install-sh -d $1 >/dev/null dnl
m4_if([$2],,,&& genfile --file [$1]/[$2]) || AT_SKIP_TEST])
@@ -358,6 +371,7 @@ m4_include([sparse02.at])
m4_include([sparse03.at])
m4_include([sparse04.at])
m4_include([sparse05.at])
m4_include([sparse06.at])
m4_include([sparsemv.at])
m4_include([spmvp00.at])
m4_include([spmvp01.at])