Commit Graph

54 Commits

Author SHA1 Message Date
Raphael S. Carvalho
20a3d5773b sstables: add create_data()
Intended to create both index and data file based on current generation
of the sstables. This function is similar to open_data(), which only
opens both files, relying on their existence.
This function is a small step towards the write support of both data
and index files.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-04-20 11:52:02 +03:00
Raphael S. Carvalho
fdf50ef643 sstables: add initial support to compression
Starting with LZ4, the default compressor.
Stub functions were added to other compression algorithms, which should
eventually be replaced with an actual implementation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-04-19 10:07:29 +03:00
Nadav Har'El
486e6271a1 sstables: data file row reading and streaming
The previous implementation could read either one sstable row or several,
but only when all the data was read in advance into a contiguous memory
buffer.

This patch changes the row read implementation into a state machine,
which can work on either a pre-read buffer, or data streamed via the
input_stream::consume() function:

The sstable::data_consume_rows_at_once() method reads the given byte range
into memory and then processes it, while the sstable::data_consume_rows()
method reads the data piecementally, not trying to fit all of it into
memory. The first function is (or will be...) optimized for reading one
row, and the second function for iterating over all rows - although both
can be used to read any number of rows.

The state-machine implementation is unfortunately a bit ugly (and much
longer than the code it replaces), and could probably be improved in the
future. But the focus was parsing performance: when we use large buffers
(the default is 8192 bytes), most of the time we don't need to read
byte-by-byte, and efficiently read entire integers at once, or even larger
chunks. For strings (like column names and values), we even avoid copying
them if they don't cross a buffer boundary.

To test the rare boundary-crossing case despite having a small sstable,
the code includes in "#if 0" a hack to split one buffer into many tiny
buffers (1 byte, or any other number) and process them one by one.
The tests still pass with this hack turned on.

This implementation of sstable reading also adds a feature not present
in the previous version: reading range tombstones. An sstable with an
INSERT of a collection always has a range tombstone (to delete all old
items from the collection), so we need this feature to read collections.
A test for this is included in this patch.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-04-13 17:40:46 +03:00
Raphael S. Carvalho
2ecc93523f sstables: add support to write the component TOC
The on-disk format is about name of the components, where each is
followed by a new line character. The text is encoded using ASCII
code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-04-11 11:34:38 +03:00
Raphael S. Carvalho
c6e31346d8 sstables: add support to write the component Summary
The definition of summary_la at types.hh provides a good explanation
on the on-disk format of the Summary file.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-04-11 11:24:53 +03:00
Glauber Costa
a505ac487f sstables: use bytes instead of sstring
We should have done that from the start

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-04-11 11:22:30 +03:00
Avi Kivity
30b40bf7b1 db: make bytes even more distinct from sstring
bytes and sstring are distinct types, since their internal buffers are of
different length, but bytes_view is an alias of sstring_view, which makes
it possible of objects of different types to leak across the abstraction
boundary.

Fix this by making bytes a basic_sstring<int8_t, ...> instead of using char.
int8_t is a 'signed char', which is a distinct type from char, so now
bytes_view is a distinct type from sstring_view.

uint8_t would have been an even better choice, but that diverges from Origin
and would have required an audit.
2015-04-07 10:56:19 +03:00
Nadav Har'El
de58d08e59 sstable: fix compressed data file stream bug
We need to update _pos after we read, or we keep reading the same
chunk over and over :-( Also, don't read anything if we're already past
the end of file.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-04-07 10:47:29 +03:00
Nadav Har'El
a53ef6028c sstables: use do_with idiom
[v2: rebase, move "&", and tone down recommendation of template lambda]

Before commit f49c065cd1, we had the buggy
but simple-looking code

    return data_stream_at(pos).read_exactly(len);

This was a bug because we need to ensure that the temporary object returned
by data_stream_at(pos) continues to live until the read_exactly() future
concludes. We solved this bug by doing this unsightly/unreadable code:

    auto stream = std::make_unique<input_stream<char>>(data_stream_at(pos));
    auto fut = stream->read_exactly(len);
    return fut.then([stream = std::move(stream)]
         (temporary_buffer<char> buf) { return buf; });

Instead, we can use the new do_with() idiom, which was exactly designed
to make a temporary object live until a future concludes. So we can write
the much shorter, and easier to understand, code:

    return do_with(data_stream_at(pos), [len] (auto &stream) {
        return stream.read_exactly(len);
    });

Note the C++14 template lambda (the "auto" in the argument of the lamda):
This lambda gets whatever we feed it (in this case, the stream returned
by data_stream_at). The "&" after the "auto" is important: without it,
the compiler tries to pass the object by value, which is impossible
(because it is not copyable).

Of course it would have also been possible to specify the stream's type
explicitly instead of using template lambda:

    return do_with(data_stream_at(pos), [len] (input_stream<char> &stream) {
        return stream.read_exactly(len);
    });

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-04-05 17:15:20 +03:00
Raphael S. Carvalho
08e5d3ca8b sstables: add support to write the component statistics
This code adds the ability to write statistics to disk.

On-disk format:

uint32_t Size;
struct {
    uint32_t metadata_type;
    uint32_t offset; /* offset into this file */
} metadata_metadata[Size];

* each metadata_metadata entry corresponds to a metadata
stored in the file.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-04-04 12:51:50 +03:00
Nadav Har'El
43b93058a5 sstables: add row consuming function
Add a function sstables::data_consume_row() which reads an entire row
(or several consecutive rows) at a given byte range in the data file,
and feeds them into a "row_consumer" implementation which the user provides.

The row_consumer's method consume_row_start() method is called at the
beginning of the (or each) row with its key and deletion information,
then the consume_cell() method is called for each of the row's cells,
and after all cells of the row, consume_row_end() is called.

The current implementation only supports regular cells, and not other
special cases like range tombstones and counters (see
https://github.com/cloudius-systems/urchin/wiki/SSTables%20Data%20File)
as I did not yet have sstables to test those on; The current
implementation will abort upon seeing these unsupported features.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-04-03 12:37:44 +03:00
Glauber Costa
939f8c4290 tests: remove "for_testing" suffix
The solution we have in tree now in for testing is obviously superior than this.
Let's switch to that.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-04-03 00:39:15 +03:00
Glauber Costa
8cbd4e8358 sstable: test internal members
We are currently testing internal members of the sstable by specifying a bunch of
friend classes in the sstable structure. We have established that this is not the
ideal solution, but it is working.

My proposal here is to change that slightly: have a placeholder class defined in
sstables.hh, that will then re-export publicly every method it wants to use. (Thanks
Avi for suggesting that)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-04-02 23:48:35 +03:00
Raphael S. Carvalho
f5f4b20d1b sstables: improve check_truncate_and_assign()
+    if (from >= std::numeric_limits<T>::max()) {
Avi explains an issue with the snippet above from the function:
This misses the case where either type is signed. At best you'd
get a compiler warning about comparing types with different
signedness, at worst a negative value can be truncated.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-04-02 09:53:46 +03:00
Raphael S. Carvalho
44735a3c88 sstables: add support to write the component filter
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-03-31 17:11:00 +03:00
Nadav Har'El
f49c065cd1 sstables: fix use-after-free of temporary object
The debug-mode sanitizer discovered a bug in sstable::data_read.
I had optimistically wrote this code:

     return data_stream_at(pos).read_exactly(len);

This is wrong - data_stream_at returns a temporary input_stream object,
which gets destructed immediately and doesn't live throughout the life
of read_exactly. Obviously, this object does need to live on (among other
things, it holds the buffer which read_exactly reads into).

The solution is an ugly variant of the same thing, but which allocates
memory to hold a copy of the input stream object. Because there is no
single reader (in theory we can have a hundred different reads ongoing
in parallel from the same sstable), we really have choice but to allocate
this read context somewhere. A better solution would not use an input
stream at all, but this is a different issue, already in a FIXME.

This patch fixes the sstable test failure that Jenkins reports.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-30 18:25:38 +02:00
Nadav Har'El
f80ac5a629 sstables: rework compression metadata to fix test.
Previously we had both a "compression" structure (read from the Compression
Info file on disk) and a "compression_metadata" class with additional
information, which std::move()ed parts of the compression structure.
This caused problems for the simplistic sstable-writing test (which does
the non-interesting thing of writing a previously-read sstable).

I'm ashamed to say, fixing this was very hard, because all this code is
built like a house of cards - try to change one thing, and everything
falls apart. After many failed attempts in trying to improve this code, what
I ended up doing is simply *extending* the "compression" structure - the
extended part isn't read or written, but it is in the structure.

We also no longer move a shared pointer to the compression structure,
but rather just an ordinary pointer; The assumption is that the user
will already make sure that the sstable structure will live for the
durations of any processing on it - and the compression structure is just
one part of this sstable structure.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-29 16:14:53 +03:00
Nadav Har'El
d305e1f95c sstables: add FIXME
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-25 15:45:00 +02:00
Raphael S. Carvalho
92b75413cb sstables: add function to set generation
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-03-25 13:23:24 +02:00
Raphael S. Carvalho
daaa1a6dcb sstables: extend it to support write of components
By the time being, compression info is the unique component being
written by store(). Changes introduced by this patch are generic,
so as to make it easier writing other components as well.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-03-25 13:22:42 +02:00
Nadav Har'El
891763d7fd sstables: implement low-level data file random-access reading
This patch implements sstable::data_read(pos, len) to do random-access
read of a specific byte range from the data file. Later we'll determine
the byte range needed to read a specific row, using the summary and index
files.

This function works for either a compressed or uncompressed data file.
To support the compressed data file, we need to determine when opening
the sstable also the data file's size, and then make a compression_metadata
object using the data we read from the compression file and the data
file's size.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-25 12:35:32 +02:00
Nadav Har'El
4d57c8fd28 sstables: fix LZ4 decompression
It turns out that Cassandra's LZ4Compressor doesn't use the LZ4
compressor directly - instead it prepends the uncompressed length,
in 4-byte little-endian (!) encoding, to the compressed chunk.
We don't need this extra information - we already know the expected
uncompressed chunk length, so we need to just skip it.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-25 12:35:32 +02:00
Nadav Har'El
c6eb2a87ea Move compress.{cc,hh} to sstables/
Move compress.{cc,hh} from db/ to sstables/. This makes more sense, as
this code is only used for sstables (un)compression.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-24 16:54:58 +02:00
Glauber Costa
e695c1632c sstable: return a format or version given a string
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 15:52:23 +02:00
Glauber Costa
a916b66984 sstable: generalize reverse transversal code.
While reading the TOC, we are given a string and then transverse the components
map in the search of a key of that corresponding value. This behavior will also
be necessary when we are parsing the filename, for version and format.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 15:52:22 +02:00
Glauber Costa
041933b179 sstables: epoch -> generation
Name matches Origin.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 15:52:20 +02:00
Nadav Har'El
4db17a454c sstables: add compressed reading to sstable.cc
This patch adds compressed_file_input_stream, which is a
"random_access_reader" subclass just like the existing file_input_stream.

Changed all the parsers to take a reference to random_access_reader (the
base class) instead of file_input_stream.

The code is now ready (sort of) for compressed-file reading, but it doesn't
actually do that: sstable::read_simple() still always uses file_input_stream
for now.

Note: despite all the layers of classes holding our input streams, we
actually pay surprisingly little cost for virtual function calls or pointer
dereferencing: random_access_reader::read_exactly is *not* a virtual function -
it always calls _in.read_exactly(). This is input_stream::read_exactly(),
which again is not a virtual function (the rarely called get() is a virtual
function is virtual, but input_stream::read_exactly() usually just reads reads
from an already available buffer.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-24 15:08:14 +02:00
Glauber Costa
37b9b4a08f sstable summary: provide method to query an index in the summary
Although the entries are in an array, and live on disk, the disk array
abstraction turns out to be a bad abstraction to read it in. This is because,
contrary to other types, the key sizes are not to be found on-disk. It is a lot
more convenient to treat it as a normal array to be constructed as a separate
step.

We will construct this array at load time, and provide a method that, given an
index, returns the corresponding key/position.  After a binary search - to be
implemented - we'll be able to fetch the real data.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 11:31:20 +02:00
Glauber Costa
bf5919de44 sstable: do not resize positions array in Summary
I should have written "reserve" instead of "resize". In any case, this neither
are not necessary since we move-assign the final array into this one a couple
of lines later anyway.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 11:31:20 +02:00
Glauber Costa
e7011c1ce9 sstable: add two more summary fields
After the keys array, the Summary file includes the first and last keys in this file's
range. Add this to the format.

Note that there is still more information after that. But that seems to be related to
the writing method (it says mmap in my files), and not relevant for us.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 11:31:19 +02:00
Glauber Costa
074f69806a extract on-disk integer parsing
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 11:31:19 +02:00
Glauber Costa
9284d1c8d1 sstables: fix bug with debug runs
While improving the summary code, the debug build code started to fail, for
apparently no reason. The code being introduced wasn't really doing anything
other than making the execution potentially slightly longer.

Still, the debug run was complaining about an use-after-free, with the
following stack trace in the free side:

freed by thread T1 here:
    #0 0x7f3b1145f64f in operator delete(void*) (/lib64/libasan.so.1+0x5864f)
    #1 0xc615f3 in posix_file_impl::~posix_file_impl() core/file.hh:75
    #2 0x45ae47 in std::default_delete<file_impl>::operator()(file_impl*) const /usr/include/c++/4.9.2/bits/unique_ptr.h:76
    #3 0x442039 in std::unique_ptr<file_impl, std::default_delete<file_impl> >::~unique_ptr() /usr/include/c++/4.9.2/bits/unique_ptr.h:236
    #4 0x438da0 in file::~file() core/file.hh:109
    #5 0x4fd70a in apply core/apply.hh:34
    #6 0x4fd821 in apply<sstables::sstable::read_toc()::<lambda(file)>, file> core/apply.hh:42
    #7 0x4fd91f in apply<sstables::sstable::read_toc()::<lambda(file)>, file> core/future.hh:685

After staring at the code for a while, my main diagnosis was that while in most
of the sstable reading functions we move the file object inside a stream
reader, we don't do that in the TOC. The file is just there and can be freed
after the future that contains it returns, which wasn't happening so far, but can
happen depending on the timing involved.

The following patch moves the file inside the lambda of the following future,
making sure it is not destroyed while the file is still being read. It fixes
the problem with the debug build in my tree.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-24 11:31:18 +02:00
Glauber Costa
bcf7e42933 column mask
Each column has a byte in the file that determines how to process whatever
data comes next. In the actual file, we can see one of those values, or a
combination of them.

Because it is an enum, no new parser is needed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-15 10:47:34 +02:00
Glauber Costa
4e73bf8b11 sstables: deletion_time structure
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-15 10:47:33 +02:00
Glauber Costa
c0ad2a8e0e sstables: parse the index file
We usually don't read the whole file into memory, so the probing interface will
also allow for the specification of boundaries that we should be use for
reading.

The sstable needs to be informed - usually by the schema - of how many columns
the partition key is composed of - 1 for simple keys, more than one, for
composites.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:13:14 -03:00
Glauber Costa
a3febe2ae0 sstable: open data and index files.
Because we're expected to go to those files many times, it doesn't make sense
to keep opening them. Upon sstable load, we will open those files and then move
the reference to the sstable main structure. From this point on, we can just seek
to the position we want.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:13:14 -03:00
Glauber Costa
cea4825642 sstables: change exception message
So it represents all kinds of mismatches. Not only buf < expected.
It will be useful in the index parsing code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:13:14 -03:00
Glauber Costa
d19abb59fb file_input_stream: accept a shared pointer in the constructor
Right now we will always move the file to create a shared pointer from it.
However, there are situations in which we don't really want to move it, because
it will be still used elsewhere.

One example is the index and data readers, where we will store a file object to
avoid opening the file all the time. In such situation, we can pass a shared
pointer that is already constructed to the file_input_stream.

The alternative to that would have been to store the file_input_stream itself.
That would, however, require us to export that in some header. It's best to
keep it private. Since we will already deal with a shared pointer anyway, it is
best to provide this option.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:03:51 -03:00
Glauber Costa
87e6b7ab56 sstables: signal eof in file_input_stream
We signal that condition in the underlying input_stream. Export that
to the file stream.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:03:51 -03:00
Glauber Costa
7d43e26c58 sstables: use net::packed in potentially unaligned accesses
Fixes debug build test code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-03-10 15:54:13 +02:00
Avi Kivity
9288b360f6 Merge branch 'master' of github.com:cloudius-systems/seastar into db
Includes adaptation by Nadav for the removal of file_input_stream:

sstables.cc used file_input_stream, which we replaced by the new
make_file_input_stream. We also couldn't make sstables.cc read either
a file_input_stream or the planned compressed_file_input_stream.

So in this patch we implement an API similar to the old "file_input_stream"
based on the new make_file_input_stream. file_input_stream now has a parent
class "random_access_reader", preparing for a future patch to support both
file_input_stream and compressed_file_input_stream in the same code - by making
all the parsers take a random_access_reader reference instead of file_input_stream.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-03-10 15:50:04 +02:00
Avi Kivity
5902243dc5 Merge branch 'master' of github.com:cloudius-systems/seastar into db
Global adjustment due to the removal of future<>::rescue().
2015-03-05 11:00:11 +02:00
Avi Kivity
f039904d75 Merge branch 'master' into db
Updated usages of std::hash<some_enum_type> to accomodate 25168fc73d.
2015-03-01 15:24:47 +02:00
Avi Kivity
81ca06fa36 sstable: fix TOC size check
Make sure we detect a too-large TOC correctly.
2015-02-28 23:50:36 +02:00
Avi Kivity
7441ce5b51 sstable: fix buffer overflow in TOC
boost::split() expects either a NUL terminated string or a proper container.
We give it neither.

Fix by wrapping the buffer in a string_view, which tells split() what size
the string is.
2015-02-28 23:49:07 +02:00
Glauber Costa
fb3682cb4f sstable statistics file
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-02-26 12:14:19 -05:00
Glauber Costa
0d98caf885 summary file
TODO: read in the actual index. This is schema-dependent.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-02-26 12:14:19 -05:00
Glauber Costa
1b75a5bccb bloom filter
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-02-26 12:14:19 -05:00
Glauber Costa
d810f03bb7 compression file
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-02-26 12:14:19 -05:00
Glauber Costa
3e4ab6848b read toc
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-02-26 12:14:19 -05:00