continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().
With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch adds support more efficiently reading small parts of a large
partition, without reading the entire partition as we had to do so far.
This is done using the "promoted index".
The "promoted index" is stored in the sstable index file, and provides
for each large sstable row ("partition" in CQL nomenclature) a sample of
the column names at (for example) 64KB intervals. This means that when we
read a slice of columns (e.g., cql rows), or page through a large partition,
we do not have to read the entire partition from disk.
This patch only implements the read side of promoted index - a later patch
will add the write-side support (i.e., writing the promoted index to the
index file while saving the sstable). Nevertheless this patch can already
be tested by reading existing sstables from Cassandra which include a
promoted index - such as the one included in the test in the previous patch.
The use of the promoted index currently has two limitations:
1. It is only used when reading a single partition with sstable::read_row(),
not when scanning through many partitions with sstable::read_range_rows()
or sstable::read_rows().
2. It is only used when filtering a single clustering-key range, rather
than a list of disjoint ranges. A single range is the common case.
These two issues will be improved later. In the meantime, in those
unsupported cases we simply continue to read entire partitions, so we're not
worse-off than before.
Also note that this patch only helps when sstable::read_row() is used with
a clustering-key prefix (i.e., a slice). Our higher-level request handling
code may decide to read an entire partition into the cache, and not use
a clustering-key prefix at all when reading. We will need to indepdently
improve the high-level code to use read_row()'s slicing capabilities
when paging through large partitions, for example.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).
For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.
So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In a later patch adding "promoted index" read support, we would like to
parse only part of an sstable row. In that case, the parser should start
not at the usual ROW_START state, but rather at the ATOM_START state.
But there's a problem: The sstable parser consumer currently assumes that
the parser stops after the start of the row, before reading any atoms.
So in the partial row case too, we must stop parsing before reading the
first atom.
For this, this patch adds the new "STOP_THEN_ATOM_START" parser state.
When starting in this state, the parser stops immediately (with
row_consumer::proceed::no), and when restarted again it will be in the
ATOM_START case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
data_consume_rows_context needs to have close() called and the returned
future waited for before it can be destroyed. data_consume_context::impl
does that in the background upon its destruction.
However, it is possible that the sstable is removed before
data_consume_rows_context::close() completes in which case EBADF may
happen. The solution is to make data_consume_context::impl keep a
reference to the sstable and extend its life time until closing of
data_consume_rows_context (which is performed in the background)
completes.
Side effect of this change is also that data_consume_context no longer
requires its user to make sure that the sstable exists as long as it is
in use since it owns its own reference to it.
Fixes#1537.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>
Move the to_bytes_view(temporary_buffer<char>) function from source file
to header file where is can be used in more places.
This saves one use of reinterpret_cast (which we are no re-evaluating),
and moreover, we want to use this function also in the promoted index
code (to return a bytes_view from the promoted index which was saved as a
temporary_buffer).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>
If read ahead is going to be enabled it is important to close
input_stream<> properly (and wait for completion) before destroying it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When we do a streaming read that knows the expected *end* position of the
read, we can use a large read-ahead buffer, and at the same time, stop
reading at exactly the intended end (or small rounding of it to the DMA
block size) and not waste resources blindly reading a large amount of data
after the end just to fill the read-ahead buffer.
The sstable reading code, both for reading the data file and the index file,
created a file input stream without specifiying its end, thereby losing
this optimization - so when a large buffer was used, we would get a large
over-read. This patch fixes this, so sstable data file and index file are
read using a file input stream which is a ware of its end.
Fixes#964.
Note that this patch does not change the behavior when reading a
*compressed* data file. For compressed read, we did not have the problem
of over-read in the first place, because chunks are read one by one.
But we do have other sources of inefficiencies there (stemming, again,
from the fact that the compressed chunks are read one by one), and I
opened a separate issue #992 for that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>
All the SSTable read path can now take an io_priority. The public functions will
take a default parameter which is Seastar's default priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When row consumer fallthrough from ATOM_NAME_BYTES to ATOM_MASK,
we assume that mask can be consumed, but it may happen that
data.size() equals to zero, thus mask cannot be consumed.
Solution is to add read_8 so that the code will only fallthrough
if mask can be consumed right away.
Fixes#197.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Since the child is a base class, we don't need to pass a reference: we can
just cast our 'this' pointer.
By doing that, the move constructor can come back.
Welcome back, move constructor.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In order to reuse the NSM in other scenarios, we need to push as much code
as possible into a common class.
This patch does that, making the continuous_data_consumer class now the main
placeholder for the NSM class. The actual readers will have to inherit from it.
However, despite using inheritance, I am not using virtual functions at all
instead, we let the continuous_data_consumer receive an instance of the derived
class, and then it can safely call its methods without paying the cost of
virtual functions.
In other attempt, I had kept the main process() function in the derived class,
that had the responsibility of then coding the loop.
With the use of the new pattern, we can keep the loop logic in the base class,
which is a lot cleaner. There is a performance penalty associated with it, but
it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we
can live with it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In my previous attempt, I have separated the state processor for the main loop,
leaving that to be filled by a derived class.
That felt a lot more natural, because then we don't have to replicate the loop
logic in the derived classes.
But well, oh, well, life is hard. Specially on fast paths. Doing that makes us
insert an extra call in this loop, and that is noticeable: we would be 1.5 %
slower, and that is not even counting the cost of making the state processing a
virtual function later on.
I could just argue that this is acceptable due to decoupling gains, but why I
would argue that, if I can just rewrite it in a way that no performance is
lost?
And then I did. The disadvantage of this, is that the derived class will now
have to re-code the loop, but no performance is lost. Another advantage of
this, is that the derived class will now be able to call into process_buffer
directly, without using virtual functions in this path for any of them.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
It is now a static member that gets the instance members as parameters. There
is no reason for that, and this will complicate the decoupling, since the
prestate reader won't know about state.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Continuing the work of decoupling the the prestate and state parts of the NSM
so we can reuse it, move the proceed class to a different holding class.
Proceeding or not has nothing to do with "rows".
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Because we didn't had before a way to know whether or not the read completed,
we would always go back to the main loop, and would only optimize sequential
reads for some kinds of data.
However, As one could see in the previous patch, the new read_X functions will
notify completion, allowing us to just fallthrough to the next case if that is
the only possibility. In most cases, it isn't. With this, we can apply this
optimization throughout all cases where we don't branch states, and with a very
elegant resulting code.
The performance actually increases by 0.75 %. It is not much, but it is more
than the error margin (which sits at 0.20 %), and because the code is not made
unreadable by it, this is a clear win to me.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In an attempt to gain some cycles, we are testing whether we can read many
values at once, and if so, using consume_be directly for those.
What we can do in this situation, is read the first value, and let the read
fall through the next case if the read succeeds.
The code actually looks a lot more elegant this way.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In some places, we cannot use our read_* functions, because we don't know
whether or not it succeeded, and that is important when passing the state
along.
The fix for this is trivial, since we can just return it from the reader.
Note for reviewers: The commend in one of the functions say we should use:
"read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is
_val, not _key.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
It shouldn't be _u16, but rather whatever we passed as len. It currently works
because all callers pass _u16 as len. But this will soon change.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
Soon enough, all the state machine will be separated from the prestate handling.
To make it easier, we will decouple them as much as we can.
Not automatically switching states in the read functions is part of this.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
For sstable reads, bigger buffers are not always better, because it can be
the case that we want to read just a piece of data.
However, as it so happens, we have already two variants for read: when we
want to read a single key, we will use read_row(), which will try to bring all
data in: so it will use smaller buffer.
For read_rows(), that will naturally span multiple buffers, we have end and
start points: with that, we can have a good estimation of the expected buffer
size, and we can have it go up until we reaches the 128k limit we have for
writes.
Before:
209578.62 +- 135.73 partitions / sec (30 runs, 1 concurrent ops)
After:
291703.98 +- 218.95 partitions / sec (30 runs, 1 concurrent ops)
Gain:
39.19 %
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
data_consume_rows(0, 0) was returning all partitions instead of no
partitions, because -1 was passed as count in such case, which was
then casted to uint64_t.
Special-casing it that way is problematic for code which calculates
the bounds, and when the key is not found we simple end up with 0 as
upper bound. Instead of convoluting the range lookup code to special
case for 0, let's simplify the interface so that (0, 0) returns no
rows, same as (1, 1). There is a new overload of data_consume_rows()
without bounds, which returns all data.
This patch replaces the sstable read APIs from having "push" style,
to having "pull style".
The sstable read code has two APIs:
1. An API for sequentially consuming low-level sstable items - sstable
row's beginning and end, cells, tombstones, etc.
2. An API for sequentially consuming entire sstable rows in our "mutation"
format.
Before this patch, both APIs were in "push style": The user supplies
callback functions, and the sstable read code "pushes" to these functions
the desired items (low-level sstable parts, or whole mutations).
However, a push API is very inconvenient for users, like the query
processing code, or the compaction code, which both iterate over mutations.
Such code wants to control its own progression through the iteration -
the user prefers to "pull" the next mutation when it wants it; Moreover,
the user wants to *stop* pulling more mutations if it wants, without
worrying about various continuations that are still scheduled in the
background (the latter concern was especially problematic in the "push"
design).
The modified APIs are:
1. The functions for iterating over mutations, sstable::read_rows() et al.,
now return a "mutation_reader" object which can be used for iterating
over the mutation: mutation_reader::read() asks for the next mutation,
and returns a future to it (or an unassigned value on EOF).
You can see an example on how it is used in sstable_mutation_test.cc.
2. The functions for consuming low-level sstable items (row begin, cell,
etc.) are still partially push-style - the items are still fed into
the consume object - but consumpton now *stops* (instead of defering
and continuing later, as in the old code) when the consumer asks to.
The caller can resume the consumption later when it wishes to (in
this sense, this is a "pull" API, because the user asks for more
input when it wants to).
This patch does *not* remove input_stream's feature of a consumer
function returning a non-ready future. However, this feature is no longer
used anywhere in our code - the new sstable reader code stops the
consumption when old sstable reader code paused it temporarily with
a non-ready future.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
After commit 3ae81e68a0, we already support
in input_stream::consume() the possibility of the consumer blocking by
returning a future. But the code for sstable consumption had now way to
use this capability. This patch adds a future<> return code for
consume_row_end(), allowing the consumer to pause after reading each
sstable row (but not, currently, after each cell in the row).
We also need to use this capability in read_range_rows(), which wrongly
ignored the future<> returned by the "walker" function - now this future<>
is returned to the sstable reader, and causes it to pause reading until
the future is fulfilled.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
When the temporary buffer has enough data for a uint64 to be
consumed, we readily consume it.
The problem is that we were wrongly storing the uint64 into
a uint32 variable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
When the temporary buffer has enough data for a uint64 to be
consumed, we readily consume it.
The problem is that we were wrongly storing the uint64 into
a uint32 variable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This patch adds support to reading deleted cells (a.k.a. cell tombstones)
from the SSTable.
The way deleted cells are encoded in the sstable is explained in the
"Cell tombstone" section of
https://github.com/cloudius-systems/urchin/wiki/SSTables-interpretation-in-Urchin
This more-or-less completes the low-level SSTable row reading code - the
only remaining untreated case are counters, which we agreed to leave to
later. If counters are found in the SSTable, we'll throw an exception.
This patch adds a new callback, consume_deleted_cell, taking the name of
the cell and its deletion_time (as usual, deletion_time includes both a
64-bit timestamp, for ordering events, and a 32-bit "local_deletion_time"
used to schedule gc of old tombstones).
This patch also adds a test SSTable with deleted cell, created by the
following Cassandra Commands:
CREATE TABLE deleted (
name text,
age int,
PRIMARY KEY (name)
);
INSERT INTO deleted (name, age) VALUES ('nadav', 40);
<flush table - the second table is what we're after>
DELETE age FROM deleted WHERE name = 'nadav';
We test our ability to read this sstable, and see the deleted cell
and its expected deletion time.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds support to reading sstable cells with expiration time.
It adds two more parameters to the row_consumer::consume_cell() - "ttl"
and "expiration". The "ttl" is the original TTL set on the cell in seconds,
the "expiration" is the absolute time (in seconds since the Unix epoch) when
this cell is set to expire. I don't know why both values are needed...
When a cell has no expiration time set (most cells will be like that), the
callback with will be called expiration==0 (and ttl==0).
This patch also adds a test SSTable with cells with set TTL, created by
the following Cassandra commands:
CREATE TABLE ttl (
name text,
age int,
PRIMARY KEY (name)
);
INSERT INTO ttl (name, age) VALUES ('nadav', 40) USING TTL 3600;
And tests our ability to read the resulting sstable, and get the expected
expiration time.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The previous implementation could read either one sstable row or several,
but only when all the data was read in advance into a contiguous memory
buffer.
This patch changes the row read implementation into a state machine,
which can work on either a pre-read buffer, or data streamed via the
input_stream::consume() function:
The sstable::data_consume_rows_at_once() method reads the given byte range
into memory and then processes it, while the sstable::data_consume_rows()
method reads the data piecementally, not trying to fit all of it into
memory. The first function is (or will be...) optimized for reading one
row, and the second function for iterating over all rows - although both
can be used to read any number of rows.
The state-machine implementation is unfortunately a bit ugly (and much
longer than the code it replaces), and could probably be improved in the
future. But the focus was parsing performance: when we use large buffers
(the default is 8192 bytes), most of the time we don't need to read
byte-by-byte, and efficiently read entire integers at once, or even larger
chunks. For strings (like column names and values), we even avoid copying
them if they don't cross a buffer boundary.
To test the rare boundary-crossing case despite having a small sstable,
the code includes in "#if 0" a hack to split one buffer into many tiny
buffers (1 byte, or any other number) and process them one by one.
The tests still pass with this hack turned on.
This implementation of sstable reading also adds a feature not present
in the previous version: reading range tombstones. An sstable with an
INSERT of a collection always has a range tombstone (to delete all old
items from the collection), so we need this feature to read collections.
A test for this is included in this patch.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>