All the SSTable read path can now take an io_priority. The public functions will
take a default parameter which is Seastar's default priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Continuing the work of decoupling the the prestate and state parts of the NSM
so we can reuse it, move the proceed class to a different holding class.
Proceeding or not has nothing to do with "rows".
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch replaces the sstable read APIs from having "push" style,
to having "pull style".
The sstable read code has two APIs:
1. An API for sequentially consuming low-level sstable items - sstable
row's beginning and end, cells, tombstones, etc.
2. An API for sequentially consuming entire sstable rows in our "mutation"
format.
Before this patch, both APIs were in "push style": The user supplies
callback functions, and the sstable read code "pushes" to these functions
the desired items (low-level sstable parts, or whole mutations).
However, a push API is very inconvenient for users, like the query
processing code, or the compaction code, which both iterate over mutations.
Such code wants to control its own progression through the iteration -
the user prefers to "pull" the next mutation when it wants it; Moreover,
the user wants to *stop* pulling more mutations if it wants, without
worrying about various continuations that are still scheduled in the
background (the latter concern was especially problematic in the "push"
design).
The modified APIs are:
1. The functions for iterating over mutations, sstable::read_rows() et al.,
now return a "mutation_reader" object which can be used for iterating
over the mutation: mutation_reader::read() asks for the next mutation,
and returns a future to it (or an unassigned value on EOF).
You can see an example on how it is used in sstable_mutation_test.cc.
2. The functions for consuming low-level sstable items (row begin, cell,
etc.) are still partially push-style - the items are still fed into
the consume object - but consumpton now *stops* (instead of defering
and continuing later, as in the old code) when the consumer asks to.
The caller can resume the consumption later when it wishes to (in
this sense, this is a "pull" API, because the user asks for more
input when it wants to).
This patch does *not* remove input_stream's feature of a consumer
function returning a non-ready future. However, this feature is no longer
used anywhere in our code - the new sstable reader code stops the
consumption when old sstable reader code paused it temporarily with
a non-ready future.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
After commit 3ae81e68a0, we already support
in input_stream::consume() the possibility of the consumer blocking by
returning a future. But the code for sstable consumption had now way to
use this capability. This patch adds a future<> return code for
consume_row_end(), allowing the consumer to pause after reading each
sstable row (but not, currently, after each cell in the row).
We also need to use this capability in read_range_rows(), which wrongly
ignored the future<> returned by the "walker" function - now this future<>
is returned to the sstable reader, and causes it to pause reading until
the future is fulfilled.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds support to reading deleted cells (a.k.a. cell tombstones)
from the SSTable.
The way deleted cells are encoded in the sstable is explained in the
"Cell tombstone" section of
https://github.com/cloudius-systems/urchin/wiki/SSTables-interpretation-in-Urchin
This more-or-less completes the low-level SSTable row reading code - the
only remaining untreated case are counters, which we agreed to leave to
later. If counters are found in the SSTable, we'll throw an exception.
This patch adds a new callback, consume_deleted_cell, taking the name of
the cell and its deletion_time (as usual, deletion_time includes both a
64-bit timestamp, for ordering events, and a 32-bit "local_deletion_time"
used to schedule gc of old tombstones).
This patch also adds a test SSTable with deleted cell, created by the
following Cassandra Commands:
CREATE TABLE deleted (
name text,
age int,
PRIMARY KEY (name)
);
INSERT INTO deleted (name, age) VALUES ('nadav', 40);
<flush table - the second table is what we're after>
DELETE age FROM deleted WHERE name = 'nadav';
We test our ability to read this sstable, and see the deleted cell
and its expected deletion time.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds support to reading sstable cells with expiration time.
It adds two more parameters to the row_consumer::consume_cell() - "ttl"
and "expiration". The "ttl" is the original TTL set on the cell in seconds,
the "expiration" is the absolute time (in seconds since the Unix epoch) when
this cell is set to expire. I don't know why both values are needed...
When a cell has no expiration time set (most cells will be like that), the
callback with will be called expiration==0 (and ttl==0).
This patch also adds a test SSTable with cells with set TTL, created by
the following Cassandra commands:
CREATE TABLE ttl (
name text,
age int,
PRIMARY KEY (name)
);
INSERT INTO ttl (name, age) VALUES ('nadav', 40) USING TTL 3600;
And tests our ability to read the resulting sstable, and get the expected
expiration time.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The previous implementation could read either one sstable row or several,
but only when all the data was read in advance into a contiguous memory
buffer.
This patch changes the row read implementation into a state machine,
which can work on either a pre-read buffer, or data streamed via the
input_stream::consume() function:
The sstable::data_consume_rows_at_once() method reads the given byte range
into memory and then processes it, while the sstable::data_consume_rows()
method reads the data piecementally, not trying to fit all of it into
memory. The first function is (or will be...) optimized for reading one
row, and the second function for iterating over all rows - although both
can be used to read any number of rows.
The state-machine implementation is unfortunately a bit ugly (and much
longer than the code it replaces), and could probably be improved in the
future. But the focus was parsing performance: when we use large buffers
(the default is 8192 bytes), most of the time we don't need to read
byte-by-byte, and efficiently read entire integers at once, or even larger
chunks. For strings (like column names and values), we even avoid copying
them if they don't cross a buffer boundary.
To test the rare boundary-crossing case despite having a small sstable,
the code includes in "#if 0" a hack to split one buffer into many tiny
buffers (1 byte, or any other number) and process them one by one.
The tests still pass with this hack turned on.
This implementation of sstable reading also adds a feature not present
in the previous version: reading range tombstones. An sstable with an
INSERT of a collection always has a range tombstone (to delete all old
items from the collection), so we need this feature to read collections.
A test for this is included in this patch.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Add a function sstables::data_consume_row() which reads an entire row
(or several consecutive rows) at a given byte range in the data file,
and feeds them into a "row_consumer" implementation which the user provides.
The row_consumer's method consume_row_start() method is called at the
beginning of the (or each) row with its key and deletion information,
then the consume_cell() method is called for each of the row's cells,
and after all cells of the row, consume_row_end() is called.
The current implementation only supports regular cells, and not other
special cases like range tombstones and counters (see
https://github.com/cloudius-systems/urchin/wiki/SSTables%20Data%20File)
as I did not yet have sstables to test those on; The current
implementation will abort upon seeing these unsupported features.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>