"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.
Three features are still missing from this series and are planned to be
developed later:
1. When we fail to parse a partition's promoted index, we silently fall back
to reading the entire partition. We should log (with rate limiting) and
count these errors, to help in debugging sstable problems.
2. The current code only uses the promoted index when looking for a single
contiguous clustering-key range. If the ck range is non-contiguous, we
fall back to reading the entire partition. We should use the promoted
index in that case too.
3. The current code only uses the promoted index when reading a single
partition, via sstable::read_row(). When scanning through all or a
range of partitions (read_rows() or read_range_rows()), we do not yet
use the promoted index; We read contiguously from data file (we do not
even read from the index file, so unsurprisingly we can't use it)."
(cherry picked from commit 700feda0db)
Move the to_bytes_view(temporary_buffer<char>) function from source file
to header file where is can be used in more places.
This saves one use of reinterpret_cast (which we are no re-evaluating),
and moreover, we want to use this function also in the promoted index
code (to return a bytes_view from the promoted index which was saved as a
temporary_buffer).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().
Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Until recently, we believed that range tombstones we read from sstables will
always be for entire rows (or more generalized clustering-key prefixes),
not for arbitrary ranges. But as we found out, because Cassandra insists
that range tombstones do not overlap, it may take two overlapping row
tombstones and convert them into three range tombstones which look like
general ranges (see the patch for a more detailed example).
Not only do we need to accept such "split" range tombstones, we also need
to convert them back to our internal representation which, in the above
example, involves two overlapping tombstones. This is what this patch does.
This patch also contains a test for this case: We created in Cassandra
an sstable with two overlapping deletions, and verify that when we read
it to Scylla, we get these two overlapping deletions - despite the
sstable file actually having contained three non-overlapping tombstones.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.
memory_footprint is used by the API to report the memory that is taken
by the summary.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The one thing that is still showing pretty high at the read_indexes flamegraph,
is allocations.
We can, however, do better. Since most of the index is the keys anyway - and we need
all of them, the amount of memory we use by copying the buffers over is about the same
as the space we would use by just keeping the buffers around.
So we can change index_entry to just keep the shared_buffers, and since we always access
it through views anyway, that is perfectly fine. The index_entry destructor will then
release() the temporary_buffer, instead of doing this after the buffer copy.
This gives us a nice additional 4 %.
perf_sstable_g --smp 1 --iterations 30 --parallelism 1 --mode index_read
Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)
After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Now that we are using the NSM, and not the general parser for the index, there
is no reason to keep using disk_string<>s in it. Since it is staying in the way
of further optimizations, let's get rid of it and use bytes directly.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Large vectors require contiguous storage, which may not be available (or may
be expensive to obtain). Switch to deque<> instead, which allocates
discontiguous storage.
Allocation problems were observed with the summary and with the bloom
filter bitmaps.
What we implement is ka, not la. Since the summary is the one element that
actually changed in the 2.2 implementation, it is particularly important that
we get this one right. I have previously missed this.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Don't let the current name fool you: Having this listed as "la" here
was just lack of discipline on my part. I meant by it "the format from
which we are importing" - which was named la for Origin. I wasn't
really thinking at the time that it would be dangerous to stop between
versions.
This should read ka, not la.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Since parsing involves a unique_ptr<metadata> holding a pointer to a
subclass of metadata, it must define a virtual destructor, or it can
cause memory leaks when deleted, or, with C++14 sized deallocators, it
can cause the wrong memory pool to be used for deleting the object.
Seen on EC2.
Define a virtual destructor to tell the compiler how to destroy
and free the object.
The entry contains not only the key, but other stuff like
position. Why would casting to bytes_view give the view on just the
key and not the whole entry. Better to be explicit.
The two separate functions can now be merged. As a result, the code
that generates statistics data is now much easier to understand.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
compaction metadata is composed of ancestors and cardinality.
ancestors data is generated via compaction process, so it will be
empty by the time being.
cardinality data is generated by hashing the keys, offering the
values to hyperloglog and retrieving a buffer with the data to be
stored.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
We always return a future, but with the threaded writer, we can get rid of
that. So while reads will still return a future, the writer will be able to
return void.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Statistics file is composed of three types of metadata:
- Validation
- Stats
- Compaction
This patch is adding support to generate the first two types.
Compaction is the hardest one to generate because it depends on
external modules. Anyway, I plan to convert whatever is needed
for us to support Compaction metadata as soon as possible.
Related to Stats metadata, we're filling the fields sstable_level
and repaired_at with default values. sstable_level is related to
compaction, and repaired_at is related to SStable repair.
In addition that we don't support compaction nor SStable repair yet,
those values come from upper layers in Cassandra.
Given the facts mentioned above, Statistics file is being generated
with only Validation and Stats metadata. Its on-disk format is
flexible enough so that a missing metadata won't damage it.
So it's technically possible to proceed without Compaction metadata
by the time being.
For reference:
../io/sstable/MetadataCollector.java
../io/sstable/ColumnStats.java
../io/sstable/format/big/BigTableWriter.java
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
In addition, this patch also fixes serialization and deserialization of
estimated histogram. Problem was found by reading the respective methods
in origin implementation.
The first element of the array offset is used for both the first and
second element of the array bucket. So given an array bucket of size N,
array offset will be of size N - 1. Our code wasn't handling this.
The new representation of estimated histogram provides us with methods
needed for writing the component Statistics.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This step was important to extend streaming_histogram with methods
needed for writing the SSTable component Statistics.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
CRC component is composed of chunk size, and a vector of checksums
for each chunk (at most chunk size bytes) composing the data file.
The implementation is about computing the checksum every time the
output stream of data file gets written. A write to output stream
may cross the chunk boundary, so that must be handled properly.
Note that CRC component will only be created if compression isn't
being used.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This initial version supports:
Regular columns
Clustering key
Compound Partition key
Compound Clustering key
Static Row
What's not supported:
Counters
Range tombstones
Collections
Compression
anything else that wasn't mentioned in the support list.
The generation of the data file consists of iterating through
a set of mutation_partition from a column_family, then writing
the SSTable rows according to the format.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
disk_string provides an easy way of serializing a string into the form
{ size, string[size] }. sstables::key, atomic_cell, among other types
provides a bytes_view for the view of data, so that's why this change
is needed. Otherwise, I would have to convert bytes_view into bytes,
which requires copy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
There is no need to expose binary search. It can be an internal function
that is accessible for test only.
Also, in the end, the implementation of the summary version was such a simple
one, that there is no need to have a specific method for that. We can just pass
the summary entries as a parameter.
Some header file massage is needed to keep it compiling
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Search code is trivially taken from Origin, but adapted so that the comparison
is done explicitly.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We have our own representation of a partition_key, clustering_key, etc. They
may different slightly from a legacy sstable key because we don't necessarily
serialize composites in our internal representation the same way as Origin
does. This patch encodes the Origin composite serialization, so we can create
keys that are compatible with Origin's understanding of what a partition key
should look like.
This whould be used when serializing or deserializing to/from an sstable.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The definition of summary_la at types.hh provides a good explanation
on the on-disk format of the Summary file.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This code adds the ability to write statistics to disk.
On-disk format:
uint32_t Size;
struct {
uint32_t metadata_type;
uint32_t offset; /* offset into this file */
} metadata_metadata[Size];
* each metadata_metadata entry corresponds to a metadata
stored in the file.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Previously we had both a "compression" structure (read from the Compression
Info file on disk) and a "compression_metadata" class with additional
information, which std::move()ed parts of the compression structure.
This caused problems for the simplistic sstable-writing test (which does
the non-interesting thing of writing a previously-read sstable).
I'm ashamed to say, fixing this was very hard, because all this code is
built like a house of cards - try to change one thing, and everything
falls apart. After many failed attempts in trying to improve this code, what
I ended up doing is simply *extending* the "compression" structure - the
extended part isn't read or written, but it is in the structure.
We also no longer move a shared pointer to the compression structure,
but rather just an ordinary pointer; The assumption is that the user
will already make sure that the sstable structure will live for the
durations of any processing on it - and the compression structure is just
one part of this sstable structure.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
By the time being, compression info is the unique component being
written by store(). Changes introduced by this patch are generic,
so as to make it easier writing other components as well.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>