Commit Graph

74 Commits

Author SHA1 Message Date
Tomasz Grabiec
5b36976bf0 sstables: Store parsed promoted index in index_entry 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5af815bf20 sstables: Define deletion_time earlier 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
27d86dfe18 sstables: Enable skipping to cells at data_consume_context level 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
0635d74e17 sstables: Make index_entry copyable
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d5e704ca1e sstables: Make key_view constructor from bytes_view explicit 2017-03-28 18:10:39 +02:00
Raphael S. Carvalho
e28537b56f sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
2016-12-20 18:28:47 +00:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Glauber Costa
4310635bae move estimated histogram to utils
Nothing sstable-specific in it, really.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Glauber Costa
ffc2131c51 decouple estimated_histogram from sstables
There is nothing really that fundamentally ties the estimated histogram to
sstables. This patch gets rid of the few incidental ties. They are:

 - the namespace name, which is now moved to utils. Users inside sstables/
   now need to add a namespace prefix, while the ones outside have to change
   it to the right one
 - sstables::merge, which has a very non-descriptive name to begin with, is
   changed to a more descriptive name that can live inside utils/
 - the disk_types.hh include has to be removed - but it had no reason to be
   here in the first place.

Todo, is to actually move the file outside sstables/. That is done in a separate
step for clarity.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Nadav Har'El
1d38a69e49 sstables: expose promoted index in index entry
Our index_entry type, holding one partition's entry that we read from the
index file, already contained the "_promoted_index" which we read from
disk - as an unparsed byte buffer. But there wasn't any API to access
this buffer after it was read. This patch adds a trivial getter, to get
a read-only view of this buffer.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2016-08-07 17:47:06 +03:00
Nadav Har'El
c647d917e0 sstables: move to_bytes_view to header file
Move the to_bytes_view(temporary_buffer<char>) function from source file
to header file where is can be used in more places.

This saves one use of reinterpret_cast (which we are no re-evaluating),
and moreover, we want to use this function also in the promoted index
code (to return a bytes_view from the promoted index which was saved as a
temporary_buffer).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>
2016-07-17 16:29:26 +03:00
Raphael S. Carvalho
1ecd9bdefc sstables: fix type of max_local_deletion_time
max_local_deletion_time was incorrectly using an unsigned type
instead of a signed one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-06 01:13:13 -03:00
Paweł Dziepak
575daea897 sstables: make deletion_time to tombstone cast safer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00
Glauber Costa
6ae601a025 do not re-read the summary
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().

Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-08 17:14:29 -04:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Nadav Har'El
99ecda3c96 sstables: overhaul range tombstone reading
Until recently, we believed that range tombstones we read from sstables will
always be for entire rows (or more generalized clustering-key prefixes),
not for arbitrary ranges. But as we found out, because Cassandra insists
that range tombstones do not overlap, it may take two overlapping row
tombstones and convert them into three range tombstones which look like
general ranges (see the patch for a more detailed example).

Not only do we need to accept such "split" range tombstones, we also need
to convert them back to our internal representation which, in the above
example, involves two overlapping tombstones. This is what this patch does.

This patch also contains a test for this case: We created in Cassandra
an sstable with two overlapping deletions, and verify that when we read
it to Scylla, we get these two overlapping deletions - despite the
sstable file actually having contained three non-overlapping tombstones.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>
2016-03-31 12:49:50 +03:00
Amnon Heiman
bae286a5b4 Add memory_footprint method to summary_ka
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.

memory_footprint is used by the API to report the memory that is taken
by the summary.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 14:52:18 +02:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Paweł Dziepak
a4a71932e1 sstables: make {max, min}_timestamp signed
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-07 21:20:32 +02:00
Calle Wilund
d4ae43862d SStables: Use db::commitlog::replay_position (not own type) 2015-08-31 14:29:45 +02:00
Glauber Costa
aab1ae9dc1 index_entry: don't generate a temporary bytes element
The one thing that is still showing pretty high at the read_indexes flamegraph,
is allocations.

We can, however, do better. Since most of the index is the keys anyway - and we need
all of them, the amount of memory we use by copying the buffers over is about the same
as the space we would use by just keeping the buffers around.

So we can change index_entry to just keep the shared_buffers, and since we always access
it through views anyway, that is perfectly fine. The index_entry destructor will then
release() the temporary_buffer, instead of doing this after the buffer copy.

This gives us a nice additional 4 %.

perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode index_read

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:09:53 -05:00
Glauber Costa
a9ab31dd9c index_entry: move its fields to private visibility
And provide accessors. This will give us the freedom to change their internal
storage.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
1fbd14354f index_entry: provide a constructor
This is a preparation to have their internal fields as private.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
13d59c9618 index_entry: do away with the disk_string<> fields
Now that we are using the NSM, and not the general parser for the index, there
is no reason to keep using disk_string<>s in it. Since it is staying in the way
of further optimizations, let's get rid of it and use bytes directly.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Calle Wilund
0ae7707106 SStables: Use db::commitlog::replay_position (not own type) 2015-08-25 09:14:39 +02:00
Avi Kivity
c51292e792 sstables: switch from vector<> to deque<>
Large vectors require contiguous storage, which may not be available (or may
be expensive to obtain).  Switch to deque<> instead, which allocates
discontiguous storage.

Allocation problems were observed with the summary and with the bloom
filter bitmaps.
2015-08-23 12:22:49 +03:00
Glauber Costa
799a6b5962 sstables: change summary_la to summary_ka
What we implement is ka, not la. Since the summary is the one element that
actually changed in the 2.2 implementation, it is particularly important that
we get this one right. I have previously missed this.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-11 17:47:48 +03:00
Glauber Costa
7bbf8c2a6f sstable types: correctly state version of metadata field
Don't let the current name fool you: Having this listed as "la" here
was just lack of discipline on my part. I meant by it "the format from
which we are importing" - which was named la for Origin. I wasn't
really thinking at the time that it would be dangerous to stop between
versions.

This should read ka, not la.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:55 -05:00
Avi Kivity
6a9d0495f8 sstables: fix memory corruption in metadata parsing
Since parsing involves a unique_ptr<metadata> holding a pointer to a
subclass of metadata, it must define a virtual destructor, or it can
cause memory leaks when deleted, or, with C++14 sized deallocators, it
can cause the wrong memory pool to be used for deleting the object.

Seen on EC2.

Define a virtual destructor to tell the compiler how to destroy
and free the object.
2015-07-22 17:46:37 +03:00
Tomasz Grabiec
e9a050da78 sstables: Obtain the key from entries using get_key() rather than casting to bytes_view
The entry contains not only the key, but other stuff like
position. Why would casting to bytes_view give the view on just the
key and not the whole entry. Better to be explicit.
2015-07-22 10:27:48 +02:00
Raphael S. Carvalho
79532b6603 sstables: merge prepare_statistics and add_statistics_metadata
The two separate functions can now be merged. As a result, the code
that generates statistics data is now much easier to understand.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-24 18:58:17 +03:00
Raphael S. Carvalho
f831d1bce9 sstables: add support to generate compaction metadata
compaction metadata is composed of ancestors and cardinality.

ancestors data is generated via compaction process, so it will be
empty by the time being.

cardinality data is generated by hashing the keys, offering the
values to hyperloglog and retrieving a buffer with the data to be
stored.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-23 12:13:26 -03:00
Glauber Costa
2dbd2b408a sstables: change describe_type's return type to auto
We always return a future, but with the threaded writer, we can get rid of
that. So while reads will still return a future, the writer will be able to
return void.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-08 15:25:35 +03:00
Raphael S. Carvalho
a9619866bb sstables: add initial support to generation of Statistics file
Statistics file is composed of three types of metadata:
- Validation
- Stats
- Compaction

This patch is adding support to generate the first two types.
Compaction is the hardest one to generate because it depends on
external modules. Anyway, I plan to convert whatever is needed
for us to support Compaction metadata as soon as possible.

Related to Stats metadata, we're filling the fields sstable_level
and repaired_at with default values. sstable_level is related to
compaction, and repaired_at is related to SStable repair.
In addition that we don't support compaction nor SStable repair yet,
those values come from upper layers in Cassandra.

Given the facts mentioned above, Statistics file is being generated
with only Validation and Stats metadata. Its on-disk format is
flexible enough so that a missing metadata won't damage it.
So it's technically possible to proceed without Compaction metadata
by the time being.

For reference:
../io/sstable/MetadataCollector.java
../io/sstable/ColumnStats.java
../io/sstable/format/big/BigTableWriter.java

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-02 10:32:12 +03:00
Raphael S. Carvalho
1310e79b98 sstables: convert EstimatedHistogram to C++ and start using it
In addition, this patch also fixes serialization and deserialization of
estimated histogram. Problem was found by reading the respective methods
in origin implementation.

The first element of the array offset is used for both the first and
second element of the array bucket. So given an array bucket of size N,
array offset will be of size N - 1. Our code wasn't handling this.

The new representation of estimated histogram provides us with methods
needed for writing the component Statistics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-02 10:32:12 +03:00
Raphael S. Carvalho
05f2bfbe77 sstables: convert StreamingHistogram to C++ and start using it
This step was important to extend streaming_histogram with methods
needed for writing the SSTable component Statistics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-02 10:32:12 +03:00
Raphael S. Carvalho
53a26a5966 sstables: move disk_* types to a header
That's needed to avoid circular dependencies of header files.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-02 10:32:12 +03:00
Raphael S. Carvalho
bdd3fe61c5 sstables: add initial support to generation of CRC component
CRC component is composed of chunk size, and a vector of checksums
for each chunk (at most chunk size bytes) composing the data file.
The implementation is about computing the checksum every time the
output stream of data file gets written. A write to output stream
may cross the chunk boundary, so that must be handled properly.
Note that CRC component will only be created if compression isn't
being used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-01 12:25:01 -03:00
Avi Kivity
f2b82fd455 sstables/types.hh: add missing includes 2015-05-26 16:53:34 +03:00
Raphael S. Carvalho
4611fd373d sstables: add missing copyright
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-05-26 09:51:15 +03:00
Raphael S. Carvalho
57060b5dfe sstables: add initial support to generation of summary file
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-05-20 15:17:21 -03:00
Glauber Costa
9d3ef62789 sstables: add convenience constructors for filter type
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-05-19 11:22:41 -04:00
Raphael S. Carvalho
7dc9ba1714 sstables: write summary position field as little endian
The field entries must be written in memory order.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-05-17 11:55:35 +03:00
Glauber Costa
34c6cca845 sstable types: convert a deletion time to a tombstone
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-05-13 17:14:03 -04:00
Glauber Costa
0a4a5914a8 sstables: add helper method for deletion_time
That should make it easier for code to test if a cell or range is live whenever
needed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-05-13 17:14:03 -04:00
Glauber Costa
2fba948ad8 sstables: move timestamps to signed integer
This is to follow Origin

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-05-13 17:14:02 -04:00