Commit Graph

253 Commits

Author SHA1 Message Date
Raphael S. Carvalho
f7a1a5618b sstables: add missing #include guard
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-16 07:25:56 +03:00
Tomasz Grabiec
c206699c9d Merge tag 'avi/logger-config/v1' from seastar-dev.git
Logger configuration from Avi.
2015-07-15 11:27:09 +02:00
Tomasz Grabiec
4e5cb05aa4 sstables: Fix heap-buffer-overflow in read_range_rows()
When passing tokens corresponding to 129th key in the sstable to
read_range_rows(), it failed with heap-buffer-overflow pointing to:

      return make_ready_future<uint64_t>(index_list[min_index_idx].position);

The scenario is as follows. We pass the lower bound token, which
corresponds to the first partition of some (not first) summary
page. That token will compare less than any entry in that page (even
less with the key we took it from, cause we want all partitions with
that token), so min_idx will point to the previous summary page
(correct). Then this code tries to locate the position in the previous
page:

  auto m = adjust_binary_search_index(this->binary_search(index_list, minimum_key(), min_token));
  auto min_index_idx = m >= 0 ? m : 0;

binary_search() will return ((-index.list_size()) -1), because the
token is greater than anything in that page. So "m" and
"min_index_idx" will be (index.list_size()-1) after adjusting.

Then the code tried this:

        auto candidate = key_view(bytes_view(index_list[min_index_idx]));
        auto tcandidate = dht::global_partitioner().get_token(candidate);
        if (tcandidate < min_token) {
            min_index_idx++;
        }

The last key compared less than the token also, so min_index_idx is
bumped up to index_list.size(). It then tried to use this too large
index on index_list, which caused buffer overflow.

We clearly need to return the first position of the next page in this
case, and this change does it indirectly by calling
data_end_position(), which also handles edge cases like if there is no
next summary page.

I reimplemented the logic top-down, and found that the last special
casing for tcandidate was not needed, so I removed it.
2015-07-14 19:58:17 +02:00
Tomasz Grabiec
2a491b2076 sstables: Fix bug in read_range_rows()
The method was using the same summary page for both min and max
tokens, whereas they can be different if they're distant enough from
each other.
2015-07-14 19:58:17 +02:00
Avi Kivity
99a15de9e5 logger: de-thread_local-ize logger
The logger class constructor registers itself with the logger registry,
in order to enable dynamically setting log levels.  However, since
thread_local variables may be (and are) initialized at the time of first
use, when the program starts up no loggers are registered.

Fix by making loggers global, not thread_local.  This requires that the
registry use locking to prevent registration happening on different threads
from corrupting the registry.

Note that technically global variables can also be initialized at the
point of first use, and there is no portable way for classes to self-register.
However this is the best we can do.
2015-07-14 17:18:11 +03:00
Raphael S. Carvalho
d3a83aa549 sstables: finish streaming_histogram::update
This method was incomplete, and thus would fail if map size were
greater than max_bin_size, bringing the application down.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-12 11:06:03 +03:00
Avi Kivity
dd29ac9593 Merge "cqlsh" from Glauber
System table Work to make cqlsh connect.
2015-07-07 19:33:23 +03:00
Glauber Costa
45905ec94d dht: change partitioner name to sstring
It is a better fit for things that are names, not blobs. We have a user that expects
a bytes parameter, but that is for no other reason than the fact that the field used
to be of bytes type.

Let's fix that, and future users will be able to use sstrings

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-07 11:38:22 -04:00
Tomasz Grabiec
66dfeb33d7 db: Filter out sstable partitions not belonging to current shard 2015-07-07 16:56:25 +02:00
Pekka Enberg
86d913954a db/legacy_schema_tables: Store CF "is_dense" to system tables
Persist column family's "is_dense" value to system tables. Please note
that we throw an exception if "is_dense" is null upon read. That needs
to be fixed later by inferring the value from other information like
Origin does.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-07 12:36:50 +02:00
Paweł Dziepak
16963b214b sstables: do not write expired cells
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-07-02 17:26:14 +02:00
Paweł Dziepak
8fa0049cae sstables: keep current timestamp
Current timestamp will be used to filter out expired cells.

Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-07-02 17:26:06 +02:00
Nadav Har'El
53aa239865 sstables: remove some dead code
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-07-02 12:13:22 +03:00
Raphael S. Carvalho
c3372c36a2 sstables: keep track of compacted sstable's ancestors
In C*, every compacted sstable keeps track of its ancestors in the
statistics file. Supposedly, that info is used to discard sstable
files from ancestors which for some odd reason weren't deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-30 18:09:01 +03:00
Raphael S. Carvalho
8324ff990c sstables: add method add_generation
metadata_collector was made member of class sstable, such that the
compaction procedure will be able to use the method add_generation
from a sstable object.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-30 18:09:01 +03:00
Nadav Har'El
c892228018 compaction: remove compacted sstables
After compaction, remove the source sstables. This cannot be done
immediately, as ongoing reads might be using them, so we mark the sstable
as "to be deleted", and when all references to this sstable are lost and
the object is destroy, we see this flag and delete the on-disk files.

This patch doesn't change the low-level compact_sstables() (which doesn't
mark its input sstables for deletion), but rather the higher-level example
"strategy" column_family::compact_all_sstables(). I thought we might want
to do this to allow in the future strategies that might only mark the input
sstables for deletion after doing perhaps other steps and to be sure it
doesn't want to abort the compaction and return to the old files. If we
decide this isn't needed, we can easily move the mark_for_deletion() call
to compact_sstables().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-30 15:00:39 +03:00
Raphael S. Carvalho
6dcf136702 sstables: enable trim_to_size option of compressed_file_output_stream
Following Nadav's discovery of the problem with large writes to output stream,
it turns out that compressed_file_output_stream also needs the option trim_to_
size enabled. Otherwise, a write to compressed_file_output_stream larger than
_size would result in a buffer larger than chunk size being flushed, which is
definitely wrong.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-28 19:57:07 +03:00
Nadav Har'El
9c7f1744b3 sstables: add missing virtual destructor
A base class with virtual functions should also have a virtual destructor,
so if someone deletes it by the base class pointer, the concrete class's
destructor will be called.

I thought this missing virtual destructor is to blame for a bug I was
hunting, but it's not - but it's still worth adding this missing definition.

The silly "default" definition of the move constructor is also necessary,
because when you define the destructor explicitly, the compiler no longer
defines any constructors implicitly for you.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-28 13:28:12 +03:00
Nadav Har'El
c7b7cf3ca4 sstables write: don't hide exception
Our sstable code currently has a bug (not solved by this patch) in writing
large summary files, where several aio write operations are done and one of
them fails with an EINVAL.

Unfortunately and inexplicably, sstable::write_simple simply *hides* this
exception (catches it and ignores it), so the write never knows it fails,
and we only get an exception later when sstable::write_components() tries
to load() the sstable it just created.

So in this patch, I remove the hiding of the exception, and now when writing
an sstable with 1,000,000 partitions, I see this in the output:

	failed to write sstable: Invalid argument

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-28 12:04:06 +03:00
Raphael S. Carvalho
c5eff7d263 sstables: use move semantics when updating stats
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-28 10:10:36 +03:00
Raphael S. Carvalho
261f1f75d0 sstables: update stats when writing tombstones
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-28 10:10:36 +03:00
Avi Kivity
90412a2b26 Merge "Compression parameter fixes" from Paweł 2015-06-25 18:48:30 +03:00
Paweł Dziepak
89ede1fc00 schema: rename [set, get]_compressor to [set, get]_compressor_params
Compressor type is only a part of the information kept in compressor
parameters and things like schema.get_compressor().get_compressor()
do not look very good.

Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-06-25 17:05:53 +02:00
Raphael S. Carvalho
92054f8391 sstables: fix typo in compaction code
s/estimated_parititions/estimated_partitions

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-25 17:45:06 +03:00
Avi Kivity
71e04b412e Merge "Compression parameters" from Paweł
"This patch series introduces compression_parameters class which is used
to handle compression options specified at table creation. Such information
is now properly propagated to the database internals."
2015-06-25 15:48:15 +03:00
Paweł Dziepak
6845cc0ee5 cql3: propagate compression info to schema and sstables code
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-06-25 13:33:00 +02:00
Raphael S. Carvalho
b54d35dcbb sstables: fix use-after-free at the end of compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-25 08:39:10 +03:00
Raphael S. Carvalho
79532b6603 sstables: merge prepare_statistics and add_statistics_metadata
The two separate functions can now be merged. As a result, the code
that generates statistics data is now much easier to understand.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-24 18:58:17 +03:00
Raphael S. Carvalho
74febd97f2 sstables: avoid assert when generating cardinality metadata
Assert should be completely avoided. Instead, we should trigger an
exception, allowing the db to proceed with a "smooth" shutdown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-24 18:58:17 +03:00
Nadav Har'El
27c238d6b7 sstable: fix load of new sstable
Apparently, after writing a new sstable, with write_components(), it
is necessary to load() it. I'm not sure why, but we get a crash on
an aio to a closed file descriptor if we don't.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-24 16:44:25 +03:00
Nadav Har'El
0b297b9f6c sstable compaction: simplify compact_sstables() function
Instead of requiring the user to subclass a "sstable_creator" class to
specify how to create a new sstable (or in the future, several of them),
switch to an std::function.

In practice, it is much easier to specify a lambda than a class, especialy
since C++11 made it easy to capture variables into lambdas - but not into
local classes.

The "commit()" function is also unnecessary. Then intention there was to
provide a function to "commit" the new sstables (i.e., rename them).
But the caller doesn't need to supply this function - it can just wait
for the future of the end of compaction, and do his own committing code
right then.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-24 16:44:11 +03:00
Raphael S. Carvalho
0f95066be4 sstables: fix a bug which invalidated generated sstables
Column stats min_timestamp, max_timestamp and max_local_deletion_time
were being update incorrectly.

max_local_deletion_time should be std::numeric_limits<int>::max() by
default, and then keep track of max local deletion time, if any.

This bug prevented a sstable generated by us from being compacted by
c* because max_local_deletion_time was storing std::numeric_limits<int>
::min(), and thus the sstable would be considered fully expired.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-23 12:13:26 -03:00
Raphael S. Carvalho
f831d1bce9 sstables: add support to generate compaction metadata
compaction metadata is composed of ancestors and cardinality.

ancestors data is generated via compaction process, so it will be
empty by the time being.

cardinality data is generated by hashing the keys, offering the
values to hyperloglog and retrieving a buffer with the data to be
stored.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-23 12:13:26 -03:00
Raphael S. Carvalho
954688b327 hyperloglog: modify it according to our needs
The first change was to add the function get_bytes, which will create
a temporary buffer with the format expected by compaction metadata's
cardinality. For creating the format, I had to import write_unsigned_
var_int from stream-lib.
write_unsigned_var_int is about using fewer bytes to encode smaller
integer values, but will use slighly more bytes to larger values.

The last change was to add the function offer_hashed, which receives
a 64-bit hashed value instead. Hash algorithm used by c* is murmur
hash - hash2_64.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-23 12:13:25 -03:00
Raphael S. Carvalho
2c95fdd7b6 sstables: import hyperloglog implementation
Imported from: https://github.com/hideo55/cpp-HyperLogLog
HEAD: 3ff431b5af84aa28a6390430bbc60c4678bec763

hyperloglog is a cardinality estimator that will used by sstable
compaction metadata stored by Statistics file.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-23 11:59:45 -03:00
Raphael S. Carvalho
0d345dfd32 sstables: extend metadata collector to add ancestors
Ancestors is a member of compaction metadata.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-23 11:18:50 -03:00
Nadav Har'El
f26dae3bf9 sstable: basic compaction function
This patch adds the basic compaction function sstables::compact_sstables,
which takes a list of input sstables, and creates several (currently one)
merged sstable. This implementation is pretty simple once we have all
the infrastructure in place (combining reader, writer, and a pipe between
them to reduce context switches).

This is already working compaction, but not quite complete: We'll need
to add compaction strategies (which sstables to compact, and when),
better cardinality estimator, sstable management and renaming, and a lot
of other details, and we'll probably still need to change the API.
But we can already write a test for compacting existing sstables (see
the next patch), and I wanted to get this patch out of the way, so we can
start working on applying compaction in a real use case.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-23 09:48:58 +03:00
Nadav Har'El
6063d4502f sstable: method for estimating number of partitions in sstable
The sstable has a lot of data, but suprisingly, and accurate count of the
number of partitions isn't available. We can get a good estimate by looking
at the number of summary entries.

Based on Origin's IndexSummary.getEstimatedKeyCount().

We need this estimate for compaction if we can't get (yet) a better
estimate from the cardinality estimator algorithm.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-23 09:48:57 +03:00
Glauber Costa
6336c02cab sstables: fix bug with old sstable
Some version of Origin will write 0 instead of -1 as the start of range marker
for a range tombstone. I've just came across one of such tables, that ended up
breaking our code. Let's be more flexible in what we accept. We don't really have
a choice.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-22 11:13:42 -04:00
Raphael S. Carvalho
118e4fc8be sstable: make do_write_components more pleasant to read
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-22 16:27:47 +03:00
Nadav Har'El
00e9f9e06a sstables: don't use size_t for on-disk sizes
As Avi suggested, we should use size_t only for memory sizes, not disk
sizes, as some hypothetical 32-bit machine could have 32-bit size_t
but still support 64-bit file sizes.

So this patch changes a number of places we used size_t in sstables/
to use uint64_t instead. It doesn't change *all* uses of size_t: Where
the size_t refers to a size of an object in memory (or an object that
should fit into memory - like the summary file), I left size_t.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-22 08:02:50 +03:00
Nadav Har'El
646c31e596 sstables: take generic mutation_reader instead of memtable
The current sstable write interface only knows how to write a memtable.
For compaction, we also want it to be able to write the compaction's
output, which we can represent as a mutation_reader. So this patch
changes the sstable::write_components() method to accept a mutation_reader,
and whatever else is needed (a schema and the number of partitions in
the reader - or an estimate thereof).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-21 23:32:17 +03:00
Avi Kivity
7faebf4524 Merge "Bloom filter tracker" from Glauber
"This is the code responsible for tracking bloom filter activity. Origin uses
atomics, we will just keep local counters and map-reduce (yay!)"
2015-06-21 08:15:54 +03:00
Raphael S. Carvalho
113d3b1001 sstables: update compression ratio stats
If compression is used, we should provide both uncompressed and
compressed length to metadata collector, so as for the ratio to
be computed. Stats metadata stores compression ratio.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-21 08:14:07 +03:00
Glauber Costa
23cae98cd6 sstables: count filter hits and misses
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-18 11:17:36 -04:00
Glauber Costa
272b97f01c sstables: initialize filter_tracker when filter file is read
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-18 11:17:29 -04:00
Glauber Costa
1c5d2141db add sstables filter tracker
This class will be used to generate filter hit / miss statistics to be consumed
by upper layers

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-18 11:17:22 -04:00
Glauber Costa
4238ecffd3 sstables: make sure sstable directory exists
In theory, when we create a new column family, we should also make sure
that the underlying directory exist. However, this would be quite challenging:
there are a lot of entry points for, add_column_family, none of them are futurized,
and futurizing them could prove challenging up the call chain.

Because we can guarantee that the keyspace directory will exist - now that we
have unified that, it is actually a lot simpler to just make sure that the
directory exist when writing the sstable.

If the keyspace directory wouldn't exist we would have to recurse through the
path.  As previously said, this patch will assume this away.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-18 09:22:58 -04:00
Raphael S. Carvalho
657c817d14 sstables: add support to deflate compression
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-17 11:55:02 -03:00
Raphael S. Carvalho
7244d392cd sstables: add support to snappy compression
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-06-17 11:55:00 -03:00