When passing tokens corresponding to 129th key in the sstable to
read_range_rows(), it failed with heap-buffer-overflow pointing to:
return make_ready_future<uint64_t>(index_list[min_index_idx].position);
The scenario is as follows. We pass the lower bound token, which
corresponds to the first partition of some (not first) summary
page. That token will compare less than any entry in that page (even
less with the key we took it from, cause we want all partitions with
that token), so min_idx will point to the previous summary page
(correct). Then this code tries to locate the position in the previous
page:
auto m = adjust_binary_search_index(this->binary_search(index_list, minimum_key(), min_token));
auto min_index_idx = m >= 0 ? m : 0;
binary_search() will return ((-index.list_size()) -1), because the
token is greater than anything in that page. So "m" and
"min_index_idx" will be (index.list_size()-1) after adjusting.
Then the code tried this:
auto candidate = key_view(bytes_view(index_list[min_index_idx]));
auto tcandidate = dht::global_partitioner().get_token(candidate);
if (tcandidate < min_token) {
min_index_idx++;
}
The last key compared less than the token also, so min_index_idx is
bumped up to index_list.size(). It then tried to use this too large
index on index_list, which caused buffer overflow.
We clearly need to return the first position of the next page in this
case, and this change does it indirectly by calling
data_end_position(), which also handles edge cases like if there is no
next summary page.
I reimplemented the logic top-down, and found that the last special
casing for tcandidate was not needed, so I removed it.
The logger class constructor registers itself with the logger registry,
in order to enable dynamically setting log levels. However, since
thread_local variables may be (and are) initialized at the time of first
use, when the program starts up no loggers are registered.
Fix by making loggers global, not thread_local. This requires that the
registry use locking to prevent registration happening on different threads
from corrupting the registry.
Note that technically global variables can also be initialized at the
point of first use, and there is no portable way for classes to self-register.
However this is the best we can do.
This method was incomplete, and thus would fail if map size were
greater than max_bin_size, bringing the application down.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>
It is a better fit for things that are names, not blobs. We have a user that expects
a bytes parameter, but that is for no other reason than the fact that the field used
to be of bytes type.
Let's fix that, and future users will be able to use sstrings
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Persist column family's "is_dense" value to system tables. Please note
that we throw an exception if "is_dense" is null upon read. That needs
to be fixed later by inferring the value from other information like
Origin does.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
In C*, every compacted sstable keeps track of its ancestors in the
statistics file. Supposedly, that info is used to discard sstable
files from ancestors which for some odd reason weren't deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
metadata_collector was made member of class sstable, such that the
compaction procedure will be able to use the method add_generation
from a sstable object.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
After compaction, remove the source sstables. This cannot be done
immediately, as ongoing reads might be using them, so we mark the sstable
as "to be deleted", and when all references to this sstable are lost and
the object is destroy, we see this flag and delete the on-disk files.
This patch doesn't change the low-level compact_sstables() (which doesn't
mark its input sstables for deletion), but rather the higher-level example
"strategy" column_family::compact_all_sstables(). I thought we might want
to do this to allow in the future strategies that might only mark the input
sstables for deletion after doing perhaps other steps and to be sure it
doesn't want to abort the compaction and return to the old files. If we
decide this isn't needed, we can easily move the mark_for_deletion() call
to compact_sstables().
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Following Nadav's discovery of the problem with large writes to output stream,
it turns out that compressed_file_output_stream also needs the option trim_to_
size enabled. Otherwise, a write to compressed_file_output_stream larger than
_size would result in a buffer larger than chunk size being flushed, which is
definitely wrong.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
A base class with virtual functions should also have a virtual destructor,
so if someone deletes it by the base class pointer, the concrete class's
destructor will be called.
I thought this missing virtual destructor is to blame for a bug I was
hunting, but it's not - but it's still worth adding this missing definition.
The silly "default" definition of the move constructor is also necessary,
because when you define the destructor explicitly, the compiler no longer
defines any constructors implicitly for you.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Our sstable code currently has a bug (not solved by this patch) in writing
large summary files, where several aio write operations are done and one of
them fails with an EINVAL.
Unfortunately and inexplicably, sstable::write_simple simply *hides* this
exception (catches it and ignores it), so the write never knows it fails,
and we only get an exception later when sstable::write_components() tries
to load() the sstable it just created.
So in this patch, I remove the hiding of the exception, and now when writing
an sstable with 1,000,000 partitions, I see this in the output:
failed to write sstable: Invalid argument
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Compressor type is only a part of the information kept in compressor
parameters and things like schema.get_compressor().get_compressor()
do not look very good.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
"This patch series introduces compression_parameters class which is used
to handle compression options specified at table creation. Such information
is now properly propagated to the database internals."
The two separate functions can now be merged. As a result, the code
that generates statistics data is now much easier to understand.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Assert should be completely avoided. Instead, we should trigger an
exception, allowing the db to proceed with a "smooth" shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Apparently, after writing a new sstable, with write_components(), it
is necessary to load() it. I'm not sure why, but we get a crash on
an aio to a closed file descriptor if we don't.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Instead of requiring the user to subclass a "sstable_creator" class to
specify how to create a new sstable (or in the future, several of them),
switch to an std::function.
In practice, it is much easier to specify a lambda than a class, especialy
since C++11 made it easy to capture variables into lambdas - but not into
local classes.
The "commit()" function is also unnecessary. Then intention there was to
provide a function to "commit" the new sstables (i.e., rename them).
But the caller doesn't need to supply this function - it can just wait
for the future of the end of compaction, and do his own committing code
right then.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Column stats min_timestamp, max_timestamp and max_local_deletion_time
were being update incorrectly.
max_local_deletion_time should be std::numeric_limits<int>::max() by
default, and then keep track of max local deletion time, if any.
This bug prevented a sstable generated by us from being compacted by
c* because max_local_deletion_time was storing std::numeric_limits<int>
::min(), and thus the sstable would be considered fully expired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
compaction metadata is composed of ancestors and cardinality.
ancestors data is generated via compaction process, so it will be
empty by the time being.
cardinality data is generated by hashing the keys, offering the
values to hyperloglog and retrieving a buffer with the data to be
stored.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
The first change was to add the function get_bytes, which will create
a temporary buffer with the format expected by compaction metadata's
cardinality. For creating the format, I had to import write_unsigned_
var_int from stream-lib.
write_unsigned_var_int is about using fewer bytes to encode smaller
integer values, but will use slighly more bytes to larger values.
The last change was to add the function offer_hashed, which receives
a 64-bit hashed value instead. Hash algorithm used by c* is murmur
hash - hash2_64.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This patch adds the basic compaction function sstables::compact_sstables,
which takes a list of input sstables, and creates several (currently one)
merged sstable. This implementation is pretty simple once we have all
the infrastructure in place (combining reader, writer, and a pipe between
them to reduce context switches).
This is already working compaction, but not quite complete: We'll need
to add compaction strategies (which sstables to compact, and when),
better cardinality estimator, sstable management and renaming, and a lot
of other details, and we'll probably still need to change the API.
But we can already write a test for compacting existing sstables (see
the next patch), and I wanted to get this patch out of the way, so we can
start working on applying compaction in a real use case.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The sstable has a lot of data, but suprisingly, and accurate count of the
number of partitions isn't available. We can get a good estimate by looking
at the number of summary entries.
Based on Origin's IndexSummary.getEstimatedKeyCount().
We need this estimate for compaction if we can't get (yet) a better
estimate from the cardinality estimator algorithm.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Some version of Origin will write 0 instead of -1 as the start of range marker
for a range tombstone. I've just came across one of such tables, that ended up
breaking our code. Let's be more flexible in what we accept. We don't really have
a choice.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
As Avi suggested, we should use size_t only for memory sizes, not disk
sizes, as some hypothetical 32-bit machine could have 32-bit size_t
but still support 64-bit file sizes.
So this patch changes a number of places we used size_t in sstables/
to use uint64_t instead. It doesn't change *all* uses of size_t: Where
the size_t refers to a size of an object in memory (or an object that
should fit into memory - like the summary file), I left size_t.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The current sstable write interface only knows how to write a memtable.
For compaction, we also want it to be able to write the compaction's
output, which we can represent as a mutation_reader. So this patch
changes the sstable::write_components() method to accept a mutation_reader,
and whatever else is needed (a schema and the number of partitions in
the reader - or an estimate thereof).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
If compression is used, we should provide both uncompressed and
compressed length to metadata collector, so as for the ratio to
be computed. Stats metadata stores compression ratio.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
This class will be used to generate filter hit / miss statistics to be consumed
by upper layers
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In theory, when we create a new column family, we should also make sure
that the underlying directory exist. However, this would be quite challenging:
there are a lot of entry points for, add_column_family, none of them are futurized,
and futurizing them could prove challenging up the call chain.
Because we can guarantee that the keyspace directory will exist - now that we
have unified that, it is actually a lot simpler to just make sure that the
directory exist when writing the sstable.
If the keyspace directory wouldn't exist we would have to recurse through the
path. As previously said, this patch will assume this away.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>