The sstables write path has been partially de-futurized, but now creates a
ton of threads, and yet does not exploit this as everything is serialized.
Remove those extra threads and futures and use a single thread to write
everything. If needed, we'll employ write-behind in output_stream to
increase parallelism.
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
range::is_wrap_around() and range::contains() rely on total ordering
on values to work properly. Current ring_position_comparator was only
imposing a weak ordering (token positions equal to all key positions
with that token).
range::before() and range::after() can't work for weak ordering. If
the bound is exclusive, we don't know if user-provided token position
is inside or outside.
Also, is_wrap_around() can't properly detect wrap around in all
cases. Consider this case:
(1) ]A; B]
(2) [A; B]
For A = (tok1) and B = (tok1, key1), (1) is a wrap around and (2) is
not. Without total ordering between A and B, range::is_wrap_around() can't
tell that.
I think the simplest soution is to define a total ordering on
ring_position by making token positions positioned either before or
after all keys with that token.
Support to compaction strategy options was recently added.
Previously, we were using default values in compaction strategy for
options, but now we can use the options defined in the schema.
Currently, we only support size-tiered strategy, so let's start
with it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
All sharded services "should" define a stop method. Calling them is also
a good practice. For this one specifically, though, we will not call stop.
We miss a good way to add a Deleter to a shared_ptr class, and that would
be the only reliable way to tie into its lifetime.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We still want to wrap it instead of writing the column name directly, so we are
able to update the statistics.
It is better to have a separate function for this, because write_column_name
doesn't have enough information to decide when to do what. Augmenting it so we
could have would require passing the schema, or an extra parameter, which would
then spread to all callers.
Keep in mind that testing for an empty clustering key is not enough, since
composite types will serialize the empty clustering key in this case.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Get values from cf->schema instead of using hardcoded threshold
values. In addition, move DEFAULT_MIN_COMPACTION_THRESHOLD and
DEFAULT_MAX_COMPACTION_THRESHOLD to schema.hh so as not to have
knowledge duplicated.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Since parsing involves a unique_ptr<metadata> holding a pointer to a
subclass of metadata, it must define a virtual destructor, or it can
cause memory leaks when deleted, or, with C++14 sized deallocators, it
can cause the wrong memory pool to be used for deleting the object.
Seen on EC2.
Define a virtual destructor to tell the compiler how to destroy
and free the object.
It is now necessary to close() a file before destroying it, otherwise a big
ugly warning message is printed by the reactor. Our sstable read path was
especially careless about closing the countless files it opens, and the
sstable test generated as many as 400 (!) of these warning messages, despite
running correctly. This patch adds the missing close() calls.
After this patch, the sstable test still shows 3 warning messages.
Those are unavoidable: They happen while broken sstables are being
tested, and an exception is thrown in the middle of the sstable processing,
causing us to destroy a file object without calling close() on it first.
This, in my opinion, proves that requiring close() in the read path is not
a good thing, it is un-RAII-like and not exception-safe. But it is benign
except the warning message, so whatever. 3 scary warning messages from the
test are better than 400...
If these 3 remaining messages really bother us, I guess we can fix it by
catching the exceptions in the sstable code, closing the file and rethrowing
the exception, but it will be quite ugly...
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
There is some missing information in the last log printout, because
it's currently hard to generate such information.
Anyway, this patch is a good start towards providing the same log
messages as origin.
Addresses issue #12
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
data_consume_rows(0, 0) was returning all partitions instead of no
partitions, because -1 was passed as count in such case, which was
then casted to uint64_t.
Special-casing it that way is problematic for code which calculates
the bounds, and when the key is not found we simple end up with 0 as
upper bound. Instead of convoluting the range lookup code to special
case for 0, let's simplify the interface so that (0, 0) returns no
rows, same as (1, 1). There is a new overload of data_consume_rows()
without bounds, which returns all data.
recently, "file" started to use a shared_ptr internally, and is already
copy-able and reference counted, and there is no reason to use
lw_shared_ptr<file>. This patch cleans up a few remaining places where
lw_shared_ptr<file> was used.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
I will need those abstractions later to handle
inclusiveness/exclusiveness of both staring and ending bounds.
They're also familiar abstractions, so the code is hopefully easier to
comprehend now.
The entry contains not only the key, but other stuff like
position. Why would casting to bytes_view give the view on just the
key and not the whole entry. Better to be explicit.
That's helpful for the purpose of testing, and leveled compaction may
also end up using size-tiered compaction strategy for selecting
candidates.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
If old average is equals to new average, then we would remove
new average entry. That's totally wrong.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Size-tired strategy basically consists of creating buckets with sstables
of nearly the same size.
Afterwards, it will find the most interesting bucket, which size must be
between min threshold and max threshold. Bucket with the smallest average
size is the most interesting one.
Bucket hotness is also considered when finding the most interesting bucket,
but we don't support this yet.
We are also missing some code that discards sstable based on its coldness,
i.e. hardly read.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Returns the sum of the size of all sstable components.
It will be used by size-tiered strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
As the name implies, this patch introduces the concept of automatic
compaction for sstables.
Compaction task is triggered whenever a new sstable is written.
Concurrent compaction on the same column family isn't supported, so
compaction may be postponed if there is an ongoing compression.
In addition, seastar::gate is used both to prevent a new compaction
from starting and to wait for an ongoing compaction to finish, when
the system is asked for a shutdown.
This patch also introduces an abstract class for compaction strategy,
which is really useful for supporting multiple strategies.
Currently, null and major compaction strategies are supported.
As the name implies, null compaction strategy does nothing.
Major compaction strategy is about compacting all sstables into one.
This strategy may end up being helpful when adding support to major
compaction via nodetool.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
When passing tokens corresponding to 129th key in the sstable to
read_range_rows(), it failed with heap-buffer-overflow pointing to:
return make_ready_future<uint64_t>(index_list[min_index_idx].position);
The scenario is as follows. We pass the lower bound token, which
corresponds to the first partition of some (not first) summary
page. That token will compare less than any entry in that page (even
less with the key we took it from, cause we want all partitions with
that token), so min_idx will point to the previous summary page
(correct). Then this code tries to locate the position in the previous
page:
auto m = adjust_binary_search_index(this->binary_search(index_list, minimum_key(), min_token));
auto min_index_idx = m >= 0 ? m : 0;
binary_search() will return ((-index.list_size()) -1), because the
token is greater than anything in that page. So "m" and
"min_index_idx" will be (index.list_size()-1) after adjusting.
Then the code tried this:
auto candidate = key_view(bytes_view(index_list[min_index_idx]));
auto tcandidate = dht::global_partitioner().get_token(candidate);
if (tcandidate < min_token) {
min_index_idx++;
}
The last key compared less than the token also, so min_index_idx is
bumped up to index_list.size(). It then tried to use this too large
index on index_list, which caused buffer overflow.
We clearly need to return the first position of the next page in this
case, and this change does it indirectly by calling
data_end_position(), which also handles edge cases like if there is no
next summary page.
I reimplemented the logic top-down, and found that the last special
casing for tcandidate was not needed, so I removed it.
The logger class constructor registers itself with the logger registry,
in order to enable dynamically setting log levels. However, since
thread_local variables may be (and are) initialized at the time of first
use, when the program starts up no loggers are registered.
Fix by making loggers global, not thread_local. This requires that the
registry use locking to prevent registration happening on different threads
from corrupting the registry.
Note that technically global variables can also be initialized at the
point of first use, and there is no portable way for classes to self-register.
However this is the best we can do.
This method was incomplete, and thus would fail if map size were
greater than max_bin_size, bringing the application down.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>