Commit Graph

297 Commits

Author SHA1 Message Date
Avi Kivity
3a5e3c8829 sstables: de-futurize write path
The sstables write path has been partially de-futurized, but now creates a
ton of threads, and yet does not exploit this as everything is serialized.

Remove those extra threads and futures and use a single thread to write
everything.  If needed, we'll employ write-behind in output_stream to
increase parallelism.

Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-03 20:33:59 +03:00
Avi Kivity
ad443e4771 sstable: add accessor for first/last partition keys 2015-08-03 20:17:41 +03:00
Avi Kivity
6ca6f0c3a4 sstables: add conversion function from sstable key to partition key 2015-08-03 20:17:40 +03:00
Raphael S. Carvalho
477a3586d7 compaction: add missing information to compaction log
duration and throughput weren't being calculated.

closes #54.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-02 19:15:57 +03:00
Avi Kivity
98ec451d6a Extract range<> into its own header
It's not just for queries any more.
2015-08-02 16:07:42 +03:00
Paweł Dziepak
430f74a8bb sstables: read expired or expiring row marker
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-07-30 14:10:06 +02:00
Paweł Dziepak
f5e3764570 sstables: properly write expiring row marker
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-07-30 14:10:06 +02:00
Raphael S. Carvalho
c9fdc7dc5d compaction: get rid of invalid FIXME comment
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-28 19:22:26 +03:00
Avi Kivity
2e745bebad Merge "use compaction strategy options" from Raphael 2015-07-27 17:06:43 +03:00
Tomasz Grabiec
e5feff5d71 dht: ring_position: Switch to total ordering
range::is_wrap_around() and range::contains() rely on total ordering
on values to work properly. Current ring_position_comparator was only
imposing a weak ordering (token positions equal to all key positions
with that token).

range::before() and range::after() can't work for weak ordering. If
the bound is exclusive, we don't know if user-provided token position
is inside or outside.

Also, is_wrap_around() can't properly detect wrap around in all
cases. Consider this case:

 (1) ]A; B]
 (2) [A; B]

For A = (tok1) and B = (tok1, key1), (1) is a wrap around and (2) is
not. Without total ordering between A and B, range::is_wrap_around() can't
tell that.

I think the simplest soution is to define a total ordering on
ring_position by making token positions positioned either before or
after all keys with that token.
2015-07-24 16:08:41 +02:00
Raphael S. Carvalho
70770c261b sstables: remove double percentage symbol from compaction log message
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-24 10:21:38 +02:00
Raphael S. Carvalho
634d00511b compaction: use compaction options in strategy
Support to compaction strategy options was recently added.
Previously, we were using default values in compaction strategy for
options, but now we can use the options defined in the schema.
Currently, we only support size-tiered strategy, so let's start
with it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-23 15:26:47 -03:00
Glauber Costa
4cd143de87 filter_tracker: define and call a stop method
All sharded services "should" define a stop method. Calling them is also
a good practice. For this one specifically, though, we will not call stop.
We miss a good way to add a Deleter to a shared_ptr class, and that would
be the only reliable way to tie into its lifetime.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-23 11:11:57 -04:00
Glauber Costa
96f7c77a04 sstables: write dense tables
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:22 -04:00
Glauber Costa
2757cc595a sstable partition: read dense tables
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:22 -04:00
Glauber Costa
87c77acbac sstables: correctly write column names for non compound types
This can happen for COMPACT STORAGE.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:21 -04:00
Glauber Costa
3383c619ad partition: handle reads of non-composite types
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:21 -04:00
Glauber Costa
e9094db7ef sstable partition: remove dead code
This is no longer used

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:21 -04:00
Glauber Costa
5b7c749310 sstables: simplified version of write_column_name for non-clustered columns
We still want to wrap it instead of writing the column name directly, so we are
able to update the statistics.

It is better to have a separate function for this, because write_column_name
doesn't have enough information to decide when to do what. Augmenting it so we
could have would require passing the schema, or an extra parameter, which would
then spread to all callers.

Keep in mind that testing for an empty clustering key is not enough, since
composite types will serialize the empty clustering key in this case.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-22 23:10:21 -04:00
Raphael S. Carvalho
e57fe36249 compaction: get compaction threshold from schema instead
Get values from cf->schema instead of using hardcoded threshold
values. In addition, move DEFAULT_MIN_COMPACTION_THRESHOLD and
DEFAULT_MAX_COMPACTION_THRESHOLD to schema.hh so as not to have
knowledge duplicated.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-22 18:03:23 +03:00
Avi Kivity
6a9d0495f8 sstables: fix memory corruption in metadata parsing
Since parsing involves a unique_ptr<metadata> holding a pointer to a
subclass of metadata, it must define a virtual destructor, or it can
cause memory leaks when deleted, or, with C++14 sized deallocators, it
can cause the wrong memory pool to be used for deleting the object.

Seen on EC2.

Define a virtual destructor to tell the compiler how to destroy
and free the object.
2015-07-22 17:46:37 +03:00
Avi Kivity
69a94732df Merge "logging compaction activity" from Raphael 2015-07-22 16:00:57 +03:00
Nadav Har'El
630ccf5a09 sstables: remember to close() files
It is now necessary to close() a file before destroying it, otherwise a big
ugly warning message is printed by the reactor. Our sstable read path was
especially careless about closing the countless files it opens, and the
sstable test generated as many as 400 (!) of these warning messages, despite
running correctly. This patch adds the missing close() calls.

After this patch, the sstable test still shows 3 warning messages.
Those are unavoidable: They happen while broken sstables are being
tested, and an exception is thrown in the middle of the sstable processing,
causing us to destroy a file object without calling close() on it first.
This, in my opinion, proves that requiring close() in the read path is not
a good thing, it is un-RAII-like and not exception-safe. But it is benign
except the warning message, so whatever. 3 scary warning messages from the
test are better than 400...

If these 3 remaining messages really bother us, I guess we can fix it by
catching the exceptions in the sstable code, closing the file and rethrowing
the exception, but it will be quite ugly...

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-07-22 15:30:59 +03:00
Avi Kivity
8870bf1bf8 Merge "Handling of non-full partition range queries" from Tomasz 2015-07-22 15:18:02 +03:00
Raphael S. Carvalho
63b41cc068 sstables: log compaction activity
There is some missing information in the last log printout, because
it's currently hard to generate such information.
Anyway, this patch is a good start towards providing the same log
messages as origin.

Addresses issue #12

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-22 09:15:18 -03:00
Raphael S. Carvalho
713953ee5e sstables: add function to return file name of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-22 08:18:10 -03:00
Tomasz Grabiec
152582a869 sstables: Add read_range_rows() variant which takes a partition_range 2015-07-22 13:13:38 +02:00
Tomasz Grabiec
5fe7c1093f sstables: Make mutation_reader::impl unmovable and uncopyable 2015-07-22 13:13:38 +02:00
Tomasz Grabiec
7aea858108 sstables: Make data_consume_rows(0, 0) return no rows
data_consume_rows(0, 0) was returning all partitions instead of no
partitions, because -1 was passed as count in such case, which was
then casted to uint64_t.

Special-casing it that way is problematic for code which calculates
the bounds, and when the key is not found we simple end up with 0 as
upper bound. Instead of convoluting the range lookup code to special
case for 0, let's simplify the interface so that (0, 0) returns no
rows, same as (1, 1). There is a new overload of data_consume_rows()
without bounds, which returns all data.
2015-07-22 13:10:01 +02:00
Nadav Har'El
4edf7fe206 clean up uses of lw_shared_ptr<file>
recently, "file" started to use a shared_ptr internally, and is already
copy-able and reference counted, and there is no reason to use
lw_shared_ptr<file>. This patch cleans up a few remaining places where
lw_shared_ptr<file> was used.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-07-22 11:51:40 +03:00
Tomasz Grabiec
f68b771927 sstables: Use lower_bound() and upper_bound() to search the partition index
I will need those abstractions later to handle
inclusiveness/exclusiveness of both staring and ending bounds.

They're also familiar abstractions, so the code is hopefully easier to
comprehend now.
2015-07-22 10:27:48 +02:00
Tomasz Grabiec
ff0308104c sstables: Add data_end_position() on summary page level 2015-07-22 10:27:48 +02:00
Tomasz Grabiec
e9a050da78 sstables: Obtain the key from entries using get_key() rather than casting to bytes_view
The entry contains not only the key, but other stuff like
position. Why would casting to bytes_view give the view on just the
key and not the whole entry. Better to be explicit.
2015-07-22 10:27:48 +02:00
Tomasz Grabiec
0b5f908a0b sstables: Make key_view comparable with partition_key_view 2015-07-22 10:27:48 +02:00
Tomasz Grabiec
73ccd51cc5 sstables: Add key_view::tri_compare() 2015-07-22 10:27:48 +02:00
Asias He
fa2aee57ac utils: Move util/serialization.hh to utils/serialization.hh
Now we will not have the ugly utils and util directories, only utils.
2015-07-21 16:12:54 +08:00
Raphael S. Carvalho
8faa202e98 sstables: add function to return candidates using size-tiered strategy
That's helpful for the purpose of testing, and leveled compaction may
also end up using size-tiered compaction strategy for selecting
candidates.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 12:27:33 -03:00
Raphael S. Carvalho
25f24c0748 sstables: fix size-tiered strategy
If old average is equals to new average, then we would remove
new average entry. That's totally wrong.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 12:26:56 -03:00
Avi Kivity
51e3f0a6df Merge "Size tiered compaction strategy" from Raphael 2015-07-20 17:29:13 +03:00
Avi Kivity
fee1f68b61 Add changes missing from previous commit 2015-07-20 17:28:45 +03:00
Avi Kivity
4a95f1589c Merge seastar upstream
Adjust make_file_*_stream() callers for updated seastar API.
2015-07-20 17:02:46 +03:00
Raphael S. Carvalho
a99c92f1b6 sstable compaction: add initial support to size-tiered strategy
Size-tired strategy basically consists of creating buckets with sstables
of nearly the same size.
Afterwards, it will find the most interesting bucket, which size must be
between min threshold and max threshold. Bucket with the smallest average
size is the most interesting one.

Bucket hotness is also considered when finding the most interesting bucket,
but we don't support this yet.
We are also missing some code that discards sstable based on its coldness,
i.e. hardly read.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 10:08:14 -03:00
Raphael S. Carvalho
d627ede812 sstables: add bytes_on_disk
Returns the sum of the size of all sstable components.
It will be used by size-tiered strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 10:07:59 -03:00
Raphael S. Carvalho
719898d0e5 introduce automatic compaction
As the name implies, this patch introduces the concept of automatic
compaction for sstables.

Compaction task is triggered whenever a new sstable is written.
Concurrent compaction on the same column family isn't supported, so
compaction may be postponed if there is an ongoing compression.
In addition, seastar::gate is used both to prevent a new compaction
from starting and to wait for an ongoing compaction to finish, when
the system is asked for a shutdown.

This patch also introduces an abstract class for compaction strategy,
which is really useful for supporting multiple strategies.
Currently, null and major compaction strategies are supported.
As the name implies, null compaction strategy does nothing.
Major compaction strategy is about compacting all sstables into one.
This strategy may end up being helpful when adding support to major
compaction via nodetool.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-16 12:00:12 +03:00
Raphael S. Carvalho
f7a1a5618b sstables: add missing #include guard
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-16 07:25:56 +03:00
Tomasz Grabiec
c206699c9d Merge tag 'avi/logger-config/v1' from seastar-dev.git
Logger configuration from Avi.
2015-07-15 11:27:09 +02:00
Tomasz Grabiec
4e5cb05aa4 sstables: Fix heap-buffer-overflow in read_range_rows()
When passing tokens corresponding to 129th key in the sstable to
read_range_rows(), it failed with heap-buffer-overflow pointing to:

      return make_ready_future<uint64_t>(index_list[min_index_idx].position);

The scenario is as follows. We pass the lower bound token, which
corresponds to the first partition of some (not first) summary
page. That token will compare less than any entry in that page (even
less with the key we took it from, cause we want all partitions with
that token), so min_idx will point to the previous summary page
(correct). Then this code tries to locate the position in the previous
page:

  auto m = adjust_binary_search_index(this->binary_search(index_list, minimum_key(), min_token));
  auto min_index_idx = m >= 0 ? m : 0;

binary_search() will return ((-index.list_size()) -1), because the
token is greater than anything in that page. So "m" and
"min_index_idx" will be (index.list_size()-1) after adjusting.

Then the code tried this:

        auto candidate = key_view(bytes_view(index_list[min_index_idx]));
        auto tcandidate = dht::global_partitioner().get_token(candidate);
        if (tcandidate < min_token) {
            min_index_idx++;
        }

The last key compared less than the token also, so min_index_idx is
bumped up to index_list.size(). It then tried to use this too large
index on index_list, which caused buffer overflow.

We clearly need to return the first position of the next page in this
case, and this change does it indirectly by calling
data_end_position(), which also handles edge cases like if there is no
next summary page.

I reimplemented the logic top-down, and found that the last special
casing for tcandidate was not needed, so I removed it.
2015-07-14 19:58:17 +02:00
Tomasz Grabiec
2a491b2076 sstables: Fix bug in read_range_rows()
The method was using the same summary page for both min and max
tokens, whereas they can be different if they're distant enough from
each other.
2015-07-14 19:58:17 +02:00
Avi Kivity
99a15de9e5 logger: de-thread_local-ize logger
The logger class constructor registers itself with the logger registry,
in order to enable dynamically setting log levels.  However, since
thread_local variables may be (and are) initialized at the time of first
use, when the program starts up no loggers are registered.

Fix by making loggers global, not thread_local.  This requires that the
registry use locking to prevent registration happening on different threads
from corrupting the registry.

Note that technically global variables can also be initialized at the
point of first use, and there is no portable way for classes to self-register.
However this is the best we can do.
2015-07-14 17:18:11 +03:00
Raphael S. Carvalho
d3a83aa549 sstables: finish streaming_histogram::update
This method was incomplete, and thus would fail if map size were
greater than max_bin_size, bringing the application down.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-12 11:06:03 +03:00