Commit Graph

271 Commits

Author SHA1 Message Date
Avi Kivity
2e745bebad Merge "use compaction strategy options" from Raphael 2015-07-27 17:06:43 +03:00
Raphael S. Carvalho
15bbb71b7b db: handle compaction exception outside keep doing
Otherwise, we would needlessly handle it twice.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-24 19:12:34 -03:00
Raphael S. Carvalho
5f89f80ae5 Revert "db: dont rethrow exceptions for termination of compaction fiber"
Actually we should rethrow exceptions because they are needed for
keep_doing() to finish. Otherwise, the future _compaction_done
will never be resolved.

This reverts commit 89698b0d1c.
2015-07-24 19:07:47 -03:00
Raphael S. Carvalho
634d00511b compaction: use compaction options in strategy
Support to compaction strategy options was recently added.
Previously, we were using default values in compaction strategy for
options, but now we can use the options defined in the schema.
Currently, we only support size-tiered strategy, so let's start
with it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-23 15:26:47 -03:00
Glauber Costa
d1496944d9 sstables: handle compaction strategy
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-23 00:02:11 -04:00
Avi Kivity
8870bf1bf8 Merge "Handling of non-full partition range queries" from Tomasz 2015-07-22 15:18:02 +03:00
Tomasz Grabiec
f9da612581 memtable: Implement range queries 2015-07-22 13:14:33 +02:00
Tomasz Grabiec
152582a869 sstables: Add read_range_rows() variant which takes a partition_range 2015-07-22 13:13:38 +02:00
Pekka Enberg
791031fbc7 database: Extract update_schema_version_and_announce() function
It's needed in storage proxy.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-22 11:57:00 +03:00
Tomasz Grabiec
0b0ea04958 range: Remove start_value() and end_value()
It's easy to miss that they may be undefined. start() and end(), which
return optional<bound> const&, make it clear.
2015-07-22 10:27:47 +02:00
Tomasz Grabiec
4a18693a23 db: Remove dead code 2015-07-22 10:27:47 +02:00
Raphael S. Carvalho
89698b0d1c db: dont rethrow exceptions for termination of compaction fiber
broken_semaphore and seastar::gate_closed_exception exceptions are
used for regular termination of compaction fiber, which otherwise
would live forever. We shouldn't re-throw these exceptions, but
instead only print a log message.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-22 11:23:58 +03:00
Avi Kivity
8ba5d19db5 db: avoid ubsan false-positive in query_state move constructor
The value is moved before initialization due to a do_with().  It's harmless,
but better to silence the warning.
2015-07-21 12:19:54 +03:00
Raphael S. Carvalho
6ae3ffa319 database: add get_sstables to column_family
Returns all sstables added to a given column_family.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 10:08:09 -03:00
Raphael S. Carvalho
ebbc7aa43e database: add compact_sstables to column_family
compact_all_sstables is about selecting all available sstables
for compaction and executing a compaction code on them.
This compaction code was moved to a more generic function called
compact_sstables, which will compact a list of given sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-20 10:08:02 -03:00
Avi Kivity
6ade74b7c3 db: recover sstable generation counter on startup
Don't attempt to overwrite an existing sstable.
2015-07-20 12:00:34 +02:00
Glauber Costa
4250b7dd64 database: do not use commitlog constructor if there is no commitlog
Tomek pointed out that we shouldn't be passing a reference to commitlog every
time we use the add_column_family interface, because that will at times pass a
reference to a null object.

Test that, and pass no_commitlog if there is none.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-16 20:04:29 +03:00
Pekka Enberg
81cddec777 database: Add versioning support
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-16 14:53:30 +03:00
Raphael S. Carvalho
719898d0e5 introduce automatic compaction
As the name implies, this patch introduces the concept of automatic
compaction for sstables.

Compaction task is triggered whenever a new sstable is written.
Concurrent compaction on the same column family isn't supported, so
compaction may be postponed if there is an ongoing compression.
In addition, seastar::gate is used both to prevent a new compaction
from starting and to wait for an ongoing compaction to finish, when
the system is asked for a shutdown.

This patch also introduces an abstract class for compaction strategy,
which is really useful for supporting multiple strategies.
Currently, null and major compaction strategies are supported.
As the name implies, null compaction strategy does nothing.
Major compaction strategy is about compacting all sstables into one.
This strategy may end up being helpful when adding support to major
compaction via nodetool.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-07-16 12:00:12 +03:00
Glauber Costa
9c464aff9b database: clean up various APIs
In much of our column_families APIs, we need to pass a pointer to the database.
The only reason we do that, is so we can properly handle the commit log entries
after we seal the current memtables into sstables.

Now that we store a pointer to the commit log in the CF itself at the time it
is created, we no longer have to do it. As a result, the APIs are a lot
cleaner, with no gratuitous parameters.

My motivation for this was the flush method, but as a result, apply() also gets
cleaner.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-15 10:24:20 -04:00
Glauber Costa
ad46daa6aa column_family: add the commitlog as a parameter
When we create a column family, we can pass as an extra parameter, the
commitlog - or lack thereof. Because the commitlog is optional to begin with -
it won't exist if we don't call init_commitlog, we can have this to be empty
meaning no commit log.

The creation of a column family should be always done through
add_column_family. And if that is the case, we have the database's commitlog
right there and can get the pointer through the db. Only tests are not creating
the column family this way, and for them, it is fine.

We want to do that, because some column family operations will use the commit log.
Right now, they are forcing us to add parameters to APIs that would be much cleaner
without it. So while separation is good, this level of coupling is a net win as it
allows us to clean up some visible APIs.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-15 10:24:20 -04:00
Avi Kivity
99a15de9e5 logger: de-thread_local-ize logger
The logger class constructor registers itself with the logger registry,
in order to enable dynamically setting log levels.  However, since
thread_local variables may be (and are) initialized at the time of first
use, when the program starts up no loggers are registered.

Fix by making loggers global, not thread_local.  This requires that the
registry use locking to prevent registration happening on different threads
from corrupting the registry.

Note that technically global variables can also be initialized at the
point of first use, and there is no portable way for classes to self-register.
However this is the best we can do.
2015-07-14 17:18:11 +03:00
Tomasz Grabiec
9bea6aa0a3 db: Introduce mutation query interface
Mutation query differs from data query in that returns information
needed to reconcile data slice with that retruned by other data
sources.

There is a generic mutation_query() algorithm introduced, which can
work with any mutation_source.

database::query_mutations() is a shard-local interface for mutation
queries.

The reconcilable_result is introduced as a medium for mutation query
results. It piggy backs on frozen_mutation as a medium for
reconcilable data.
2015-07-12 12:51:38 +02:00
Tomasz Grabiec
9724b84bb3 db: Fix query of partitions with no live clustered rows
When partition has no live regular rows, but has some data live in the
static row, then it should appear in the results, even though we
didn't select any static column.

To reproduce:

  create table cf (k blob, c blob, v blob, s1 blob static, primary key (k, c));
  update cf set s1 = 0x01 where k = 0x01;
  update cf set s1 = 0x02 where k = 0x02;
  select k from cf;

The "select" statement should return 2 rows, but was returning 0.

The following query worked fine, because static columns were included:

  select * from cf;

The data query should contain only live data, so we shouldn't write a
partition entry if it's supposed to be absent from the results. We
can'r tell that though until we've processed all the data. To solve
this problem, query result writer is using an optimistic approach,
where the partition header will be retracted from the buffer
(cheaply), if it turns out there's no live data in it.
2015-07-09 19:55:00 +02:00
Tomasz Grabiec
09ed972068 mutation_partition: Remove redundant slice parameter from query()
The slice used by partition_writer must match the one used by query()
anyway.
2015-07-09 19:47:32 +02:00
Tomasz Grabiec
8a18d2b699 Extract memtable implementation to memtable.cc 2015-07-09 19:46:29 +02:00
Avi Kivity
5d9222d935 Merge "Filter sstable data not belonging to current shard" from Tomasz
"We don't want multiple shards to respond with the same data. Higher level code
assumes that shard data is non-overlapping. It's cheaper to drop duplicates as
soon as possible. Memtable reader for example will never have overlapping
data, so cache hitting queries will never need to pay for this. Compaction
process may also rely on this."
2015-07-07 18:12:35 +03:00
Tomasz Grabiec
66dfeb33d7 db: Filter out sstable partitions not belonging to current shard 2015-07-07 16:56:25 +02:00
Tomasz Grabiec
d035c499b8 db: Move database::shard_of() to dht::shard_of() 2015-07-07 16:56:25 +02:00
Pekka Enberg
a358990855 db/legacy_schema_tables: Pass storage_proxy by reference
We always operate on the local storage proxy so pass it by reference.
This simplifies DEFINITIONS_UPDATE message handler where all we have is
a "this" pointer to the local storage proxy.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-07-07 16:27:58 +03:00
Glauber Costa
5044e5191d gate: use with_gate idiom
Aside from guaranteeing that we will always leave correctly, it will also
allow us to change the implementation of the enter / leave pair without
disrupting existing code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-07-05 19:18:34 +03:00
Calle Wilund
abe3851376 Move memtable in_flight_seal.leave() to finally (symmetry) 2015-07-05 16:04:40 +03:00
Paweł Dziepak
183b6fc6d9 db: do not return already expired cells in queries
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-07-02 17:25:41 +02:00
Tomasz Grabiec
3fc951e807 mutation_partition: Use default value for row_limit in query() 2015-07-02 13:25:46 +02:00
Nadav Har'El
c892228018 compaction: remove compacted sstables
After compaction, remove the source sstables. This cannot be done
immediately, as ongoing reads might be using them, so we mark the sstable
as "to be deleted", and when all references to this sstable are lost and
the object is destroy, we see this flag and delete the on-disk files.

This patch doesn't change the low-level compact_sstables() (which doesn't
mark its input sstables for deletion), but rather the higher-level example
"strategy" column_family::compact_all_sstables(). I thought we might want
to do this to allow in the future strategies that might only mark the input
sstables for deletion after doing perhaps other steps and to be sure it
doesn't want to abort the compaction and return to the old files. If we
decide this isn't needed, we can easily move the mark_for_deletion() call
to compact_sstables().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-30 15:00:39 +03:00
Avi Kivity
e85197a806 db: fix std::terminate() called during failed find_schema()
One of the find_schema variants calls a find_uuid() that throws out_of_range,
without converting it to no_such_column_family first.  This results in
std::terminate() being called due to exception specifications.

Fix by converting the exception.
2015-06-29 13:32:28 +03:00
Glauber Costa
e19f7e93a4 database: column families loading
Same as for keyspaces, and we will reuse most of the code. CFs that are
found on disk upon bootstrap are now recreated in memory.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-25 09:50:55 -04:00
Glauber Costa
b3a4eb83d3 database: keyspaces loading
This patch bootstraps the keyspaces found in system sstables and make
our in-memory structures reflect them. It tries to reuse as much code
as we can from db::legacy_system_tables, but keeping everything local
and without applying any mutations to the database - since the latter
would only unnecessarily complicate the write path.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-25 09:50:55 -04:00
Glauber Costa
c0ad0dc1d3 database: pass storage proxy to init_data_directory
In order for us to call some function from db::legacy_schema_tables, we need
a working storage proxy. We will use those functions in order to leverage the
work done in keyspace / table creation from mutations

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-25 09:50:55 -04:00
Nadav Har'El
3f5114e415 compaction: compact_all_sstables demo function
This is an example of how to use the low-level compact_sstable() function
to compact all the sstables of one column family into one. It is not a
full-fledged "compaction strategy" but the real ones can be based on this
example.

Among the things that this code doesn't do yet is to delete the old
sstables. In the future, this should happen automatically in the sstable
destructor when all the references to the sstable get deleted.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-25 11:05:35 +03:00
Glauber Costa
4d07c952cc remove global create_keyspace
There is only one user in tree, and as Tomek pointed out, it is buggy. The
reason is that ksm is a shard-local structure, and it is currently used
indiscriminately by all shards which can easily lead to problems.

We could fix it with some tricks, but it is way better and safer to make sure
the callers are doing it right instead. There is only one caller, so let's fix
that.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-24 12:44:05 -04:00
Glauber Costa
f0f04892d3 database: no longer replicate create_keyspace code
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-24 12:44:05 -04:00
Avi Kivity
3d22623a6b Merge "Flush schema changes to disk" from Glauber
"This is the current patchset to flush and persist schema changes to disk.
It is not perfect, in the sense that older changes still in flight won't be
waited for. But as we discussed - at this moment we'll just note that, and
leave the fix for later"
2015-06-24 17:08:33 +03:00
Glauber Costa
fd6ec4a7ca database: fix sstables exception
This exception handling code is clearly bogus. That's old code, and it is not
the proper way to propagate it.

Fix it to use then_wrapped. Also include the filename in the message, so we have
a better clue about what happened.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-24 06:16:54 +03:00
Glauber Costa
a6ef9815e9 database: futurize seal_active_memtables
By doing this, it is possible to synchronously wait for the seal to complete by
waiting on this future. This is useful in situations where we want to
synchronously flush data to disk.

Existing callers will not be patched, and this keeps their current behavior,
alas, asynchronously initiating a write, is preserved.

TODO: A better interface would guarantee that all writes before this one are
also complete

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-23 15:23:37 -04:00
Tomasz Grabiec
d4e0e5957b db: Integrate cache with the read path 2015-06-23 13:49:25 +02:00
Tomasz Grabiec
b9288d9fa7 db: Make column_family managed by lw_shared_ptr<>
It will be share-owned by readers.
2015-06-23 13:49:24 +02:00
Tomasz Grabiec
83e7a21dfb mutation: Add apply() helper which works on mutation_opt 2015-06-23 13:49:23 +02:00
Vlad Zolotarov
3520d4de10 locator: introduce a global distributed<snitch_ptr> i_endpoint_snitch::snitch_instance()
Snitch class semantics defined to be per-Node. To make it so we
introduce here a static member in an i_endpoint_snitch class that
has to contain the pointer to the relevant snitch class instance.

Since the snitch contents are not always pure const it has to be per
shard, therefore we'll make it a "distributed". All the I/O is going
to take place on a single shard and if there are changes - they are going
to be propagated to the rest of the shards.

The application is responsible to initialize this distributed<shnitch>
before it's used for the first time.

This patch effectively reverts most of the "locator: futurize
snitch creation" a2594015f9 patch - the part that modifies the
code that was creating the snitch instance. Since snitch is
created explicitly by the application and all the rest of the code
simply assumes that the above global is initialized we won't need
all those changes any more and the code will get back to be nice and simple
as it was before the patch above.

So, to summarize, this patch does the following:
   - Reverts the changes introduced by a2594015f9 related to the fact that
     every time a replication strategy was created there should have been created
     a snitch that would have been stored in this strategy object. More specifically,
     methods like keyspace::create_replication_strategy() do not return a future<>
     any more and this allows to simplify the code that calls it significantly.
   - Introduce the global distributed<snitch_ptr> object:
      - It belongs to the i_endpoint_snitch class.
      - There has been added a corresponding interface to access both global and
        shard-local instances.
      - locator::abstract_replication_strategy::create_replication_strategy() does
        not accept snitch_ptr&& - it'll get and pass the corresponding shard-local
        instance of the snitch to the replication strategy's constructor by itself.
      - Adjusted the existing snitch infrastructure to the new semantics:
         - Modified the create_snitch() to create and start all per-shard snitch
           instances and update the global variable.
         - Introduced a static i_endpoint_snitch::stop_snitch() function that properly
           stops the global distributed snitch.
         - Added the code to the gossiping_property_file_snitch that distributes the
           changed data to all per-shard snitch objects.
         - Made all existing snitches classes properly maintain their state in order
           to be able to shut down cleanly.
         - Patched both urchin and cql_query_test to initialize a snitch instance before
           all other services.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v6:
   - Rebased to the current master.
   - Extended a commit message a little - the summary.

New in v5:
   - database::create_keyspace(): added a missing _keyspaces.emplace()

New in v4:
   - Kept the database::create_keyspace() to return future<> by Glauber's request
     and added a description to this method that needs to be changed when Glauber
     adds his bits that require this interface.
2015-06-22 23:18:31 +03:00
Shlomi Livne
0ce374a853 Add support for setting keyspace replication_strategy
To support initialization of system tables keyspace replication_strategy
without the need of having snitch creation.

Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
2015-06-22 13:19:55 +03:00