Commit Graph

184 Commits

Author SHA1 Message Date
Glauber Costa
f4a167670a database: seal active memtables when we close the database
Failing to do so can lead to data not being written to disk when
we terminate.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-21 09:39:31 +03:00
Glauber Costa
1f13d3e38f database: gate seal_active_memtable
We need to do that in order to close the database cleanly, flushing all pending
data before we do.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-21 09:39:29 +03:00
Avi Kivity
f221301d5e Merge "preparation work - system table handling" from Glauber 2015-06-18 17:49:29 +03:00
Tomasz Grabiec
51cae834e3 db: Put all sstables behind single reader
This change abstracts reading from on-disk data sources behind a single
reader which is then composed with memtable readers. This change also
abstracts all data sources behind a single reader obtained via
column_family::make_reader(). That reader is then used by algorithms
like column_family::for_all_partitions() or
column_family::query(). Having those abstractions will make it easier
to add row cache, because it will be encapsulated in a single place.
2015-06-18 16:33:33 +02:00
Tomasz Grabiec
7f1ff0401e db: Move mutation_reader definition to separate header 2015-06-18 15:47:40 +02:00
Glauber Costa
057c38b61c only populate system keyspace
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-06-18 09:22:20 -04:00
Pekka Enberg
8345874dda database: Add database::has_schema() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-06-17 15:45:45 +03:00
Gleb Natapov
2d409250f2 remove ad-hoc token_metadata creation 2015-06-15 12:51:09 +03:00
Avi Kivity
446731cf88 Merge "column family API"
Column family API, from Amnon.
2015-06-15 10:50:23 +03:00
Vlad Zolotarov
e045d8465c db: use snitch name from the configuration file
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-06-14 15:31:58 +03:00
Gleb Natapov
b7155ad862 pass partitions_ranges separately from from read_command
partitions_ranges will be manipulated upon to be split for different
destination, so provide it separately from read_command to not copy the
later for each destination.
2015-06-11 15:18:07 +03:00
Avi Kivity
ce6cd4b67e Merge "Store keyspace strategy options to database"
From Pekka:

"This series fixes up schema management code to store keyspace strategy
options to database. The map is stored as JSON just like in Origin."
2015-06-11 14:21:53 +03:00
Pekka Enberg
d088cb8181 Fix keyspace strategy options to preserve key-value ordering
Fix keyspace strategy options to preserve key-value ordering by
switching to std::map. We need this to be able to store the map in
database as JSON because unordered maps can cause the schema merging
code to attempt a keyspace update, which we don't support, even though
the values did not change.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-06-11 13:02:42 +03:00
Pekka Enberg
b7a23ddadd database: Memtable flush batching
Currently, we flush out memtables very aggressively which results into
lots of small sstable writes. The proper fix here is to do accounting on
the memtable size but before that happens, bump up the threshold to
another magic number which gives better batching:

  $ ./build/release/seastar --smp 1 --data-file-directories data --commitlog-directory commitlog/

  $ tools/bin/cassandra-stress write -mode cql3 native prepared -rate threads=32

Before:

  Results:
  op rate                   : 37280
  partition rate            : 37280
  row rate                  : 37280
  latency mean              : 0.8
  latency median            : 0.6
  latency 95th percentile   : 1.1
  latency 99th percentile   : 7.6
  latency 99.9th percentile : 11.9
  latency max               : 50.5
  Total operation time      : 00:00:30
  END

After:

  Results:
  op rate                   : 46721
  partition rate            : 46721
  row rate                  : 46721
  latency mean              : 0.7
  latency median            : 0.5
  latency 95th percentile   : 0.9
  latency 99th percentile   : 1.3
  latency 99.9th percentile : 5.8
  latency max               : 96.3
  Total operation time      : 00:00:39
  END

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-06-11 10:24:35 +03:00
Amnon Heiman
b9e3a03483 Expose the column family info in the database
The API needs the column family information in the database object.
This adds function to the database to expose the column family
information.
2015-06-11 09:50:52 +03:00
Calle Wilund
8b9a63a3c6 Database/commitlog: guard against replay position reordering
Commit log guarantees that once an RP is assigned to a data frame/caller, it
will not block before returning the result via future. However, this is not
enough, since we could
a.) Have blocked earlier, in which case the return value processing will be
async anyway
b.) Even if no blocking takes place, future chaining mechanism could decide
it has to reorder execution.

Assuming though that the case where this happens is rare, and cases where it
actually affects the rule of replay position ordering is even rarer, we can
guard against it by simply keeping track of the highest RP _discarded_ (sent
to sstable flush), and if we attempt to apply a mutation with a higher RP,
simply re-do the operation (i.e. write same entry to commit log again).

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-06-10 11:56:45 +03:00
Vlad Zolotarov
a2594015f9 locator: futurize snitch creation
- Forbid explicit snitch creation with constructor.
   - Allow the creation of snitches only with locator::make_snitch() template
     function.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v4:
   - Make sure the snitch is stopped before it's destroyed when _snitch_is_ready
     is returned in an exceptional state.

New in v2:
   - Change snitch_ptr to be std::unique_ptr<i_endpoint_snitch>
   - abstract_replication_strategy::create_replication_strategy(): explicitly
     specify (template) types of create_object() parameters.
   - Re-arrange the loop in marge_keyspaces() so that lambdas that depend on
     "this" complete before there is a chance that "this" gets destroyed.
   - create_keyspace(): Don't add a new keyspace if a keyspace with this name
     already exists.
   - i_endpoint_snitch: added a stop() virtual method
      - Added a stop() pure virtual method.
      - Added an enum class snitch_state and a _state member initialized to snitch_state::initializing,
        added an assert() in a destructor requiring _state to become snitch_state::stopped,
        which should be set when stop() is complete.
   - rack_inferring_snitch: added a stop() method.
   - simple_snitch: added a stop() method.
   - Added stop() methods to abstract_replication_strategy and keyspace.
   - Updated database::stop() to wait for all keyspaces in _keyspaces to stop.
2015-06-09 15:33:38 +03:00
Vlad Zolotarov
c1f0d285bb database: make the the create_keyspace() function declaration match the definitiion.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-06-09 15:18:46 +03:00
Pekka Enberg
87e525b6b5 database: Add update and drop column family stubs
They're needed by table merging in db/legacy_schema_tables.cc.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-06-08 14:42:36 +03:00
Tomasz Grabiec
b2549a7b14 Merge branch 'calle/secondary_index' from seastar-dev.git 2015-06-03 13:22:01 +02:00
Calle Wilund
293dbf66e3 Forward and use replay_position when applying mutation
* Forward commitlog replay_position to column_family.memtable, updating
  highest RP if needed
* When flushing memtable, signal back to commitlog that RP has been dealt with
  to potentially remove finished segment(s)

Note: since memtable flushing right now is _not_ explicitly ordered,
this does not actually work, since we need to guarantee ordering with
regards to RP. I.e. if we flush N blocks, we must guarantee that:
a.) We report "flushed RP" in RP order
b.) For a given RP1, all RP* lower than RP1 must also have been flushed.
(The latter means that it is fine to say, flush X tables at the same time, as long as we report a single RP that is the highest, and no lower RP:s exist in non-flushed tables)

I am however letting someone else deal with ensuring MT->sstable flush order.

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-06-03 12:38:13 +03:00
Calle Wilund
724a33c11d Database: add "existing_index_names" 2015-06-03 10:13:53 +02:00
Paweł Dziepak
8e66bfc9d4 db: add getter for database::_keyspaces
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-06-02 14:11:34 +02:00
Paweł Dziepak
d50859907f db: update keyspace_metadata when column family is added
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-06-02 14:11:34 +02:00
Pekka Enberg
4dc488afb2 database: Store metadata in 'struct keyspace'
Store a lw_shared_ptr<keyspace_metadata> in struct keyspace so callers
in migration manager, for example, can look it up.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-25 09:12:29 +02:00
Avi Kivity
ff42d58881 db: use CoW to modify the memtable list in column_family
Allow memtables to be removed from a column_family while a running query
continues to use them.
2015-05-20 16:00:00 +03:00
Avi Kivity
1342553fed db: remove column_family::testonly_all_memtables()
Unused and gets in the way.
2015-05-20 15:28:53 +03:00
Avi Kivity
f8f6e979ef db: use CoW to modify the sstable table in column_family
Allow sstables to be removed from a column_family while a running query
continues to use them.
2015-05-20 15:17:35 +03:00
Tomasz Grabiec
137b3beb2f Merge tag 'avi/readpath-prep/v1' from seastar-dev.git
From Avi:

"This patchset prepares for adding sstables to the read path.  Because sstables
involve I/O, their APIs return futures, which means that APIs that may call
those sstable APIs also need to return futures.

This patchset uses the two-space indent + do_with + reference aliases trick
to make patches more readable.  Cleanup patches will follow once it is merged."
2015-05-19 20:39:36 +02:00
Pekka Enberg
56d6fdacfe database: Simplify replication strategy initialization
Initialize replication strategy when keyspace is being created now that
we have access to keyspace_metadata.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 15:27:47 +03:00
Pekka Enberg
cd35617855 database: Use keyspace_metadata for creation functions
Use the keyspace_metadata type for keyspace creation functions. This is
needed to be able to have a mapping from keyspace name to keyspace
metadata for various call-sites.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 15:27:47 +03:00
Avi Kivity
db04bba208 db: futurize the single partition query path
Prepare for disk reads.
2015-05-19 15:13:09 +03:00
Avi Kivity
738be63b28 db: define column_family move constructor in .cc
Allows using it from files that do not include sstable.hh.
2015-05-19 15:13:09 +03:00
Pekka Enberg
8380df84b4 database: Rename ks_meta_data to keyspace_metadata
Follow the naming convention set by user_types_metadata and rename
ks_meta_data to keyspace_metadata.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 11:24:06 +03:00
Pekka Enberg
7a84b53d61 database: Use lw_shared_ptr for user types metadata
Use lw_shared_ptr for user types metadata member in ks_meta_data.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 11:17:55 +03:00
Pekka Enberg
a225439fdb database: Inline ks_meta_data implementation
The implementation part of ks_meta_data is just few lines of code.
Inline that to the database.hh header file.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 11:07:14 +03:00
Pekka Enberg
032af4d53b database: Move ks_meta_data definition to database.hh
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-05-19 11:03:28 +03:00
Avi Kivity
07d7f410f3 Merge branch 'memtable' into db
Conflicts:
	database.hh
	memtable changes moved to memtable.hh
2015-05-18 15:50:24 +03:00
Avi Kivity
875148dae6 db: create keyspace/column_family directory structure
This is slightly awkwards, since the directory structure is not sharded.
This requires some processing to occur outside the shard, while the rest
is sharded.
2015-05-18 15:34:41 +03:00
Avi Kivity
20775b9d5c db: store a column_family's memtables in a list instead of a vector
A vector can cause memtables to be move()d around, which breaks any
code that captures a memtable's this pointer.

Fix by using a linked list.
2015-05-18 15:34:25 +03:00
Avi Kivity
394e0d3a8c db: make database::add_keyspace() return void
Returning a reference to the keyspace is dangerous in that the keyspace can
be moved away, when we start futurizing the add_keyspace() process.  Make
it return void and look up the keyspace at the point of use.
2015-05-18 15:34:25 +03:00
Avi Kivity
d8fed7e211 db: add simple memtable sealing policy
Need to be replaced with something better, but we lack the infrastructure so
far (region memory allocator).
2015-05-18 15:34:25 +03:00
Avi Kivity
0eb842dc5b db: write memtable after sealing it
Still missing handling after write completes.
2015-05-18 15:00:33 +03:00
Avi Kivity
ca49d73f97 db: allow configuring a column family to be memory-only
Useful for tests.
2015-05-18 15:00:33 +03:00
Avi Kivity
dda5cbfd0d db: make column_family and keyspace configurable
Currently used for the data directory.
2015-05-18 15:00:31 +03:00
Avi Kivity
7842113cb6 db: prune some unused column_familiy methods
Made redundant by switching tests to using memtable directly.
2015-05-18 14:59:02 +03:00
Glauber Costa
2174285c31 db: move memtable definition to its own file
Following what happened to others: we can now include memtable.hh
without including database.hh

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-05-17 12:38:32 +03:00
Avi Kivity
40c2d91cd8 db: add memtable::find_or_create_row_slow()
Useful for tests that do not need a column_family.
2015-05-17 10:31:22 +03:00
Tomasz Grabiec
f7abbda156 db: Apply frozen_mutation directly
We don't convert it back to mutation before applying.

mutation_partition has now apply() which works on
mutation_partition_view.
2015-05-08 09:19:02 +02:00
Tomasz Grabiec
4ab66de0ae db: Introduce frozen_mutation
The immediate motivation for introducing frozen_mutation is inability
to deserialize current "mutation" object, which needs schema reference
at the time it's constructed. It needs schema to initialize its
internal maps with proper key comparators, which depend on schema.

frozen_mutation is an immutable, compact form of a mutation. It
doesn't use complex in-memory strucutres, data is stored in a linear
buffer. In case of frozen_mutation schema needs to be supplied only at
the time mutation partition is visited. Therefore it can be trivially
deserialized without schema.
2015-05-08 09:19:01 +02:00