Commit Graph

154 Commits

Author SHA1 Message Date
Avi Kivity
0eb842dc5b db: write memtable after sealing it
Still missing handling after write completes.
2015-05-18 15:00:33 +03:00
Avi Kivity
ca49d73f97 db: allow configuring a column family to be memory-only
Useful for tests.
2015-05-18 15:00:33 +03:00
Avi Kivity
dda5cbfd0d db: make column_family and keyspace configurable
Currently used for the data directory.
2015-05-18 15:00:31 +03:00
Avi Kivity
7842113cb6 db: prune some unused column_familiy methods
Made redundant by switching tests to using memtable directly.
2015-05-18 14:59:02 +03:00
Avi Kivity
40c2d91cd8 db: add memtable::find_or_create_row_slow()
Useful for tests that do not need a column_family.
2015-05-17 10:31:22 +03:00
Tomasz Grabiec
f656ae8ed4 db: Encapsulate deletable_row fields 2015-05-13 08:56:54 +02:00
Tomasz Grabiec
dbc40dfb09 db: Encapsulate the "row" class
Reduces coupling. User's should not rely on the fact that it's an
std::map<>.  It also allows us to extend row's interface with
domain-specific methods, which are a lot easier to discover than free
functions.
2015-05-13 08:56:54 +02:00
Tomasz Grabiec
56bea440a7 mutation_partition: Pass schema by const& where applicable
If method doesn't want to share schema ownership it doesn't have to
take it by shared pointer. The benefit is that it's slightly cheaper
and those methods may now be called from places which don't own
schema.
2015-05-13 08:56:54 +02:00
Avi Kivity
7630fe332d db: pass correct mutation size to commitlog
Use serialized_size() instead of reprentation().size(), to account for the
size header.
2015-05-11 19:19:23 +03:00
Tomasz Grabiec
eaceb61801 db: Add atomic_cell::deletion_time()
Deleted cells store deletion time not expiry time. This change makes
expiry() valid only for live cells with TTL and adds deletion_time(),
which is inteded to be used with deleted cells.
2015-05-10 12:03:26 +03:00
Tomasz Grabiec
f7abbda156 db: Apply frozen_mutation directly
We don't convert it back to mutation before applying.

mutation_partition has now apply() which works on
mutation_partition_view.
2015-05-08 09:19:02 +02:00
Tomasz Grabiec
bdcd11efe9 db: Use operator<< for partition printing 2015-05-08 09:19:02 +02:00
Tomasz Grabiec
4ab66de0ae db: Introduce frozen_mutation
The immediate motivation for introducing frozen_mutation is inability
to deserialize current "mutation" object, which needs schema reference
at the time it's constructed. It needs schema to initialize its
internal maps with proper key comparators, which depend on schema.

frozen_mutation is an immutable, compact form of a mutation. It
doesn't use complex in-memory strucutres, data is stored in a linear
buffer. In case of frozen_mutation schema needs to be supplied only at
the time mutation partition is visited. Therefore it can be trivially
deserialized without schema.
2015-05-08 09:19:01 +02:00
Tomasz Grabiec
f43836eb68 db: Handle expired cells in compare_atomic_cell_for_merge()
While at it, clarify some comments.
2015-05-06 18:31:21 +02:00
Tomasz Grabiec
5ba1486ae7 db: Rename "ttl" to "expiry" when it's used as time point
To avoid confusion with "ttl" the duration.
2015-05-06 17:27:22 +02:00
Tomasz Grabiec
36ad6c9aa8 Merge tag 'avi/memtables/v3' from seastar-dev.git
Multiple memtable support from Avi.
2015-05-06 15:02:42 +02:00
Avi Kivity
ef5c661d11 db: add variant of column_family::for_all_partitions() for unit tests
Since it's for tests, we can pass a slower std::function<>.
2015-05-06 15:43:06 +03:00
Avi Kivity
1d6ac071c0 db: add API to seal current active memtable 2015-05-06 15:39:31 +03:00
Avi Kivity
22969aeb18 db: support for multiple memtables
Each column family now contains multiple memtables, with one designated as
"active" receiving all writes, while the others only serve reads.
2015-05-06 15:39:29 +03:00
Avi Kivity
5e81b92dc0 db: split column_family::partitions into a new memtable class
In preparation for multiple memtables, move column_family::partitions into
its own class, and forward relevant calls from column_family.

A testonly_all_memtables() function was added to support sstable_test.
2015-05-06 15:35:14 +03:00
Avi Kivity
cc291d7e3b db: improve sharding
Currently we use the first byte of the token for determining the local
shard.  This is suboptimal for two reasons:

 1. the first bytes of the token were already used to select the node,
    so they are not randomly distributed
 2. using a single byte is not sufficient for large core counts, as the
    modulo operation will not return evenly distributed results

Fix by using the final two bytes of the token.
2015-05-06 13:19:44 +02:00
Avi Kivity
e811690588 db: return smart pointers for column_family read-side lookups
A lookup can cause several data sources to be merged, in which case we will
have to return a temporary (containing data from all the data sources).

For simplicity, we start by always returning a temporary.
2015-05-05 20:21:04 +03:00
Avi Kivity
8028fb441a db: make column_family a class, not a struct
Don't expose privates in public.
2015-05-05 20:21:03 +03:00
Avi Kivity
3a0de14aa8 db: more const correctness for column_family and component types
Ensure that read-side accessors are const.  This is important in preparation
for multiple memtables (and later, sstables) since a read-side
mutation_partition may be a temporary object coming from multiple memtables
(and sstables) while a write-side mutation_partition is guaranteed to belong
to a single memtable (and thus, not be temporary).

Since writers will want non-const mutation_partitions to write to, they won't
be able to use the read-side accessors by accident.
2015-05-05 19:37:21 +03:00
Tomasz Grabiec
aec740f895 db: Make decorated_key have ordering compatible with Origin 2015-04-30 12:02:39 +02:00
Calle Wilund
2f4e7a00f6 Use db/config object in main, database etc
* Uses config object to augument/impl options parsing
* Database now holds config obj
* Commitlog can now be inited with global config obj.
2015-04-29 18:01:17 +02:00
Tomasz Grabiec
2693dd2c7b db: Extract bytes related stuff from database.cc to bytes.cc
Some tests (eg murmur_hash_test) need only byte manipulation
functions. By specifying dependencies precisely we can drastically
reduce recompilation times, which speeds up development cycle.

I managed to reduce recompilation time for murmur_hash_test from 5
minutes to 4 seconds by breaking dependency on whole urchin object
set.
2015-04-29 15:50:16 +03:00
Avi Kivity
6290dee438 db: const correctness for abstract_type and friends
Types are immutable.
2015-04-29 15:40:38 +03:00
Avi Kivity
3162873d7f Merge branch 'calle/commitlog' of github.com:cloudius-systems/seastar-dev into db
Use commit log in database, from Calle:

"Initial" usage of the commitlog in database mutation path.
A commitlog is created in "work" dirs when initing the db
from a datadir. However, since we have neither disk data storage,
nor replay capability yet (and no real db config), the settings
are basically to just write in-memory serialization, write them to
disk and then discard them. So in fact, pointless. But at least using
the log...
2015-04-29 11:28:05 +03:00
Calle Wilund
aeb83f2874 Add commitlog to db + use it in storage_proxy/handler
* A commitlog is created in "work" dirs when initing the db
  from a datadir. However, since we have neither disk data storage,
  nor replay capability yet (and no real db config), the settings 
  are basically to just write in-memory serialization, write them to 
  disk and then discard them. So in fact, pointless. But at least using
  the log...
* Moved the actual "apply" of mutation into database. If a commitlog
  is active, add an entry to it before applying mutation.
2015-04-29 10:10:21 +02:00
Pekka Enberg
33ceac5643 database: add database::delete_keyspace() stub
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-04-28 15:49:33 +03:00
Pekka Enberg
cf1d6197d6 database: add database::update_keyspace() stub
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-04-27 11:39:57 +03:00
Tomasz Grabiec
5a7e3d3278 db: Order partitions by decorated_key
Partitions should be ordered using Origin's ordering, which is first
by token, then by Origin's representation of the key. That is the
natural ordering of decorated_key.

This also changes mutation class to hold decorated_key, to avoid
decoration overhead at different layers.
2015-04-24 18:01:01 +02:00
Tomasz Grabiec
1c3275c950 mutation: Encapsulate fields 2015-04-24 18:01:01 +02:00
Tomasz Grabiec
4641bc6f95 database: Move implementation to source file 2015-04-24 18:01:01 +02:00
Tomasz Grabiec
731a63e371 schema: Embed raw_schema inside schema
Public fields got encapsulated.
2015-04-24 18:01:01 +02:00
Tomasz Grabiec
c963821e1d db: Extract schema-specific code to schema.cc 2015-04-23 20:54:12 +02:00
Avi Kivity
da8782b9e5 Merge branch 'tgrabiec/code-moves' of github.com:cloudius-systems/seastar-dev into db
Cleanups in preparation for memtables, from Tomasz.
2015-04-23 18:44:40 +03:00
Tomasz Grabiec
0d4821009c db: Move mutation and mutation_partition to separate headers and compilation units 2015-04-22 18:42:33 +02:00
Tomasz Grabiec
a5c201a685 db: Move column_family::get_partition_slice() to mutation_partition::query()
There's nothing column_family-specific there.
2015-04-22 17:40:02 +02:00
Tomasz Grabiec
de5bea90fe db: Add const qualifiers to mutation_partition methods 2015-04-22 17:37:40 +02:00
Tomasz Grabiec
631dad8a29 schema: Add const qualifiers to lookup methods 2015-04-22 17:36:27 +02:00
Gleb Natapov
57ac231cd2 convert some snitch related classes 2015-04-21 18:24:35 +03:00
Tomasz Grabiec
ef05c5b919 db: Lookup column family by UUID
It's a bit faster.
2015-04-20 12:12:55 +02:00
Tomasz Grabiec
5693f73b7a db: Implement generate_legacy_id() properly 2015-04-17 14:22:29 +02:00
Tomasz Grabiec
00f99cefd4 db: split query.hh to reduce header dependencies 2015-04-15 20:44:59 +02:00
Tomasz Grabiec
878a740b9d db: Write query results in serialized form
This gives about 30% increase in tps in:

  build/release/tests/perf/perf_simple_query -c1 --query-single-key

This patch switches query result format from a structured one to a
serialized one. The problems with structured format are:

  - high level of indirection (vector of vectors of vectors of blobs), which
    is not CPU cache friendly

  - high allocation rate due to fine-grained object structure

On replica side, the query results are probably going to be serialized
in the transport layer anyway, so this change only subtracts
work. There is no processing of the query results on replica other
than concatenation in case of range queries. If query results are
collected in serialized form from different cores, we can concatenate
them without copying by simply appending the fragments into the
packet. This optimization is not implemented yet.

On coordinator side, the query results would have to be parsed from
the transport layer buffers anyway, so this also doesn't add work, but
again saves allocations and copying. The CQL server doesn't need
complex data structures to process the results, it just goes over it
linearly consuming it. This patch provides views, iterators and
visitors for consuming query results in serialized form. Currently the
iterators assume that the buffer is contiguous but we could easily
relax this in future so that we can avoid linearization of data
received from seastar sockets.

The coordinator side could be optimized even further for CQL queries
which do not need processing (eg. select * from cf where ...)  we
could make the replica send the query results in the format which is
expected by the CQL binary protocol client. So in the typical case the
coordinator would just pass the data using zero-copy to the client,
prepending a header.

We do need structure for prefetched rows (needed by list
manipulations), and this change adds query result post-processing
which converts serialized query result into a structured one, tailored
particularly for prefetched rows needs.

This change also introduces partition_slice options. In some queries
(maybe even in typical ones), we don't need to send partition or
clustering keys back to the client, because they are already specified
in the query request, and not queried for. The query results hold now
keys as optional elements. Also, meta-data like cell timestamp and
ttl is now also optional. It is only needed if the query has
writetime() or ttl() functions in it, which it typically won't have.
2015-04-15 20:44:50 +02:00
Tomasz Grabiec
ecc5d23456 db: Avoid copying of column_definition
Spotted in the perf profile.
2015-04-15 20:33:48 +02:00
Tomasz Grabiec
7ebc7830b7 db: Optimize column family lookup in query path 2015-04-15 20:33:48 +02:00
Tomasz Grabiec
06f198b10c schema: Add id field
It uniquely identifies column_family globally. Will be used for
column_family lookups.
2015-04-15 20:33:48 +02:00