This gives about 30% increase in tps in:
build/release/tests/perf/perf_simple_query -c1 --query-single-key
This patch switches query result format from a structured one to a
serialized one. The problems with structured format are:
- high level of indirection (vector of vectors of vectors of blobs), which
is not CPU cache friendly
- high allocation rate due to fine-grained object structure
On replica side, the query results are probably going to be serialized
in the transport layer anyway, so this change only subtracts
work. There is no processing of the query results on replica other
than concatenation in case of range queries. If query results are
collected in serialized form from different cores, we can concatenate
them without copying by simply appending the fragments into the
packet. This optimization is not implemented yet.
On coordinator side, the query results would have to be parsed from
the transport layer buffers anyway, so this also doesn't add work, but
again saves allocations and copying. The CQL server doesn't need
complex data structures to process the results, it just goes over it
linearly consuming it. This patch provides views, iterators and
visitors for consuming query results in serialized form. Currently the
iterators assume that the buffer is contiguous but we could easily
relax this in future so that we can avoid linearization of data
received from seastar sockets.
The coordinator side could be optimized even further for CQL queries
which do not need processing (eg. select * from cf where ...) we
could make the replica send the query results in the format which is
expected by the CQL binary protocol client. So in the typical case the
coordinator would just pass the data using zero-copy to the client,
prepending a header.
We do need structure for prefetched rows (needed by list
manipulations), and this change adds query result post-processing
which converts serialized query result into a structured one, tailored
particularly for prefetched rows needs.
This change also introduces partition_slice options. In some queries
(maybe even in typical ones), we don't need to send partition or
clustering keys back to the client, because they are already specified
in the query request, and not queried for. The query results hold now
keys as optional elements. Also, meta-data like cell timestamp and
ttl is now also optional. It is only needed if the query has
writetime() or ttl() functions in it, which it typically won't have.
To be more precise, do not take schema_ptr by value.
Fixes crashes in running smp > 1 where mutations applied across shards
(i.e. foreign memory) would cause schema_ptr:s to get out of sync (using
other shards ptr)
This patch converts (for very small value of 'converts') some
replication related classes. Only static topology is supported (it is
created in keyspace::create_replication_strategy()). During mutation
no replication is done, since messaging service is not ready yet,
only endpoints are calculated.
For serializing to commit log, and potentially internal wire messaging.
Note: intentionally incompatible with stock C wire/serial format.
Note: intentionally separate from the CQL-centric serialization
for a few reasons.
1.) Need "bulk serializers" for internal objects (mutation etc)
which might not fit well into the "types.hh" serializer schemes.
2.) No need for polymorphism/virtual type parameters since we know
exactly what we serialize and to where.
* database now holds all keyspace + column family object
* column families are mapped by uuid, either generated or explicit
* lookup by name tuples or uuid
* finder functions now return refs + throws on missing obj
In CQL a row is considered as present if its row marker is live or it
has any cells live. The 'insert' statement creates a row
marker. Internally Origin handles that by inserting a special cell
whose name shares the prefix with other cells in that row.
One consequence of this way of things is that when we query a column
slice from sstables we will have to read the whole CQL row, even if
not all columns are queried. We won't have to include the data, but we
will need liveness information in order to commute it with other
mutations, so that we can finally determine if the row is live or not.
System keyspace is used for things like keyspace and table metadata.
Initialize it in database constructor so that they're always available.
Needed for CQL create keyspace test case, for example.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
This adds a set_cell() variant which accepts a column name and a
boost::any value and does column definition lookup and decomposition
under the hood. Simplifies code that manipulates system tables directly
in db/legacy_schema_tables.cc.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Now that the code for sstable metadata is ready, we can read it when we are
loading the keyspaces.
At this moment, only the system tables are processed. This is because we will
require the schema to be already determined in order to properly read the
sstables. The system schema is known at compile time. The others will have to
be derived when we are able to read it from the system tables themselves.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
std::map<> does not support lookup using different comparator than the
one used to compare keys. For range prefix queries and for row prefix
tombstone queries we will need to perform lookups using different
comparators.
Holding keys and their prefixes as "bytes" is error prone. It's easy
to mix them up (or use wrong types). This change adds wrappers for
keys with accessors which are meant to make misuses as difficult as
possible.
Prefix and full keys are now distinguished. Places which assumed that
the representation is the same (it currently is) were changed not to
do so. This will allow us to introduce more compact storage for non-prefix
keys.
This also changes populate() interface a bit. They now work on
existing objects, so that system keyspace definition is not
overriden. For non-system keyspace, the keyspace definition would come
from the data in the system tables.
With a collection, setting two separate elements in a collection would
cause the second to override the first. This also applies, with much
smaller effect, to normal cells (for example, updating the same counter
twice, or issuing two updates to the same cell but with different timestamps,
via thrift).
Fix by merging the two values rather than replacing the old one.
We use bytes for many different things, and it is easy to get confused as
to what format the data is actually in.
Fix that for atomic_cell by proving wrappers. atomic_cell::one corresponds
to a bytes object holding exactly one atomic cell, and atomic_cell::view is
a bytes_view to an atomic_cell. The static functions of atomic_cell itself
are privatized to prevent the unwashed masses from using them on the wrong
objects.
Since a row entry can hold either a an atomic cell, or a collection,
depending on the schema, also introduce a variant type
atomic_cell_or_collection and allow the user to pick the type explicitly.
Internally both are stored as bytes object.
Storing cells as boost::any objects makes us use expensive
boost::any_cast to access the data. This change replaces boost::any
with bytes object which holds the value in serialized form (the same
as will be used for on-wire format).
If the cell type is atomic, you use fields accessors defined in
atomic_cell class, eg like this:
if (column.type.is_atomic()) {
if (atomic_cell::is_live(c) {
auto timestamp = atomic_cell::timestamp(c);
...
}
}
Eventually we could switch to a more officient semi-serialized form
with native byte order but I don't want to introduce it just yet for
simplicity.
Add database::shard_of() to compute the shard hosting the partition
(with a simplistic algorithm, but perhaps not too bad).
Convert non-metadata invoke_on_all() and local calls on the database
to use shard_of().
s/database/distributed<database>/ everywhere.
Use simple distribution rules: writes are broadcast, reads are local.
This causes tremendous data duplication, but will change soon.
With replication, we want the contents of the mutation to be available
to multiple replicas.
(In this context, we will replicate the mutation to all shards in the same
node, as a temporary step in sharding a node; but the issue also occurs
when replicating to other nodes).