Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Since the timestamp is not serialized, it must always be the last
parameter of query::read_command. This patch reorders it with the
partition_limit parameters and updates callers that specified a
timestamp argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468312334-10623-1-git-send-email-duarte@scylladb.com>
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Metadata usually doesn't change after it is created; make that visible in
the code, allowing further optimizations to be applied later.
Message-Id: <1464334638-7971-3-git-send-email-avi@scylladb.com>
The gocql driver assumes that there's a result metadata section in the
PREPARED message. Technically, Scylla is not at fault here as the CQL
specification explicitly states in Section 4.2.5.4. ("Prepared") that the
section may be empty:
- <result_metadata> is defined exactly as <metadata> but correspond to the
metadata for the resultSet that execute this query will yield. Note that
<result_metadata> may be empty (have the No_metadata flag and 0 columns, See
section 4.2.5.2) and will be for any query that is not a Select. There is
in fact never a guarantee that this will non-empty so client should protect
themselves accordingly. The presence of this information is an
However, Cassandra always populates the section so lets do that as well.
Fixes#912.
Message-Id: <1456317082-31688-1-git-send-email-penberg@scylladb.com>
The is_reversed function uses a variable length array, which isn't
spec-abiding C++. Additionally, the Clang compiler doesn't allow them
with non-POD types, so this function wouldn't compile.
After reading through the function it seems that the array wasn't
necessary as the check could be calculated inline rather than
separately. This version should be more performant (since it no longer
requires the VLA lookup performance hit) while taking up less memory in
all but the smallest of edge-cases (when the clustering_key_size *
sizeof(optional<bool>) < sizeof(size_type) - sizeof(uint32_t) +
sizeof(bool).
This patch uses relation_order_unsupported it assure that the exception
order is consistent with the preivous version. The throw would
otherwise be moved into the initial for-loop.
There are two derrivations in behavior:
The first is the initial assert. It however should not change the apparent
behavior besides causing orderings() to be looked up 2x in debug
situations.
The second is the conversion of is_reversed_ from an optional to a bool.
The result is that the final return value is now well-defined to be
false in the release-condition where orderings().size() == 0, rather
than be the ill-defined *is_reversed_ that was there previously.
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454546285-16076-4-git-send-email-erich.keane@verizon.net>
Fixes#549.
Being clinically absent-minded, aggregate query support (i.e. count(...))
was left out of the "paging" change set.
This adds repeated paged querying to do aggregate queries (similar to
origin). Uses "batched" paging.
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
Because of the reverse flag in partition slice rows inside bounds will
be returned in reversed order, however, we still have to make sure
that the bounds are in the expected order.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
In case of queries with IN on partition key, ORDER BY and LIMIT it is
not known until after post-query sort which rows should be included in
the result set. To make sure that the output is correct each partition
specified in IN clause is queried for LIMIT rows and the excess data is
trimmed after the results are sorted.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
When partition_slice option reversed is set the query will already
return rows in desired (i.e. reversed) order. That's not true, however,
for statements using IN restriction on partition keys. In such case
post-query ordering is still needed.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
When partition has no live regular rows, but has some data live in the
static row, then it should appear in the results, even though we
didn't select any static column.
To reproduce:
create table cf (k blob, c blob, v blob, s1 blob static, primary key (k, c));
update cf set s1 = 0x01 where k = 0x01;
update cf set s1 = 0x02 where k = 0x02;
select k from cf;
The "select" statement should return 2 rows, but was returning 0.
The following query worked fine, because static columns were included:
select * from cf;
The data query should contain only live data, so we shouldn't write a
partition entry if it's supposed to be absent from the results. We
can'r tell that though until we've processed all the data. To solve
this problem, query result writer is using an optimistic approach,
where the partition header will be retracted from the buffer
(cheaply), if it turns out there's no live data in it.
partitions_ranges will be manipulated upon to be split for different
destination, so provide it separately from read_command to not copy the
later for each destination.
To prepare a user-defined type, we need to look up its name in the keyspace.
While we get the keyspace name as an argument to prepare(), it is useless
without the database instance.
Fix the problem by passing a database reference along with the keyspace.
This precolates through the class structure, so most cql3 raw types end up
receiving this treatment.
Origin gets along without it by using a singleton. We can't do this due
to sharding (we could use a thread-local instance, but that's ugly too).
Hopefully the transition to a visitor will clean this up.
This gives about 30% increase in tps in:
build/release/tests/perf/perf_simple_query -c1 --query-single-key
This patch switches query result format from a structured one to a
serialized one. The problems with structured format are:
- high level of indirection (vector of vectors of vectors of blobs), which
is not CPU cache friendly
- high allocation rate due to fine-grained object structure
On replica side, the query results are probably going to be serialized
in the transport layer anyway, so this change only subtracts
work. There is no processing of the query results on replica other
than concatenation in case of range queries. If query results are
collected in serialized form from different cores, we can concatenate
them without copying by simply appending the fragments into the
packet. This optimization is not implemented yet.
On coordinator side, the query results would have to be parsed from
the transport layer buffers anyway, so this also doesn't add work, but
again saves allocations and copying. The CQL server doesn't need
complex data structures to process the results, it just goes over it
linearly consuming it. This patch provides views, iterators and
visitors for consuming query results in serialized form. Currently the
iterators assume that the buffer is contiguous but we could easily
relax this in future so that we can avoid linearization of data
received from seastar sockets.
The coordinator side could be optimized even further for CQL queries
which do not need processing (eg. select * from cf where ...) we
could make the replica send the query results in the format which is
expected by the CQL binary protocol client. So in the typical case the
coordinator would just pass the data using zero-copy to the client,
prepending a header.
We do need structure for prefetched rows (needed by list
manipulations), and this change adds query result post-processing
which converts serialized query result into a structured one, tailored
particularly for prefetched rows needs.
This change also introduces partition_slice options. In some queries
(maybe even in typical ones), we don't need to send partition or
clustering keys back to the client, because they are already specified
in the query request, and not queried for. The query results hold now
keys as optional elements. Also, meta-data like cell timestamp and
ttl is now also optional. It is only needed if the query has
writetime() or ttl() functions in it, which it typically won't have.