"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.
I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.
It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."
* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
query: introduce result_memory_accounter::foreign_state
storage_proxy: fix short reads in parallel range queries
storage_proxy: pass maximum result size to replicas
mutation_partition: use result limiter for digest reads
query: make result_memory_limiter constants available for linker
result_memory_limiter: add accounter for digest reads
idl: allow writers to use any output stream
result_memory_limiter: split new_read() to new_{data, mutation}_read()
idl: is_short_read() was added in 1.6
mutation_partition: honour allowed_short_read for static rows
storage_proxy: fix _is_short_read computation
storage_proxy: disallow short reads if got no live rows
storage_proxy: don't stop after result with no live rows
(cherry picked from commit 868b4d110c)
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."
* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
query_pagers: Don't trim returned rows
select_statement: Don't always trim result set
query_result_merger: Limit rows
mutation_query: to_data_query_result enforces row limit
(cherry picked from commit 3421ebe8be)
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."
* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
storage_proxy: Decrease limits when retrying command
storage_proxy: Don't fetch superfluous partitions
query::result: Add partition count
column_family: Use counters in query::result::builder
query_result_builder: Use the underlying counters
mutation_partition: Count partitions in query_compacted
mutation_partition: Remove tabs in query_compacted
query::result::builder: Add partition count
query_result_merger: Limit partitions
(cherry picked from commit 6bb875bdb7)
This patch introduces an infrastrucutre for limiting result size.
There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order
In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That will be important for sstable code that will rule out a sstable
if it doesn't cover a given clustering key range.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
query::full_slice is a partiton slice which has full clustering row
ranges for all partition keys and no per-partition row limit.
Options and columns are not set.
It is used as a helper object in cases when a reference to
partition_slice is needed but the user code needs just all data there is
(an example of such case would be sstable compaction).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The patch calculates row count during result building and while merging.
If one of results that are being merged does not have row count the
merged result will not have one either.
Amnon reports that current code fails to compile on gcc 4.9:
distcc[9700] ERROR: compile /home/amnon/.ccache/tmp/query.tmp.localhost.localdomain.9673.ii on localhost failed
In file included from query.cc:30:0:
query-result-reader.hh: In instantiation of ‘void query::result_view::consume(const query::partition_slice&, ResultVisitor&&) [with ResultVisitor = query::result::calculate_row_count(const query::partition_slice&)::<anonymous struct>&]’:
query.cc:196:32: required from here
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class clustering_key_prefix’ through ‘...’
visitor.accept_new_row(*row.key(), static_row, view);
^
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
visitor.accept_new_row(static_row, view);
^
query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’
Work around the problem by not using '...'.
Message-Id: <1460964042-2867-1-git-send-email-tgrabiec@scylladb.com>
query::result transformation to printable form is very heavy operation
that allocates memory and thus can fail. Add a class to query::result that
can be used with logger to push to string conversion when output is
performed.
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps
Serializer requires class to be defined, so it has to be in .h file. It
also does not support nested types yet, so move it outside of containing
class.
Refs #752
Paged aggregate queries will re-use the partition_slice object,
thus when setting a specific ck range for "last pk", we will hit
an exception case.
Allow removing entries (actually only the one), and overwriting
(using schema equality for keys), so we maintain the interface
while allowing the pager code to re-set the ck range for previous
page pass.
[tgrabiec: commit log cleanup, fixed issue ref]
Message-Id: <1452616259-23751-1-git-send-email-calle@scylladb.com>
A cut-and-paste accident in query::to_partition_range caused the wrong
end's inclusiveness to be tested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
That's what we're trying to standardize on.
This patch also fixes an issue with current query::result::serialize()
not being const-qualified, because it modifies the
buffer. messaging_service did a const cast to work this around, which
is not safe.
Allows for having more than one clustering row range set, depending on
PK queried (although right now limited to one - which happens to be exactly
the number of mutiplexing paging needs... What a coincidence...)
Encapsulates the row_ranges member in a query function, and if needed holds
ranges outside the default one in an extra object.
Query result::builder::add_partition now fetches the correct row range for
the partition, and this is the range used in subsequent iteration.
It should be moved to i_partitioner.hh, but to do that range<> has to
be first moved out of query-request.hh to break cyclic dependency.
I didn't want to cause conflicts with in-flight patches to range<>.
Mutation query differs from data query in that returns information
needed to reconcile data slice with that retruned by other data
sources.
There is a generic mutation_query() algorithm introduced, which can
work with any mutation_source.
database::query_mutations() is a shard-local interface for mutation
queries.
The reconcilable_result is introduced as a medium for mutation query
results. It piggy backs on frozen_mutation as a medium for
reconcilable data.
This change abstracts reading from on-disk data sources behind a single
reader which is then composed with memtable readers. This change also
abstracts all data sources behind a single reader obtained via
column_family::make_reader(). That reader is then used by algorithms
like column_family::for_all_partitions() or
column_family::query(). Having those abstractions will make it easier
to add row cache, because it will be encapsulated in a single place.
Current model was not really correct because Origin doesn't support
querying of partition ranges by their value. We can query slices
according to dht::decorated_key ordering, which orders partitions
first by token then by key value.
ring_position encapsulates range constraint. Key value is optional, in
which case only token is constrained.
partitions_ranges will be manipulated upon to be split for different
destination, so provide it separately from read_command to not copy the
later for each destination.
This gives about 30% increase in tps in:
build/release/tests/perf/perf_simple_query -c1 --query-single-key
This patch switches query result format from a structured one to a
serialized one. The problems with structured format are:
- high level of indirection (vector of vectors of vectors of blobs), which
is not CPU cache friendly
- high allocation rate due to fine-grained object structure
On replica side, the query results are probably going to be serialized
in the transport layer anyway, so this change only subtracts
work. There is no processing of the query results on replica other
than concatenation in case of range queries. If query results are
collected in serialized form from different cores, we can concatenate
them without copying by simply appending the fragments into the
packet. This optimization is not implemented yet.
On coordinator side, the query results would have to be parsed from
the transport layer buffers anyway, so this also doesn't add work, but
again saves allocations and copying. The CQL server doesn't need
complex data structures to process the results, it just goes over it
linearly consuming it. This patch provides views, iterators and
visitors for consuming query results in serialized form. Currently the
iterators assume that the buffer is contiguous but we could easily
relax this in future so that we can avoid linearization of data
received from seastar sockets.
The coordinator side could be optimized even further for CQL queries
which do not need processing (eg. select * from cf where ...) we
could make the replica send the query results in the format which is
expected by the CQL binary protocol client. So in the typical case the
coordinator would just pass the data using zero-copy to the client,
prepending a header.
We do need structure for prefetched rows (needed by list
manipulations), and this change adds query result post-processing
which converts serialized query result into a structured one, tailored
particularly for prefetched rows needs.
This change also introduces partition_slice options. In some queries
(maybe even in typical ones), we don't need to send partition or
clustering keys back to the client, because they are already specified
in the query request, and not queried for. The query results hold now
keys as optional elements. Also, meta-data like cell timestamp and
ttl is now also optional. It is only needed if the query has
writetime() or ttl() functions in it, which it typically won't have.