"nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.
Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide."
* tag 'asias/repair_dht_token_range/v2' of github.com:cloudius-systems/seastar-dev:
Convert to use dht::partition_range_vector and dht::token_range_vector
dht: Introduce dht::partition_range_vector and dht::token_range_vector
Get rid of query::partition_range
Convert to use dht::partition_range
Convert to use dht::token_range
dht: Rename token_range to token_range_endpoints
dht: Introduce dht::token_range an dht::partition_range
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.
Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
"In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.
In this series, we make the optimization more generec and cover more cases."
* tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev:
repair: Reduce unnecessary streaming traffic even more
repair: Add hash specialization for partition_checksum
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."
* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
query_pagers: Don't trim returned rows
select_statement: Don't always trim result set
query_result_merger: Limit rows
mutation_query: to_data_query_result enforces row limit
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."
* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
storage_proxy: Decrease limits when retrying command
storage_proxy: Don't fetch superfluous partitions
query::result: Add partition count
column_family: Use counters in query::result::builder
query_result_builder: Use the underlying counters
mutation_partition: Count partitions in query_compacted
mutation_partition: Remove tabs in query_compacted
query::result::builder: Add partition count
query_result_merger: Limit partitions
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.
This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."
* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
db: use multi range reader for streaming readers
dht: describe split_range[s]_to_shards() guarantees
repair: remove outdated fixme
test/mutation_reader_test: add multi_range_reader test
tests/mutation_reader: extract key creation code
mutation_reader: add multi_range_reader
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.
The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.
However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.
This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
Since storage_proxy::query() now respects the read_command limits, we
can remove the trimming logic from query_pagers.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Trimming the result set is only needed when the query contains an "IN"
relation, an ORDER BY clause, and defines a limit, which is the case
where we query different ranges concurrently. We don't use the
result_merger to trim since we first need to reorder the rows.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes the row limit enforced by the storage_proxy layer.
It adds a row limit to the query_result_merger, useful when merging
results for concurrent queries.
More importantly, it provides guarantees that upper layers may be
relying on implicitly (e.g., the paging code).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes mutation_query::to_data_query_result() so that it
enforces the row limit alongside the partition limit and the
per-partition limit.
In the following patch, we'll enforce the row limit in an upper layer,
but this lets us optimize the case where only when replica replies.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes a read_command's limits when retrying it, so that
we don't ask for more rows than necessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we keep track of how many partitions we've queried
so we don't ask for more than the number we need.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a partition count to query::result, filled by the
query::result::builder. The partition count is present whenever the
result carries data, being absent only for the case where the result
contains only a digest.
We also ensure that counts are present for an empty query::result.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the query_result_builder to use the counters
provided by the query::result::builder. It also ensures they are kept
current.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes mutation_partition::query_compacted() to count the
number of partitions written to the underlying writer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a partition count to the query::result::builder. It is
intended to be incremented by users, and later used to build a
query::result.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a partition limit to the query_result_merger, useful
when merging results for concurrent queries. This change also makes
the partition limit enforced by the storage_proxy layer, no changes
being needed by the upper layers, namely the Thrift interface.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We saw the message twice for the same feature check. This is a bit
confusing.
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
This is because
ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);
The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
Add the helper function to enable the a feature and log the feature is
enabled.
When a feature is enabled, we see
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled
in the log.
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).
There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.
Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.
Three new metrics were added:
* database_bytes_total_result_memory -- total memory used by query results
* database_total_operations_short_data_queries -- data queries that were
limited by size, particulary bad as it basically forces coordinator to
retry them as mutation queries
* database_total_operations_short_mutation_queries -- mutation queries limited
by size"
* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
storage_proxy: clean up after primary_key introduction
cql3: allow short reads with paged queries
storage_proxy: handle intentional short reads
storage_proxy: make sure coordinator has complete data
storage_proxy: honour partition limit
storage_proxy: use cmd limits to determine that replica reached end
db: add metrics for short reads and memory used for results
data_query: limit result size
mutation_query: limit result size
db: create result_memory_accounters when starting query
query_builder: add partition_slice getter
reconcilable_result: keep result_memory_tracker object
mutation_compactor: honour stop_iteration from consumers
db: add result_memory_limiter
query: add result size limiter
reconcilable_result: properly propagate short_read flag
query_pagers: handle short reads properly
query: allow short reads
serializer_impl: add serializer for bool_class<Tag>
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).
However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.
This patch makes it honour the partition limit and increase it in the
retried queries if necessary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.
For example, consider cluster having the following partitions (deletions
cover live cells):
node1:
pk=0, v=0
pk=1, v=1
node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3
Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.
However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>