Commit Graph

378 Commits

Author SHA1 Message Date
Gleb Natapov
b4c368a6bc storage_proxy: update correct statistics on range reads
Fixes #2167

Message-Id: <20170405094119.GM8197@scylladb.com>
2017-04-09 18:16:06 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Amnon Heiman
295a981c61 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
2017-03-14 11:19:39 +02:00
Paweł Dziepak
cfde2ad5b4 storage_proxy: make mutate() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
00b42c477f storage_proxy: count counter updates for which the node was a leader 2017-03-02 09:05:12 +00:00
Paweł Dziepak
cf193f4b41 storage_proxy: use counter-specific timeout for writes 2017-03-02 09:05:12 +00:00
Paweł Dziepak
d177160f90 storage_proxy: transform counter timeouts to mutation_write_timeout_exception 2017-03-02 09:05:12 +00:00
Paweł Dziepak
774241648d db: add more tracing events for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
277501f42f db: propagate tracing state for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
25173f8095 db: propagate timeout for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
426345e1d4 storage_proxy: avoid excessive mutation freezes 2017-03-01 16:33:36 +00:00
Paweł Dziepak
f10eb952d0 coordinator: do not apply counter write twice on leader 2017-03-01 16:33:36 +00:00
Calle Wilund
0a4edca756 counters/cql: allow wormholing actual counter values (with shards) via cql
Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that
argument value, which must be a list of tuples <int, UUID, long, long>,
should be inserted as an actual counter value, not update.

This of course to allow counters to be read from sstable loader.

Note that we also need to allow timestamps for counter mutations,
as well as convince the counter code itself to treat the data as
already baked. So ugly wormhole galore.

v2:
* Changed flag names
* More explicit wormholing, bypassing normal counter path, to
  avoid read-before-write etc
* throw exceptions on unhandled shard types in marshalling
v3:
* Added counter id ordering check
* Added batch statement check for mixing normal and raw counter updates
Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>
2017-02-22 09:19:46 +00:00
Gleb Natapov
bb72425b61 storage_proxy: fix send_to_endpoint() to use correct create_write_response_handler() overload
There are several problems with storage_proxy::send_to_endpoint right
now. It uses create_write_response_handler() overload that is specific
to read repair which is suboptimal and creates incorrect logs, it does
not process errors and it does not hold storage_proxy object until write
is complete. The patch fixes all of the problems.

Message-Id: <20170208101949.GA19474@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-02-12 10:46:13 +02:00
Avi Kivity
9530bac2d6 Merge "Adding metrics using histogram and labels" from Amnon
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.

This would add latency and histogram and the missing metrics from column family."

* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
  database: add metrics registration for the coloumn family
  storage_proxy: add read and write latency histogram
  estimated_histogram: returns a metrics histogram
2017-02-09 11:40:57 +02:00
Amnon Heiman
2cf13c26e2 storage_proxy: add read and write latency histogram
Register the read and write latency histogram on the metrics layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 17:54:47 +02:00
Nadav Har'El
f2fd81ece0 materialized views: function to send a mutation to endpoint
Add a function for sending one mutation to one remote replica owning
this mutation. This is needed for materialized views, where each
base replica sends each view mutation to one particular view replica.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-02-06 13:36:45 +01:00
Gleb Natapov
3c372525ed storage_proxy: use storage_proxy clock instead of explicit lowres_clock
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.

Message-Id: <20170206095237.GA7691@scylladb.com>
2017-02-06 12:51:36 +02:00
Paweł Dziepak
1e8814f5ce storage_proxy: support counter updates 2017-02-02 10:35:14 +00:00
Paweł Dziepak
c14c6b753b storage_proxy: add get_live_endpoints() 2017-02-02 10:35:14 +00:00
Amnon Heiman
45b6070832 Merge seastar upstream
* seastar 397685c...c1dbd89 (13):
  > lowres_clock: drop cache-line alignment for _timer
  > net/packet: add missing include
  > Merge "Adding histogram and description support" from Amnon
  > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
  > Set the option '--server' of tests/tcp_sctp_client to be required
  > core/memory: Remove superfluous assignment
  > core/memory: Remove dead code
  > core/reactor: Use logger instead of cerr
  > fix inverted logic in overprovision parameter
  > rpc: fix timeout checking condition
  > rpc: use lowres_clock instead of high resolution one
  > semaphore: make semaphore's clock configurable
  > rpc: detect timedout outgoing packets earlier

Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.

Includes fixup from Amnon:

collectd api should use the metrics getters

As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.

This will be important when the internal implementation will changed
from union to variant.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
2017-02-01 14:39:08 +02:00
Gleb Natapov
6e4817137e storage_proxy: report foreground reads instead of reads
The reason is the same as why foreground writes are reported instead of
total writes (049ae37d08): It is much easier to see what is going on
this way.

Also fixes a typo in a counter's description.

Fixes #1217

Message-Id: <20170129093412.GS11469@scylladb.com>
2017-01-29 12:40:56 +02:00
Gleb Natapov
64660397fc storage_proxy: move operation type information from counter's name to a label
Makes it much more flexible to view the data in various ways in Graphana.

Message-Id: <20170126102746.GL11469@scylladb.com>
2017-01-26 12:38:29 +02:00
Gleb Natapov
ccee01f352 storage_proxy: put datacenter name into a label instead of counter's name
Having datacenter name as a label makes it possible to create Prometheus board for the counters.

Message-Id: <20170124132051.GX11469@scylladb.com>
2017-01-24 15:27:34 +02:00
Amnon Heiman
e19fa02a17 remove scollectd from headers
As the metrics migration progressed, some include to scollectd.hh left
behind.

Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.

This patch remove those include and add a missing include to
storage_proxy.cc.

The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.

Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
2017-01-17 17:39:47 +02:00
Tomasz Grabiec
3c3a4358ae storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
2017-01-16 11:49:11 +02:00
Tomasz Grabiec
66547e7d7c storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
2017-01-13 16:47:54 +02:00
Tomasz Grabiec
1e8151b4f2 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
2017-01-12 15:10:51 +02:00
Gleb Natapov
76aed548e3 storage_proxy: add replica side counters for data read
Message-Id: <20170112085907.GN11469@scylladb.com>
2017-01-12 11:41:04 +02:00
Avi Kivity
8f36dca6f1 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>
2017-01-09 09:21:43 +00:00
Avi Kivity
eb520e7352 storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>
2017-01-05 12:51:37 +01:00
Gleb Natapov
4ca58959ad storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
2016-12-27 16:30:03 +02:00
Paweł Dziepak
e6d27ac529 query: introduce result_memory_accounter::foreign_state
Range queries used to be performed sequentially and the shard performing
part of the read was reading state of the merger's memory accounter
directly. Now, they may be performed in parallel so it is safer to just
pass relevant data by value to the intersted shards so that they are not
reading something that another shard is modyfing at the same time.

Since query is done in parallel there is a chance of overread. However,
the parallelism is high only in sparsely populated tables and that's
when the overread is less serious problem.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
49d675223e storage_proxy: fix short reads in parallel range queries
Since a1cafed370 "storage_proxy: handle
range scans of sparsely populated tables" nonsingular range queries may
be performed in parallel on multiple shards. The consequence of this
that result may be added to the merger out of order. This requires more
complex logic for handling short reads.

As soon as mutation_result_merger gets a short read it starts to discard
all subsequently received results that are known to contain partitions
with larger keys.
Then when the final result is being prepared the merger may need to
combine and sorts results which ordering is not known. If at least one
of these results is a short one all partitions with larger keys are
removed.

Due to request being performed in parallel it is possible that even
though there was a short read the merger has got enough live data to
satisfy specified limits. If this has happened the short read flag is
not set on the final result.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
1a52569f7d storage_proxy: pass maximum result size to replicas
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
2016-12-22 17:16:23 +01:00
Paweł Dziepak
aa083d3d85 result_memory_limiter: split new_read() to new_{data, mutation}_read()
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.

Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
a7a454c388 storage_proxy: fix _is_short_read computation 2016-12-22 13:35:04 +01:00
Paweł Dziepak
8c1e4a707c storage_proxy: disallow short reads if got no live rows
If after reconciliation the coordinator ends up with no live rows and
short reads are allowed a retry may not make any progress if replicas
end their reads in the same place. The solution is to disallow short
reads on retries which are caused by final result having no live rows.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
6db262446f storage_proxy: don't stop after result with no live rows
mutation_result_merger merges results from different shards and stops as
soon as a shard returned a short read or memory usage on the merging
shard is too high. However, it should never stop unless at least one
live rows is in the merged result.
2016-12-22 13:35:04 +01:00
Avi Kivity
a1cafed370 storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
2016-12-20 18:32:29 +01:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Duarte Nunes
fee0b7fa48 query_result_merger: Limit rows
This patch makes the row limit enforced by the storage_proxy layer.
It adds a row limit to the query_result_merger, useful when merging
results for concurrent queries.

More importantly, it provides guarantees that upper layers may be
relying on implicitly (e.g., the paging code).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:36 +00:00
Duarte Nunes
efc986d548 mutation_query: to_data_query_result enforces row limit
This patch changes mutation_query::to_data_query_result() so that it
enforces the row limit alongside the partition limit and the
per-partition limit.

In the following patch, we'll enforce the row limit in an upper layer,
but this lets us optimize the case where only when replica replies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:56:40 +00:00
Duarte Nunes
c2072c7dc9 storage_proxy: Decrease limits when retrying command
This patch changes a read_command's limits when retrying it, so that
we don't ask for more rows than necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:41:06 +00:00
Duarte Nunes
9572c19dc6 storage_proxy: Don't fetch superfluous partitions
This patch ensures we keep track of how many partitions we've queried
so we don't ask for more than the number we need.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
108011a839 query_result_merger: Limit partitions
This patch adds a partition limit to the query_result_merger, useful
when merging results for concurrent queries. This change also makes
the partition limit enforced by the storage_proxy layer, no changes
being needed by the upper layers, namely the Thrift interface.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:41 +00:00
Paweł Dziepak
4c69d7e2fe storage_proxy: clean up after primary_key introduction
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
3c173d87b5 storage_proxy: handle intentional short reads
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dd67de7218 storage_proxy: make sure coordinator has complete data
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).

However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00