* seastar 6b21197...2ebe842 (6):
> Merge "Various improvements to execution stages" from Paweł
> app-template: allow apps to specify a name for help message
> bool_class: avoid initializing object of incomplete type
> app-template: make sure we can still get help with required options
> prometheus: Http handler that returns prometheus 0.4 protobuf or text format
> Update DPDK to 17.02
Includes patch from Pawel to adjust to updated execution_stage interface.
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.
Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.
This could be related to scylladb/seastar#250
Fixes#2163.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that
argument value, which must be a list of tuples <int, UUID, long, long>,
should be inserted as an actual counter value, not update.
This of course to allow counters to be read from sstable loader.
Note that we also need to allow timestamps for counter mutations,
as well as convince the counter code itself to treat the data as
already baked. So ugly wormhole galore.
v2:
* Changed flag names
* More explicit wormholing, bypassing normal counter path, to
avoid read-before-write etc
* throw exceptions on unhandled shard types in marshalling
v3:
* Added counter id ordering check
* Added batch statement check for mixing normal and raw counter updates
Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>
There are several problems with storage_proxy::send_to_endpoint right
now. It uses create_write_response_handler() overload that is specific
to read repair which is suboptimal and creates incorrect logs, it does
not process errors and it does not hold storage_proxy object until write
is complete. The patch fixes all of the problems.
Message-Id: <20170208101949.GA19474@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.
This would add latency and histogram and the missing metrics from column family."
* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
database: add metrics registration for the coloumn family
storage_proxy: add read and write latency histogram
estimated_histogram: returns a metrics histogram
Add a function for sending one mutation to one remote replica owning
this mutation. This is needed for materialized views, where each
base replica sends each view mutation to one particular view replica.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.
Message-Id: <20170206095237.GA7691@scylladb.com>
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
The reason is the same as why foreground writes are reported instead of
total writes (049ae37d08): It is much easier to see what is going on
this way.
Also fixes a typo in a counter's description.
Fixes#1217
Message-Id: <20170129093412.GS11469@scylladb.com>
As the metrics migration progressed, some include to scollectd.hh left
behind.
Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.
This patch remove those include and add a missing include to
storage_proxy.cc.
The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.
Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().
Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test
Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.
Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.
Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.
Fixes#2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>
During a range scan, we try to avoid sorting according to partition range
when we can do so. This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.
However, we use the wrong key for the sort -- we use the shard number. But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.
Fix by storing the iteration order as the sort key. We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.
Fixes#1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>
Range queries used to be performed sequentially and the shard performing
part of the read was reading state of the merger's memory accounter
directly. Now, they may be performed in parallel so it is safer to just
pass relevant data by value to the intersted shards so that they are not
reading something that another shard is modyfing at the same time.
Since query is done in parallel there is a chance of overread. However,
the parallelism is high only in sparsely populated tables and that's
when the overread is less serious problem.
Since a1cafed370 "storage_proxy: handle
range scans of sparsely populated tables" nonsingular range queries may
be performed in parallel on multiple shards. The consequence of this
that result may be added to the merger out of order. This requires more
complex logic for handling short reads.
As soon as mutation_result_merger gets a short read it starts to discard
all subsequently received results that are known to contain partitions
with larger keys.
Then when the final result is being prepared the merger may need to
combine and sorts results which ordering is not known. If at least one
of these results is a short one all partitions with larger keys are
removed.
Due to request being performed in parallel it is possible that even
though there was a short read the merger has got enough live data to
satisfy specified limits. If this has happened the short read flag is
not set on the final result.
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.
Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
If after reconciliation the coordinator ends up with no live rows and
short reads are allowed a retry may not make any progress if replicas
end their reads in the same place. The solution is to disallow short
reads on retries which are caused by final result having no live rows.
mutation_result_merger merges results from different shards and stops as
soon as a shard returned a short read or memory usage on the merging
shard is too high. However, it should never stop unless at least one
live rows is in the merged result.
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard. With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.
Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).
If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation. This
optimizes for the dense(r) case, where few partial results are required.
If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys. This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.
We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.
[tgrabiec: trivial conflicts]
Message-Id: <20161220170532.25173-1-avi@scylladb.com>
This patch makes the row limit enforced by the storage_proxy layer.
It adds a row limit to the query_result_merger, useful when merging
results for concurrent queries.
More importantly, it provides guarantees that upper layers may be
relying on implicitly (e.g., the paging code).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes mutation_query::to_data_query_result() so that it
enforces the row limit alongside the partition limit and the
per-partition limit.
In the following patch, we'll enforce the row limit in an upper layer,
but this lets us optimize the case where only when replica replies.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes a read_command's limits when retrying it, so that
we don't ask for more rows than necessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we keep track of how many partitions we've queried
so we don't ask for more than the number we need.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a partition limit to the query_result_merger, useful
when merging results for concurrent queries. This change also makes
the partition limit enforced by the storage_proxy layer, no changes
being needed by the upper layers, namely the Thrift interface.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).
However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>