storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be). However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.
Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.
Fixes#1863.
Message-Id: <20161120165741.2488-1-avi@scylladb.com>
(cherry picked from commit 6bdb8ba31d)
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.
The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.
Fixes#1822.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 055d78ee4c)
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.
Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.
Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.
Fixes#1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].
This patch adds checks to make sure that in such cases the range is
removed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.
This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.
Fixes#1678Fixes#1647
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.
Refs #1446.
Fixes#1684.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the storage_proxy so it passed along a
trace_state_ptr to the layers below, when querying locally or
receiving a remote query request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
There is nothing really that fundamentally ties the estimated histogram to
sstables. This patch gets rid of the few incidental ties. They are:
- the namespace name, which is now moved to utils. Users inside sstables/
now need to add a namespace prefix, while the ones outside have to change
it to the right one
- sstables::merge, which has a very non-descriptive name to begin with, is
changed to a more descriptive name that can live inside utils/
- the disk_types.hh include has to be removed - but it had no reason to be
here in the first place.
Todo, is to actually move the file outside sstables/. That is done in a separate
step for clarity.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series introduces a "slow query logging" feature that
allows logging the queries that take more than a specified
threshold time to complete.
Once such a query detected, it will be logged in a system_traces.node_slow_log table.
In addition all trace for that query that have been collected on a Coordinator
are going to be written as well.
If the handling time on a replica in the context of a query takes more than (the same) threshold
they are going to be written too.
The raw in a node_slow_log contains a session_id of a corresponding tracing session,
thereby allowing the user to query the system_traces tables for the corresponding trace
records.
The schema of the node_slow_log table is as follows:
CREATE TABLE system_traces.node_slow_log (
node_ip inet,
shard int,
session_id uuid,
date timestamp,
start_time timeuuid,
command text,
duration int,
parameters map<text, text>,
source_ip inet,
table_names set<text>,
username text,
PRIMARY KEY (start_time, node_ip, shard))
WITH default_time_to_live = 86400
where
- node_ip: IP of the coordinator Node.
- shard: shard ID on a Coordinator where the query was handled.
- session_id: ID of a corresponding tracing session.
- date: a time when the query has began.
- start_time: a time-based UUID for this query (needed for a primary key mostly).
- command: a query string.
- duration: a time it took to handle this query (in microseconds).
- parameters: a map of query parameters (like in system_traces.sessions).
- source_ip: IP of a Client that sent this query.
- table_names: a set of "<keyspace>.<table name>" strings representing column
families used in this query.
- username: a user name used for this query.
The good thing is that most of the data we needed is already
collected by the regular tracing framework. The only missing ones
are a username and tables' names. So, this series makes the framework collect them too.
The whole feature is integrated in the Tracing framework. The main
changes to the framework that were made are as follows:
- Store the constant capabilities of the tracing session in an enum_set, e.g.:
- primary/secondary.
- write on close.
- Introduce two new capabilities to a tracing session of a specific query:
- full tracing: collect all traces for this query (as it is before this series).
- log slow query: log this query if its duration is above the threshold.
These two capabilities may be defined independently.
- Add the logic that handles the "log slow query"-only case:
- Build the parameters<sstring, sstring> map only if the "duration" is above
the given threshold.
- The same about writing the trace entries.
- In a not-only "log slow query" case:
- Write the node_slow_log entry.
- Extend the trace_info struct to pass slow query threshold and TTL to the replica
Node.
In addition to above this series add the capability to configure the slow query logging
threshold and a TTL for the node_slow_log records.
The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming)
patches that are meant to align the naming of the existing methods with the ones the last one
is going to add."
Now that mutation handler knows how much time is left for mutation
write to be handled it can use this knowledge to set correct timeout
for forwarded mutations.
Message-Id: <20160828080637.GE9243@scylladb.com>
- Instead of keeping separate booleans introduce a trace_state_props_set enum_set and
pass it around instead of separate booleans.
- Change the trace_info to hold this value in addition to write_on_close. Initialize
a corresponding bit in an enum_set based on a write_on_close value in a trace_info
constructor for a backward compatibility.
- Separate a trace_state constructor into two:
- For a primary session object.
- For a secondary session object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This patch makes the storage_proxy return an empty result when the
query doesn't define any clustering ranges (default or specific).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This reverts commit 141ea49e05.
There was a confusion around the meaning of "partition limit".
Parts of our code interpreted it just as "maximum number of partitions".
This is also how Cassandra behaves.
However, the other parts of the code, including data query, interpreted
it as "maximum number of live partitions" or otherwise skipped dead
partitions resulting in #1447.
A decision has been made to stick to the "maximum number of live
partitions" interpretation everywhere. The consequences are, among
others, that the patch reverted by this one is no longer correct.
While, the actual series fixing the interpretations of partition limit
and getting rid of the confusion is yet to come, the purpose of this
revert is to make backporting easier (as the patch being reverted
hasn't made it to branch-1.3 yet).
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the trace state in the abstract_write_response_handler.
Instrument send_mutation RPC to receive an additional
rpc::optional parameter that will contain optional<trace_info>
value.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adding this method allows to use tracing helper functions
and remove the no longer needed accessors in the query_state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
From now on trace_state::trace() is able to receive the sprint-ready
format string with the arguments that will be applied only during
the flush event.
This patch also optimizes the way the source address is evaluated -
do it only once instead of twice if tracing is requested.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Sometimes we want to be able to set "params" map after we
started a tracing session, e.g. when the parameters values,
like a consistency level parsed from the "options" part of a binary frame,
are available only after some heavy part of a flow we would like
to trace.
This patch includes the following changes:
- No longer pass a map to the begin().
- Limit the parameters to the known set.
- Define a method to set each such parameter and save its
value till the final sstring->sstring map is created.
- Construct the final sstring->sstring map in the destructor of the trace_state
object in order to defer all the formatting to be after the traced flow.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This function is similar to has_column_family_access, but skips
validating if the specified keyspace and column family names map to a
valid schema, as it already takes one as an argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If a node fails to talk to any seed node, shadow round will fail. We
should exit shadow round state before we continue.
This issue is spotted by
consistency_test.TestConsistency.data_query_digest_test dtest.
Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.
Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.
This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.
Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.
This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables. This
data structure has a method to obtain the sstables used for a read.
For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.
Skip the feature check for seed node for now, util we have a proper solution.
Fixes recent dtest failure due to fail to boot the seed node.
Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>