"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."
(cherry picked from commit 5f11a727c9)
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the trace state in the abstract_write_response_handler.
Instrument send_mutation RPC to receive an additional
rpc::optional parameter that will contain optional<trace_info>
value.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adding this method allows to use tracing helper functions
and remove the no longer needed accessors in the query_state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
From now on trace_state::trace() is able to receive the sprint-ready
format string with the arguments that will be applied only during
the flush event.
This patch also optimizes the way the source address is evaluated -
do it only once instead of twice if tracing is requested.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Sometimes we want to be able to set "params" map after we
started a tracing session, e.g. when the parameters values,
like a consistency level parsed from the "options" part of a binary frame,
are available only after some heavy part of a flow we would like
to trace.
This patch includes the following changes:
- No longer pass a map to the begin().
- Limit the parameters to the known set.
- Define a method to set each such parameter and save its
value till the final sstring->sstring map is created.
- Construct the final sstring->sstring map in the destructor of the trace_state
object in order to defer all the formatting to be after the traced flow.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This function is similar to has_column_family_access, but skips
validating if the specified keyspace and column family names map to a
valid schema, as it already takes one as an argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If a node fails to talk to any seed node, shadow round will fail. We
should exit shadow round state before we continue.
This issue is spotted by
consistency_test.TestConsistency.data_query_digest_test dtest.
Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.
Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.
This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.
Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.
This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables. This
data structure has a method to obtain the sstables used for a read.
For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.
Skip the feature check for seed node for now, util we have a proper solution.
Fixes recent dtest failure due to fail to boot the seed node.
Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the way we fetch each replica's last row to
determine if we got incomplete information from any of them. Instead
of fetching the last rows up front, we fetch them on demand only if we
actually trigger the code that needs them. We now get the last row from
the versions vector of vectors.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts to a function the code that actually determines
the last row of a partition based on the direction of the query.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.
In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.
1) All old nodes
O O O O O <- N OK
O O O O O <- O OK
2) All new nodes
N N N N N <- N OK
N N N N N <- O FAIL
3) Mixed old and new nodes
O N O N O <- N OK
O N O N O <- O OK
(O == old node, N == new node, <- == joining the cluster)
With this patch, I tested:
1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})
3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}
3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
Fixes#1253
User may specify time after which speculative retry should happen
instead of relying on cf statics. Use provided value in speculative
executor.
Message-Id: <20160616104422.GH5961@scylladb.com>
Currently, we only stop the CQL transport server. Extract a
stop_transport() function from drain_on_shutdown() and call it from
do_isolate_on_error() to also shut down the inter-node RPC transport,
Thrift, and other communications services.
Fixes#1353
tracing::tracing local instance is dereferenced from a
cql_server::connection::process_request(), therefore tracing::tracing
service may be stop()ed only after a CQL server service is down.
On the other hand it may not be stopped before RPC service is down
because a remote side may request a tracing for a specific command too.
This patch splits the tracing::tracing stop() into two phases:
1) Flush all pending tracing records and stop the backend.
2) Stop the service.
The first phase is called after CQL server is down and before RPC is down.
The second phase is called after RPC is down.
Fixes#1339
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>
When read repair writes diffs back to replicas it is enough to wait
for requested CL to guaranty read monotonicity. This patch makes read
repair write reuse regular mutate functionality which already tracks
CL status. This is done by changing write response handler to not hold
mutation directly, but instead hold a container that, depending on
whether
this is read repair write or regular one, can provide different mutation
per destination.
Message-Id: <20160613124727.GL1096@scylladb.com>
Call for a tracing::tracing::create_session() doesn't promise a session creation.
Check that the session is actually created before trying to use it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This is a demo instrumentation:
- Check if a tracing info is present in the read_command.
- If yes - create a tracing session with the given tracing
session ID.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>