Commit Graph

856 Commits

Author SHA1 Message Date
Paweł Dziepak
e95f4eaee4 Merge "partition_limit: Don't count dead partitions" from Duarte
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."

(cherry picked from commit 5f11a727c9)
2016-08-03 12:44:32 +03:00
Duarte Nunes
8243d3d1e0 storage_service: Fix get_range_to_address_map_in_local_dc
This patch fixes a couple of bugs in
get_range_to_address_map_in_local_dc.

Fixes #1517

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469782666-21320-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 7d1b7e8da3)
2016-07-29 11:24:12 +02:00
Vlad Zolotarov
1d7ed190f8 SELECT tracing instrumentation: improve inter-nodes communication stages messages
Add/fix "sending to"/"received from" messages.

With this patch the single key select trace with a data on an external node
looks as follows:

Tracing session: 65dbfcc0-4f51-11e6-8dd2-000000000001

 activity                                                                                                                        | timestamp                  | source    | source_elapsed
---------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                              Execute CQL3 query | 2016-07-21 17:42:50.124000 | 127.0.0.2 |              0
                                                                                                   Parsing a statement [shard 1] | 2016-07-21 17:42:50.124127 | 127.0.0.2 |             --
                                                                                                Processing a statement [shard 1] | 2016-07-21 17:42:50.124190 | 127.0.0.2 |             64
 Creating read executor for token 2309717968349690594 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 1] | 2016-07-21 17:42:50.124229 | 127.0.0.2 |            103
                                                                            read_data: sending a message to /127.0.0.1 [shard 1] | 2016-07-21 17:42:50.124234 | 127.0.0.2 |            108
                                                                           read_data: message received from /127.0.0.2 [shard 1] | 2016-07-21 17:42:50.124358 | 127.0.0.1 |             14
                                                          read_data handling is done, sending a response to /127.0.0.2 [shard 1] | 2016-07-21 17:42:50.124434 | 127.0.0.1 |             89
                                                                               read_data: got response from /127.0.0.1 [shard 1] | 2016-07-21 17:42:50.124662 | 127.0.0.2 |            536
                                                                                  Done processing - preparing a result [shard 1] | 2016-07-21 17:42:50.124695 | 127.0.0.2 |            569
                                                                                                                Request complete | 2016-07-21 17:42:50.124580 | 127.0.0.2 |            580

Fixes #1481

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469112271-22818-1-git-send-email-vladz@cloudius-systems.com>
(cherry picked from commit 57b58cad8e)
2016-07-25 13:50:39 +03:00
Vlad Zolotarov
7c590295ef SELECT instrumentation: add a nice trace point
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:59 +03:00
Vlad Zolotarov
b36b69c1d6 service::storage_proxy: remove a default value for a tracing::trace_state_ptr parameter
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:59 +03:00
Vlad Zolotarov
baa6496816 service::storage_proxy: READ instrumentation: store trace state object in abstract_read_executor
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:59 +03:00
Vlad Zolotarov
962bddf8fe transport: CQL tracing: instrument a BATCH command
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
be88074f47 service::query_state: get rid of begin_tracing()
Use tracing::begin() directly.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
982d301178 service::client_state: add a const version of get_trace_state()
tracing::begin() requires a non-const version, tracing::trace()
requires a const version.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
da56aa4256 service::client_state: rename: trace_state_ptr() -> get_trace_state()
Rename the method for consistency with other classes methods returning
the same value.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
4c16df9e4c service: instrument MUTATE flow with tracing
Store the trace state in the abstract_write_response_handler.
Instrument send_mutation RPC to receive an additional
rpc::optional parameter that will contain optional<trace_info>
value.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
952dc8a3d4 query_state: add get_trace_state() method
Adding this method allows to use tracing helper functions
and remove the no longer needed accessors in the query_state.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
0552ffcd17 service/storage_proxy: tracing: adjust the existing SELECT instrumentation with the new trace() interface
From now on trace_state::trace() is able to receive the sprint-ready
format string with the arguments that will be applied only during
the flush event.

This patch also optimizes the way the source address is evaluated -
do it only once instead of twice if tracing is requested.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
0689843e79 tracing::trace_state: add method to set the session's "params" map entries
Sometimes we want to be able to set "params" map after we
started a tracing session, e.g. when the parameters values,
like a consistency level parsed from the "options" part of a binary frame,
are available only after some heavy part of a flow we would like
to trace.

This patch includes the following changes:

   - No longer pass a map to the begin().
   - Limit the parameters to the known set.
   - Define a method to set each such parameter and save its
     value till the final sstring->sstring map is created.
   - Construct the final sstring->sstring map in the destructor of the trace_state
     object in order to defer all the formatting to be after the traced flow.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:58 +03:00
Vlad Zolotarov
a5022a09a4 tracing: use 'write' instead of 'flush' and 'store' for consistency with seastar's API
In names of functions and variables:
s/flush_/write_/
s/store_/write_/

In a i_tracing_backend_helper:
s/flush()/kick()/

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:57 +03:00
Duarte Nunes
3c389ba871 client_state: Add has_schema_access function
This function is similar to has_column_family_access, but skips
validating if the specified keyspace and column family names map to a
valid schema, as it already takes one as an argument.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-17 17:38:23 +00:00
Tomasz Grabiec
c97871d95c migration_manager: Uncomment logging for keysapce drop
Message-Id: <1468413673-6899-1-git-send-email-tgrabiec@scylladb.com>
2016-07-13 13:42:23 +01:00
Paweł Dziepak
85c092c56c storage_service: add LARGE_PARTITIONS_FEATURE
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:51:23 +01:00
Asias He
e0949a8f4f storage_service: Exit shadow round state if it fails
If a node fails to talk to any seed node, shadow round will fail. We
should exit shadow round state before we continue.

This issue is spotted by
consistency_test.TestConsistency.data_query_digest_test dtest.
Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>
2016-07-08 10:05:07 +01:00
Paweł Dziepak
32a5de7a1f db: handle receiving fragmented mutations
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.

Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.

This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Paweł Dziepak
4031c0ed8f streaming: pass plan_id to column family for apply and flush
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.

Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Asias He
5236e7a379 storage_service: Implement feature check for seed node
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.

In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.

When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).

I tested the following scenarios.

1) start a completely new seed node in a new cluster
* system table is empty, skip the check.

2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed

3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.

4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
2016-07-05 10:09:54 +08:00
Avi Kivity
e22517bafc Merge "Optimize reads from leveled sstables"
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.

This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables.  This
data structure has a method to obtain the sstables used for a read.

For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
2016-07-04 16:00:35 +03:00
Asias He
610a0f7ef0 storage_service: Skip feature check for seed node for now
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.

Skip the feature check for seed node for now, util we have a proper solution.

Fixes recent dtest failure due to fail to boot the seed node.

Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
2016-07-04 15:09:57 +03:00
Asias He
f6a2672be0 storage_service: Modify log to match config option of scylla
We currently log as follow:

May  9 00:09:13 node3.nl scylla[2546]:  [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again

Howerver, user should use

   override_decommission:true

instead of

   cassandra.override_decommission:true

in scylla.yaml where the cassandra prefix is stripped.

Fixes #1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
2016-07-04 10:47:49 +02:00
Avi Kivity
2a46410f4a Change sstable_list from a map to a set
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set.  The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
2016-07-03 10:26:57 +03:00
Duarte Nunes
386c0dd4b2 storage_proxy: Correctly calculate new limit
This patch fixes a bug where we would always return query::max_rows
when calculating the new limit for a retry read command.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1467289746-18177-1-git-send-email-duarte@scylladb.com>
2016-06-30 14:49:56 +02:00
Gleb Natapov
8bf82cc31c put additional info into cql timeout exception
Fixes #1397

Message-Id: <20160628101829.GR14658@scylladb.com>
2016-06-30 12:03:48 +02:00
Paweł Dziepak
0c441378f2 client_state: support thrift clients
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-27 15:24:27 +02:00
Paweł Dziepak
002d2bc353 thrift: pass query_processor to the thrift handler
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-27 15:24:27 +02:00
Pekka Enberg
bcba45f546 Merge "Prevent old node to join new cluster" from Asias
Fixes #1253
2016-06-23 10:25:38 +03:00
Duarte Nunes
82dbf5bff3 storage_proxy: Trace when retrying a query
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:48:15 +02:00
Duarte Nunes
01b18063ea query: Add per-partition row limit
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:46:51 +02:00
Duarte Nunes
20d9813a89 storage_proxy: Fetch last replica row just in time
This patch changes the way we fetch each replica's last row to
determine if we got incomplete information from any of them. Instead
of fetching the last rows up front, we fetch them on demand only if we
actually trigger the code that needs them. We now get the last row from
the versions vector of vectors.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 00:15:38 +02:00
Duarte Nunes
4ce9fc24cb storage_proxy: Extract finding last row
This patch extracts to a function the code that actually determines
the last row of a partition based on the direction of the query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 00:15:38 +02:00
Paweł Dziepak
579de26e95 storage_proxy: drop make_local_reader()
This code was used only by its unit test.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:49 +01:00
Asias He
4f3ce42163 storage_service: Prevent old version node to join a new version cluster
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.

In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.

1) All old nodes
O O O O O <- N   OK
O O O O O <- O   OK

2) All new nodes
N N N N N <- N   OK
N N N N N <- O   FAIL

3) Mixed old and new nodes
O N O N O <- N   OK
O N O N O <- O   OK

(O == old node, N == new node, <- == joining the cluster)

With this patch, I tested:

1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})

3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}

3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

Fixes #1253
2016-06-17 10:49:45 +08:00
Gleb Natapov
4659800ab9 storage_proxy: implement custom speculative retry strategy
User may specify time after which speculative retry should happen
instead of relying on cf statics. Use provided value in speculative
executor.

Message-Id: <20160616104422.GH5961@scylladb.com>
2016-06-16 13:45:56 +03:00
Pekka Enberg
d72c608868 service/storage_service: Make do_isolate_on_error() more robust
Currently, we only stop the CQL transport server. Extract a
stop_transport() function from drain_on_shutdown() and call it from
do_isolate_on_error() to also shut down the inter-node RPC transport,
Thrift, and other communications services.

Fixes #1353
2016-06-16 13:34:09 +03:00
Gleb Natapov
7f54333c45 storage_proxy: fix complication on older boost
boost before 1.56.0 had broken boost:size() implementation. Do not use
it.

Message-Id: <20160615123134.GD5961@scylladb.com>
2016-06-15 15:34:57 +03:00
Pekka Enberg
155ad2eeb5 storage_service: Fix start_rpc_server() to use logger
Message-Id: <1465882880-7392-1-git-send-email-penberg@scylladb.com>
2016-06-14 09:52:04 +02:00
Vlad Zolotarov
d3960f0bbb tracing: rearrange shut down
tracing::tracing local instance is dereferenced from a
cql_server::connection::process_request(), therefore tracing::tracing
service may be stop()ed only after a CQL server service is down.
On the other hand it may not be stopped before RPC service is down
because a remote side may request a tracing for a specific command too.

This patch splits the tracing::tracing stop() into two phases:
   1) Flush all pending tracing records and stop the backend.
   2) Stop the service.

The first phase is called after CQL server is down and before RPC is down.
The second phase is called after RPC is down.

Fixes #1339

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>
2016-06-14 07:58:04 +03:00
Gleb Natapov
e089166cfa storage_proxy: wait only for expected CL when writing back data during read repair
When read repair writes diffs back to replicas it is enough to wait
for requested CL to guaranty read monotonicity. This patch makes read
repair write reuse regular mutate functionality which already tracks
CL status. This is done by changing write response handler to not hold
mutation directly, but instead hold a container that, depending on
whether
this is read repair write or regular one, can provide different mutation
per destination.

Message-Id: <20160613124727.GL1096@scylladb.com>
2016-06-13 19:01:51 +03:00
Vlad Zolotarov
89375d4c2a service::storage_proxy: tracing: instrument read_digest and read_mutation_data
Instrument read_digest and read_mutation_data handlers similarly
to a read_data handler instrumentation.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465304055-4263-1-git-send-email-vladz@cloudius-systems.com>
2016-06-09 14:32:42 +02:00
Vlad Zolotarov
35402b965f service/client_state: don't try to dereference a tracing state if it's not initialized
Call for a tracing::tracing::create_session() doesn't promise a session creation.
Check that the session is actually created before trying to use it.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-06-06 13:00:31 +03:00
Duarte Nunes
c970d682d1 storage_service: Announce range tombstones feature
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
91aac30f12 mutations: Row tombstones are now a set of ranges
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.

Fixes #1155

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
e46537b7d3 storage_service: Include range tombstones feature
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:58 +02:00
Vlad Zolotarov
69bd8efc40 storage_proxy: instrument a read_data handler to accept a tracing info
This is a demo instrumentation:
   - Check if a tracing info is present in the read_command.
   - If yes - create a tracing session with the given tracing
     session ID.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-06-01 20:17:25 +03:00
Vlad Zolotarov
4c17a422e0 cql3: instrument a SELECT query to send tracing info
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-06-01 20:17:25 +03:00