Commit Graph

1412 Commits

Author SHA1 Message Date
Tomasz Grabiec
1d9b88dceb service/storage_service: Introduce cluster_schema_features() 2019-04-28 15:50:12 +02:00
Tomasz Grabiec
6e2c190b5f schema_tables: Propagate storage_service& to merge_schema()
We will need to calculate cluster schema features at the time we
calculate the schema digest.
2019-04-28 12:33:10 +02:00
Piotr Sarna
037b517c85 service: initialize system distributed keyspace after schema agreement
In order to avoid schema disagreements during upgrades (which may lead
to deadlocks), system distributed keyspace initialization is moved
right before starting the bootstrapping process, after the schema
agreement checks already succeeded.

Fixes #3976
Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>
2019-04-25 18:44:08 +02:00
Gleb Natapov
c6b3b9ff13 cache_hitrate_calculator: wait for ongoing calculation to complete during stop
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.

Message-Id: <20190422113538.GR21208@scylladb.com>
2019-04-22 14:44:42 +03:00
Gleb Natapov
306f5b99b5 cache_hitrate_calculator: fix use after free in non_system_filter lambda
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.

Fixed #4425

Message-Id: <20190421152139.GN21208@scylladb.com>
2019-04-21 18:22:31 +03:00
Piotr Jastrzebski
2c599122e1 Update supported features on format change
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:38:31 +02:00
Piotr Jastrzebski
9c7e3dd470 Add _unbounded_range_tombstones_feature
This requires introduction of storage_service::get_known_features
and using it with check_knows_remote_features.
Otherwise a node joining the existing cluster won't be able to
join because it does not support unbounded range tombstones yet.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
96ad8f7df9 Use _sstables_format to determine current format
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
7339e9de30 Add service::read_sstables_format
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:37:12 +02:00
Piotr Jastrzebski
9934740c39 Register feature listeners in storage_service
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 10:36:58 +02:00
Piotr Jastrzebski
081542cf00 storage_service: add _sstables_format field
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-04-12 09:33:40 +02:00
Rafael Ávila de Espíndola
6191fd7701 Avoid duplicated read_keyspace_mutation calls
There were many calls to read_keyspace_mutation. One in each function
that prepares a mutation for some other schema change.

With this patch they are all moved to a single location.

Tests: unit (dev, debug)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190328024440.26201-1-espindola@scylladb.com>
2019-04-07 09:26:56 +03:00
Vlad Zolotarov
0dc0a6025d query_pager::fetch_page: cosmetics: fix code alignment
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190401214030.5570-2-vladz@scylladb.com>
2019-04-02 11:53:10 +03:00
Avi Kivity
f259a4c3b4 Merge "Remove usage of static gossiper object in init.cc and storage_service" from Asias
"
This series removes the usage of the static gossiper object in init.cc
and storage_service.

Follow up series will remove more in other components. This is the
effort to clean up the component dependencies and have better shutdown
procedure.

Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test,  dtests.
"

* tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev:
  storage_service: Do not use the global gms::get_local_gossiper()
  storage_service: Pass gossiper object to storage_service
  gms: Remove i_failure_detector.hh
  gossip: Get rid of the gms::get_local_failure_detector static object
  dht: Do not use failure_detector::is_alive in failure_detector_source_filter
  tests: Fix stop snitch in gossip_test.cc
  gossiper: Do not use value_factory from storage_service object
  gossiper: Use cfg options from _cfg instead of get_local_storage_service
  gossiper: Pass db::config object to gossiper class
  init: Pass gossiper object to init_ms_fd_gossiper
2019-03-26 08:54:46 +02:00
Duarte Nunes
93a1c27b31 service/storage_proxy: Don't consider view hints for MV backpressure
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.

However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.

There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.

The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.

Fixes #4351
Fixes #4352

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
2019-03-24 20:29:56 +02:00
Asias He
7447c92d63 storage_service: Do not use the global gms::get_local_gossiper()
Use the gossiper object stored in _gossiper member from storage_service.
2019-03-22 09:11:26 +08:00
Asias He
b91452ed4c storage_service: Pass gossiper object to storage_service
Pass the gossiper object to storage_service class in order to avoid the
usage of the static object returned from get_local_gossiper().
2019-03-22 09:11:26 +08:00
Asias He
af579a055b gossip: Get rid of the gms::get_local_failure_detector static object
Store the failure_detector object inside gossiper object.

- No more the global object sharded<failure_detector>

- No need to initialize sharded<failure_detector> manually which
simplifies the code in tests/cql_test_env.cc and init.cc.
2019-03-22 09:08:51 +08:00
Asias He
2b6a4050c2 dht: Do not use failure_detector::is_alive in failure_detector_source_filter
Switch failure_detector_source_filter to use get_local_gossiper::is_alive
directly since we are going to remove the static
gms::get_local_failure_detector object soon.
Pass the nodes that are down to the filter direclty, to avoid the
range_streamer to depends on gossiper at all.
2019-03-22 08:26:47 +08:00
Asias He
c0f744b407 storage_service: Wait for gossip to settle only if do_bind is set
In commit 71bf757b2c, we call
wait_for_gossip_to_settle() which takes some time to complete in
storage_service::prepare_to_join().

In tests/cql_query_test calls init_server with do_bind == false which in
turn calls storage_service::prepare_to_join(). Since in the test, there
is only one node, there is no point to wait for gossip to settle.

To make the cql_query_test fast again, do not call
wait_for_gossip_to_settle if do_bind is false.

Before this patch, cql_query_test takes forever to complete.
After it takes 10s.

Tests: tests/cql_query_test
Message-Id: <3ae509e0a011ae30eef3f383c6a107e194e0e243.1553147332.git.asias@scylladb.com>
2019-03-21 12:46:00 -03:00
Glauber Costa
34b640993f storage proxy: add tracepoints about delays
When we are tracing requests, we would like to know everything that
happened to a query that can contribute to it having increased
latencies.

We insert some of those latencies explicitly due to throttling, but we
do not log that into tracing.

In the case of storage proxy, we do have a log message at trace level
but that is rarely used: trace messages are too heavy of a hammer, there
is no way to specify specific queries, etc.

The correct place for that is CQL tracing. This patch moves that message
to CQL tracing. We also add a matching tracepoint assuring us that no
delay happened if that's the case.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190320163350.15075-1-glauber@scylladb.com>
2019-03-21 12:45:52 -03:00
Avi Kivity
eddb98e8c6 Merge "sstables: mc: Write and read static compact tables the same way as Cassandra" from Tomasz
"
Static compact tables are tables with compact storage and no
clustering columns.

Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of as static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.

This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in our schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.

Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.

Fixes #4139.

Tests:

  - unit (dev)
"

* tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla:
  tests: sstables: Test reading of static compact sstable generated by Cassandra
  tests: sstables: Add test for writing and reading of static compact tables
  sstables: mc: Write static compact tables the same way as Cassandra
  sstable: mc: writer: Set _static_row_written inside write_static_row()
  sstables: Add sstable::features()
  sstables: mc: writer: Prepare write_static_row() for working with any column_kind
  storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
  sstables: mc: writer: Build indexed_columns together with serialization_header
  sstables: mc: writer: De-optimize make_serialization_header()
  sstable: mc: writer: Move attaching of mc-specific components out of generic code
2019-03-21 12:45:51 -03:00
Nadav Har'El
7c874057f5 materialized_views: propagate "view virtual columns" between nodes
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.

What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.

So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.

Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes #4339, which was about
virtual columns in materialized views not being propagated to other nodes.

Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.

Fixes #4339

Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
2019-03-20 09:14:59 -03:00
Nadav Har'El
ccf731a820 Materialized views: add metric for current flow-control delay
The materialized views flow control mechanism works by adding a certain
delay to each client request, designed to slow down the client to the
rate at we can complete the background view work. Until now we could observe
this mechanism only indirectly, in whether or not it succeeded to keep the
view backlog bounded; But we had no way to directly observe the delay that
we decided to add. In fact, we had a bug where this delay was constantly
zero, and we didn't even notice :-)

So in this patch we add a new metric,
scylla_storage_proxy_coordinator_last_mv_flow_control_delay

The metric is a floating point number, in units of seconds.

This metric is somewhat peculiar that it always contains the *last* delay
used for some request - unlike other metrics it doesn't measure the "current"
value of something. Moreover, it can jump wildly because there is no
guarantee that each request's delay will be identical (in particular,
different requests may involve different base replicas which have different
view backlogs, so decide on different delays). In the future we may want
to supplement this metric with some sort of delay histogram. But even
this simple metric is already useful to debug certain scenarios and
understand if the materialized-views flow control is working or not.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190227133630.26328-1-nyh@scylladb.com>
2019-03-20 09:14:59 -03:00
Asias He
71bf757b2c gossiper: Enable features only after gossip is settled
n1, n2, n3 in the cluster,

shutdown n1, n2, n3

start n1, n2

start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.

INFO  2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO  2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO  2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO  2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}

The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.

Fixes #4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>
2019-03-18 18:25:29 +01:00
Tomasz Grabiec
fefef7b9eb storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
When enabled on all nodes, sstable writers will start to produce
correct MC-format sstables for compact storage tables by writing rows
into the static row (like C*) rather than into the regular row.

We only do that when all nodes are upgraded to support rolling
downgrade. After all nodes are upgraded, regular rolling downgrade will
not be possible.

Refs #4139
2019-03-18 11:18:33 +01:00
Asias He
1d59f26c11 gossiper: Fix empty remote common_features in check_knows_remote_features
Three nodes in the cluster node1, node2, node3

Shutdown the whole cluster

Start node1

Start node2, node2 sees empty remote common_features.

   gossip - Feature check passed.  Local node 127.0.0.2 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.

Start node3, node3 sees empty remote common_features.

   gossip - Feature check passed. Local node 127.0.0.3 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.

Fixes #4225
2019-03-18 10:56:10 +01:00
Piotr Sarna
2e05d86cf3 service: reduce number of spawned threads when notifying
Commit 9c544df217 introduced running up/down/join/leave notifications
in threaded context, but spawned a thread for every notification,
while it could be done once for all notifiees.

Reported-by: Avi Kivity <avi@scylladb.com>
Message-Id: <34815d5aa11902c4a052cff38f4c45c45ff919d8.1552897848.git.sarna@scylladb.com>
2019-03-18 10:45:47 +02:00
Piotr Sarna
9c544df217 service: run notifying code in threaded context
In order to allow yielding when handling endpoint lifecycle changes,
notifiers now run in threaded context.
Implementations which used this assumption before are supplemented
with assertions that they indeed run in seastar::async mode.

Fixes #4317
Message-Id: <45bbaf2d25dac314e4f322a91350705fad8b81ed.1552567666.git.sarna@scylladb.com>
2019-03-14 12:56:53 +00:00
Piotr Sarna
aea4b7ea78 service: remove unused stop_hints_manager
Stopping hints manager now occurs when draining storage proxy
and it shouldn't be executed independently, so it's removed
from external API.
2019-03-07 13:44:06 +01:00
Piotr Sarna
cc806909d7 storage_proxy: add drain_on_shutdown implementation
When storage proxy is shutting down, all interruptible writes
can be timed out in order not to wait for them. Instead, the mechanism
will fall back to storing hints and/or not progressing with view
building.
2019-03-07 13:44:05 +01:00
Piotr Sarna
c61d0ee8aa main: register storage proxy as lifecycle subscriber
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes #3826
Fixes #4028
2019-03-07 12:10:40 +01:00
Piotr Sarna
92df1d5a6b storage_proxy: add endpoint_lifecycle_subscriber interface
Storage proxy is able to react to membership changes
in order to cancel long-standing operations for an endpoint.
2019-03-07 12:10:40 +01:00
Piotr Sarna
f9ff97511f storage_proxy: register view update handlers for view write type
View update handlers have a specialized class, so all writes
of type write_type::VIEW are now registered as such.
2019-03-07 12:10:40 +01:00
Piotr Sarna
75ec5fa876 storage_proxy: add intrusive list of view write handlers
In order to be able to iterate over view update write response handlers,
an intrusive list of them is added to storage proxy. This way
iteration can be easily yielded without invalidating operators and all
logic is moved to slow path.
2019-03-07 12:10:40 +01:00
Piotr Sarna
c2048a0758 storage_proxy: add view_update_write_response_handler
View update write response handler inherits from a regular write
response handler, but it's also possible to link it intrusively
in order to be able to induce timeouts on them later.
2019-03-07 12:10:40 +01:00
Jesse Haber-Kucharsky
a139afc30c auth: Reject logins from disallowed roles
When the `LOGIN` option for a role is set to `false`, Scylla should not
permit the role to log in.

Fixes #4284

Tests: unit (debug)
2019-02-28 15:02:53 -05:00
Avi Kivity
88322086cb Merge "Add fuzzer-type unit test for range scans" from Botond
"
This series adds a fuzzer-type unit test for range scans, which
generates a semi-random dataset and executes semi-random range scans
against it, validating the result.
This test aims to cover a wide range of corner cases with the help of
randomness. Data and queries against it are generated in such a way that
various corner cases and their combinations are likely to be covered.

The infrastructure under range-scans have gone under massive changes in
the last year, growing in complexity and scope. The correctness of range
scans is critical for the correct functioning of any Scylla cluster, and
while the current unit tests served well in detecting any major problems
(mostly while developing), they are too simplistic and can only be
relied on to check the correctness of the basic functionality. This test
aims to extend coverage drastically, testing cases that the author of
the range-scan code or that of the existing unit tests didn't even think
exists, by relying on some randomness.

Fixes: #3954 (deprecates really)
"

* 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla:
  tests/multishard_mutation_query_test: add fuzzy test
  tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan()
  tests/test_table: add advanced `create_test_table()` overload
  tests/test_table: make `create_test_table()` customizable
  query: add trim_clustering_row_ranges_to()
  tests/test_table: add keyspace and table name params
  tests/test_table: s/create_test_cf/create_test_table/
  tests: move create_test_cf() to tests/test_table.{hh,cc}
  tests/multishard_mutation_query_test: drop many partition test
  tests/multishard_mutation_query_test: drop range tombstone test
2019-02-27 17:26:53 +02:00
Nadav Har'El
da54d0fc7d Materialized views: fix accidental zeroing of flow-control delay
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.

Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.

Fixes #4143.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
2019-02-26 18:22:18 +02:00
Tomasz Grabiec
b06aac4fdb Merge "Fix temporary spurious schema version mismatch when nodes are restarted" from Asias
Fixes: #4148
Fixes: #4258

Tests: resharding_test.py:reshardingtest_nodes4_with_sizetieredcompactionstrategy.resharding_by_smp_increase_test

* seastar-dev.git asias/fix_schema_mismatch_when_nodes_restarts/v1:
  database: Add update_schema_version and announce_schema_version
  storage_service: Add application_state::SCHEMA when gossip starts
2019-02-26 12:55:52 +01:00
Avi Kivity
5f94bc902a transport: add option to disable shard-aware drivers
The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.

Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).

Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
2019-02-26 12:44:11 +01:00
Asias He
459836079c storage_service: Add application_state::SCHEMA when gossip starts
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:

   4 nodes in the tests

   n1, n2, n3, n4 are started

   n1 is stopped

   n1 is changed to use different shard config

   n1 is restarted ( 2019-01-27 04:56:00,377 )

The backtrace happened on n2 right fater n1 restarts:

   0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
   1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
   2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
   3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
   4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
   5 Segmentation fault on shard 0.
   6 Backtrace:
   7 0x00000000041c0782
   8 0x00000000040d9a8c
   9 0x00000000040d9d35
   10 0x00000000040d9d83
   11 /lib64/libpthread.so.0+0x00000000000121af
   12 0x0000000001a8ac0e
   13 0x00000000040ba39e
   14 0x00000000040ba561
   15 0x000000000418c247
   16 0x0000000004265437
   17 0x000000000054766e
   18 /lib64/libc.so.6+0x0000000000020f29
   19 0x00000000005b17d9

The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.

In commit da80f27f44, we fixed the problem by
checking the pointer before dereference.

To prevent this to happen in the first place, we'd better to add
application_state::SCHEMA when gossip starts. This way, peer nodes
always see the application_state::SCHEMA when a node restarts.

Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test

Fixes #4148
Fixes #4258
2019-02-26 19:30:22 +08:00
Piotr Sarna
c743617236 cql3: unify max value for row limit and per-partition limit
Limits are stored as uint32_t everywhere, but in some places
int32_t was used, which created inconsistencies when comparing
the value to std::numeric_limits<Type>::max().
In order to solve inconsistencies, the types are unified to uint32_t,
and instead of explicitly calling numeric limit max,
an already existing constant value query::max_rows is utilized.

Fixes #4253

Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>
2019-02-21 13:56:02 +02:00
Duarte Nunes
6e83457b1b Merge 'Add PER PARTITION LIMIT' from Piotr
"
This series introduces PER PARTITION LIMIT to CQL.
Protocol and storage is already capable of applying per-partition limits,
so for nonpaged queries the changes are superficial - a variable is parsed
and passed down.
For paged queries and filtering the situation is a little bit more complicated
due to corner cases: results for one partition can be split over 2 or more pages,
filtering may drop rows, etc. To solve these, another variable is added to paging
state - the number of rows already returned from last served partition.
Note that "last" partition may be stretched over any number of pages, not just the
last one, which is a case especially when considering filtering.
As a result, per-partition-limiting queries are not eligible for page generator
optimization, because they may need to have their results locally filtered
for extraneous rows (e.g. when the next page asks for  per-partition limit 5,
but we already received 4 rows from the last partition, so need just 1 more
from last partition key, but 5 from all next ones).

Tests: unit (dev)

Fixes #2202
"

* 'add_per_partition_limit_3' of https://github.com/psarna/scylla:
  tests: remove superficial ignore_order from filtering tests
  tests: add filtering with per partition key limit test
  tests: publish extract_paging_state and count_rows_fetched
  tests: fix order of parameters in with_rows_ignore_order
  cql3,grammar: add PER PARTITION LIMIT
  idl,service: add persistent last partition row count
  cql3: prevent page generator usage for per-partition limit
  cql3: add checking for previous partition count to filtering
  pager: add adjusting per-partition row limit
  cql3: obey per partition limit for filtering
  cql3: clean up unneeded limit variables
  cql3: obey per partition limit for select statement
  cql3: add get_per_partition_limit
  cql3: add per_partition_limit to CQL statement
2019-02-18 14:47:11 +00:00
Piotr Sarna
acf7bedad4 idl,service: add persistent last partition row count
In order to process paged queries with per-partition limits properly,
paging state needs to keep additional information: what was the row
count of last partition returned in previous run.
That's necessary because the end of previous page and the beginning
of current one might consist of rows with the same partition key
and we need to be able to trim the results to the number indicated
by per-partition limit.
2019-02-18 11:06:44 +01:00
Piotr Sarna
1dadae212a cql3: add checking for previous partition count to filtering
Filtering now needs to take into account per partition limits as well,
and for that it's essential to be able to compare partition keys
and decide which rows should be dropped - if previous page(s) contained
rows with the same partition key, these need to be taken into
consideration too.
2019-02-18 11:06:43 +01:00
Piotr Sarna
82a3883575 pager: add adjusting per-partition row limit
For filtering pagers, per partition limit should be set
to page size every time a query is executed, because some rows
may potentially get dropped from results.
2019-02-18 10:55:52 +01:00
Piotr Sarna
b965c3778f cql3: obey per partition limit for filtering
Filtering queries now take into account the limit of rows
per single partition provided by the user.
2019-02-18 10:29:34 +01:00
Gleb Natapov
b01a659014 storage_proxy: remove old Cassandra code
Part of the code is already implemented (counters and hinted-handoff).
Part of the code will probably never be (triggers). And the rest is
the code that estimates number of rows per range to determine query
parallelism, but we implemented exponential growth algorithms instead.

Message-Id: <20190214112226.GE19055@scylladb.com>
2019-02-18 10:34:55 +02:00
Avi Kivity
a1567b0997 Merge "replace get_restricted_ranges() function with generator interface" from Gleb
"
get_restricted_ranges() is inefficient since it calculates all
vnodes that cover a requested key ranges in advance, but callers often
use only the first one.  Replace the function with generator interface
that generates requested number of vnodes on demand.
"

* 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev:
  storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator
  storage_proxy: remove old get_restricted_ranges() interface
  cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface
  tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface
  storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface
  storage_proxy: introduce new query_ranges_to_vnode_generator interface
2019-02-18 10:33:54 +02:00