Commit Graph

1083 Commits

Author SHA1 Message Date
Calle Wilund
8c257c40b4 storage_service: Only replicate token metadata iff modified in on_change
Fixes #2869

Message-Id: <20171101105629.22104-1-calle@scylladb.com>
2017-11-01 14:56:55 +02:00
Duarte Nunes
044b8deae4 Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.

Fixes #2855."

* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
  gms/gossiper: Remove periodic replication of endpoint state map
  gossiper: Check for features in the change listener
  gms/gossiper: Replicate changes incrementally to other shards
  gms/gossiper: Document validity of endpoint_state properties
  storage_service: Update token_metadata after changing endpoint_state
  gms/gossiper: Process endpoints in parallel
  gms/gossiper: Serialize state changes and notifications for given node
  utils/loading_shared_values: Allow Loader to return non-future result
  gms/gossiper: Encapsulate lookup of endpoint_state
  storage_service: Batch token metadata and endpoint state replication
  utils/serialized_action: Introduce trigger_later()
  gossiper: Add and improve logging
  gms/gossiper: Don't fire change listeners when there is no change
  gms/gossiper: Allow parallel apply_state_locally()
  gms/gossiper: Avoid copies in endpoint_state::add_application_state()
  gms/failure_detector: Ignore short update intervals
2017-10-18 10:13:25 +01:00
Tomasz Grabiec
2d5fb9d109 gms/gossiper: Replicate changes incrementally to other shards
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
cf113ed295 storage_service: Update token_metadata after changing endpoint_state
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).

Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
6263b0ebb6 storage_service: Batch token metadata and endpoint state replication
Replication needs to be serialized. We can batch replication requests
which are waiting to start. Use serialized_action, which does this.
2017-10-18 08:49:52 +02:00
Pekka Enberg
ae92055b52 Merge "Bring histogram closer to what Prometheus expects" from Glauber
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/

One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.

The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.

2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.

About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:

"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."

It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."

Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>

* 'prometheus-histograms' of github.com:glommer/scylla:
  storage_proxy: change reporting of estimated histograms
  estimated_histogram: bring histogram closer to what prometheus expects.
2017-10-17 20:23:10 +03:00
Duarte Nunes
e9358c1c83 service/storage_service: Avoid copies in prepare_replacement_info()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
674f5d8eaf service/storage_service: Cleanup get_application_state_value()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
0ccb9211d7 service/storage_service: Cleanup handle_state_removing()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
bdee795876 service/storage_service: Cleanup get_rpc_address()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
03e6fc95ba service/migration_manager: Avoid copies in is_ready_for_bootstrap()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
72ca6b34ef service/migration_manager: Cleanup has_compatible_schema_tables_version()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
976324bbb8 service/migration_manager: Fix usages of get_application_state()
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
25b0654312 service/load_broadcaster: Avoid copy in on_join()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2210d10552 gms/gossiper: Cleanup is_alive()
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
ceebbe14cc gossiper: Avoid endpoint_state copies
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.

This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).

Fixes #764

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:48:02 +01:00
Duarte Nunes
198b1b76b5 storage_service: Remove duplicate endpoint state check
We already performed the check, so we don't need to do it again.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:25:34 +01:00
Glauber Costa
189ef02596 storage_proxy: change reporting of estimated histograms
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.

While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.

Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-10-04 20:01:15 -04:00
Tomasz Grabiec
b704710954 migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.

Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull.  It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.

The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.

Fixes sporadic failure in dtest:

  repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
2017-09-27 12:00:07 +01:00
Asias He
4b1034b9cd storage_service: Remove the stream_hints
Our hinted handoff implementation will not use the
db::system_keyspace::HINTS system table to store hints.
No need to stream them.

Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>
2017-09-26 19:05:21 +03:00
Tomasz Grabiec
8e46d15f91 storage_service: Register features before joining
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:13:02 +01:00
Tomasz Grabiec
b92dcb0284 storage_service: Extract register_features()
Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:12:46 +01:00
Asias He
ebc3bada12 storage_service: Check gossip feature update in replicate_tm_and_ep_map
This is another place we can update endpoint_state_map in addition to
gossiper::run().

Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
2017-09-20 16:58:33 +08:00
Asias He
173cba67ba storage_service: Remove rpc client on all shards in on_dead
We should close connections to nodes that are down on all shards instead
of the shard which runs the on_dead gossip callback.

Found by Gleb.
Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>
2017-09-20 10:23:31 +02:00
Avi Kivity
55e0b63e65 storage_proxy: scan more nodes exponentially to achieve target result set size
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.

Fixes #1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Tomasz Grabiec
5a92c18e63 migration_manager: Disable pulls during rolling upgrade from 1.7
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.

To avoid this, disable pulls until all nodes are upgraded.

Fixes #2802.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
713d75fd51 storage_service: Introduce SCHEMA_TABLES_V3 feature 2017-09-14 20:26:31 +02:00
Gleb Natapov
31e803a36c storage_proxy: wire up percentile speculative read properly
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).

Fixes #2757

Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
2017-09-14 10:31:26 +03:00
Avi Kivity
aebab377d9 storage_service: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
9b540eccb0 database: remove dependency on compaction.hh and compaction_manager.hh 2017-09-11 20:09:45 +03:00
Gleb Natapov
d0d8bdf615 storage_proxy: remove unused parameter from get_restricted_ranges() function
Message-Id: <20170911084653.GH24167@scylladb.com>
2017-09-11 11:58:44 +02:00
Gleb Natapov
f66e9377d4 storage_proxy: do not keep reference to a keyspace during write
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.

Fixes #2777

Message-Id: <20170911084436.GG24167@scylladb.com>
2017-09-11 11:57:00 +02:00
Asias He
bb9dbc5ade storage_service: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>
2017-09-10 18:10:24 +03:00
Paweł Dziepak
ecd2bf128b storage_service: introduce CORRECT_COUNTER_ORDER feature
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
9d82a1ebfd abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Avi Kivity
e428805ba5 Merge "Optimize query result partition and row counts" from Duarte
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.

This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."

* 'calculate-counts/v3' of github.com:duarten/scylla:
  query-result: Send row and partition count over the wire
  query::result: Optimize calculate_counts()
2017-08-17 13:41:21 +03:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Duarte Nunes
d7bab684ea query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:28:29 +02:00
Duarte Nunes
bcf21aacc2 storage_proxy: Directly call query_nonsingular_mutations_locally
Instead of duplicating the branch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811001559.25788-1-duarte@scylladb.com>
2017-08-11 09:06:01 +03:00
Duarte Nunes
a3ee99554b service/storage_proxy: Remove out of date comment
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).

Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
2017-08-11 09:04:23 +03:00
Asias He
49360992d9 storage_service: Use the new range_streamer interface for removenode
So that removenode operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
6b8dc85f12 storage_service: Use the new range_streamer interface for decommission
So that decommission operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
24584b8509 storage_service: Use the new range_streamer interface for rebuild
So that rebuild operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Gleb Natapov
d2a2a6d471 storage_proxy: make range_slice_read_executor go through digest matching state
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.

The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
2017-08-03 11:37:03 +03:00
Gleb Natapov
3b7d8c8767 storage_proxy: add capability to read data/digest for non singular ranges
Currently only mutation_data read supports non singular ranges. This
patch extends data/digest reads to support them too.
2017-08-03 10:35:09 +03:00
Gleb Natapov
c619ef258b storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
2017-08-03 10:08:44 +03:00
Tomasz Grabiec
e09220dbff migration_manager: Log schema pulls 2017-07-27 20:08:25 +02:00
Tomasz Grabiec
350d98d4e1 migration_manager: Prevent pull requests from accumulating
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.
2017-07-27 20:08:25 +02:00
Vlad Zolotarov
e98adb13d5 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
2017-07-27 10:54:36 +02:00
Vlad Zolotarov
9086c643a6 service::storage_proxy: add a trace points pair in the SELECT replica flow
Add two trace points: at the beginning and at the end of the replica flow on the
replica shard.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>
2017-07-20 16:44:25 +02:00