"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.
Fixes #2855."
* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
gms/gossiper: Remove periodic replication of endpoint state map
gossiper: Check for features in the change listener
gms/gossiper: Replicate changes incrementally to other shards
gms/gossiper: Document validity of endpoint_state properties
storage_service: Update token_metadata after changing endpoint_state
gms/gossiper: Process endpoints in parallel
gms/gossiper: Serialize state changes and notifications for given node
utils/loading_shared_values: Allow Loader to return non-future result
gms/gossiper: Encapsulate lookup of endpoint_state
storage_service: Batch token metadata and endpoint state replication
utils/serialized_action: Introduce trigger_later()
gossiper: Add and improve logging
gms/gossiper: Don't fire change listeners when there is no change
gms/gossiper: Allow parallel apply_state_locally()
gms/gossiper: Avoid copies in endpoint_state::add_application_state()
gms/failure_detector: Ignore short update intervals
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).
Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."
Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>
* 'prometheus-histograms' of github.com:glommer/scylla:
storage_proxy: change reporting of estimated histograms
estimated_histogram: bring histogram closer to what prometheus expects.
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.
While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.
Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.
Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull. It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.
The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.
Fixes sporadic failure in dtest:
repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
This is another place we can update endpoint_state_map in addition to
gossiper::run().
Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.
Fixes#1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.
To avoid this, disable pulls until all nodes are upgraded.
Fixes#2802.
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).
Fixes#2757
Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.
Fixes#2777
Message-Id: <20170911084436.GG24167@scylladb.com>
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.
This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."
* 'calculate-counts/v3' of github.com:duarten/scylla:
query-result: Send row and partition count over the wire
query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).
Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.
The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.
In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:
concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test
Allowing only one active and one queued pull request per remote
endpoint is enough.
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.
This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.
Fixes#2273
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>