After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.
Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.
The fix is to explicitly tell set extra_replica from unset one.
fixes: #11825
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11833
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
- partition scan
- counter mutation
- paxos
With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.
Closes#11296
* github.com:scylladb/scylladb:
storage_proxy: hold effective_replication_map for the duration of a paxos transaction
storage_proxy: move paxos_response_handler class to .cc file
storage_proxy: deinline paxos_response_handler constructor/destructor
storage_proxy: use consistent effective_replication_map for counter coordinator
storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
storage_proxy: query_singular: use fewer smart pointers
storage_proxy: query_singular: simplify lambda captures
locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
storage_proxy: use consistent token_metadata with rest of singular read
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.
The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.
Fix by capturing it once in the caller and propagating it everywhere.
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
do_with() is a sure indicator for coroutinization, since it adds
an allocation (like the coroutine does with its frame). Therefore
translating a function with do_with is at least a break-even, and
usually a win since other continuations no longer allocate.
This series converts most of storage_proxy's function that have
do_with to coroutines. Two remain, since they are not simple
to convert (the do_with() is kept running in the background and
its future is discarded).
Individual patches favor minimal changes over final readability,
and there is a final patch that restores indentation.
The patches leave some moves from coroutine reference parameters
to the coroutine frame, this will be cleaned up in a follow-up. I wanted
this series not to touch headers to reduce rebuild times.
Closes#11683
* github.com:scylladb/scylladb:
storage_proxy: reindent after coroutinization
storage_proxy: convert handle_read_digest() to a coroutine
storage_proxy: convert handle_read_mutation_data() to a coroutine
storage_proxy: convert handle_read_data() to a coroutine
storage_proxy: convert handle_write() to a coroutine
storage_proxy: convert handle_counter_mutation() to a coroutine
storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
The decision to reject a read operation can either be made by replicas,
or by the coordinator. In the second case, the
scylla_storage_proxy_coordinator_read_rate_limited
metric was not incremented, but it should. This commit fixes the issue.
Fixes: #11651Closes#11694
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
A do_with() makes this at least a break-even.
Some internal lambdas were not converted since they commonly
do not allocate or block.
A finally() continuation is converted to seastar::defer().
The do_with means the coroutine conversion is free, and conversion
of parallel_for_each to coroutine::parallel_for_each saves a
possible allocation (though it would not have been allocated usually.
An inner continuation is not converted since it usually doesn't
block, and therefore doesn't allocate.
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.
To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.
The latter, statements::truncate_statement, is no longer derived
from raw::cf_statement, and just stores a schema_ptr to get to the
keyspace and column_family names.
`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.
Fixes#11408
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With EverywhereStrategy, we know that all tokens will be on the same node and the data is typically sparse like LocalStrategy.
Result of testing the feature:
Cluster: 2 DC, 2 nodes in each DC, 256 tokens per nodes, 14 shards per node
Before: 154 scanning operations
After: 14 scanning operations (~10x improvement)
On bigger cluster, it will probably be even more efficient.
Closes#11403
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Proxy is the only place that calls this method. Also the method name
suggests it's not something "generic", but rather an internal logic of
proxy's query processing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Simplify ahead of refactoring for consistent effective_replication_map.
This is probably a pessimization of the error case, but the error case
will be terrible in any case unless we resultify it.
Move the error handling function where it's used so the code
is more straightforward.
Due to some std::move()s later, we must still capture the schema early.
Replication is a mix of several inputs: tokens and token->node mappings (topology),
the replication strategy, replication strategy parameters. These are all captured
in effective_replication_map.
However, if we use effective_replication_map:s captured at different times in a single
query, then different uses may see different inputs to effective_replication_map.
This series protects against that by capturing an effective_replication_map just
once in a query, and then using it. Furthermore, the captured effective_replication_map
is held until the query completes, so topology code can know when a topology is no
longer is use (although this isn't exploited in this series).
Only the simple read and write paths are covered. Counters and paxos are left for
later.
I don't think the series fixes any bugs - as far as I could tell everything was happening
in the same continuation. But this series ensures it.
Closes#11259
* github.com:scylladb/scylladb:
storage_proxy: use consistent topology
storage_proxy: use consistent replication map on read path
storage_proxy: use consistent replication map on write path
storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
consistency_level: accept effective_replication_map as parameter, rather than keyspace
consistency_level: be more const when using replication_strategy
Derive the topology from captured and stable effective_replication_map
instead of getting a fresh topology from storage_proxy, since the
fresh topology may be inconsistent with the running query.
digest_read_resolver did not capture an effective_replication_map, so
that is added.
Capture a replication map just once in
abstract_read_executor::_effective_replication_map_ptr. Although it isn't
used yet, it serves to keep a reference count on topology (for fencing),
and some accesses to topology within reads still remain, which can be
converted to use the member in a later patch.
Capture a replication map just once in
abstract_write_handler::_effective_replication_map_ptr and use it
in all write handlers. A few accesses to get the topology still remain,
they will be fixed up in a later patch.
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).
Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.
Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
This patch reduces the number of metrics ScyllaDB generates.
Motivation: The combination of per-shard with per-scheduling group
generates a lot of metrics. When combined with histograms, which require
many metrics, the problem becomes even bigger.
The two tools we are going to use:
1. Replace per-shard histograms with summaries
2. Do not report unused metrics.
The storage_proxy stats holds information for the API and the metrics
layer. We replaced timed_rate_moving_average_and_histogram and
time_estimated_histogram with the unfied
timed_rate_moving_average_summary_and_histogram which give us an option
to report per-shard summaries instead of histogram.
All the counters, histograms, and summaries were marked as
skip_when_empty.
The API was modified to use
timed_rate_moving_average_summary_and_histogram.
Closes#11173
We expect each replica to stop at exactly the same position when the
digests match. Soon however, if replicas have a lot of tombstones, some
may stop earlier then the others. As long as all digests match, this is
fine but we need to make sure we continue from the smallest such
positions on the next page.
We want to transmit the last position as determined by the replica on
both result and digest reads. Result reads already do that via the
query::result, but digest reads don't yet as they don't return the full
query::result structure, just the digest field from it. Add the last
position to the digest read's return value and collect these in the
digest resolver, along with the returned digests.
To be used by coordinator side code to determine the correct tombstone
limit to pass to read-command (tombstone limit field added in the next
commit). When this limit is non-zero, the replica will start cutting
pages after the tombstone limit is surpassed.
This getter works similarly to `get_max_result_size()`: if the cluster
feature for empty replica pages is set, it will return the value
configured via db::config::query_tombstone_limit. System queries always
use a limit of 0 (unlimited tombstones).
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.
Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.
The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.
Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"
* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
consistency_level: Remove is_local() and count_local_endpoints()
storage_proxy: Use topology::local_endpoints_count()
storage_proxy: Use proxy's topology for DC checks
storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
storage_proxy: Use topology local_dc_filter in its methods
storage_proxy: Mark some digest_read_resolver methods private
forwarding_service: Use topology local_dc_filter
storage_service: Use topology local_dc_filter
consistency_level: Use topology local_dc_filter
consitency-level: Call count_local_endpoints from topology
consistency_level: Get datacenter from topology
replication_strategy: Remove hold snitch reference
effective_replication_map: Get datacenter from topology
topology: Add local-dc detection shugar
A continuation of the previous patches -- now all the code that needs
this helper have proxy pointer at hand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>