"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.
This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.
I manually checked that each individual defense added is already
preventing the crash from happening.
Fixes: #6613Fixes: #6907Fixes: #6908
Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"
* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: use cached semaphore
messaging: make verb handler registering independent of current scheduling group
multishard_mutation_query: validate the semaphore of the looked-up reader
reader_concurrency_semaphore: make inactive read handles unique across semaphores
reader_concurrency_semaphore: add name() accessor
reader_concurrency_semaphore: allow passing name to no-limit constructor
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.
This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.
This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
Handlers for each verb have both -- register and unregister helpers, but unregistration ones
for some verbs are missing, so here they are.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.
I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.
- unordered_set:
hash_sets=295634, time=333029199 ns
- flat_hash_set:
hash_sets=295634, time=312484711 ns
- node_hash_set:
hash_sets=295634, time=346195835 ns
- btree_set:
hash_sets=295634, time=341379801 ns
The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.
To fix, switch to absl btree_set container.
Fixes#6190
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
The messaging service constructor's body does two main things in this
order:
1. it registers the CLIENT_ID verb with rpc.
2. it initializes the scheduling mechanism in charge of locating the
right scheduling group for each verb.
The registration function uses the scheduling mechanism to determine
the scheduling group for the verb.
This commit simply reverses the order of execution.
Fixes#6628
Tenants get their own connections for statement verbs and are further
isolated from each other by different scheduling groups. A tenant is
identified by a scheduling group and a name. When selecting the client
index for a statement verb, we look up the tenant whose scheduling group
matches the current one. This scheduling group is persisted across the
RPC call, using the name to identify the tenant on the remote end, where
a reverse lookup (name -> scheduling group) happens.
Instead of a single scheduling group to be used for all statement verbs,
messaging_service::scheduling_config now contains a list of tenants. The
first among these is the default tenant, the one we use when the current
scheduling group doesn't match that of any configured tenant.
To make this mapping easier, we reshuffle the client index assignment,
such that statement and statement-ack verbs have the idx 2 and 3
respectively, instead of 0 and 3.
The tenant configuration is configured at message service construction
time and cannot be changed after. Adding such capability should be easy
but is not needed for query classification, the current user of the
tenant concept.
Currently two tenants are configured: $user (default tenant) and
$system.
Per-user SLA means we have connection classifications determined dynamically,
as SLAs are added or removed. This means the classification information cannot
be static.
Fix by making it a non-static vector (instead of a static array), allowing it
to be extended. The scheduling group member pointer is replaced by a scheduling
group as a member pointer won't work anymore - we won't have a member to refer
to.
On the client side, we supply an isolation cookie based on the connection index
On the server side, we convert an isolation cookie back to a scheduling_group.
This has two advantages:
- rpc processes the entire connection using the scheduling group, so that code
is also isolated and accounted for
- we can later add per-user connections; the previous approach of looking at the
verb to decide the scheduling_group doesn't help because we don't have a set of
verbs per user
With this, the main group sees <0.1% usage under simple read and write loads.
Move it from a function-local static to a class static variable. We will want
to extend it in two ways:
- add more information per connection index (like the rpc isolation cookie)
- support adding more connections for per-user SLA
As a first step, make it an array of structures and make it accessible to all
of messaging_service.
Commit 75cf255c67 (repair: Ignore keyspace
that is removed in sync_data_using_repair) is not enough to fix the
issue because when the repair master checks if the table is dropped, the
table might not be dropped yet on the repair master.
To fix, the repair master should check if the follower failed the repair
because the table is dropped by checking the error returned from
follower.
With this patch, we would see
WARN 2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0
completed successfully, keyspace=ks, ignoring dropped tables={cf}
when the table is dropped during bootstrap.
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-
Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.
Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.
Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.
Fixes#5930Fixes#5998
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.
The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round. It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.
Fixes#5779
Message-Id: <20200330154853.GA31074@scylladb.com>
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.
The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
Paxos protocol has three stages: prepare, accept, learn. This patch adds
rpc verb for each of those stages. To be term compatible with Cassandra
the patch calls those stages: prepare, propose, commit.
The verbs are:
* DEFINITIONS_UPDATE (push)
* MIGRATION_REQUEST (pull)
Support was added in a backward-compatible way. The push verb, sends
both the old frozen mutation parameter, and the new optional canonical
mutation parameter. It is expected that new nodes will use the latter,
while old nodes will fall-back to the former. The pull verb has a new
optional `options` parameter, which for now contains a single flag:
`remote_supports_canonical_mutation_retval`. This flag, if set, means
that the remote node supports the new canonical mutation return value,
thus the old frozen mutations return value can be left empty.
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Avoid including the lengthy stream_session.hh in messaging_service.
More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.
Refs: #4789
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.
To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.
Fixes: #4789
get_rpc_client assumes the messaging_service is not stopped. We should check
is_stopping() before we call get_rpc_client.
We do such check in existing code, e.g., send_message and friends. Do
the same check in the newly introduced
make_sink_and_source_for_stream_mutation_fragments() and friends for row
level repair.
Fixes: #4767
Because inet_address was initially hardcoded to
ipv4, its wire format is not very forward compatible.
Since we potentially need to communicate with older version nodes, we
manually define the new serial format for inet_address to be:
ipv4: 4 bytes address
ipv6: 4 bytes marker 0xffffffff (invalid address)
16 bytes data -> address
- REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM
Get repair rows from follower nodes
- REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
Put repair rows to follower nodes
- REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM:
Get full hashes from follower nodes
Before this patch, repair master and followers use their own schema
version at the point repair starts independently. The schemas can be
different due to schema change. Repair uses the schema to serialize
mutation_fragment and deserialize the mutation_fragment received from
peer nodes. Using different schema version to serialize and deserialize
cause undefined behaviour.
To fix, we use the schema the repair master decides for all the repair
nodes involved.
On top of this patch, we could do another step to make sure all nodes
has the latest schema. But let's do it in a separate patch.
Fixes: #4549
Backports: 3.1
Currently, REPAIR_GET_COMBINED_ROW_HASH RPC verb returns only the
repair_hash object. In the future, we will use set reconciliation
algorithm to decode the full row hashes in working row buf. It is useful
to return the number of rows inside working row buf in addition to the
combined row hashes to make sure the decode is successful.
It is also better to use a wrapper class for the verb response so we can
extend the return values later more easily with IDL.
Fixes#4526
Message-Id: <93be47920b523f07179ee17e418760015a142990.1559771344.git.asias@scylladb.com>
Seastar now supports two RPC compression algorithm: the original LZ4 one
and LZ4_FRAGMENTED. The latter uses lz4 stream interface which allows it
to process large messages without fully linearising them. Since, RPC
requests used by Scylla often contain user-provided data that
potentially could be very large, LZ4_FRAGMENTED is a better choice for
the default compression algorithm.
Message-Id: <20190417144318.27701-1-pdziepak@scylladb.com>