Commit Graph

324 Commits

Author SHA1 Message Date
Avi Kivity
789233228b messaging: don't inherit from seastar::rpc::protocol
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.

Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.

Closes #8084
2021-02-16 16:04:44 +02:00
Pavel Solodovnikov
d8dfdfba1e raft: pass group_id as an argument to raft rpc messages
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:25:33 +03:00
Pavel Solodovnikov
1a979dbba2 raft: add Raft RPC verbs to messaging_service and wire up the RPC calls
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.

`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.

All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:11:17 +03:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Benny Halevy
e28d80ec0c messaging: msg_addr: mark methods noexcept
Based on gms::inet_address.

With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Tomasz Grabiec
14fdd2f501 Merge "Gossip echo message improvement" from Asias
This series improves gossip echo message handling in a loaded cluster.

Refs: #7197

* git://github.com/asias/scylla.git gossip_echo_improve_7197:
  gossiper: Handle echo message on any shard
  gossiper: Increase echo message timeout
  gossiper: Remove unused _last_processed_message_at
2020-09-24 15:13:55 +02:00
Asias He
c7cb638e95 gossiper: Increase echo message timeout
Gossip echo message is used to confirm a node is up. In a heavily loaded
slow cluster, a node might take a long time to receive a heart beat
update, then the node uses the echo message to confirm the peer node is
really up.

If the echo message timeout too early, the peer node will not be marked
as up. This is bad because a live node is marked as down and this could
happen on multiple nodes in the cluster which causes cluster wide
unavailability issue. In order to prevent multiple nodes to marked as
down, it is better to be conservative and less restrictive on echo
message timeout.

Note, echo message is not used to detect a node down. Increasing the
echo timeout does not have any impact on marking a node down in a timely
manner.

Refs: #7197
2020-09-24 09:50:09 +08:00
Pavel Emelyanov
2fde6bbfe7 messaging_service: Report still registered services as errors
On stop -- unregister the CLIENT_ID verb, which is registerd
in constructor, then check for any remaining ones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:52:57 +03:00
Pavel Emelyanov
623f61e63e messaging_service: Unglobal messaging service instance
Remove the global messaging_service, keep it on the main stack.
But also store a pointer on it in debug namespace for debugging.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
4ea63b2211 gossiper: Share the messaging service with snitch
And make snitch use gossiper's messaging, not global

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
878c50b9ad main: Keep reference on global messaging service
This is the preparation for moving the message service to main -- keep
a reference and eventually pass one to subsystems depending on messaging.
Once they are ready, the reference will be turned into an instance.

For now only push the reference into the messaging service init/exit
itself, other subsystems will be patched next.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
bdfb77492f init: The messaging_service::stop is back (not really)
Introduce back the .stop() method that will be used to really stop
the service. For now do not do sharded::stop, as its users are not
yet stopping, so this prevents use-after-free on messaging service.

For now the .stop() is empty, but will be in charge of checking if
all the other users had unregisterd their handlers from rpc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
c28aeaee2e messaging_service: Move initialization to messaging/
Now the init_messaging_service() only deals with messaing service
and related internal stuff, so it can sit in its own module.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
5b169e8d16 messaging_service: Construct using config
This is the continuation of the previous patch -- change the primary
constructor to work with config. This, in turn, will decouple the
messaging service from database::config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
304a414e39 messaging_service: Introduce and use config
This service constructor uses and copies many simple values, it would be
much simpler to group them on config. It also helps the next patches to
simplify the messaging service initialization and to keep the defaults
(for testing) in one place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
1c8ea817cd messaging_service: Rename stop() to shutdown()
On today's stop() the messaging service is not really stopped as other
services still (may) use it and have registered handlers in it. Inside
the .stop() only the rpc servers are brought down, so the better name
for this method would be shutdown().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
e6fb2b58fc messaging_service: Cleanup visibility of stopping methods
Just a cleanup. These internal stoppers must be private, also there
are too many public specifiers in the class description around them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Avi Kivity
3b1ff90a1a Merge "Get rid of seed concept in gossip" from Asias
"
gossip: Get rid of seed concept

The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.

If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.

Major changes:

Seed nodes are only used as the initial contact point nodes.

Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.

The unsafe auto_bootstrap option is now ignored.

Gossip shadow round now talks to all nodes instead of just seed nodes.

Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"

* 'gossip_no_seed_v2' of github.com:asias/scylla:
  gossip: Get rid of seed concept
  gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
  gossip: Add do_apply_state_locally helper
  gossip: Do not talk to seed node explicitly
  gossip: Talk to live endpoints in a shuffled fashion
2020-08-17 09:50:51 +03:00
Avi Kivity
257c17a87a Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:

template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);

template <typename T>
shared_ptr<T> make_shared(T&& a);

The second variant doesn't exist in std::make_shared.

This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"

* 'espindola/make_shared' of https://github.com/espindola/scylla:
  Everywhere: Explicitly instantiate make_lw_shared
  Everywhere: Add a make_shared_schema helper
  Everywhere: Explicitly instantiate make_shared
  cql3: Add a create_multi_column_relation helper
  main: Return a shared_ptr from defer_verbose_shutdown
2020-08-02 19:51:24 +03:00
Avi Kivity
3f84d41880 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.

In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.

This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.

I manually checked that each individual defense added is already
preventing the crash from happening.

Fixes: #6613
Fixes: #6907
Fixes: #6908

Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"

* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: use cached semaphore
  messaging: make verb handler registering independent of current scheduling group
  multishard_mutation_query: validate the semaphore of the looked-up reader
  reader_concurrency_semaphore: make inactive read handles unique across semaphores
  reader_concurrency_semaphore: add name() accessor
  reader_concurrency_semaphore: allow passing name to no-limit constructor
2020-07-27 13:56:52 +03:00
Botond Dénes
0df4c2fd3b messaging: make verb handler registering independent of current scheduling group
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
2020-07-27 10:11:21 +03:00
Asias He
cd7d64f588 gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
The new verb is used to replace the current gossip shadow round
implementation. Current shadow round implementation reuses the gossip
syn and ack async message, which has plenty of drawbacks. It is hard to
tell if the syn messages to a specific peer node has responded. The
delayed responses from shadow round can apply to the normal gossip
states even if the shadow round is done. The syn and ack message
handler are full special cases due to the shadow round. All gossip
application states including the one that are not relevant are sent
back. The gossip application states are applied and the gossip
listeners are called as if is in the normal gossip operation. It is
completely unnecessary to call the gossip listeners in the shadow round.

This patch introduces a new verb to request the exact gossip application
states the shadow round  needed with a synchronous verb and applies the
application states without calling the gossip listeners. This patch
makes the shadow round easier to reason about, more robust and
efficient.

Refs: #6845
Tests: update_cluster_layout_tests.py
2020-07-27 09:15:11 +08:00
Pavel Emelyanov
7a7b1b3108 messaging: Add missing handlers unregistration helpers
Handlers for each verb have both -- register and unregister helpers, but unregistration ones
for some verbs are missing, so here they are.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:31:57 +03:00
Rafael Ávila de Espíndola
e15c8ee667 Everywhere: Explicitly instantiate make_lw_shared
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:

https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

This means that we have to move from

    make_lw_shared(T(...)

to

    make_lw_shared<T>(...)

If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Pavel Emelyanov
8618a02815 migration_manager: Remove db/schema_tables.hh inclustion into header
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:43 +03:00
Asias He
67f6da6466 repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190
2020-07-09 11:35:18 +03:00
Rafael Ávila de Espíndola
af44684418 messaging_service: Don't return variadic futures from make_sink_and_source_for_* 2020-06-29 16:50:45 -07:00
Avi Kivity
e5be3352cf database, streaming, messaging: drop streaming memtables
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.

Remove this code.
2020-06-25 15:25:54 +02:00
Eliran Sinvani
14520e843a messagin service: fix execution order in messaging_service constructor
The messaging service constructor's body does two main things in this
order:
1. it registers the CLIENT_ID verb with rpc.
2. it initializes the scheduling mechanism in charge of locating the
right scheduling group for each verb.

The registration function uses the scheduling mechanism to determine
the scheduling group for the verb.
This commit simply reverses the order of execution.

Fixes #6628
2020-06-11 12:14:10 +03:00
Pavel Emelyanov
67d5fad65f storage_service: Remove some inclusions of its header
GC pass over .cc files. Some really do not need it, some need for features/gossiper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Botond Dénes
16d8cdadc9 messaging_service: introduce the tenant concept
Tenants get their own connections for statement verbs and are further
isolated from each other by different scheduling groups. A tenant is
identified by a scheduling group and a name. When selecting the client
index for a statement verb, we look up the tenant whose scheduling group
matches the current one. This scheduling group is persisted across the
RPC call, using the name to identify the tenant on the remote end, where
a reverse lookup (name -> scheduling group) happens.

Instead of a single scheduling group to be used for all statement verbs,
messaging_service::scheduling_config now contains a list of tenants. The
first among these is the default tenant, the one we use when the current
scheduling group doesn't match that of any configured tenant.
To make this mapping easier, we reshuffle the client index assignment,
such that statement and statement-ack verbs have the idx 2 and 3
respectively, instead of 0 and 3.

The tenant configuration is configured at message service construction
time and cannot be changed after. Adding such capability should be easy
but is not needed for query classification, the current user of the
tenant concept.

Currently two tenants are configured: $user (default tenant) and
$system.
2020-05-28 11:34:32 +03:00
Avi Kivity
db8974fef3 messaging_service: de-static-ify _scheduling_info_for_connection_index
Per-user SLA means we have connection classifications determined dynamically,
as SLAs are added or removed. This means the classification information cannot
be static.

Fix by making it a non-static vector (instead of a static array), allowing it
to be extended. The scheduling group member pointer is replaced by a scheduling
group as a member pointer won't work anymore - we won't have a member to refer
to.
2020-05-28 10:40:08 +03:00
Avi Kivity
10dd08c9b0 messaging_service: supply and interpret rpc isolation_cookies
On the client side, we supply an isolation cookie based on the connection index
On the server side, we convert an isolation cookie back to a scheduling_group.

This has two advantages:
 - rpc processes the entire connection using the scheduling group, so that code
   is also isolated and accounted for
 - we can later add per-user connections; the previous approach of looking at the
   verb to decide the scheduling_group doesn't help because we don't have a set of
   verbs per user

With this, the main group sees <0.1% usage under simple read and write loads.
2020-05-28 10:40:08 +03:00
Avi Kivity
dbce57fa3c messaging_service: extract connection_index -> scheduling_group translation
Move it from a function-local static to a class static variable. We will want
to extend it in two ways:
 - add more information per connection index (like the rpc isolation cookie)
 - support adding more connections for per-user SLA

As a first step, make it an array of structures and make it accessible to all
of messaging_service.
2020-05-28 10:40:08 +03:00
Asias He
c02fea5f04 repair: Ignore table removed in sync_data_using_repair
Commit 75cf255c67 (repair: Ignore keyspace
that is removed in sync_data_using_repair) is not enough to fix the
issue because when the repair master checks if the table is dropped, the
table might not be dropped yet on the repair master.

To fix, the repair master should check if the follower failed the repair
because the table is dropped by checking the error returned from
follower.

With this patch, we would see

WARN  2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0
completed successfully, keyspace=ks, ignoring dropped tables={cf}

when the table is dropped during bootstrap.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test

Fixes: #5942
2020-05-24 13:39:59 +03:00
Calle Wilund
08d069f78d messaging_service: Use reloadable TLS certificates
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-

Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.

Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
2020-05-04 11:32:21 +00:00
Botond Dénes
7dabf75682 service: messaging_service: resolve rpc set_logger deprecation warning
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200407091413.310764-1-bdenes@scylladb.com>
2020-04-22 10:05:35 +03:00
Asias He
13a9c5eaf7 repair: Send reason for node operations
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.

Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.

Fixes #5930
Fixes #5998
2020-04-13 13:47:26 +03:00
Avi Kivity
88ade3110f treewide: replace calls to engine().some_api() with some_api()
This removes the need to include reactor.hh, a source of compile
time bloat.

In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.

Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().

Ref #1
2020-04-05 12:46:04 +03:00
Gleb Natapov
8a408ac5a8 lwt: remove entries from system.paxos table after successful learn stage
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.

The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round.  It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.

Fixes #5779

Message-Id: <20200330154853.GA31074@scylladb.com>
2020-03-30 21:02:14 +03:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Gleb Natapov
5753ab7195 lwt: drop invoke_on in paxos_state prepare and accept
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
2020-01-13 10:26:02 +02:00
Benny Halevy
9ec98324ed messaging_service: unregister_handler: return rpc unregister_handler future
Now that seastar returns it.

Fixes https://github.com/scylladb/scylla/issues/5228

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191212143214.99328-1-bhalevy@scylladb.com>
2019-12-12 16:38:36 +02:00
Benny Halevy
105c8ef5a9 messaging_service: wait on unregister_handler
Prepare for returning future<> from seastar rpc
unregister_handler.

Refs https://github.com/scylladb/scylla/issues/5228

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191208153924.1953-1-bhalevy@scylladb.com>
2019-12-11 14:17:41 +02:00
Piotr Dulikowski
adfa7d7b8d messaging_service: don't move unsigned values in handlers
Performing std::move on integral types is pointless. This commit gets
rid of moves of values of `unsigned` type in rpc handlers.
2019-12-05 00:58:31 +01:00
Piotr Dulikowski
2e802ca650 hh: add HINT_MUTATION verb
Introduce a new verb dedicated for receiving and sending hints:
HINT_MUTATION. It is handled on the streaming connection, which is
separate from the one used for handling mutations sent by coordinator
during a write.

The intent of using a separate connection is to increase fariness while
handling hints and user requests - this way, a situation can be avoided
in which one type of requests saturate the connection, negatively
impacting the other one.
2019-12-05 00:51:49 +01:00
Vladimir Davydov
bf5f864d80 paxos: piggyback result query on prepare response
Current LWT implementation uses at least three network round trips:
 - first, execute PAXOS prepare phase
 - second, query the current value of the updated key
 - third, propose the change to participating replicas

(there's also learn phase, but we don't wait for it to complete).

The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.

To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.

Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.

To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:

                latency 99% (ms)    bandwidth (rq/s)    timeouts (rq/s)
    clients     before  after       before  after       before  after
          1          2      2          626    637            0      0
          5          4      3         2616   2843            0      0
         10          3      3         4493   4767            0      0
         50          7      7        10567  10833            0      0
        100         15     15        12265  12934            0      0
        200         48     30        13593  14317            0      0
        400        185     60        14796  15549            0      0
        600        290     94        14416  15669            0      0
        800        568    118        14077  15820            2      0
       1000        710    118        13088  15830            9      0
       2000       1388    232        13342  15658           85      0
       3000       1110    363        13282  15422          233      0
       4000       1735    454        13387  15385          329      0

That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
3d1d4b018f paxos: remove unnecessary move constructor invocations
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
2019-11-24 11:35:29 +02:00
Gleb Natapov
8d6201a23b lwt: Add RPC verbs needed for paxos implementation
Paxos protocol has three stages: prepare, accept, learn. This patch adds
rpc verb for each of those stages. To be term compatible with Cassandra
the patch calls those stages: prepare, propose, commit.
2019-10-27 23:21:51 +03:00
Avi Kivity
ba64ec78cf messaging_service: use rpc::tuple instead of variadic futures for rpc
Since variadic future<> is deprecated, switch to rpc::tuple for multiple
return values in rpc calls. This is more or less mechanical translation.
2019-09-26 12:09:31 +02:00