Commit Graph

421 Commits

Author SHA1 Message Date
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Konstantin Osipov
73e5298273 raft: (address map) actively maintain ip <-> raft server id map
1) make address map API flexible

Before this patch:
- having a mapping without an actual IP address was an
  internal error
- not having a mapping for an IP address was an internal
  error
- re-mapping to a new IP address wasn't allowed

After this patch:

- the address map may contain a mapping
  without an actual IP address, and the caller must be prepared for it:
  find() will return a nullopt. This happens when we first add an entry
  to Raft configuration and only later learn its IP address, e.g.  via
  gossip.

- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications

Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.

3) prompt address map state with app state

Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).

The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.

Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.

Thanks to the changes above, Raft configuration no longer
stores IP addresses.

We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
2022-11-29 19:55:43 +03:00
Pavel Emelyanov
a396c27efc Merge 'message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client' from Kamil Braun
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780

Closes #11942

* github.com:scylladb/scylladb:
  message: messaging_service: check for known topology before calling is_same_dc/rack
  test: reenable test_topology::test_decommission_node_add_column
  test/pylib: util: configurable period in wait_for
  message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
  message: messaging_service: topology independent connection settings for GOSSIP verbs
2022-11-17 20:14:32 +03:00
Kamil Braun
a83789160d message: messaging_service: check for known topology before calling is_same_dc/rack
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.

When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).

Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.

Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
2022-11-16 14:01:50 +01:00
Kamil Braun
1bd2471c19 message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780
2022-11-16 14:01:50 +01:00
Kamil Braun
840be34b5f message: messaging_service: topology independent connection settings for GOSSIP verbs
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).

The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.

However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix #11780.

To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).

This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.

Fixes #11992.

Inspired by xemul/scylla@45d48f3d02.
2022-11-16 13:58:07 +01:00
Pavel Emelyanov
7b193ab0a5 messaging_service: Deny putting INADD_ANY as preferred ip
Even though previous patch makes scylla not gossip this as internal_ip,
an extra sanity check may still be useful. E.g. older versions of scylla
may still do it, or this address can be loaded from system_keyspace.

refs: #11502

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
aa7a759ac9 messaging_service: Toss preferred ip cache management
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Konstantin Osipov
224dd9ce1e raft: rename group0_upgrade.hh to group0_fwd.hh
The plan is to add other group-0-related forward declarations
to this file, not just the ones for upgrade.
2022-10-10 15:58:48 +03:00
Botond Dénes
13ace7a05e Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.

This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.

Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.

The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.

Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"

* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
  messaging_service: Fix gossiper verb group
  messaging_service: Mind the absence of topology data when creating sockets
  messaging_service: Templatize and rename remove_rpc_client_one
2022-09-16 13:27:56 +03:00
Pavel Emelyanov
82162be1f1 messaging_service: Remove init/uninit helpers
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11535
2022-09-15 11:54:46 +03:00
Pavel Emelyanov
2c74062962 messaging_service: Fix gossiper verb group
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.

fixes: #11465

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:40:47 +03:00
Pavel Emelyanov
7bdad47de2 messaging_service: Mind the absence of topology data when creating sockets
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)

This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.

The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:

- currently dc/rack do not change on the fly (though they can, e.g. if
  gossiping property file snitch is updated without restart) and
  topology update effectively comes from a single place

- updating topology on token-metadata is not like topology.update()
  call. Instead, a clone of token metadata is created, then update
  happens on the clone, then the clone is committed into t.m. Though
  it's possible to find out commit-time which nodes changed their
  topology, but since it only happens on join this complexity likely
  doesn't worth the effort (yet)

fixes: #11514
fixes: #11492
fixes: #11483

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:51 +03:00
Pavel Emelyanov
5ffc9d66ec messaging_service: Templatize and rename remove_rpc_client_one
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:07 +03:00
Pavel Emelyanov
5663b3eda5 messaging_service: Relax connection drop on re-caching
When messaging_service::get_rpc_client() picks up cached socket and
notices error on it, it drops the connection and creates a new one. The
method used to drop the connection is the one that re-lookups the verb
index again, which is excessive. Tune this up while at it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-12 12:05:02 +03:00
Pavel Emelyanov
24d68e1995 messaging_service: Simplify remove_rpc_client_one()
Make it void as after previous patch no code is interesed in this value

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:50:41 +03:00
Pavel Emelyanov
ca92ed65e5 messaging_service: Notify connection drop when connection is removed
There are two methods to close an RPC socket in m.s. -- one that's
called on error path of messaging_service::send_... and the other one
that's called upon gossiper down/leave/cql-off notifications. The former
one notifies listeners about connection drop, the latter one doesn't.

The only listener is the storage-proxy which, in turn, kicks database to
release per-table cache hitrate data. Said that, when a node goes down
(or when an operator shuts down its transport) the hit-rate stats
regarding this node are leaked.

This patch moves notification so that any socket drop calls notification
and thus releases the hitrates.

fixes: #11497

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:47:38 +03:00
Pavel Emelyanov
f0580aedaf messaging: Get DC/RACK from topology
Now everything is prepared for that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-02 11:34:57 +03:00
Pavel Emelyanov
e147681d85 messaging, topology: Keep shared_token_metadata* on messaging
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.

However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
551c51b5bf messaging: Add is_same_{dc|rack} helpers
For convenience of future patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
c08c370c2c snitch, messaging: Dont relookup dc/rack on internal IP
When getting dc/rack snitch may perform two lookups -- first time it
does it using the provided IP, if nothing is found snitch assumes that
the IP is internal one, gets the corresponding public one and searches
again.

The thing is that the only code that may come to snitch with internal
IP is the messaging service. It does so in two places: when it tries
to connect to the given endpoing and when it accepts a connection.

In the former case messaging performs public->internal IP conversion
itself and goes to snitch with the internal IP value. This place can get
simpler by just feeding the public IP to snich, and converting it to the
internal only to initiate the connection.

In the latter case the accepted IP can be either, but messaging service
has the public<->private map onboard and can do the conversion itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00
Benny Halevy
314e45d957 streaming: define plan_id as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:45:30 +03:00
Kamil Braun
ac5f4248a9 service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
During the upgrade procedure nodes will want to obtain the upgrade state
of other nodes to proceed. This is what the new verb is for.
2022-08-19 19:15:19 +02:00
Kamil Braun
7e56251aea service/raft: introduce group0_upgrade_state
Define an enum class, `group0_upgrade_state`, describing the state of
the upgrade procedure (implemented in later commits).

Provide IDL definitions for (de)serialization.

The node will have its current upgrade state stored on disk in
`system.scylla_local` under the `group0_upgrade_state` key. If the key
is not present we assume `use_pre_raft_procedures` (meaning we haven't
started upgrading yet or we're at the beginning of upgrade).

Introduce `system_keyspace` accessor methods for storing and retrieving
the on-disk state.
2022-08-19 19:15:19 +02:00
Kamil Braun
9e5a81da4a message: messaging_service: cancellable version of send_schema_check
This RPC will be used during the Raft upgrade procedure during schema
synchronization step.

Make a version which can be cancelled when the upgrade procedure gets
aborted.
2022-08-19 19:15:18 +02:00
Benny Halevy
2b017ce285 schema, everywhere: define and use table_schema_version as a strong type
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.

Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:45 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
1fda686f96 idl: make idl headers self-sufficient
Add include statements to satisfy dependencies.

Delete, now unneeded, include directives from the upper level
source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Piotr Dulikowski
da571ed93b message: get rid of throws in send_message{,_timeout,_abortable}
Now, those function don't rethrow existing or throw new exceptions.
2022-07-04 19:27:06 +02:00
Pavel Emelyanov
85033ea6ae Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.

They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.

Closes #10864

* github.com:scylladb/scylla:
  service/raft: raft_group_registry: add assertions when fetching servers for groups
  service/raft: raft_group_registry: remove `_raft_support_listener`
  service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
  service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
  service/raft: messaging: extract raft_addr/inet_addr conversion functions
  service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
  treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
  test/boost: memtable_test: perform schema operations on shard 0
  test/boost: cdc_test: remove test_cdc_across_shards
  message: rename `send_message_abortable` to `send_message_cancellable`
  message: change parameter order in `send_message_oneway_timeout`
2022-06-29 16:51:54 +03:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
Kamil Braun
8e907cbf57 service/raft: raft_group0: move group 0 RPC handlers from storage_service
And generate the boilerplate from IDL declarations.
Simplifies the code, and the code now resides where it belongs.
2022-06-23 16:14:41 +02:00
Kamil Braun
c030d03893 message: rename send_message_abortable to send_message_cancellable
It's not possible to abort an RPC call entirely, since the remote part
continues running (if the message got out). Calling the provided abort
source does the following:
1. if the message is still in the outgoing queue, drop it,
2. resolve waiter callbacks exceptionally.

Using the word "cancellable" is more appropriate.

Also write a small comment at `send_message_cancellable`.
2022-06-23 16:14:41 +02:00
Kamil Braun
07fe3e4a99 message: change parameter order in send_message_oneway_timeout
Make it consistent with the other 'send message' functions.
Simplify code generation logic in idl-compiler.

Interestingly this function is not used anywhere so I didn't have to fix
any call sites.
2022-06-23 16:14:41 +02:00
Botond Dénes
009d2fe2f7 idl/query: add last_position to query_result
To be used to allow the replica to specify the last position in the
stream, where the query was left off. Currently this is always
the same as the implicit position -- the last row in the result-set --
but this requires only stopping the read on a live row, which is a
requirement we want to lift: we want to be able to stop on a tombstone.
As tombstones are not included in the query result, we have to allow the
replica to overwrite the last seen position explicitly.
This patch introduces the new field in the query-result IDL but it is
not written to yet, nor is it read, that is left for the next patches.
2022-06-23 13:36:24 +03:00
Botond Dénes
119be5d5db idl: move position_in_partition into own header
So it can be used without pulling in all of partition_checksum.idl.hh.
2022-06-23 13:36:24 +03:00
Piotr Dulikowski
02469e0b15 storage_proxy: add per partition rate limit info to write RPC
Adds db::per_partition_rate_limit::info parameter to the write RPC. The
rate limit info controls the behavior of the rate limiter on the
replica.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
51546b0609 storage_proxy: pass rate_limit_exception through write RPC
This commit modifies the storage_proxy logic so that the coordinator
knows whether a write operation failed due to rate limit being exceeded,
and returns `exceptions::rate_limit_exception` when that happens.
2022-06-22 20:16:48 +02:00
Avi Kivity
ee2420ff43 messaging: add boilerplate to rpc_protocol_impl.hh
License, copyright, #pragma once.

The copyright is set to 2021 since that was when the file was created.

Closes #10778
2022-06-13 07:29:32 +02:00
Avi Kivity
afc06f0017 messaging: forward-declare types in messaging_service.hh
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.

Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.

Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.

Closes #10755
2022-06-09 15:52:12 +03:00
Michael Livshin
029508b77c flat_mutation_reader ist tot
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Avi Kivity
c83393e819 messaging: do isolate default tenants
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.

In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:

 scheduling_group
 messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
     return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
     if (must_compress) {
         opts.compressor_factory = &compressor_factory;
     }
     opts.tcp_nodelay = must_tcp_nodelay;
     opts.reuseaddr = true;
-    opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    // We send cookies only for non-default statement tenant clients.
+    if (idx > 3) {
+        opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    }

This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.

Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.

Fixes #9505.

Closes #10673
2022-05-27 16:36:57 +02:00
Benny Halevy
1308b45c58 messaging_service: do_make_sink_source: handle failed source future
I've stumbled upon this with version a2901a376d in debug mode
when testing repair_additional_test.py::TestRepairAdditional::test_repair_kill_3:

WARN  2022-05-17 07:26:12,581 [shard 0] seastar - Exceptional future ignored: seastar::rpc::closed_error (connection is closed), backtrace: 0x137c33d0 0x1ad14d0d 0x1ad149cd 0x1ad16fc3 0x1ad17e52 0x19d8a809 0x19d8ab6a 0x139165a9 0x17be0d21 0x17bdcfb0 0x17bf3611 0x17bf39f0 0x17bf3c62 0x17bf3958 0x17bf57d8 0x17bf5468 0x19efe44e 0x19f04ac6 0x19f09732 0x19f072a1 0x19cca281 0x19cc7de5 0x13859cbf 0x13d309d6 0x13d3090b 0x13d30775 0x1391364d 0x13858521 /lib64/libc.so.6+0x27b74 0x137774ad

Decode:
```
seastar::report_failed_future(seastar::future_state_base::any&&) at //./seastar/src/core/future.cc:218
seastar::future_state_base::any::check_failure() at //./seastar/include/seastar/core/future.hh:573
seastar::future_state<seastar::rpc::source<repair_row_on_wire_with_cmd> >::clear() at ././seastar/include/seastar/core/future.hh:615
~future_state at ././seastar/include/seastar/core/future.hh:620
 (inlined by) ~future at ././seastar/include/seastar/core/future.hh:1343
~ at ./message/messaging_service.cc:841
```

Looks like if sink.close() fails after source.failed() then source gets abandoned.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-19 11:47:38 +03:00
Kamil Braun
9551256e81 messaging_service: abortable version of send_gossip_echo
Use the new `send_message_abortable` function to implement an abortable
version of `send_gossip_echo`.

These echo messages will be used for direct failure detection.
2022-05-09 13:14:41 +02:00
Kamil Braun
f2548fc3fa message: abortable version of send_message
I want to be able to timeout `send_message`, but not through the
existing `send_message_timeout` API which forces me to use a particular
clock/duration/timepoint type. Introduce a more general
`send_message_abortable` API which gets an `abort_source&`, subscribes
to it, and uses the `rpc::cancellable` interface to cancel the RPC on
abort.

The function is 90% copy-pasta from `send_message{_timeout}`, only the
abort part is new.
2022-05-09 13:14:41 +02:00
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Michał Sala
fff454761a messaging_service: add verb for count(*) request forwarding
Except for the verb addition, this commit also defines forward_request
and forward_result structures, used as an argument and result of the new
rpc. forward_request is used to forward information about select
statement that does count(*) (or other aggregating functions such as
max, min, avg in the future). Due to the inability to serialize
cql3::statements::select_statement, I chose to include
query::read_command, dht::partition_range_vector and some configuration
options in forward_request. They can be serialized and are sufficient
enough to allow creation of service::pager::query_pagers::pager.
2022-02-01 21:14:41 +01:00
Kamil Braun
cc0c54ea15 service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.

We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.

Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
2022-01-24 15:20:37 +01:00