Commit Graph

316 Commits

Author SHA1 Message Date
Michał Jadwiszczak
749399e4b8 message/messaging_service: guard adding maintenance tenant under cluster feature
Set `enabled` flag for `$maintenance` tenant to false and
enable it when `MAINTENANCE_TENANT` feature is enabled.

(cherry-picked from b4b91ca364)
2024-09-18 19:10:24 +02:00
Michał Jadwiszczak
bdd97b2950 message/messaging_service: add feature_service dependency
(cherry-picked from 71a03ef6b0)
2024-09-18 19:09:46 +02:00
Michał Jadwiszczak
1a056f0cab message/messaging_service: add enabled flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.

This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.

(cherry-picked from d44844241d)
2024-09-18 19:09:06 +02:00
Aleksandra Martyniuk
6029936665 tasks: implement task_manager::virtual_task::impl::get_children
Return a vector of task_identity of all children of a virtual task
in a cluster.
2024-07-23 13:35:01 +02:00
Avi Kivity
3fc4e23a36 forward_service: rename to mapreduce_service
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.

The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.

Closes scylladb/scylladb#19444
2024-07-03 19:29:47 +03:00
Gleb Natapov
09556bff0e gossiper: move gossip verbs to the idl 2024-06-17 12:47:17 +03:00
Gleb Natapov
6e6aefc9ab raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
We have new, more generic, RPC to pull group0 mutations now: RAFT_PULL_SNAPSHOT.
Use it instead of more specific RAFT_PULL_TOPOLOGY_SNAPSHOT one.
2024-03-27 19:18:45 +02:00
Marcin Maliszkiewicz
5a6d4dbc37 storage_service: add support for auth-v2 raft snapshots
This patch adds new RPC for pulling snapshot of auth tables.
2024-03-01 16:25:14 +01:00
Avi Kivity
605bf6e221 range.hh: retire
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.

Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.

The unit tests are renamed and range.hh is deleted.

Closes scylladb/scylladb#17428
2024-02-21 00:24:25 +02:00
Asias He
a0e46a6b47 repair: Fix rpc::source and rpc::optional parameter order in rpc message
In a mixed cluster (5.4.1-20231231.3d22f42cf9c3 and
5.5.0~dev-20240119.b1ba904c4977), in the rolling upgrade test, we saw
repair never finishing.

The following was observed:

rpc - client 127.0.0.2:65273 msg_id 5524:  caught exception while
processing a message: std::out_of_range (deserialization buffer
underflow)

It turns out the repair rpc message was not compatible between the two
versions. Even with a rpc stream verb, the new rpc parameters must come
after the rpc::source<> parameter. The rpc::source<> parameter is not
special in the sense that it must be the last parameter.

For example, it should be:

void register_repair_get_row_diff_with_rpc_stream(
std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (
const rpc::client_info& cinfo, uint32_t repair_meta_id,
rpc::source<repair_hash_with_cmd> source, rpc::optional<shard_id> dst_cpu_id_opt)>&& func);

not:

void register_repair_get_row_diff_with_rpc_stream(
std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (
const rpc::client_info& cinfo, uint32_t repair_meta_id,
rpc::optional<shard_id> dst_cpu_id_opt, rpc::source<repair_hash_with_cmd> source)>&& func);

Fixes #16941

Closes scylladb/scylladb#17156
2024-02-12 09:50:30 +02:00
Piotr Dulikowski
7601f40bf8 storage_service: introduce join_node_query verb
When a node joins an existing cluster, it will ask a node that already
belongs to the cluster about which topology operations to use when
joining.
2024-02-07 10:02:00 +01:00
Asias He
e7e1f4b01a streaming: Fix rpc::source and rpc::optional parameter order
The new rpc::optional parameter must come after any existing parameters,
including the rpc::source parameters, otherwise it will break
compatibility.

The regression was introduced in:

```
commit fd3c089ccc
Author: Tomasz Grabiec <tgrabiec@scylladb.com>
Date:   Thu Oct 26 00:35:19 2023 +0200

    service: range_streamer: Propagate topology_guard to receivers
```

We need to backport this patch ASAP before we release anything that
contains commit fd3c089ccc.

Refs: #16941
Fixes: #17175

Closes scylladb/scylladb#17176
2024-02-06 13:15:28 +01:00
Avi Kivity
c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision
2024-01-31 13:59:56 +02:00
Raphael S. Carvalho
9519a0c9e4 storage_service: Implement table_load_stats RPC
This implements the RPC for collecting table stats.

Since both leaving and pending replica can be accounted during
tablet migration, the RPC handler will look at tablet transition
info and account only either leaving or replica based on the
tablet migration stage. Replicas that are not leaving or
pending, of course, don't contribute to the anomaly in the
reported size.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Mikołaj Grzebieluch
d8de209dcf message_service: add sanity check that rpc connections are not created in the maintenance mode
In maintenance mode, a node shouldn't be able to communicate with other nodes.

To make sure this does not happen, the sanity check is added.
2024-01-25 15:27:53 +01:00
Kefu Chai
a1dcddd300 utils: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16833
2024-01-18 12:50:06 +02:00
Asias He
fd774862be repair: Make row_level repair work with tablet
Since a given tablet belongs to a single shard on both repair master and repair
followers, row level repair code needs to be changed to work on a single
shard for a given tablet. In order to tell the repair followers which
shard to work on, a dst_cpu_id value is passed over rpc from the repair
master.
2024-01-18 08:49:06 +08:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Asias He
6beadab9e6 messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb
They will be used to implement file stream for tablet in the future. Reserve
the verb ID.
2023-12-07 14:54:12 +08:00
Tomasz Grabiec
fd3c089ccc service: range_streamer: Propagate topology_guard to receivers 2023-12-06 18:36:16 +01:00
Benny Halevy
984a576405 messaging_service: accept broadcast_addr in config rather than via fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 09:46:25 +02:00
Benny Halevy
586f35bb55 messaging_service: move listen_address and port getters inline
And make them const noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 09:44:41 +02:00
Pavel Emelyanov
492b842929 messaging_service: Define metrics domain for client connections
Recent seastar update included RPC metrics (scylladb/seastar#1753). The
reported metrics groups together sockets based on their "metrics_domain"
configuration option. This patch makes use of this domain to make scylla
metrics sane.

The domain as this patch defines it includes two strings:

First, the datacenter the server lives in. This is because grouping
metrics for connections to different datacenters makes little sense for
several reasons. For example -- packet delays _will_ differ for local-DC
vs cross-DC traffic and mixing those latencies together is pointless.
Another example -- the amount of traffic may also differ for local- vs
cross-DC connections e.g. because of different usage of enryption and/or
compression.

Second, each verb-idx gets its own domain. That's to be able to analyze
e.g. query-related traffic from gossiper one. For that the existing
isolation cookie is taken as is.

Note, that the metrics is _not_ per-server node. So e.g. two gossiper
connections to two different nodes (in one DC) will belong to the same
domain and thus their stats will be summed when reported.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15785
2023-11-13 11:13:20 +01:00
Asias He
2b2302d373 streaming: Ignore dropped table on both sides
It is possible the sender and receiver of streaming nodes have different
views on if a table is dropped or not.

For example:
- n1, n2 and n3 in the cluster

- n4 started to join the cluster and stream data from n1, n2, n3

- a table was dropped

- n4 failed to write data from n2 to sstable because a table was dropped

- n4 ended the streaming

- n2 checked if the table was present and would ignore the error if the table was dropped

- however n2 found the table was still present and was not dropped

- n2 marked the streaming as failed

This will fail the streaming when a table is dropped. We want streaming to
ignore such dropped tables.

In this patch, a status code is sent back to the sender to notify the
table is dropped so the sender could ignore the dropped table.

Fixes #15370

Closes scylladb/scylladb#15912
2023-11-03 13:38:48 +02:00
Piotr Dulikowski
7cbe5e3af8 rpc: add new join handshake verbs
The `join_node_request` and `join_node_response` RPCs are added:

- `join_node_request` is sent from the joining node to any node in the
  cluster. It contains some initial parameters that will be verified by
  the receiving node, or the topology coordinator - notably, it contains
  a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
  joining node to tell it about the the outcome of the verification.
2023-09-26 15:56:52 +02:00
Tomasz Grabiec
d5539e080d tablets: Implement cleanup step
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.

The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.

This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
2023-09-14 12:45:10 +02:00
Benny Halevy
357d57c82d raft: group0_state_machine: transfer_snapshot: make abortable
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-03 16:32:08 +03:00
Botond Dénes
fdaf908967 repair/row_level: opt in to compacting the stream
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
2023-07-27 04:57:50 -04:00
Avi Kivity
615544a09a Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov
This is to make m.s. initialization more solid and simplify sys.ks.::setup()

Closes #14832

* github.com:scylladb/scylladb:
  system_keyspace: Remove unused snitch arg from setup()
  messaging_service: Setup preferred IPs from config
2023-07-26 22:12:28 +03:00
Pavel Emelyanov
0fba57a3e8 messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-26 16:03:23 +03:00
Tomasz Grabiec
6d545b2f9e storage_service: Implement stream_tablet RPC
Performs streaming of data for a single tablet between two tablet
replicas. The node which gets the RPC is the receiving replica.
2023-07-25 21:08:51 +02:00
Calle Wilund
e1a52af69e messaging_service: Do TLS init early
Fixes #14299

failure_detector can try sending messages to TLS endpoints before start_listen
has been called (why?). Need TLS initialized before this. So do on service creation.

Closes #14493
2023-07-11 18:19:01 +03:00
Kamil Braun
8cf47d76a4 messaging_service: implement host banning
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
  be immediately dropped) when receiving the `CLIENT_ID` verb.
2023-06-20 13:03:46 +02:00
Kamil Braun
95c726a8df messaging_service: exchange host IDs and map them to connections
When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.

Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.

This mapping will be used later when we ban nodes that we remove from
the cluster.
2023-06-20 13:03:46 +02:00
Kamil Braun
87f65d01b8 messaging_service: store the node's host ID 2023-06-20 13:03:46 +02:00
Kamil Braun
a78cc17bd4 messaging_service: don't use parameter defaults in constructor 2023-06-20 13:03:46 +02:00
Pavel Emelyanov
7e8b9aecab messaging_service: Shutdown rpc server on shutdown
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-01 21:24:13 +03:00
Pavel Emelyanov
5861d15912 Merge 'Small gossiper and migration_manager cleanups' from Gleb
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/

Reviewed-by: Konstantin Osipov <kostja@scylladb.com>

* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
  storage_service: avoid unneeded copies in on_change
  storage_service: remove check that is always true
  storage_service: rename handle_state_removing to handle_state_removed
  storage_service: avoid string copy
  storage_service: delete code that handled REMOVING_TOKENS state
  gossiper: remove code related to advertising REMOVING_TOKEN state
  migration_manager: add wait_for_schema_agreement() function
2023-05-27 10:49:54 +03:00
Gleb Natapov
05aa07835d storage_service: delete code that handled REMOVING_TOKENS state
The state is never advertised so the code is never used.
2023-05-25 14:48:09 +03:00
Pavel Emelyanov
222f21d180 messaging_service: Remove unused headers from m.s..hh
The tracing.hh is quite large to care
Another one is "while at it"

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14024
2023-05-25 08:38:49 +03:00
Botond Dénes
8663b27f25 message: generalize per-tenant connection types
We have a set amount of connection types for each tenant. The amount of
these connection types can change. Although currently these are
hardcoded in a single place, soon (in the next patch) there will be yet
another place where these will be used. To avoid duplicating these
names, making future changes error prone, centralize them in a const
array, generalizing the concept of a tenant connection type.
2023-05-10 04:28:57 -04:00
Gleb Natapov
d69a887366 storage_service: raft topology: introduce snapshot transfer code for the topology table 2023-03-23 16:29:56 +02:00
Gleb Natapov
6a4d773b7e raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
2023-03-23 16:29:56 +02:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
a83789160d message: messaging_service: check for known topology before calling is_same_dc/rack
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.

When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).

Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.

Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
2022-11-16 14:01:50 +01:00
Botond Dénes
13ace7a05e Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.

This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.

Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.

The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.

Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"

* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
  messaging_service: Fix gossiper verb group
  messaging_service: Mind the absence of topology data when creating sockets
  messaging_service: Templatize and rename remove_rpc_client_one
2022-09-16 13:27:56 +03:00
Pavel Emelyanov
82162be1f1 messaging_service: Remove init/uninit helpers
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11535
2022-09-15 11:54:46 +03:00
Pavel Emelyanov
7bdad47de2 messaging_service: Mind the absence of topology data when creating sockets
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)

This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.

The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:

- currently dc/rack do not change on the fly (though they can, e.g. if
  gossiping property file snitch is updated without restart) and
  topology update effectively comes from a single place

- updating topology on token-metadata is not like topology.update()
  call. Instead, a clone of token metadata is created, then update
  happens on the clone, then the clone is committed into t.m. Though
  it's possible to find out commit-time which nodes changed their
  topology, but since it only happens on join this complexity likely
  doesn't worth the effort (yet)

fixes: #11514
fixes: #11492
fixes: #11483

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:51 +03:00
Pavel Emelyanov
5ffc9d66ec messaging_service: Templatize and rename remove_rpc_client_one
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:07 +03:00