scylladb

Author	SHA1	Message	Date
Michał Jadwiszczak	749399e4b8	message/messaging_service: guard adding maintenance tenant under cluster feature Set `enabled` flag for `$maintenance` tenant to false and enable it when `MAINTENANCE_TENANT` feature is enabled. (cherry-picked from `b4b91ca364`)	2024-09-18 19:10:24 +02:00
Michał Jadwiszczak	bdd97b2950	message/messaging_service: add feature_service dependency (cherry-picked from `71a03ef6b0`)	2024-09-18 19:09:46 +02:00
Michał Jadwiszczak	1a056f0cab	message/messaging_service: add `enabled` flag to statement tenants Adding a new tenant needs to be done under cluster feature protection. However it wasn't the case for adding `$maintenance` statement tenant and to fix it we need to support an upgrade from node which doesn't know about maintenance tenant at all and from one which uses it without any cluster feature protection. This commit adds `enabled` flag to statement tenants. This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection. (cherry-picked from `d44844241d`)	2024-09-18 19:09:06 +02:00
Aleksandra Martyniuk	6029936665	tasks: implement task_manager::virtual_task::impl::get_children Return a vector of task_identity of all children of a virtual task in a cluster.	2024-07-23 13:35:01 +02:00
Avi Kivity	3fc4e23a36	forward_service: rename to mapreduce_service forward_service is nondescriptive and misnamed, as it does more than forward requests. It's a classic map/reduce algorithm (and in fact one of its parameters is "reducer"), so name it accordingly. The name "forward" leaked into the wire protocol for the messaging service RPC isolation cookie, so it's kept there. It's also maintained in the name of the logger (for "nodetool setlogginglevel") for compatibility with tests. Closes scylladb/scylladb#19444	2024-07-03 19:29:47 +03:00
Gleb Natapov	09556bff0e	gossiper: move gossip verbs to the idl	2024-06-17 12:47:17 +03:00
Gleb Natapov	6e6aefc9ab	raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC We have new, more generic, RPC to pull group0 mutations now: RAFT_PULL_SNAPSHOT. Use it instead of more specific RAFT_PULL_TOPOLOGY_SNAPSHOT one.	2024-03-27 19:18:45 +02:00
Marcin Maliszkiewicz	5a6d4dbc37	storage_service: add support for auth-v2 raft snapshots This patch adds new RPC for pulling snapshot of auth tables.	2024-03-01 16:25:14 +01:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Asias He	a0e46a6b47	repair: Fix rpc::source and rpc::optional parameter order in rpc message In a mixed cluster (5.4.1-20231231.3d22f42cf9c3 and 5.5.0~dev-20240119.b1ba904c4977), in the rolling upgrade test, we saw repair never finishing. The following was observed: rpc - client 127.0.0.2:65273 msg_id 5524: caught exception while processing a message: std::out_of_range (deserialization buffer underflow) It turns out the repair rpc message was not compatible between the two versions. Even with a rpc stream verb, the new rpc parameters must come after the rpc::source<> parameter. The rpc::source<> parameter is not special in the sense that it must be the last parameter. For example, it should be: void register_repair_get_row_diff_with_rpc_stream( std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> ( const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source, rpc::optional<shard_id> dst_cpu_id_opt)>&& func); not: void register_repair_get_row_diff_with_rpc_stream( std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> ( const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::optional<shard_id> dst_cpu_id_opt, rpc::source<repair_hash_with_cmd> source)>&& func); Fixes #16941 Closes scylladb/scylladb#17156	2024-02-12 09:50:30 +02:00
Piotr Dulikowski	7601f40bf8	storage_service: introduce join_node_query verb When a node joins an existing cluster, it will ask a node that already belongs to the cluster about which topology operations to use when joining.	2024-02-07 10:02:00 +01:00
Asias He	e7e1f4b01a	streaming: Fix rpc::source and rpc::optional parameter order The new rpc::optional parameter must come after any existing parameters, including the rpc::source parameters, otherwise it will break compatibility. The regression was introduced in: ``` commit `fd3c089ccc` Author: Tomasz Grabiec <tgrabiec@scylladb.com> Date: Thu Oct 26 00:35:19 2023 +0200 service: range_streamer: Propagate topology_guard to receivers ``` We need to backport this patch ASAP before we release anything that contains commit `fd3c089ccc`. Refs: #16941 Fixes: #17175 Closes scylladb/scylladb#17176	2024-02-06 13:15:28 +01:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Raphael S. Carvalho	9519a0c9e4	storage_service: Implement table_load_stats RPC This implements the RPC for collecting table stats. Since both leaving and pending replica can be accounted during tablet migration, the RPC handler will look at tablet transition info and account only either leaving or replica based on the tablet migration stage. Replicas that are not leaving or pending, of course, don't contribute to the anomaly in the reported size. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Mikołaj Grzebieluch	d8de209dcf	message_service: add sanity check that rpc connections are not created in the maintenance mode In maintenance mode, a node shouldn't be able to communicate with other nodes. To make sure this does not happen, the sanity check is added.	2024-01-25 15:27:53 +01:00
Kefu Chai	a1dcddd300	utils: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16833	2024-01-18 12:50:06 +02:00
Asias He	fd774862be	repair: Make row_level repair work with tablet Since a given tablet belongs to a single shard on both repair master and repair followers, row level repair code needs to be changed to work on a single shard for a given tablet. In order to tell the repair followers which shard to work on, a dst_cpu_id value is passed over rpc from the repair master.	2024-01-18 08:49:06 +08:00
Avi Kivity	9c0f05efa1	Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later. This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted. The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained. The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was. This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas. Closes scylladb/scylladb#15847 * github.com:scylladb/scylladb: test: tablets: Add test for failed streaming being fenced away error_injection: Introduce poll_for_message() error_injection: Make is_enabled() public api: Add API to kill connection to a particular host range_streamer: Do not block topology change barriers around streaming range_streamer, tablets: Do not keep token metadata around streaming tablets: Fail gracefully when migrating tablet has no pending replica storage_service, api: Add API to disable tablet balancing storage_service, api: Add API to migrate a tablet storage_service, raft topology: Run streaming under session topology guard storage_service, tablets: Use session to guard tablet streaming tablets: Add per-tablet session id field to tablet metadata service: range_streamer: Propagate topology_guard to receivers streaming: Always close the rpc::sink storage_service: Introduce concept of a topology_guard storage_service: Introduce session concept tablets: Fix topology_metadata_guard holding on to the old erm docs: Document the topology_guard mechanism	2023-12-07 16:29:02 +02:00
Asias He	6beadab9e6	messaging_service: Introduce STREAM_BLOB and TABLET_STREAM_FILES verb They will be used to implement file stream for tablet in the future. Reserve the verb ID.	2023-12-07 14:54:12 +08:00
Tomasz Grabiec	fd3c089ccc	service: range_streamer: Propagate topology_guard to receivers	2023-12-06 18:36:16 +01:00
Benny Halevy	984a576405	messaging_service: accept broadcast_addr in config rather than via fb_utilities Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 09:46:25 +02:00
Benny Halevy	586f35bb55	messaging_service: move listen_address and port getters inline And make them const noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 09:44:41 +02:00
Pavel Emelyanov	492b842929	messaging_service: Define metrics domain for client connections Recent seastar update included RPC metrics (scylladb/seastar#1753). The reported metrics groups together sockets based on their "metrics_domain" configuration option. This patch makes use of this domain to make scylla metrics sane. The domain as this patch defines it includes two strings: First, the datacenter the server lives in. This is because grouping metrics for connections to different datacenters makes little sense for several reasons. For example -- packet delays _will_ differ for local-DC vs cross-DC traffic and mixing those latencies together is pointless. Another example -- the amount of traffic may also differ for local- vs cross-DC connections e.g. because of different usage of enryption and/or compression. Second, each verb-idx gets its own domain. That's to be able to analyze e.g. query-related traffic from gossiper one. For that the existing isolation cookie is taken as is. Note, that the metrics is _not_ per-server node. So e.g. two gossiper connections to two different nodes (in one DC) will belong to the same domain and thus their stats will be summed when reported. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15785	2023-11-13 11:13:20 +01:00
Asias He	2b2302d373	streaming: Ignore dropped table on both sides It is possible the sender and receiver of streaming nodes have different views on if a table is dropped or not. For example: - n1, n2 and n3 in the cluster - n4 started to join the cluster and stream data from n1, n2, n3 - a table was dropped - n4 failed to write data from n2 to sstable because a table was dropped - n4 ended the streaming - n2 checked if the table was present and would ignore the error if the table was dropped - however n2 found the table was still present and was not dropped - n2 marked the streaming as failed This will fail the streaming when a table is dropped. We want streaming to ignore such dropped tables. In this patch, a status code is sent back to the sender to notify the table is dropped so the sender could ignore the dropped table. Fixes #15370 Closes scylladb/scylladb#15912	2023-11-03 13:38:48 +02:00
Piotr Dulikowski	7cbe5e3af8	rpc: add new join handshake verbs The `join_node_request` and `join_node_response` RPCs are added: - `join_node_request` is sent from the joining node to any node in the cluster. It contains some initial parameters that will be verified by the receiving node, or the topology coordinator - notably, it contains a list of cluster features supported by the joining node. - `join_node_response` is sent from the topology coordinator to the joining node to tell it about the the outcome of the verification.	2023-09-26 15:56:52 +02:00
Tomasz Grabiec	d5539e080d	tablets: Implement cleanup step This change adds a stub for tablet cleanup on the replica side and wires it into the tablet migration process. The handling on replica side is incomplete because it doesn't remove the actual data yet. It only flushes the memtables, so that all data is in sstables and none requires a memtable flush. This patch is necessary to make decommission work. Otherwise, a memtable flush would happen when the decommissioned node is put in the drained state (as in nodetool drain) and it would fail on missing host id mapping (node is no longer in topology), which is examined by the tablet sharder when producing sstable sharding metadata. Leading to abort due to failed memtable flush.	2023-09-14 12:45:10 +02:00
Benny Halevy	357d57c82d	raft: group0_state_machine: transfer_snapshot: make abortable Use an abort_source in group0_state_machine to abort an ongoing transfer_snapshot operation on group0_state_machine::abort() Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-03 16:32:08 +03:00
Botond Dénes	fdaf908967	repair/row_level: opt in to compacting the stream Using a centrally generated compaction-time, generated on the repair master and propagated to all repair followers. For repair it is imperative that all participants use the exact same compaction time, otherwise there can be artificial differences between participants, generating unnecessary repair activity. If a repair follower doesn't get a compaction-time from the repair master, it uses a locally generated one. This is no worse than the previous state of each node being on some undefined state of compaction.	2023-07-27 04:57:50 -04:00
Avi Kivity	615544a09a	Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov This is to make m.s. initialization more solid and simplify sys.ks.::setup() Closes #14832 * github.com:scylladb/scylladb: system_keyspace: Remove unused snitch arg from setup() messaging_service: Setup preferred IPs from config	2023-07-26 22:12:28 +03:00
Pavel Emelyanov	0fba57a3e8	messaging_service: Setup preferred IPs from config Population of messageing service preferred IPs cache happens inside system keyspace setup() call and it needs m.s. per ce and additionally snitch. Moving preferred ip cache to initial configuration keeps m.s. start more self-contained and keeps system_keyspace::setup() simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:03:23 +03:00
Tomasz Grabiec	6d545b2f9e	storage_service: Implement stream_tablet RPC Performs streaming of data for a single tablet between two tablet replicas. The node which gets the RPC is the receiving replica.	2023-07-25 21:08:51 +02:00
Calle Wilund	e1a52af69e	messaging_service: Do TLS init early Fixes #14299 failure_detector can try sending messages to TLS endpoints before start_listen has been called (why?). Need TLS initialized before this. So do on service creation. Closes #14493	2023-07-11 18:19:01 +03:00
Kamil Braun	8cf47d76a4	messaging_service: implement host banning Calling `ban_host` causes the following: - all connections from that host are dropped, - any further attempts to connect will be rejected (the connection will be immediately dropped) when receiving the `CLIENT_ID` verb.	2023-06-20 13:03:46 +02:00
Kamil Braun	95c726a8df	messaging_service: exchange host IDs and map them to connections When a node first establishes a connection to another node, it always sending a `CLIENT_ID` one-way RPC first. The message contains some metadata such as `broadcast_address`. Include the `host_id` of the sender in that RPC. On the receiving side, store a mapping from that `host_id` to the connection that was just opened. This mapping will be used later when we ban nodes that we remove from the cluster.	2023-06-20 13:03:46 +02:00
Kamil Braun	87f65d01b8	messaging_service: store the node's host ID	2023-06-20 13:03:46 +02:00
Kamil Braun	a78cc17bd4	messaging_service: don't use parameter defaults in constructor	2023-06-20 13:03:46 +02:00
Pavel Emelyanov	7e8b9aecab	messaging_service: Shutdown rpc server on shutdown The RPC server now has a lighter .shutdown() method that just does what m.s. shutdown() needs, so call it. On stop call regular stop to finalize the stopping process Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 21:24:13 +03:00
Pavel Emelyanov	5861d15912	Merge 'Small gossiper and migration_manager cleanups' from Gleb Some assorted cleanups here: consolidation of schema agreement waiting into a single place and removing unused code from the gossiper. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/ Reviewed-by: Konstantin Osipov <kostja@scylladb.com> * gleb/gossiper-cleanups of github.com:scylladb/scylla-dev: storage_service: avoid unneeded copies in on_change storage_service: remove check that is always true storage_service: rename handle_state_removing to handle_state_removed storage_service: avoid string copy storage_service: delete code that handled REMOVING_TOKENS state gossiper: remove code related to advertising REMOVING_TOKEN state migration_manager: add wait_for_schema_agreement() function	2023-05-27 10:49:54 +03:00
Gleb Natapov	05aa07835d	storage_service: delete code that handled REMOVING_TOKENS state The state is never advertised so the code is never used.	2023-05-25 14:48:09 +03:00
Pavel Emelyanov	222f21d180	messaging_service: Remove unused headers from m.s..hh The tracing.hh is quite large to care Another one is "while at it" Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14024	2023-05-25 08:38:49 +03:00
Botond Dénes	8663b27f25	message: generalize per-tenant connection types We have a set amount of connection types for each tenant. The amount of these connection types can change. Although currently these are hardcoded in a single place, soon (in the next patch) there will be yet another place where these will be used. To avoid duplicating these names, making future changes error prone, centralize them in a const array, generalizing the concept of a tenant connection type.	2023-05-10 04:28:57 -04:00
Gleb Natapov	d69a887366	storage_service: raft topology: introduce snapshot transfer code for the topology table	2023-03-23 16:29:56 +02:00
Gleb Natapov	6a4d773b7e	raft topology: add RAFT_TOPOLOGY_CMD verb that will be used by topology coordinator to communicated with nodes Empty for now. Will be used later by the topology coordinator to communicate with other nodes to instruct them to start streaming, or start to fence read/writes.	2023-03-23 16:29:56 +02:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Kamil Braun	cbdcc944b5	service/raft: specialized verb for failure detector pinger We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace.	2022-12-01 20:54:18 +01:00
Kamil Braun	a83789160d	message: messaging_service: check for known topology before calling is_same_dc/rack `is_same_dc` and `is_same_rack` assume that the peer's topology is known. If it's unknown, `on_internal_error` will be called inside topology. When these functions are used in `get_rpc_client`, they are already protected by an earlier check for knowing the peer's topology (the `has_topology()` lambda). Another use is in `do_start_listen()`, where we create a filter for RPC module to check if it should accept incoming connections. If cross-dc or cross-rack encryption is enabled, we will reject connections attempts to the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`. However, it might happen that something (other Scylla node or otherwise) tries to contact us on the regular port and we don't know that thing's topology, which would result in `on_internal_error`. But this is not a fatal error; we simply want to reject that connection. So protect these calls as well. Finally, there's `get_preferred_ip` with an unprotected `is_same_dc` call which, for a given peer, may return a different IP from preferred IP cache if the endpoint resides in the same DC. If there is not entry in the preferred IP cache, we return the original (external) IP of the peer. We can do the same if we don't know the peer's topology. It's interesting that we didn't see this particular place blowing up. Perhaps the preferred IP cache is always populated after we know the topology.	2022-11-16 14:01:50 +01:00
Botond Dénes	13ace7a05e	Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov " Messaging service checks dc/rack of the target node when creating a socket. However, this information is not available for all verbs, in particular gossiper uses RPC to get topology from other nodes. This generates a chicken-and-egg problem -- to create a socket messaging service needs topology information, but in order to get one gossiper needs to create a socket. Other than gossiper, raft starts sending its APPEND_ENTRY messages early enough so that topology info is not avaiable either. The situation is extra-complicated with the fact that sockets are not created for individual verbs. Instead, verbs are groupped into several "indices" and socket is created for it. Thus, the "gossiping" index that includes non-gossiper verbs will create topology-less socket for all verbs in it. Worse -- raft sends messages w/o solicited topology, the corresponding socket is created with the assumption that the peer lives in default dc and rack which doesn't matchthe local nodes' dc/rack and the whole index group gets the "randomly" configured socket. Also, the tcp-nodelay tries to implement similar check, but uses wrong index of 1, so it's also fixed here. " * 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla: messaging_service: Fix gossiper verb group messaging_service: Mind the absence of topology data when creating sockets messaging_service: Templatize and rename remove_rpc_client_one	2022-09-16 13:27:56 +03:00
Pavel Emelyanov	82162be1f1	messaging_service: Remove init/uninit helpers These two are just getting in the way when touching inter-components dependencies around messaging service. Without it m.-s. start/stop just looks like any other service out there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11535	2022-09-15 11:54:46 +03:00
Pavel Emelyanov	7bdad47de2	messaging_service: Mind the absence of topology data when creating sockets When a socket is created to serve a verb there may be no topology information regarding the target node. In this case current code configures socket as if the peer node lived in "default" dc and rack of the same name. If topology information appears later, the client is not re-connected, even though it could providing more relevant configuration (e.g. -- w/o encryption) This patch checks if the topology info is needed (sometimes it's not) and if missing it configures the socket in the most restrictive manner, but notes that the socket ignored the topology on creation. When topology info appears -- and this happens when a node joins the cluster -- the messaging service is kicked to drop all sockets that ignored the topology, so thay they reconnect later. The mentioned "kick" comes from storage service on-join notification. More correct fix would be if topology had on-change notification and messaging service subscribed on it, but there are two cons: - currently dc/rack do not change on the fly (though they can, e.g. if gossiping property file snitch is updated without restart) and topology update effectively comes from a single place - updating topology on token-metadata is not like topology.update() call. Instead, a clone of token metadata is created, then update happens on the clone, then the clone is committed into t.m. Though it's possible to find out commit-time which nodes changed their topology, but since it only happens on join this complexity likely doesn't worth the effort (yet) fixes: #11514 fixes: #11492 fixes: #11483 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-14 20:30:51 +03:00
Pavel Emelyanov	5ffc9d66ec	messaging_service: Templatize and rename remove_rpc_client_one It actually finds and removes a client and in its new form it also applies filtering function it, so some better name is called for Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-14 20:30:07 +03:00

1 2 3 4 5 ...

316 Commits