`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.
As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.
This should improve CI stability.
Fixes: #12972
Refs: #14042Closes#14354
* github.com:scylladb/scylladb:
messaging_service: print which connections are dropped due to missing topology info
storage_service: wait for nodes to be UP on bootstrap
storage_service: wait for NORMAL state handler before `setup_group0()`
storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
This connection dropping caused us to spend a lot of time debugging.
Those debugging sessions would be shorter if Scylla logs indicated that
connections are being dropped and why.
Connection drops for a given node are a one-time event - we only do it
if we establish a connection to a node without topology info, which
should only happen before we handle the node's NORMAL status for the
first time. So it's a rare thing and we can log it on INFO level without
worrying about log spam.
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
be immediately dropped) when receiving the `CLIENT_ID` verb.
When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.
Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.
This mapping will be used later when we ban nodes that we remove from
the cluster.
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/
Reviewed-by: Konstantin Osipov <kostja@scylladb.com>
* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
storage_service: avoid unneeded copies in on_change
storage_service: remove check that is always true
storage_service: rename handle_state_removing to handle_state_removed
storage_service: avoid string copy
storage_service: delete code that handled REMOVING_TOKENS state
gossiper: remove code related to advertising REMOVING_TOKEN state
migration_manager: add wait_for_schema_agreement() function
On connection setup, the isolation cookie of the connection is matched
to the appropriate scheduling group. This is achieved by iterating over
the known statement tenant connection types as well as the system
connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and
the remote node has a scheduling group the local one doesn't have. To
avoid demoting a scheduling group of unknown importance, in this case the
default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise
version, as the scheduling groups of the enterprise service-levels will
match none of the statement tenants and will hence fall-back to the
default scheduling group. As a consequence, while the cluster is mixed,
user workload on old (OSS) nodes, will be executed under the system
scheduling group and concurrency semaphore. Not only does this mean that
user workloads are directly competing for resources with system ones,
but the two workloads are now sharing the semaphore too, reducing the
available throughput. This usually manifests in queries timing out on
the old (OSS) nodes in the cluster.
This patch proposes to fix this, by recognizing that the unknown
scheduling group is in fact a tenant this node doesn't know yet, and
matching it with the default statement tenant.
With this, order should be restored, with service-level connections
being recognized as user connections and being executed in the statement
scheduling group and the statement (user) concurrency semaphore.
We have a set amount of connection types for each tenant. The amount of
these connection types can change. Although currently these are
hardcoded in a single place, soon (in the next patch) there will be yet
another place where these will be used. To avoid duplicating these
names, making future changes error prone, centralize them in a const
array, generalizing the concept of a tenant connection type.
Make sure that the int64_t generation we get over rpc
fits in the int32_t generation_type we keep locally.
Restrict this assertion to non-release builds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
the goal of this change is to reduce the dependency on
`operator<<(ostream&, const gms::inet_address&)`.
this is not an exhaustive search-and-replace change, as in some
caller sites we have other dependencies to yet-converted ostream
printer, we cannot fix them all, this change only updates some
caller of `operator<<(ostream&, const gms::inet_address&)`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
We already check is remote's node topology is missing before creating a
connection, but local node topology can be missing too when we will use
raft to manage it. Raft needs to be able to create connections before
topology is knows.
Message-Id: <20221228144944.3299711-7-gleb@scylladb.com>
Now, with a44ca06906, is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12294
* github.com:scylladb/scylladb:
topology: get rid of pending state
topology: debug log update and remove endpoint
Now, with a44ca06906,
is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep
all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
1) make address map API flexible
Before this patch:
- having a mapping without an actual IP address was an
internal error
- not having a mapping for an IP address was an internal
error
- re-mapping to a new IP address wasn't allowed
After this patch:
- the address map may contain a mapping
without an actual IP address, and the caller must be prepared for it:
find() will return a nullopt. This happens when we first add an entry
to Raft configuration and only later learn its IP address, e.g. via
gossip.
- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications
Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.
3) prompt address map state with app state
Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).
The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.
Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.
Thanks to the changes above, Raft configuration no longer
stores IP addresses.
We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780Closes#11942
* github.com:scylladb/scylladb:
message: messaging_service: check for known topology before calling is_same_dc/rack
test: reenable test_topology::test_decommission_node_add_column
test/pylib: util: configurable period in wait_for
message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
message: messaging_service: topology independent connection settings for GOSSIP verbs
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.
When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).
Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.
Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).
The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.
However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix#11780.
To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).
This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.
Fixes#11992.
Inspired by xemul/scylla@45d48f3d02.
Even though previous patch makes scylla not gossip this as internal_ip,
an extra sanity check may still be useful. E.g. older versions of scylla
may still do it, or this address can be loaded from system_keyspace.
refs: #11502
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.
This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.
Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.
The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.
Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"
* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
messaging_service: Fix gossiper verb group
messaging_service: Mind the absence of topology data when creating sockets
messaging_service: Templatize and rename remove_rpc_client_one
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11535
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)
This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.
The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:
- currently dc/rack do not change on the fly (though they can, e.g. if
gossiping property file snitch is updated without restart) and
topology update effectively comes from a single place
- updating topology on token-metadata is not like topology.update()
call. Instead, a clone of token metadata is created, then update
happens on the clone, then the clone is committed into t.m. Though
it's possible to find out commit-time which nodes changed their
topology, but since it only happens on join this complexity likely
doesn't worth the effort (yet)
fixes: #11514fixes: #11492fixes: #11483
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When messaging_service::get_rpc_client() picks up cached socket and
notices error on it, it drops the connection and creates a new one. The
method used to drop the connection is the one that re-lookups the verb
index again, which is excessive. Tune this up while at it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two methods to close an RPC socket in m.s. -- one that's
called on error path of messaging_service::send_... and the other one
that's called upon gossiper down/leave/cql-off notifications. The former
one notifies listeners about connection drop, the latter one doesn't.
The only listener is the storage-proxy which, in turn, kicks database to
release per-table cache hitrate data. Said that, when a node goes down
(or when an operator shuts down its transport) the hit-rate stats
regarding this node are leaked.
This patch moves notification so that any socket drop calls notification
and thus releases the hitrates.
fixes: #11497
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.
However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When getting dc/rack snitch may perform two lookups -- first time it
does it using the provided IP, if nothing is found snitch assumes that
the IP is internal one, gets the corresponding public one and searches
again.
The thing is that the only code that may come to snitch with internal
IP is the messaging service. It does so in two places: when it tries
to connect to the given endpoing and when it accepts a connection.
In the former case messaging performs public->internal IP conversion
itself and goes to snitch with the internal IP value. This place can get
simpler by just feeding the public IP to snich, and converting it to the
internal only to initiate the connection.
In the latter case the accepted IP can be either, but messaging service
has the public<->private map onboard and can do the conversion itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.
Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
schema operations.
With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).
This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.
Read the design doc, then read the commits in topological order
for best reviewing experience.
---
I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.
As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).
Closes#10835
* github.com:scylladb/scylladb:
service/raft: raft_group0: implement upgrade procedure
service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
service/raft: raft_group0: introduce local loggers for group 0 and upgrade
service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
service/raft: raft_group0_client: prepare for upgrade procedure
service/raft: introduce `group0_upgrade_state`
db: system_keyspace: introduce `load_peers`
idl-compiler: introduce cancellable verbs
message: messaging_service: cancellable version of `send_schema_check`