Commit Graph

653 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
e5e8b970ed join_token_ring, gossip topology: recalculate sync nodes in wait_alive
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

(cherry picked from commit 017134fd38)
2024-06-21 12:05:42 +00:00
Benny Halevy
796ca367d1 gossiper: rename topo_sm member to _topo_sm
Follow scylla convention for class member naming.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#18528
2024-05-12 11:02:35 +03:00
Gleb Natapov
3b40d450e5 gossiper: try to locate an endpoint by the host id when applying state if search by IP fails
Even if there is no endpoint for the given IP the state can still belong to existing endpoint that
was restarted with different IP, so lets try to locate the endpoint by host id as well. Do it in raft
topology mode only to not have impact on gossiper mode.

Also make the test more robust in detecting wrong amount of entries in
the peers table. Today it may miss that there is a wrong entry there
because the map will squash two entries for the same host id into one.

Fixes: scylladb/scylladb#18419
Fixes: scylladb/scylladb#18457
2024-05-09 13:14:54 +02:00
Gleb Natapov
06e6ed09ed gossiper: disable status check for endpoints in raft mode
Gossiper automatically removes endpoints that do not have tokens in
normal state and either do not send gossiper updates or are dead for a
long time. We do not need this with topology coordinator mode since in
this mode the coordinator is responsible to manage the set of nodes in
the cluster. In addition the patch disables quarantined endpoint
maintenance in gossiper in raft mode and uses left node list from the
topology coordinator to ignore updates for nodes that are no longer part
of the topology.
2024-04-21 16:36:07 +03:00
Kefu Chai
372a4d1b79 treewide: do not define FMT_DEPRECATED_OSTREAM
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.

in this change,

* utils: drop the range formatters in to_string.hh and to_string.c, as
  we don't use them anymore. and the tests for them in
  test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
  we are not able to print the elements using operator<< anymore
  after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
  due to a bug in {fmt} v9, {fmt} fails to format a range whose
  element type is `basic_sstring<uint8_t>`, as it considers it
  as a string-like type, but `basic_sstring<uint8_t>`'s char type
  is signed char, not char. this issue does not exist in {fmt} v10,
  so, in this change, we add a workaround to explicitly specialize
  the type trait to assure that {fmt} format this type using its
  `fmt::formatter` specialization instead of trying to format it
  as a string. also, {fmt}'s generic ranges formatter calls the
  pair formatter's `set_brackets()` and `set_separator()` methods
  when printing the range, but operator<< based formatter does not
  provide these method, we have to include this change in the change
  switching to {fmt}, otherwise the change specializing
  `fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
  for comparing values. but without the operator<< based formatters,
  Boost.Test would not be able to print them. after removing
  the homebrew formatters, we need to use the generic
  `boost_test_print_type()` helper to do this job. so we are
  including `test_utils.hh` in tests so that we can print
  the formattable types.
* treewide: add "#include "utils/to_string.hh" where
  `fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:57:36 +08:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Kamil Braun
eb9ba914a3 Merge 'Set dc and rack in gossiper when loaded from system.peers and load the ignored nodes state for replace' from Benny Halevy
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.

When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.

Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.

Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.

Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.

The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.

But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.

The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`.  If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).

Fixes scylladb/scylladb#15787

Closes scylladb/scylladb#15788

* github.com:scylladb/scylladb:
  storage_service: join_token_ring: load ignored nodes state if replacing
  storage_service: replacement_info: return ignore_nodes state
  locator: host_id_or_endpoint: keep value as variant
  gms: endpoint_state: add getters for host_id, dc_rack, and tokens
  storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
  gossiper: add_saved_endpoint: set dc and rack
  gossiper: add_saved_endpoint: fixup indentation
  gossiper: add_saved_endpoint: make host_id mandatory
  gossiper: add load_endpoint_state
  gossiper: start_gossiping: log local state
2024-04-16 10:27:36 +02:00
Benny Halevy
239069eae5 storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
When loading this node endpoint state and it has
tokens in token_metadata, its status can already be set
to normal.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:07:00 +03:00
Benny Halevy
6aaa1b0f48 gossiper: add_saved_endpoint: set dc and rack
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.

Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.

Note that this change makes the host_id presence
mandatory following https://github.com/scylladb/scylladb/pull/16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.

Refs #15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:07:00 +03:00
Benny Halevy
468462aa73 gossiper: add_saved_endpoint: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:07:00 +03:00
Benny Halevy
b9e2aa4065 gossiper: add_saved_endpoint: make host_id mandatory
Require all callers to provide a valid host_id parameter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:07:00 +03:00
Benny Halevy
1061455442 gossiper: add load_endpoint_state
Pack the topology-related data loaded from system.peers
in `gms::load_endpoint_state`, to be used in a following
patch for `add_saved_endpoint`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:06:56 +03:00
Benny Halevy
6b2d94045a gossiper: start_gossiping: log local state
The trace level message hides important information
about the initial node state in gossip.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:06:30 +03:00
Kamil Braun
72955093eb test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.
2024-04-04 18:47:01 +02:00
Kamil Braun
a0b331b310 gossiper: lock local endpoint when updating heart_beat
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
2024-04-04 18:46:56 +02:00
Piotr Dulikowski
2d9e78b09a gossiper: failure detector: don't handle directly removed live endpoints
Commit 0665d9c346 changed the gossiper
failure detector in the following way: when live endpoints change
and per-node failure detectors finish their loops, the main failure
detector calls gossiper::convict for those nodes which were alive when
the current iteration of the main FD started but now are not. This was
changed in order to make sure that nodes are marked as down, because
some other code in gossiper could concurrently remove nodes from
the live node lists without marking them properly.

This was committed around 3 years ago and the situation changed:

- After 75d1dd3a76
  the `endpoint_state::_is_alive` field was removed and liveness
  of a node is solely determined by its presence
  in the `gossiper::_live_endpoints` field.
- Currently, all gossiper code which modifies `_live_endpoints`
  takes care to trigger relevant callback. The only function which
  modifies the field but does not trigger notifications
  is `gossiper::evict_from_membership`, but it is either called
  after `gossiper::remove_endpoint` which triggers callbacks
  by itself, or when a node is already dead and there is no need
  to trigger callbacks.

So, it looks like the reasons it was introduced for are not relevant
anymore. What's more important though is that it is involved in a bug
described in scylladb/scylladb#17515. In short, the following sequence
of events may happen:

1. Failure detector for some remote node X decides that it was dead
   long enough and `convict`s it, causing live endpoints to be updated.
2. The gossiper main loop sends a successful echo to X and *decides*
  to mark it as alive.
3. At the same time, failure detector for all nodes other than X finish
  and main failure detector continues; it notices that node X is
  not alive (because it was convicted in point 1.) and *decides*
  to convict it.
4. Actions planned in 2 and 3 run one after another, i.e. node is first
  marked as alive and then immediately as dead.

This causes `on_alive` callbacks to run first and then `on_dead`. The
second one is problematic as it closes RPC connections to node X - in
particular, if X is in the process of replacing another node with the
same IP then it may cause the replace operation to fail.

In order to simplify the code and fix the bug - remove the piece
of logic in question.

Fixes: scylladb/scylladb#17515

Closes scylladb/scylladb#17754
2024-03-14 13:29:17 +01:00
Avi Kivity
dd76e1c834 Merge 'Simplify error_injection::inject_with_handler()' from Pavel Emelyanov
The method in question can have a shorter name that matches all other injections in this class, and can be non-template

Closes scylladb/scylladb#17734

* github.com:scylladb/scylladb:
  error_injection: De-template inject() with handler
  error_injection: Overload inject() instead of inject_with_handler()
2024-03-14 13:37:54 +02:00
Pavel Emelyanov
488404e080 gms: Remove unused i_failure_detection_event_listener
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17765
2024-03-13 09:33:56 +02:00
Pavel Emelyanov
1f44a374b8 error_injection: Overload inject() instead of inject_with_handler()
The inject_with_handler() method accepts a coroutine that can be called
wiht injection_handler. With such function as an argument, there's no
need in distinctive inject_with_handler() name for a method, it can be
overload of all the existing inject()-s

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-11 19:30:19 +03:00
Benny Halevy
9804ce79d8 gossiper: do_status_check: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-10 20:17:00 +02:00
Benny Halevy
1375c4e6a3 gossiper: do_status_check: allow evicting dead nodes from membership with no host_id
Be more permissive about the presence of host_id
application state for dead and expired nodes in release mode,
so do not throw runtime_error in this case, but
rather consider them as non-normal token owners.
Instead, call on_internal_error_noexcept that will
log the internal error and a backtrace, and will abort
if abort-on-internal-error is set.

This was seen when replacing dead nodes,
without https://github.com/scylladb/scylladb/pull/15788

Fixes #16936

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-10 20:17:00 +02:00
Benny Halevy
f32efcb7a6 gossiper: print the host_id when endpoint state goes UP/DOWN
The host_id is now used in token_metadata
and in raft topology changes so print it
when the gossiper marks the node as UP/DOWN.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-10 20:17:00 +02:00
Benny Halevy
fbf85ee199 gossiper: get_host_id: differentiate between no endpoint_state and no application_state
Currently, we throw the same runtime_error:
`Host {} does not have HOST_ID application_state`
in both case: where there is no endpoint_state
or when the endpoint_state has no HOST_ID
application state.

The latter case is unexpected, especially
after 8ba0decda5
(and also from the add_saved_endpoint path
after https://github.com/scylladb/scylladb/pull/15788
is merged), so throw different error in each case
so we can tell them apart in the logs.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-10 20:16:49 +02:00
Benny Halevy
234774295e gossiper: do_status_check: continue loop after evicting FatClient
We're seeing cases like #16936:
```
INFO  2024-01-23 02:14:19,915 [shard 0:strm] gossip - failure_detector_loop: Mark node 127.0.23.4 as DOWN
INFO  2024-01-23 02:14:19,915 [shard 0:strm] gossip - InetAddress 127.0.23.4 is now DOWN, status = BOOT
INFO  2024-01-23 02:14:27,913 [shard 0: gms] gossip - FatClient 127.0.23.4 has been silent for 30000ms, removing from gossip
INFO  2024-01-23 02:14:27,915 [shard 0: gms] gossip - Removed endpoint 127.0.23.4
WARN  2024-01-23 02:14:27,916 [shard 0: gms] gossip - === Gossip round FAIL: std::runtime_error (Host 127.0.23.4 does not have HOST_ID application_state)
```

Since the FatClient timeout handling already evicts the endpoint
from memberhsip there is no need to check further if the
node is dead and expired, so just co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-10 15:19:51 +02:00
Benny Halevy
f86a5072d6 gossiper: add_local_application_state: drop internae error
After 1d07a596bf that
dropped before_change notifications there is no sense
in getting the local endpoint_state_ptr twice: before
and after the notifications and call on_internal_error
if the state isn't found after the notifications.

Just throw the runtime_error if the endpoint state is not
found, otherwise, use it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-02-11 13:33:26 +02:00
Kefu Chai
005d231f96 db: add formatter for gms::application_state
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `gms::application_state`,
but its operator<< is preserved, as it is still used by the generic
homebrew formatter for `std::unordered_map<>`.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17096
2024-02-01 10:02:25 +02:00
Kamil Braun
cf646022cb gossiper: report error when waiting too long for endpoint lock
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb/scylladb#7127).

Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
2024-01-11 17:29:25 +01:00
Kamil Braun
6e39c2ffde gossiper: store source_location instead of string in endpoint_permit
The original code extracted only the function_name from the
source_location for logging. We'll use more information from the
source_location in later commits.
2024-01-10 17:02:52 +01:00
Kamil Braun
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
Currently, `add_saved_endpoint` is called from two paths:  One, is when
loading states from system.peers in the join path (join_cluster,
join_token_ring), when `_raft_topology_change_enabled` is false, and the
other is from `storage_service::topology_state_load` when raft topology
changes are enabled.

In the later path, from `topology_state_load`, `add_saved_endpoint` is
called only if the endpoint_state does not exist yet.  However, this is
checked without acquiring the endpoint_lock and so it races with the
gossiper, and once `add_saved_endpoint` acquires the lock, the endpoint
state may already be populated.

Since `add_saved_endpoint` applies local information about the endpoint
state (e.g. tokens, dc, rack), it uses the local heart_beat_version,
with generation=0 to update the endpoint states, and that is
incompatible with changes applies via gossip that will carry the
endpoint's generation and version, determining the state's update order.

This change makes sure that the endpoint state is never update in
`add_saved_endpoint` if it has non-zero generation.  An internal error
exception is thrown if non-zero generation is found, and in the only
call site that might reach that state, in
`storage_service::topology_state_load`, the caller acquires the
endpoint_lock for checking for the existence of the endpoint_state,
calling `add_saved_endpoint` under the lock only if the endpoint_state
does not exist.

Fixes #16429

Closes scylladb/scylladb#16432

* github.com:scylladb/scylladb:
  gossiper: add_saved_endpoint: keep heart_beat_state if ep_state is found
  storage_service: topology_state_load: lock endpoint for add_saved_endpoint
  raft_group_registry: move on_alive error injection to gossiper
2024-01-04 14:47:10 +01:00
Benny Halevy
9e8998109f gossiper: get_*_members_synchronized: acquire endpoint update semaphore
To ensure that the value they return is synchronized on all shards.

This got broken recently by 147f30caff.

Refs https://github.com/scylladb/scylladb/pull/16597#discussion_r1440445432

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#16629
2024-01-03 17:41:46 +01:00
Benny Halevy
147f30caff gossiper: mutate_live_and_unreachable_endpoints: make exception safe
Change the mutate_live_and_unreachable_endpoints procedure
so that the called `func` would mutate a cloned
`live_and_unreachable_endpoints` object in place.

Those are replicated to temporary copies on all shards
using `foreign<unique_ptr<>>` so that the would be
automatically freed on exception.

Only after all copies are made, they are applied
on all gossiper shards in a noexcept loop
and finally, a `on_success` function is called
to apply further side effects if everything else
was replicated successfully.

The latter is still susceptible to exceptions,
but we can live with those as long as `_live_endpoints`
and `_unreachable_endpoints` are synchronized on all shards.

With that, the read-only methods:
`get_live_members_synchronized` and
`get_unreachable_members_synchronized`
become trivial and they just return the required data
from shard 0.

Fixes #15089

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#16597
2024-01-03 14:46:10 +02:00
Benny Halevy
ad8a9104d8 endpoint_state subscriptions: batch on_change notification
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.

In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
1d07a596bf everywhere: drop before_change subscription
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
5abf556399 gms: endpoint_state: define application_state_map
Have a central definition for the map held
in the endpoint_state (before changing it to
std::unordered_map).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
3cba079b26 gossiper: add_saved_endpoint: keep heart_beat_state if ep_state is found
Currently, when loading peers' endpoint state from system.peers,
add_saved_endpoint is called.
The first instance of the endpoint state is created with the default
heart_beat_state, with both generation and version set to zero.
However, if add_saved_endpoint finds an existing instance of the
endpoint state, it reuses it, but it updates its heart_beat_state
with the local heart_beat_state() rather than keeping the existing
heart_beat_state, as it should.

This is a problem since it may confuse updates over gossip
later on via do_apply_state_locally that compares the remote
generation vs. the local generation, so they must stem from
the same root that is the endpoint itself.

Fixes #16429

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 16:48:57 +02:00
Benny Halevy
3099c5b8ab storage_service: topology_state_load: lock endpoint for add_saved_endpoint
`topology_state_load` currently calls `add_saved_endpoint`
only if it finds no endpoint_state_ptr for the endpoint.
However, this is done before locking the endpoint
and the endpoint state could be inserted concurrently.

To prevent that, a permit_id parameter was added to
`add_saved_endpoint` allowing the caller to call it
while the endpoint is locked.  With that, `topology_state_load`
locks the endpoint and checks the existence of the endpoint state
under the lock, before calling `add_saved_endpoint`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 16:48:57 +02:00
Benny Halevy
db434e8cb5 raft_group_registry: move on_alive error injection to gossiper
Move the `raft_group_registry::on_alive` error injection point
to `gossiper::real_mark_alive` so it can delay marking the endpoint as
alive, and calling the `on_alive` callback, but without holding
the endpoint_lock.

Note that the entry for this endpoint in `_pending_mark_alive_endpoints`
still blocks marking it as alive until real_mark_alive completes.

Fixes #16506

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 15:28:54 +02:00
Petr Gusev
7b55ccbd8e token_metadata: drop the template
Replace token_metadata2 ->token_metadata,
make token_metadata back non-template.

No behavior changes, just compilation fixes.
2023-12-12 23:19:54 +04:00
Petr Gusev
799f747c8f shared_token_metadata: switch to the new token_metadata 2023-12-12 23:19:54 +04:00
Petr Gusev
c7314aa8e2 gossiper: use new token_metadata 2023-12-12 23:19:53 +04:00
Botond Dénes
d2a88cd8de Merge 'Typos: fix typos in code' from Yaniv Kaul
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255

Closes scylladb/scylladb#16289

* github.com:scylladb/scylladb:
  Update unified/build_unified.sh
  Update main.cc
  Update dist/common/scripts/scylla-housekeeping
  Typos: fix typos in code
2023-12-06 07:36:41 +02:00
Yaniv Kaul
ae2ab6000a Typos: fix typos in code
Fixes some more typos as found by codespell run on the code.
In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255
2023-12-05 15:18:11 +02:00
Benny Halevy
25754f843b gossiper: add get_this_endpoint_state_ptr
Returns this node's endpoint_state_ptr.
With this entry point, the caller doesn't need to
get_broadcast_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Kefu Chai
ef76c4566b gossiper: do not use {:d} fmt specifier when formating generation_number
generation_number's type is `generation_type`, which in turn is a
`utils::tagged_integer<struct generation_type_tag, int32_t>`,
which formats using either fmtlib which uses ostream_formatter backed by
operator<< . but `ostream_formatter` does not provide the specifier
support. so {:d} does apply to this type, when compiling with fmtlib
v10, it rejects the format specifier (the error is attached at the end
of the commit message).

so in this change, we just drop the format specifier. as fmtlib prints
`int32_t` as a decimal integer, so even if {:d} applied, it does not
change the behavior.

```
/home/kefu/dev/scylladb/gms/gossiper.cc:1798:35: error: call to consteval function 'fmt::basic_format_string<char, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int> &, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int> &>::basic_format_string<char[48], 0>' is not a constant expression
 1798 |                 auto err = format("Remote generation {:d} != local generation {:d}", remote_gen, local_gen);
      |                                   ^
/usr/include/fmt/core.h:2322:31: note: non-constexpr function 'throw_format_error' cannot be used in a constant expression
 2322 |       if (!in(arg_type, set)) throw_format_error("invalid format specifier");
      |                               ^
/usr/include/fmt/core.h:2395:14: note: in call to 'parse_presentation_type.operator()(1, 510)'
 2395 |       return parse_presentation_type(pres::dec, integral_set);
      |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2706:9: note: in call to 'parse_format_specs<char>(&"Remote generation {:d} != local generation {:d}"[20], &"Remote generation {:d} != local generation {:d}"[47], formatter<mapped_type, char_type>().formatter::specs_, checker(s).context_, 13)'
 2706 |         detail::parse_format_specs(ctx.begin(), ctx.end(), specs_, ctx, type);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2561:10: note: in call to 'formatter<mapped_type, char_type>().parse<fmt::detail::compile_parse_context<char>>(checker(s).context_)'
 2561 |   return formatter<mapped_type, char_type>().parse(ctx);
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2647:39: note: in call to 'parse_format_specs<utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int>, fmt::detail::compile_parse_context<char>>(checker(s).context_)'
 2647 |     return id >= 0 && id < num_args ? parse_funcs_[id](context_) : begin;
      |                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2485:15: note: in call to 'handler.on_format_specs(0, &"Remote generation {:d} != local generation {:d}"[20], &"Remote generation {:d} != local generation {:d}"[47])'
 2485 |       begin = handler.on_format_specs(adapter.arg_id, begin + 1, end);
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2541:13: note: in call to 'parse_replacement_field<char, fmt::detail::format_string_checker<char, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int>, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int>> &>(&"Remote generation {:d} != local generation {:d}"[19], &"Remote generation {:d} != local generation {:d}"[47], checker(s))'
 2541 |     begin = parse_replacement_field(p, end, handler);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/core.h:2769:7: note: in call to 'parse_format_string<true, char, fmt::detail::format_string_checker<char, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int>, utils::tagged_tagged_integer<utils::final, gms::generation_type_tag, int>>>({&"Remote generation {:d} != local generation {:d}"[0], 47}, checker(s))'
 2769 |       detail::parse_format_string<true>(str_, checker(s));
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/gms/gossiper.cc:1798:35: note: in call to 'basic_format_string<char[48], 0>("Remote generation {:d} != local generation {:d}")'
 1798 |                 auto err = format("Remote generation {:d} != local generation {:d}", remote_gen, local_gen);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16126
2023-11-23 11:02:44 +02:00
Tomasz Grabiec
dc6a0b2c35 gossiper: Elevate logging level for node restart events
They cause connection drops, which is a significant disruptive
event. We should log it so that we can know that this is the cause of
the problems it may cause, like requests timing out. Connection drop
will cause coordinator-side requests to time out in the absence of
speculation.

Refs #14746

Closes scylladb/scylladb#16018
2023-11-14 11:21:13 +02:00
Kamil Braun
15b441550b gossiper: do_shadow_round: increment nodes_down in case of timeout
Previously we would only increment `nodes_down` when getting
`rpc::closed_error`. Distinguishing between that and timeout is
unreliable. Consider:
1. if a node is dead but we can reach the IP, we'd get `closed_error`
2. if we cannot reach the IP (there's a network partition), the RPC
   would hang so we'd get `timeout_error`
3. if the node is both dead and the IP is unreachable, we'd get
   `timeout_error`

And there are probably other more complex scenarios as well. In general,
it is impossible to distinguish a dead node from a partitioned node in
asynchronous networks, and whether we end up with `closed_error` or
`timeout_error` is an implementation detail of the underlying protocol
that we use.

The fact that `nodes_down` was not incremented for timeouts would
prevent a node from starting if it cannot reach isolated IPs (whether or
not there were dead or alive nodes behind those IPs). This was observed
in a Jepsen test: https://github.com/scylladb/scylladb/issues/15675.

Note that `nodes_down` is only used to skip shadow round outside
bootstrap/replace, i.e. during restarts, where the shadow round was
"best effort" anyway (not mandatory). During bootstrap/replace it is now
mandatory.

Also fix grammar in the error message.
2023-11-06 10:28:08 +01:00
Kamil Braun
897cb6510e gossiper: do_shadow_round: fix nodes_down calculation
During shadow round we would calculate the number of nodes from which we
got `rpc::closed_error` using `nodes_counter`, and if the counter
reached the size of all contact points passed to shadow round, we would
skip the shadow round (and after the previous commit, we do it only in
the case of restart, not during bootstrap/replace which is unsafe).

However, shadow round might have multiple loops, and `nodes_down` was
initialized to `0` before the loop, then reused. So the same node might
be counted multiple times in `nodes_down`, and we might incorrectly
enter the skipping branch. Or we might go over `nodes.size()` and never
finish the loop.

Fix this by initializing `nodes_down = 0` inside the loop.
2023-11-06 10:28:07 +01:00
Kamil Braun
b03fa87551 storage_service: make shadow round mandatory during bootstrap/replace
It is unsafe to bootstrap or perform replace without performing the
shadow round, which is used to obtain features from the existing cluster
and verify that we support all enabled features.

Before this patch, I could easily produce the following scenario:
1. bootstrap first node in the cluster
2. shut it down
3. start bootstrapping second node, pointing to the first as seed
4. the second node skips shadow round because it gets
   `rpc::closed_error` when trying to connect to first node.
5. the node then passes the feature check (!) and proceeds to the next
   step, where it waits for nodes to show up in gossiper
6. we now restart the first node, and the second node finishes bootstrap

The shadow round must be mandatory during bootstrap/replace, which is
what this patch does.

On restart it can remain optional as it was until now. In fact it should
be completely unnecessary during restart, but since we did it until now
(as best-effort), we can keep doing it.
2023-11-06 10:28:07 +01:00
Kamil Braun
108aae09c5 gossiper: do_shadow_round: remove fall_back_to_syn_msg
If during shadow round we learned that a contact node does not
understand the GET_ENDPOINT_STATES verb, we'd fall back to old shadow
round method (using gossiper SYN messages).

The verb was added a long time ago and it ended up in Scylla 4.3 and
2021.1. So in newer versions we can make it mandatory, as we don't
support skipping major versions during upgrades. Even if someone
attempted to, they would just get an error and they can retry bootstrap
after finnishing upgrade.
2023-11-06 10:28:07 +01:00