Commit Graph

2123 Commits

Author SHA1 Message Date
Kefu Chai
24d14b601b treewide: s/boost::adaptors::map_values/std::views::values/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.

in this change, we:

- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
  we cannot use `std::views::concat` yet. this helper is only
  available in C++26.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21265
2024-10-27 21:32:45 +02:00
Benny Halevy
04d741bcbb storage_service: on_change: update_peer_info only if peer info changed
Return an optional peer_info from get_peer_info_for_update
when the `app_state_map` arg does not change peer_info,
so that we can skip calling update_peer_info, if it didn't
change.

Fixes scylladb/scylladb#20991
Refs scylladb/scylladb#16376

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21152
2024-10-22 10:26:08 +02:00
Kefu Chai
6ead5a4696 treewide: move log.hh into utils/log.hh
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.

in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 06:54:46 +03:00
Kefu Chai
5cd619a60c treewide: s/boost::adaptors::map_keys/std::views::keys/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.

in this change, we:

- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21198
2024-10-21 12:47:52 +03:00
Avi Kivity
c3be2489ce treewide: drop includes of <boost/range/adaptors.hpp>
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.

Closes scylladb/scylladb#21187
2024-10-20 17:17:11 +03:00
Emil Maskovsky
a840949ea0 treewide: code cleanup and refactoring
Fix the clang-tidy warnings, code cleanup and improvements.

Applied the clang format to the updated places.
2024-10-08 20:53:54 +02:00
Kamil Braun
1b9337bf99 Merge 'Wait for all users of group0 server to complete before destroying it' from Gleb Natapov
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.

Also make sure to signal group0 abort source if initialization fails.

Fixes scylladb/scylladb#20701

Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.

Closes scylladb/scylladb#20891

* github.com:scylladb/scylladb:
  group: hold group0 shutdown gate during async operations
  group0: Stop group0 if node initialization fails
2024-10-08 13:46:54 +02:00
Gleb Natapov
e642f0a86d group: hold group0 shutdown gate during async operations
Wait for all outstanding async work that uses group0 to complete before
destroying group0 server.

Fixes scylladb/scylladb#20701
2024-10-06 17:20:52 +03:00
Gleb Natapov
ba22493a69 group0: Stop group0 if node initialization fails
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).
2024-10-06 17:20:52 +03:00
Botond Dénes
07094c3e44 Merge 'replica: Fix tombstone GC during tablet split preparation' from Raphael "Raph" Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

Closes scylladb/scylladb#20939

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-10-04 10:29:42 +03:00
Raphael S. Carvalho
bcd358595f service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.
2024-10-02 11:23:44 -03:00
Sergey Zolotukhin
6398b7548c config: Add a warning about use of IP address for join topology and replace
operations.

When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.

Fixes scylladb/scylladb#19218
2024-10-02 11:56:59 +02:00
Sergey Zolotukhin
3b9033423d utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
- utils::split_comma_separated_list now accepts a reference to sstring instead
  of a copy to avoid extra memory allocations. Additionally, the results of
  trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
  parse_node_list and api/storage_service::set_storage_service were changed to use
  std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
  std::vector is a more cache-friendly structure,  resulting in better performance.
2024-10-02 11:56:59 +02:00
Kamil Braun
9224e48d6b Merge 'Populate raft address map from gossiper on raft configuration change' from Gleb Natapov
For each new node added to the raft config populate its ID to IP mapping
in raft address map from the gossiper. The mapping may have expired if a
node is added to the raft configuration long after it first appears in
the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

Closes scylladb/scylladb#20601

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-26 12:41:25 +02:00
Gleb Natapov
9e4cd32096 test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join 2024-09-25 17:10:09 +03:00
Kamil Braun
7d8f1d251a Merge 'Mark node as being replaced earlier' from Gleb Natapov
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: scylladb/scylladb#20629

Need to be backported since this is a regression

Closes scylladb/scylladb#20743

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-25 15:46:12 +02:00
Avi Kivity
d16ea0afd6 Merge 'cql3: Extend DESC SCHEMA by auth and service levels' from Dawid Mędrek
Auth has been managed via Raft since Scylla 6.0. Restoring data
following the usual procedure (1) is error-prone and so a safer
method must have been designed and implemented. That's what
happens in this PR.

We want to extend `DESC SCHEMA` by auth and service levels
to provide a safe way to backup and restore those two components.
To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS`
and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`.

* `DESC SCHEMA` -- no change, i.e. the statement describes the current
  schema items such as keyspaces, tables, views, UDTs, etc.
* `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier
  and also describes auth and service levels. No information about
  passwords is returned.
* `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same
  as the previous tier and also includes information about the salted
  hashes corresponding to the passwords of roles.

To restore existing roles, we extend the `CREATE ROLE` statement
by allowing to use the option `WITH SALTED HASH = '[...]'`.

---

Implementation strategy:

* Add missing things/adjust existing ones that will be used later.
* Implement creating a role with salted hash.
* Add tests for creating a role with salted hash.
* Prepare for implementing describe functionality of auth and service levels.
* Implement describe functionality for elements of auth and service levels.
* Extend the grammar.
* Add tests for describe auth and service levels.
* Add/update documentation.

---

(1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html
In case the link stops working, restoring a schema was realised
by managing raw files on disk.

Fixes scylladb/scylladb#18750
Fixes scylladb/scylladb#18751
Fixes scylladb/scylladb#20711

Closes scylladb/scylladb#20168

* github.com:scylladb/scylladb:
  docs: Update user documentation for backup and restore
  docs/dev: Add documentation for DESC SCHEMA
  test: Add tests for describing auth and service levels
  cql3/functions/user_function: Remove newline character before and after UDF body
  cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS
  auth: Implement describing auth
  auth/authenticator: Add member functions for querying password hash
  service/qos/service_level_controller: Describe service levels
  data_dictionary: Remove keyspace_element.hh
  treewide: Start using new overloads of describe
  treewide: Fix indentation in describe functions
  treewide: Return create statement optionally in describe functions
  treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element
  treewide: Start using schema::ks_name() instead of schema::keyspace_name()
  cql3: Refactor `description`
  cql3: Move description to dedicated files
  test: Add tests for `CREATE ROLE WITH SALTED HASH`
  cql3/statements: Restrict CREATE ROLE WITH SALTED HASH
  auth: Allow for creating roles with SALTED HASH
  types: Introduce a function `cql3_type_name_without_frozen()`
  cql3/util: Accept std::string_view rather than const sstring&
2024-09-24 21:44:32 +03:00
Abhinav
36d68ec955 raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271

Closes scylladb/scylladb#20500
2024-09-24 16:11:19 +02:00
Dawid Mędrek
39cf106151 treewide: Start using schema::ks_name() instead of schema::keyspace_name()
We're going to remove the interface `data_dictionary::keyspace_element`.
As `schema::keyspace_name()` is an implementation of one of the methods
specified by that interface, we replace its uses by `schema::ks_name()`.
`schema::keyspace_name()` was an alias for it, so no semantic change
has occured.
2024-09-20 14:24:53 +02:00
Gleb Natapov
c0939d86f9 topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
2024-09-19 15:23:48 +03:00
Benny Halevy
574a08ed96 storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20375
2024-09-19 14:25:46 +03:00
Pavel Emelyanov
36863d4ad0 sstables_manager: Remove table_dir from make_sstable()
It used to be passed to sstable constructor, but now it doesn't need
this argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Piotr Dulikowski
d98708013c Merge 'view: move view_build_status to group0' from Michael Litvak
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.

The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.

New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.

The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.

When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.

Fixes scylladb/scylladb#15329

dtest https://github.com/scylladb/scylla-dtest/pull/4827

Closes scylladb/scylladb#19745

* github.com:scylladb/scylladb:
  view: test view_build_status table with node replace
  test/pylib: use view_build_status_v2 table in wait_for_view
  view_builder: common write view_build_status function
  view_builder: improve migration to v2 with intermediate phase
  view: delete node rows from view_build_status on node removal
  view: sanitize view_build_status during migration
  view: make old view_build_status table a virtual table
  replica: move streaming_reader_lifecycle_policy to header file
  view_builder: test view_build_status_v2
  storage_service: add view_build_status to raft snapshot
  view_builder: migration to v2
  db:system_keyspace: add view_builder_version to scylla_local
  view_builder: read view status from v2 table
  view_builder: introduce writing status mutations via raft
  view_builder: pass group0_client and qp to view_builder
  view_builder: extract sys_dist status operations to functions
  db:system_keyspace: add view_build_status_v2 table
2024-09-11 13:02:58 +02:00
Gleb Natapov
af83c5e53e group0: stop group0 before draining storage service during shutdown
Currently storage service is drained while group0 is still active. The
draining stops commitlogs, so after this point no more writes are
possible, but if group0 is still active it may try to apply commands
which will try to do writes and they will fail causing group0 state
machine errors. This is benign since we are shutting down anyway, but
better to fix shutdown order to keep logs clean.

Fixes scylladb/scylladb#19665
2024-09-10 13:15:56 +02:00
Evgeniy Naydanov
769424723b test: error injections for Raft-based topology
Add following error injections:
 - stop_after_init_of_system_ks
 - stop_after_init_of_schema_commitlog
 - stop_after_starting_gossiper
 - stop_after_starting_raft_address_map
 - stop_after_starting_migration_manager
 - stop_after_starting_commitlog
 - stop_after_starting_repair
 - stop_after_starting_cdc_generation_service
 - stop_after_starting_group0_service
 - stop_after_starting_auth_service
 - stop_during_gossip_shadow_round
 - stop_after_saving_tokens
 - stop_after_starting_gossiping
 - stop_after_sending_join_node_request
 - stop_after_setting_mode_to_normal_raft_topology
 - stop_before_becoming_raft_voter
 - topology_coordinator_pause_after_updating_cdc_generation
 - stop_before_streaming
 - stop_after_streaming
 - stop_after_bootstrapping_initial_raft_configuration
2024-09-05 22:11:31 +00:00
Michael Litvak
c1f3517a75 view_builder: improve migration to v2 with intermediate phase
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
2024-09-05 15:42:35 +03:00
Michael Litvak
fcf66ad541 storage_service: add view_build_status to raft snapshot
Include the table system.view_build_status_v2 in the raft snapshot, and
also the view_builder version parameter.
2024-09-05 15:42:30 +03:00
Michael Litvak
8d25a4d678 view_builder: migration to v2
Migrate view_builder to v2, to store the view build status of all nodes
in the group0 based table view_build_status_v2.

Introduce a feature view_build_status_on_group0 so we know when all
nodes are ready to migrate and use the new table.

A new cluster is initialized to use v2. Otherwise, The topology coordinator
initiates the migration when the feature is enabled, if it was not done
already.

The migration reads all the rows in the v1 table and writes it via
group0 to the v2 table, together with a mutation that updates the
view_builder parameter in scylla_local to v2. When this mutation is
applied, it updates the view_builder service to start using the v2
table.
2024-09-05 15:41:04 +03:00
Kamil Braun
79983723c8 storage_service: pass _abort_source to hold_read_apply_mutex
There's no point waiting for this lock if `storage_service` is being
aborted. In theory the lock, if held, should be eventually released by
whatever is holding it during shutdown -- but if there is some cyclic
reference between the services, and e.g. whatever holds the lock is
stuck because of ongoing shutdown and would only be unstuck by
`storage_service` getting stopped (which it can't because it's waiting
on the lock), that would cause a shutdown deadlock. Better to be safe
than sorry.
2024-09-03 15:52:05 +02:00
Kamil Braun
a4d1065628 api: move reload_raft_topology_state implementation inside storage_service
In later commit we'll want to access more `storage_service` internals
in the API's implementation (namely, `_abort_source`)

Also moving the implementation there allows making
`service::topology_transition()` private again (it was made public in
992f1327d3 only for this API
implementation)
2024-09-03 15:52:03 +02:00
Kamil Braun
292ef0d1f9 Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.

The series adds the test for this scenario and the fix for the
chicken&egg problem above.

The series (or at least the fix itself) needs to be backported because
this is a serious regression.

Fixes: scylladb/scylladb#19025

Closes scylladb/scylladb#20290

* github.com:scylladb/scylladb:
  topology coordinator: fix indentation after the last patch
  topology coordinator: do not add replacing node without a ring to topology
  test: add test for replace in clusters with encryption enabled
  test.py: add server encryption support to cluster manager
  .gitignore: fix pattern for resources to match only one specific directory
2024-08-30 11:29:05 +02:00
Raphael S. Carvalho
26facd807e storage_service: avoid processing same table unnecessarily in split monitor
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.

during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.

Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.

Fixes #20339.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#20029
2024-08-29 19:38:43 +03:00
Gleb Natapov
32a59ba98f topology coordinator: fix indentation after the last patch 2024-08-29 17:14:09 +03:00
Gleb Natapov
17f4a151ce topology coordinator: do not add replacing node without a ring to topology
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.

Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.

To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.

The patch effectively reverts b8ee8911ca

Fixes: scylladb/scylladb#19025
2024-08-29 17:14:09 +03:00
Patryk Jędrzejczak
02bb70da19 treewide: support zero-token nodes in the recovery mode
Before we implement the manual recovery tool, we must support
zero-token nodes in the recovery mode. This means that two topology
operations involving zero-token nodes must work in the gossip-based
topology:
- removing a dead zero-token node,
- restarting a live zero-token node.
We make changes necessary to make them work in this patch.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
574c252391 feature_service: introduce the ZERO_TOKEN_NODES feature
Zero-token nodes must be supported by all nodes in the cluster.
Otherwise, the non-supporting nodes would crash on some assertion
that assumes only token-owing normal nodes make sense.

Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token
nodes refuse to boot if it is not supported.

I tested this patch manually. First, I booted a node built in the
previous patch. Then, I tried to add a zero-token node built in this
patch. It refused to boot as expected.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
c25eefe217 storage_service: rename join_token_ring to join_topology
After introducing zero-token nodes that call join_token_ring but do
not join the ring, the join_token_ring name does not make much sense.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
9937cf3a24 storage_service: raft_topology_cmd_handler: improve warnings 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
22d907e721 treewide: introduce support for zero-token nodes in Raft topology
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.

The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.

From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.

Topology operations involving zero-token nodes are simplified:
- `add` and `replace` finish in the `join_group0` state, so creating
a new CDC generation and streaming are skipped,
- `removenode` and `decommission` skip streaming,
- `rebuild` does not even contact the topology coordinator as there
is nothing to rebuild,

Also, if the topology operation involves a token-owning node,
zero-token nodes are ignored in streaming.

Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.

The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.

Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.

Zero-token nodes could also be used as load balancers in the
Alternator.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
ed55261650 treewide: distinguish all nodes from all token owners
In one of the following patches, we introduce support for zero-token
nodes. From that point, getting all nodes and getting all token
owners isn't equivalent. In this patch, we ensure that we consider
only token owners when we want to consider only token owners (for
example, in the replication logic), and we consider all nodes when
we want to consider all nodes (for example, in the topology logic).

The main purpose of this patch is to make the PR introducing
zero-token nodes easier to review. The patch that introduces
zero-token nodes is already complicated. We don't want trivial
changes from this patch to make noise there.

This patch introduces changes needed for zero-token nodes only in the
Raft-based topology and in the recovery mode. Zero-token nodes are
unsupported in the gossip-based topology outside recovery.

Some functions added to `token_metadata` and `topology` are
inefficient because they compute a new data structure in every call.
They are never called in the hot path, so it's not a serious problem.
Nevertheless, we should improve it somehow. Note that it's not
obvious how to do it because we don't want to make `token_metadata`
store topology-related data. Similarly, we don't want to make
`topology` store token-related data. We can think of an improvement
in a follow-up.

We don't remove unused `topology::get_datacenter_rack_nodes` and
`topology::get_datacenter_nodes`. These function can be useful in the
future. Also, `topology::_dc_nodes` is used internally in `topology`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
2d9575d6a9 gossip topology: make a replacing node remove the replaced node from topology
In the following patch, we change the gossiper to work the same for
zero-token nodes and token-owning nodes. We replace occurrences of
`is_normal_token_owner` with topology-based conditions. We want to
rely on the invariant that token-owning nodes own tokens if and only
if they are in the normal or leaving state. However, this invariant
is broken by a replacing node because it does not remove the
replaced node from topology. Hence, after joining, the replacing node
has topology with a node that is not a token owner anymore but is
in a leaving state (`being_replaced`). We fix it to prevent the
following patch from introducing a regression.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
c7016dedb3 locator: topology: add_or_update_endpoint: use none as the default node state
In one of the following patches, we change the gossiper to work the
same for zero-token nodes and token-owning nodes. We replace
occurrences of `is_normal_token_owner` with topology-based
conditions. We want to rely on the invariant that token-owning nodes
own tokens if and only if they are in the normal or leaving state.
However, this invariant can be broken in the gossip-based topology
when a new node joins the cluster. When a boostrapping node starts
gossiping, other nodes add it to their topology in
`storage_service::on_alive`. Surprisingly, the state of the new node
is set to `normal`, as it's the default value used by
`add_or_update_endpoint`. Later, the state will be set to
`bootstrapping` or `replacing`, and finally it will be set again to
`normal` when the join operation finishes. We fix this strange
behavior by setting the node state to `none` in
`storage_service::on_alive` for nodes not present in the topology.
Note that we must add such nodes to the topology. Other code needs
their Host ID, IP, and location.

We change the default node state from `normal` to `none` in
`add_or_update_endpoint` to prevent bugs like the one in
`storage_service::on_alive`. Also, we ensure that nodes in the `none`
state are ignored in the getters of `locator::topology`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
366605224c token_metadata: rename get_all_endpoints and get_all_ips
In one of the following patches, we introduce support for zero-token
nodes. A zero-token node that has successfully joined the cluster is
in the normal state but is not a normal token owner. Hence, the names
of `get_all_endpoints` and `get_all_ips` become misleading. They
should specify that the functions return only IDs/IPs of token owners.
2024-08-29 10:37:07 +02:00
Benny Halevy
18c45f7502 raft_rebuild: propagate source_dc force option to rebuild_option
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.

This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.

Fixes scylladb/scylladb#20242

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20249
2024-08-27 17:05:48 +02:00
Benny Halevy
686a8f2939 abstract_replication_strategy: make get_ranges async
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:57:34 +03:00
Tomasz Grabiec
ff52527c54 Merge 'repair: do_rebuild_replace_with_repair: use source_dc only when safe' from Benny Halevy
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.

Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.

Fixes #16826

Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.

* Requires backport to live versions.  Issue hit in the filed with 2022.2.14

Closes scylladb/scylladb#16827

* github.com:scylladb/scylladb:
  repair: do_rebuild_replace_with_repair: use source_dc only when safe
  repair: replace_with_repair: pass the replace_node downstream
  repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
  repair: replace_rebuild_with_repair: pass ks_erms from caller
  nodetool: rebuild: add force option
  Add and use utils::optional_param to pass source_dc
2024-08-20 16:13:23 +02:00
Benny Halevy
8665eef98c repair: replace_with_repair: pass the replace_node downstream
To be used by the next path to count how many nodes
are lost in each datacenter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:23:33 +03:00
Benny Halevy
9729dd21c3 repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,

Refs scylladb/scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:22:01 +03:00
Benny Halevy
b5d0ab092c repair: replace_rebuild_with_repair: pass ks_erms from caller
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.

Note that it's already done on the rebuild path
for streaming based rebuild.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:20:27 +03:00