We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
Closesscylladb/scylladb#27387
* github.com:scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.
The patch is
for f in $(git grep -l 'seastar::compat::source_location'); do
sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
done
and removal of few header includes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27309
When applying group0_command we now inspect
whether any auth internal tables were modified,
and reload affected role entries in the cache.
Since one auth DML may change multiple tables,
when iterating over mutations we deduplicate
affected roles across those tables.
Receiving snaphot is a rare event so as a simplification
we'll be reloading the whole cache instead of trying to merge
states, especially that expected size is small, below 100 records.
Reloading is non-disruptive operation, old entries are removed
only after all entries are loaded. If entry is updated, shared
pointer will be atomically replaced in a cache map.
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.
We fix this issue in this commit by considering topology members
instead.
We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.
We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.
We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
effort to minimize external dependencies in Scylla.
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.
Fixes#26569
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.
The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
example, all 3 nodes that first joined the new group 0 are in the same
dc and rack, but only one of them becomes a voter because the voter
handler tries to make non-members in other dcs/racks voters.
Fixes#26321Closesscylladb/scylladb#26327
This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm.
Fixes: #21937
Backport: No backport required
Closesscylladb/scylladb#25787
As requested in #22104, moved the files and fixed other includes and build system.
Moved files:
- combine.hh
- collection_mutation.hh
- collection_mutation.cc
- converting_mutation_partition_applier.hh
- converting_mutation_partition_applier.cc
- counters.hh
- counters.cc
- timestamp.hh
Fixes: #22104
This is a cleanup, no need to backport
Closesscylladb/scylladb#25085
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
Read the CDC stream metadata from the internal system tables, and store
it in the cdc metadata data structures.
The metadata is stored in the tables as diffs which is more storage
efficient, but when in-memory we store it as full stream sets for each
timestamp. This is more useful because we need to be able to find a
stream given timestamp and token.
The Raft topology workflow was changed by the limited voters feature:
nodes no longer request votership themselves. As a result, the
"stop_before_becoming_raft_voter" error injection is now obsolete and
has been removed.
Fixes: scylladb/scylladb#23418
If the view building coordinator is running, adjust view_building_tasks
in case of tablet operations.
The mutations are generated in the same batch as tablet mutations.
At the start of tablet migration/resize/RF change, started
view building tasks are aborted (by setting ABORTED state) if needed.
Then, new adjusted tasks are created in group0 batch which ends the
tablet operation and aborted tasks are removed from the table.
In case the tablet operation fails or is revoked, aborted view building
tasks are rollback by creating new copies of them and aborted ones are
deleted from the table.
View building tasks are not aborted/changed during tablet repair,
because in this case, even if vb task is started, a staging sstable will
be generated.
The state may be also reloaded on `topology_change` or `mixed_change`
because topology coordinator may change view building tasks during
tablet operations.
https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache.
When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications:
* The partition the tombstone is part of was already repaired at the time the memtable was created.
* All writes in the memtable were produced *after* this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone.
Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency.
Fixes: https://github.com/scylladb/scylladb/issues/24962
Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well
Closesscylladb/scylladb#25033
* github.com:scylladb/scylladb:
docs/dev/tombstone.md: document the memtable overlap check elision optimization
test/boost/row_cache_test: add test for memtable overlap check elision
db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
db/cache_mutation_reader: use max_purgeable::can_purge()
replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
replica/table: propagate gc_state to memtable_list
replica/memtable_list: add tombstone_gc_state* member
replica/memtable: add tombstone_gc_state_snapshot
tombstone_gc: introduce tombstone_gc_state_snapshot
tombstone_gc: extract shared state into shared_tombstone_gc_state
tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
tombstone_gc: fold get_group0_gc_time() into its caller
tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
tombstone_gc: refactor get_or_greate_repair_history_for_table()
replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
db/read_context: return max_purgeable from get_max_purgeable()
compaction/compaction_garbage_collector: add formatter for max_purgeable
mutation: move definition of gc symbols to compaction.cc
compaction/compaction_garbage_collector: refactor max_purgeable into a class
test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).
Fixes: scylladb/scylladb#23266
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixesscylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
Closesscylladb/scylladb#25032
* github.com:scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)
Fixesscylladb/scylladb#25114Fixesscylladb/scylladb#23065Closesscylladb/scylladb#25116
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).
However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.
This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.
The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.
Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.
Refs #25115Fixes#24625
Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.
Closesscylladb/scylladb#25151
* https://github.com/scylladb/scylladb:
raft_group0: split shutdown into abort_and_drain and destroy
Revert "main.cc: fix group0 shutdown order"
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).
However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.
This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.
The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.
Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().
Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().
The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
Refs: #22099 (issue)
Refs: #25079 (pr)
remove include for partition_slice_builder
that is not used. makes it clear that
group0_state_machine.cc does not depend on
partition_slice_builder
Closesscylladb/scylladb#25125
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.
Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh
Fixes: #22102Fixes: #22103Fixes: #22105Closesscylladb/scylladb#25082
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.
After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:
```
throw std::runtime_error(
"Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
"If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```
Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters.
This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments.
This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality.
If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency.
This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for.
Fixes: scylladb/scylladb#24420
No backport: Fix of a theoretical bug + CI stability improvement (we can backport eventually later if we see hits in branches)
Closesscylladb/scylladb#24843
* https://github.com/scylladb/scylladb:
raft: fix voter assignment of transitioning nodes
raft: improve comments in group0 voter handler
Previously, nodes would become voters immediately after joining, ensuring
voter status was established before bootstrap completion. With the limited
voters feature, voter assignment became deferred, creating a timing gap
where nodes could finish bootstrapping without becoming voters.
This timing issue could lead to quorum loss scenarios, particularly
observed in tests but theoretically possible in production environments.
This commit reorders voter assignment to occur before the
`update_topology_state()` call, ensuring nodes achieve voter status
before bootstrap operations are marked complete. This prevents the
problematic timing gap while maintaining compatibility with limited
voters functionality.
If voter assignment succeeds but topology state update fails, the
operation will raise an exception and be retried by the topology
coordinator, maintaining system consistency.
This commit also fixes issue where the `update_nodes` ignored leaving
voters potentially exceeding the voter limit and having voters
unaccounted for.
Fixes: scylladb/scylladb#24420
The function validate_change in raft_group0_client is used currently to
validate tablet metadata changes, and therefore it applies only to
commands of type topology_change.
But the type mixed_change also allows topology change mutations and it's
in fact used for tablet metadata changes, for example in
keyspace_rf_change.
Therefore, extend validate_change to validate also changes of type
mixed_change, so we can catch issues there as well.
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.
This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.
A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#23536Closesscylladb/scylladb#23829
Applier fiber needs local storage, so before shutting down local storage we need to make sure that group0 is stopped.
We also improve the logs for the case when `gate_closed_exception` is thrown while a mutation is being written.
Fixes [scylladb/scylladb#24401](https://github.com/scylladb/scylladb/issues/24401)
Backport: no backport -- not safe and the problem is minor.
Closesscylladb/scylladb#24418
* github.com:scylladb/scylladb:
storage_service: test_group0_apply_while_node_is_being_shutdown
main.cc: fix group0 shutdown order
storage_proxy: log gate_closed_exception
group0 persistence relies on local storage, so before
shutting down local storage we need to make sure that
group0 is stopped.
Fixesscylladb/scylladb#24401
Refactor the voter handler logic to only pass around node IDs
(`raft::server_id`), instead of pairs of IDs and node descriptor
references. Node descriptors can always be efficiently retrieved from
the original nodes map, which remains valid throughout the calculation.
This change reduces unnecessary reference passing and simplifies
the code. All node detail lookups are now performed via the central
nodes map as needed.
Fixes: scylladb/scylladb#24035
Refactor the voter handler to use explicit priority comparator classes
for datacenter and rack selection. This makes the prioritization logic
more transparent and robust, and reduces the risk of subtle bugs that
could arise from relying on implicit comparison operators.
Remove comments from the group0 voter handler that simply restate
the code or do not provide meaningful clarification. This improves
code readability and maintainability by reducing noise and focusing
on essential documentation.
The get_blob() method linearizes data by copying it into a
single buffer, which can trigger "oversized allocation" warnings.
This commit avoids that extra copy by creating an input stream
directly over the original fragmented managed bytes returned by
untyped_result_set_row::get_view().
Fixesscylladb/scylladb#23903
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.
This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.
Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.
Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing
voters in each data center and rack.
This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.
Fixes: scylladb/scylladb#23950
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.
The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.
Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.
This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.
Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.
Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.