"
The series moves node ops, repair and streaming verbs to IDL. Also
contains IDL related cleanups.
In addition to the CI tested manually by bootstrapping a node with the
series into a cluster of old nodes with repair and streaming both in
gossiper and raft mode. This exercises repair, streaming and node_ops
paths.
"
* 'gleb/move-more-rpcs-to-idl-v3' of github.com:scylladb/scylla-dev:
repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such
streaming: move streaming verbs to IDL
messaging_service: move repair verbs to IDL
node_ops: move node_ops_cmd to IDL
idl: rename partition_checksum.dist.hh to repair.dist.hh
idl: move node_ops related stuff from the repair related IDL
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.
Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
before this change, the header files generated with `idl-compiler.py`
are not regenerated if `idl-compiler.py` is updated. but they should,
as the change to the script could in turn change the generated header
files. because we have a typo in the `DEPENDS` argument,
`${idle_compiler}` is expanded to an empty string.
in this change, the typo is corrected, and the dependency from the
generated headers to the script is correctly reflected in the building
rules.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21475
The hints and batchlog flush requests are issued to all nodes for each
repair request when tombstone_gc repair mode is used.
The amount of such flush requests is high when all nodes in the cluster
run repair. It is observed it takes a long time, up to 15s, for a repair
request to finish such a flush request.
To reduce overhead of the flush, each node caches the flush and only
executes the real flush when the cahce time has passed. It is safe to do
so because the real flush_time is returned. Repair uses the smallest
flush_time returned from peers as the repair time.
The nice thing about the cache on the receiver side is that all senders
can hit the cache. It is better than cache on the sender side.
A slightly smaller flush_time compared to the real flush time will be
used with the benefits of significantly dropped hints and batchlog
flush. The trade-off looks reasonable.
Tests: 2 nodes, with 1s batchlog delay:
Before:
Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198
After:
Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288
Fixes#20259
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.
The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.
Closesscylladb/scylladb#19444
The series adds a step during node's boot process, just before completing
the initialization, in which the node sends a notification to all other
normal nodes in the cluster that it is UP now. Other nodes wait for this
node to be UP and in normal state before replying. This ensures that,
in a healthy cluster, when a node start serving queries the entire
cluster knows its up-to-date state. The notification is a best effort
though. If some nodes are down or do not reply in time the boot process
continues. It is somewhat similar to shutdown notification in this regard.
* 'gleb/notify-up-v2' of github.com:scylladb/scylla-dev:
gossiper: wait for a bootstrapping node to be seen as normal on all nodes before completing initialization
Wait for booting node to be marked UP before complete booting.
gossiper: move gossip verbs to the idl
Currently a node does not wait to be marked UP by other nodes before
complete booting which creates a usability issue: during a rolling restart
it is not enough to wait for local CQL port to be opened before
restarting next node, but it is also needed to check that all other
nodes already see this node as alive otherwise if next node is restarted
some nodes may see two node as dead instead of one.
This patch improves the situation by making sure that boot process does
not complete before all other nodes do not see the booting one as alive.
This is still a best effort thing: if some nodes are unreachable or
gossiper propagation takes too much time the boot process continues
anyway.
Fixesscylladb/scylladb#19206
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.
In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689,
the rebuilt abseil package provided by fedora has different settings
than the ones if the tree is built with the sanitizer enabled. this
inconsistency leads to a crash.
to address this problem, we have to reinstate the abseil submodule, so
we can built it with the same compiler options with which we build the
tree.
in this change
* Revert "build: drop abseil submodule, replace with distribution abseil"
* update CMake building system with abseil header include settings
* bump up the abseil submodule to the latest LTS branch of abseil:
lts_2024_01_16
* update scylla-gdb.py to adapt to the new structure of
flat_hash_map
This reverts commit 8635d24424.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18511
* 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev:
raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
Use correct limit for raft commands throughout the code.
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.
We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428
This implements the RPC for collecting table stats.
Since both leaving and pending replica can be accounted during
tablet migration, the RPC handler will look at tablet transition
info and account only either leaving or replica based on the
tablet migration stage. Replicas that are not leaving or
pending, of course, don't contribute to the anomaly in the
reported size.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that we have explicit status for each request we may use it to
replace shutdown notification rpc. During a decommission, in
left_token_ring state, we set done to true after metadata barrier
that waits for all request to the decommissioning node to complete
and notify the decommissioning node with a regular barrier. At this
point the node will see that the request is complete and exit.
When a new node joins the cluster we need to be sure that it's IP
is known to all other nodes. In this patch we do this by waiting
for the IP to appear in raft_address_map.
A new raft_topology_cmd::command::wait_for_ip command is added.
It's run on all nodes of the cluster before we put the topology
into transition state. This applies both to new and replacing nodes.
It's important to run wait_for_ip before moving to
topology::transition_state::join_group0 since in this state
node IPs are already used to populate pending nodes in erm.
State changes are processed as a batch and
there is no reason to maintain them as an ordered map.
Instead, use a std::unordered_map that is more efficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a new variant of the reply to the direct_fd_ping which specifies
whether the local group0 is alive or not, and start actively using it.
There is no need to introduce a cluster feature. Due to how our
serialization framework works, nodes which do not recognize the new
variant will treat it as the existing std::monostate. The std::monostate
means "the node and group0 is alive"; nodes before the changes in this
commit would send a std::monostate anyway, so this is completely
transparent for the old nodes.
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`
Closesscylladb/scylladb#15583
* github.com:scylladb/scylladb:
system_keyspace: drop truncation_record
system_keyspace: remove get_truncated_at method
table: get_truncation_time: check _truncated_at is initialized
database: add_column_family: initialize truncation_time for new tables
database: add_column_family: rename readonly parameter to is_new
system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
commitlog_replayer: refactor commitlog_replayer::impl::init
system_keyspace: drop redundant typedef
system_keyspace: drop redundant save_truncation_record overload
table: rename cache_truncation_record -> set_truncation_time
system_keyspace: get_truncated_position -> get_truncated_positions
The `join_node_request` and `join_node_response` RPCs are added:
- `join_node_request` is sent from the joining node to any node in the
cluster. It contains some initial parameters that will be verified by
the receiving node, or the topology coordinator - notably, it contains
a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
joining node to tell it about the the outcome of the verification.
In unlucky but possible circumstances where a node is being replaced
very quickly, RPC requests using raft-related verbs from storage_service
might be sent to it - even before the node starts its group 0 server.
In the latter case, this triggers on_internal_error.
This commit adds protection to the existing verbs in storage_service:
they check whether the group 0 is running and whether the received
host_id matches the actual recipient's host_id.
None of the verbs that are modified are in any existing release, so the
added parameter does not have to be wrapped in rpc::optional.
Move node_ops related classes to node_ops/ so that they
are consistently grouped and could be access from
many modules.
Closes#15351
* github.com:scylladb/scylladb:
node_ops: extract classes related to node operations
node_ops: repair: move node_ops_id to node_ops directory
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.
The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.
This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
This PR collects followups described in #14972:
- The `system.topology` table is now flushed every time feature-related
columns are modified. This is done because of the feature check that
happens before the schema commitlog is replayed.
- The implementation now guarantees that, if all nodes support some
feature as described by the `supported_features` column, then support
for that feature will not be revoked by any node. Previously, in an
edge case where a node is the last one to add support for some feature
`X` in `supported_features` column, crashes before applying/persisting
it and then restarts without supporting `X`, it would be allowed to boot
anyway and would revoke support for the `X` in `system.topology`.
The existing behavior, although counterintuitive, was safe - the
topology coordinator is responsible for explicitly marking features as
enabled, and in order to enable a feature it needs to perform a special
kind of a global barrier (`barrier_after_feature_update`) which only
succeeds after the node has updated its features column - so there is no
risk of enabling an unsupported feature. In order to make the behavior
less confusing, the node now will perform a second check when it tries
to update its `supported_features` column in `system.topology`.
- The `barrier_after_feature_update` is removed and the regular global
`barrier` topology command is used instead. The `barrier` handler now
performs a feature check if the node did not have a chance to verify and
update its cluster features for the second time.
JOIN_NODE rpc will be sent separately as it is a big item on its own.
Fixes: #14972Closes#15168
* github.com:scylladb/scylladb:
test: topology{_experimental_raft}: don't stop gracefully in feature tests
storage_service: remove _topology_updated_with_local_metadata
topology_coordinator: remove barrier_after_feature_update
topology_coordinator: perform feature check during barrier
storage_service: repeat the feature check after read barrier
feature_service: introduce unsupported_feature_exception
feature_service: move startup feature check to a separate function
topology_coordinator: account for features to enable in should_preempt_balancing
group0_state_machine: flush system.topology when updating features columns