Currently send_gossip_echo has a 22 seconds timeout
during which _abort_source is ignored.
Mark the verb as cancellable so it can be canceled
on shutdown / abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.
On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.
Interoperability was verified manually by running mixed cluster.
The series concludes the conversion of the system to be host id based.
"
* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
gossiper: make examine_gossiper private
gossiper: rename get_nodes_with_host_id to get_node_ip
treewide: drop id parameter from gossiper::for_each_endpoint_state
treewide: move gossiper to index nodes by host id
gossiper: drop ip from replicate function parameters
gossiper: drop ip from apply_new_states parameters
gossiper: drop address from handle_major_state_change parameter list
gossiper: pass rpc::client_info to gossiper_shutdown verb handler
gossiper: add try_get_host_id function
gossiper: add ip to endpoint_state
serialization: fix std::map de-serializer to not invoke value's default constructor
gossiper: drop template from wait_alive_helper function
gossiper: move get_supported_features and its users to host id
storage_service: make candidates_for_removal host id based
gossiper: use peers table to detect address change
storage_service: use std::views::keys instead of std::views::transform that returns a key
gossiper: move _pending_mark_alive_endpoints to host id
gossiper: do not allow to assassinate endpoint in raft topology mode
gossiper: fix indentation after previous patch
gossiper: do not allow to assassinate non existing endpoint
Add two verbs needed to implement dictionary training for SSTable
compression.
SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files
with a given cardinality and using a given chunk size,
for the given table.
ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data
files the given table.
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.
I tested manually that old and new node can co-exist in the same
cluster,
"
* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
gossiper: drop unneeded code
gossiper: move _expire_time_endpoint_map to host_id
gossiper: move _just_removed_endpoints to host id
gossiper: drop unused get_msg_addr function
messaging_service: change connection dropping notification to pass host id only
messaging_service: pass host id to remove_rpc_client in down notification
treewide: pass host id to endpoint_lifecycle_subscriber
treewide: drop endpoint life cycle subscribers that do nothing
load_meter: move to host id
treewide: use host id directly in endpoint state change subscribers
treewide: pass host id to endpoint state change subscribers
gossiper: drop deprecated unsafe_assassinate_endpoint operation
storage_service: drop unused code in handle_state_removed
treewide: drop endpoint state change subscribers that do nothing
gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
gossiper: start using host ids to send messages earlier
messaging_service: add temporary address map entry on incoming connection
topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
storage_proxy: drop unused template
...
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
The patch adds new marker for a verb - [[ip]] that means that for this
verb ip version of the verbs needs to be generated. Most of the verbs
do not need it.
Use host_id in a children list of a task in task manager to indicate
a node on which the child was created.
Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id.
Fixes: https://github.com/scylladb/scylladb/issues/22284.
Ip to host_id transition; backport isn't needed.
Closesscylladb/scylladb#22487
* github.com:scylladb/scylladb:
tasks: drop task_manager::config::broadcast_address as it's unused
tasks: replace ip with host_id in task_identity
api: task_manager: pass gossiper to api::set_task_manager
tasks: keep host_id in task_manager
tasks: move tasks_get_children to IDL
Before these changes, shutting down a node could be prolonged because of
mapreduce_service. `mapreduce_service::stop()` uninitializes messaging
service, which includes waiting for all ongoing RPC handlers. We already
had a mechanism for cancelling local mapreduce tasks, but we were missing
one for cancelling external queries.
In this commit, we modify the signature of the request so it supports
cancelling via an abort source. We also provide a reproducer test
for the problem.
Fixesscylladb/scylladb#22337Closesscylladb/scylladb#22651
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22467
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
New feature. No backport is needed.
Closesscylladb/scylladb#21896
* github.com:scylladb/scylladb:
repair: Stop using rpc to update repair time for repairs scheduled by scheduler
repair: Wire repair_time in system.tablets for tombstone gc
test: Disable flush_cache_time for two tablet repair tests
test: Introduce guarantee_repair_time_next_second helper
repair: Return repair time for repair_service::repair_tablet
service: Add tablet_operation.hh
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.
Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.
Fixes#17507
In 0b0e661a85, we brought abseil back as a submodule, and we
added absl::headers as an interface library for importing
abseil headers' include directory. And:
```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla
/home/kefu/dev/scylla/idl/absl::headers
```
In this change, we remove `absl::headers` from
`target_link_directories()` as it's an interface library that only
provides header files, not linkable libraries. This fixes the incorrect
inclusion of absl::headers in the rpath of the scylla executable.
Additionally, remove abseil library dependencies from the idl target
since none of the idl source files directly include abseil headers.
After this change,
```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla
```
the output of `pathelf` is now empty.
Fixes#22265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22266
"
The series moves node ops, repair and streaming verbs to IDL. Also
contains IDL related cleanups.
In addition to the CI tested manually by bootstrapping a node with the
series into a cluster of old nodes with repair and streaming both in
gossiper and raft mode. This exercises repair, streaming and node_ops
paths.
"
* 'gleb/move-more-rpcs-to-idl-v3' of github.com:scylladb/scylla-dev:
repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such
streaming: move streaming verbs to IDL
messaging_service: move repair verbs to IDL
node_ops: move node_ops_cmd to IDL
idl: rename partition_checksum.dist.hh to repair.dist.hh
idl: move node_ops related stuff from the repair related IDL
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.
Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
before this change, the header files generated with `idl-compiler.py`
are not regenerated if `idl-compiler.py` is updated. but they should,
as the change to the script could in turn change the generated header
files. because we have a typo in the `DEPENDS` argument,
`${idle_compiler}` is expanded to an empty string.
in this change, the typo is corrected, and the dependency from the
generated headers to the script is correctly reflected in the building
rules.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21475
The hints and batchlog flush requests are issued to all nodes for each
repair request when tombstone_gc repair mode is used.
The amount of such flush requests is high when all nodes in the cluster
run repair. It is observed it takes a long time, up to 15s, for a repair
request to finish such a flush request.
To reduce overhead of the flush, each node caches the flush and only
executes the real flush when the cahce time has passed. It is safe to do
so because the real flush_time is returned. Repair uses the smallest
flush_time returned from peers as the repair time.
The nice thing about the cache on the receiver side is that all senders
can hit the cache. It is better than cache on the sender side.
A slightly smaller flush_time compared to the real flush time will be
used with the benefits of significantly dropped hints and batchlog
flush. The trade-off looks reasonable.
Tests: 2 nodes, with 1s batchlog delay:
Before:
Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198
After:
Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288
Fixes#20259
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.
The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.
Closesscylladb/scylladb#19444
The series adds a step during node's boot process, just before completing
the initialization, in which the node sends a notification to all other
normal nodes in the cluster that it is UP now. Other nodes wait for this
node to be UP and in normal state before replying. This ensures that,
in a healthy cluster, when a node start serving queries the entire
cluster knows its up-to-date state. The notification is a best effort
though. If some nodes are down or do not reply in time the boot process
continues. It is somewhat similar to shutdown notification in this regard.
* 'gleb/notify-up-v2' of github.com:scylladb/scylla-dev:
gossiper: wait for a bootstrapping node to be seen as normal on all nodes before completing initialization
Wait for booting node to be marked UP before complete booting.
gossiper: move gossip verbs to the idl
Currently a node does not wait to be marked UP by other nodes before
complete booting which creates a usability issue: during a rolling restart
it is not enough to wait for local CQL port to be opened before
restarting next node, but it is also needed to check that all other
nodes already see this node as alive otherwise if next node is restarted
some nodes may see two node as dead instead of one.
This patch improves the situation by making sure that boot process does
not complete before all other nodes do not see the booting one as alive.
This is still a best effort thing: if some nodes are unreachable or
gossiper propagation takes too much time the boot process continues
anyway.
Fixesscylladb/scylladb#19206
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.
In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689,
the rebuilt abseil package provided by fedora has different settings
than the ones if the tree is built with the sanitizer enabled. this
inconsistency leads to a crash.
to address this problem, we have to reinstate the abseil submodule, so
we can built it with the same compiler options with which we build the
tree.
in this change
* Revert "build: drop abseil submodule, replace with distribution abseil"
* update CMake building system with abseil header include settings
* bump up the abseil submodule to the latest LTS branch of abseil:
lts_2024_01_16
* update scylla-gdb.py to adapt to the new structure of
flat_hash_map
This reverts commit 8635d24424.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18511
* 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev:
raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
Use correct limit for raft commands throughout the code.
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.
We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428
This implements the RPC for collecting table stats.
Since both leaving and pending replica can be accounted during
tablet migration, the RPC handler will look at tablet transition
info and account only either leaving or replica based on the
tablet migration stage. Replicas that are not leaving or
pending, of course, don't contribute to the anomaly in the
reported size.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>