scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 20:05:10 +00:00

Author	SHA1	Message	Date
Avi Kivity	cff90755d8	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats (cherry picked from commit `b1d9f80d85`)	2025-03-25 23:16:35 +01:00
Aleksandra Martyniuk	6bbf20a440	repair: pass session_id to repair_meta Pass session_id of tablet repair down the stack from the repair request to repair_meta. The session_id will be utiziled in the following patches. (cherry picked from commit `928f92c780`)	2025-03-19 10:02:24 +01:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Tomasz Grabiec	c7f78edc78	Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507 New feature. No backport is needed. Closes scylladb/scylladb#21896 * github.com:scylladb/scylladb: repair: Stop using rpc to update repair time for repairs scheduled by scheduler repair: Wire repair_time in system.tablets for tombstone gc test: Disable flush_cache_time for two tablet repair tests test: Introduce guarantee_repair_time_next_second helper repair: Return repair time for repair_service::repair_tablet service: Add tablet_operation.hh	2025-01-20 18:08:49 +01:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00
Asias He	53e6025aa6	repair: Wire repair_time in system.tablets for tombstone gc The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507	2025-01-17 16:12:05 +08:00
Kefu Chai	fbca0a08f7	build: cmake: do not add absl::headers as a link directory In `0b0e661a85`, we brought abseil back as a submodule, and we added absl::headers as an interface library for importing abseil headers' include directory. And: ```console $ patchelf --print-rpath build/RelWithDebInfo/scylla /home/kefu/dev/scylla/idl/absl::headers ``` In this change, we remove `absl::headers` from `target_link_directories()` as it's an interface library that only provides header files, not linkable libraries. This fixes the incorrect inclusion of absl::headers in the rpath of the scylla executable. Additionally, remove abseil library dependencies from the idl target since none of the idl source files directly include abseil headers. After this change, ```console $ patchelf --print-rpath build/RelWithDebInfo/scylla ``` the output of `pathelf` is now empty. Fixes #22265 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22266	2025-01-13 09:05:43 +03:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	ecd78c88bf	Merge "move more verbs to idl" from Gleb " The series moves node ops, repair and streaming verbs to IDL. Also contains IDL related cleanups. In addition to the CI tested manually by bootstrapping a node with the series into a cluster of old nodes with repair and streaming both in gossiper and raft mode. This exercises repair, streaming and node_ops paths. " * 'gleb/move-more-rpcs-to-idl-v3' of github.com:scylladb/scylla-dev: repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such streaming: move streaming verbs to IDL messaging_service: move repair verbs to IDL node_ops: move node_ops_cmd to IDL idl: rename partition_checksum.dist.hh to repair.dist.hh idl: move node_ops related stuff from the repair related IDL	2024-12-12 17:19:43 +02:00
Gleb Natapov	c095f63ea5	repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such After `b3b3e880d3` target_nodes is not used by the receiver, so we can skip setting it on sender as well.	2024-12-11 18:26:57 +02:00
Gleb Natapov	92c2558a83	streaming: move streaming verbs to IDL	2024-12-11 18:26:50 +02:00
Gleb Natapov	bfee93c747	messaging_service: move repair verbs to IDL	2024-12-09 14:50:52 +02:00
Gleb Natapov	5f6007f6ec	node_ops: move node_ops_cmd to IDL	2024-12-09 14:50:52 +02:00
Gleb Natapov	39c75d3add	idl: rename partition_checksum.dist.hh to repair.dist.hh The file has many more things than partition_checksum. All of them are repair related now.	2024-12-09 14:49:59 +02:00
Ferenc Szili	36d35d2297	RPC: add truncate_with_tablets RPC with frozen_topology_guard This change introduces a new truncate_with_tablets RPC with a parameter of type service::frozen_topology_guard. This is materialized on replica nodes into a topology_guard which guarantees that truncate is performed under a global session, which, in turn, makes sure that we don't execute truncate as a result of stale RPCs. Also, this RPC does not have a timeout. Timeout will be handled on the coordinator side, and the truncate operation will not be allowed to time out.	2024-12-04 11:30:07 +01:00
Gleb Natapov	b47faed54f	idl: move node_ops related stuff from the repair related IDL Create separate IDL file for node_ops stuff.	2024-12-04 10:36:40 +02:00
Gleb Natapov	a1fdc8c847	storage_proxy: change mutation rpcs to send forward and reply addresses as host ids RPCs from old nodes will still use old format so translation will be used in this case. The change is backwards compatible thanks to RPC extensibility.	2024-12-02 10:31:12 +02:00
Gleb Natapov	a1de06d90f	migration_manager: move migration manager verbs to the IDL	2024-11-24 11:02:03 +02:00
Asias He	b71a563030	repair: Add core tablet repair scheduler support This adds a new tablet migration kind: repair. It allows tablet repair scheduler to use this migration kind to schedule repair jobs. The current repair scheduler implementation does the following: - A tablet is picked to be repaired when the time since last repair is bigger than a threshold (auto repair mode) or it is requested by user (manual repair mode) - The tablet repair can be scheduled along with tablet migration and rebuild. It runs in the tablet_migration track. - Repair jobs are scheduled in a smart way so that at any point in time, there are no more than configured jobs per shard, which is similar to scylla manager's control. In this patch, both the manual repair and the auto repair are not enabled yet.	2024-11-20 09:42:41 +08:00
Kefu Chai	1f940d56b2	build: cmake: s/idle_compiler/idl_compiler/ before this change, the header files generated with `idl-compiler.py` are not regenerated if `idl-compiler.py` is updated. but they should, as the change to the script could in turn change the generated header files. because we have a typo in the `DEPENDS` argument, `${idle_compiler}` is expanded to an empty string. in this change, the typo is corrected, and the dependency from the generated headers to the script is correctly reflected in the building rules. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21475	2024-11-09 20:06:23 +02:00
Asias He	b3b3e880d3	repair: Reduce hints and batchlog flush The hints and batchlog flush requests are issued to all nodes for each repair request when tombstone_gc repair mode is used. The amount of such flush requests is high when all nodes in the cluster run repair. It is observed it takes a long time, up to 15s, for a repair request to finish such a flush request. To reduce overhead of the flush, each node caches the flush and only executes the real flush when the cahce time has passed. It is safe to do so because the real flush_time is returned. Repair uses the smallest flush_time returned from peers as the repair time. The nice thing about the cache on the receiver side is that all senders can hit the cache. It is better than cache on the sender side. A slightly smaller flush_time compared to the real flush time will be used with the benefits of significantly dropped hints and batchlog flush. The trade-off looks reasonable. Tests: 2 nodes, with 1s batchlog delay: Before: Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198 After: Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288 Fixes #20259	2024-10-30 11:07:57 +08:00
Avi Kivity	3fc4e23a36	forward_service: rename to mapreduce_service forward_service is nondescriptive and misnamed, as it does more than forward requests. It's a classic map/reduce algorithm (and in fact one of its parameters is "reducer"), so name it accordingly. The name "forward" leaked into the wire protocol for the messaging service RPC isolation cookie, so it's kept there. It's also maintained in the name of the logger (for "nodetool setlogginglevel") for compatibility with tests. Closes scylladb/scylladb#19444	2024-07-03 19:29:47 +03:00
Kamil Braun	13fc2bd854	Merge `notify other nodes on boot` from Gleb The series adds a step during node's boot process, just before completing the initialization, in which the node sends a notification to all other normal nodes in the cluster that it is UP now. Other nodes wait for this node to be UP and in normal state before replying. This ensures that, in a healthy cluster, when a node start serving queries the entire cluster knows its up-to-date state. The notification is a best effort though. If some nodes are down or do not reply in time the boot process continues. It is somewhat similar to shutdown notification in this regard. * 'gleb/notify-up-v2' of github.com:scylladb/scylla-dev: gossiper: wait for a bootstrapping node to be seen as normal on all nodes before completing initialization Wait for booting node to be marked UP before complete booting. gossiper: move gossip verbs to the idl	2024-06-25 17:58:17 +02:00
Gleb Natapov	28c0a27467	Wait for booting node to be marked UP before complete booting. Currently a node does not wait to be marked UP by other nodes before complete booting which creates a usability issue: during a rolling restart it is not enough to wait for local CQL port to be opened before restarting next node, but it is also needed to check that all other nodes already see this node as alive otherwise if next node is restarted some nodes may see two node as dead instead of one. This patch improves the situation by making sure that boot process does not complete before all other nodes do not see the booting one as alive. This is still a best effort thing: if some nodes are unreachable or gossiper propagation takes too much time the boot process continues anyway. Fixes scylladb/scylladb#19206	2024-06-20 14:55:40 +03:00
Wojciech Mitros	cde14a5788	mv: make the view update backlog unmofidiable Currently, a view update backlog may reach an invalid state, when its max is 0 and its relative_size() is NaN as a result. This can be achieved either by constructing the backlog with a 0 max or by modifying the max of an existing backlog. In particular, this happens when creating the backlog using the default constructor. In this patch the the default constructor is deleted and a check is added to make sure that the max is different than 0 is added to its constructor - if the check fails, we construct an empty backlog instead, to handle the possibility of getting an invalid backlog sent from a node with a version that's missing this check. Additionally, we make the backlogs members private, exposing them only through const getters.	2024-06-19 19:44:57 +02:00
Gleb Natapov	09556bff0e	gossiper: move gossip verbs to the idl	2024-06-17 12:47:17 +03:00
Michael Litvak	afc9a1a8a6	db/hints: migrate sync point to host ID Change the format of sync points to use host ID instead of IPs, to be consistent with the use of host IDs in hinted handoff module. Introduce sync point v3 format which is the same as v2 except it stores host IDs instead of IPs. The encoding of sync points now always uses the new v3 format with host IDs. The decoding supports both formats with host IDs and IPs, so a sync point contains now a variant of either types, and in the case of the new format the translation from IP to host ID is avoided.	2024-06-11 11:07:00 +02:00
Michael Litvak	b824e73418	db/hints: rename sync point structures with _v1 suffix to _v1_v2 rename sync point types and variables to have v1/v2 suffix according to their use.	2024-06-11 11:05:59 +02:00
Piotr Smaron	06008970fb	New raft cmd for both schema & topo changes Allows executing combined topology & schema mutations under a single RAFT command	2024-05-27 12:48:44 +02:00
Kefu Chai	0b0e661a85	build: bring abseil submodule back because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689, the rebuilt abseil package provided by fedora has different settings than the ones if the tree is built with the sanitizer enabled. this inconsistency leads to a crash. to address this problem, we have to reinstate the abseil submodule, so we can built it with the same compiler options with which we build the tree. in this change * Revert "build: drop abseil submodule, replace with distribution abseil" * update CMake building system with abseil header include settings * bump up the abseil submodule to the latest LTS branch of abseil: lts_2024_01_16 * update scylla-gdb.py to adapt to the new structure of flat_hash_map This reverts commit `8635d24424`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18511	2024-05-05 23:31:09 +03:00
Gleb Natapov	040c6ca0c1	topology coordinator: drop unused structure	2024-04-21 16:36:07 +03:00
Kamil Braun	33751f8f4e	Merge 'raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC' from Gleb * 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev: raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC Use correct limit for raft commands throughout the code.	2024-03-28 14:25:58 +01:00
Gleb Natapov	6e6aefc9ab	raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC We have new, more generic, RPC to pull group0 mutations now: RAFT_PULL_SNAPSHOT. Use it instead of more specific RAFT_PULL_TOPOLOGY_SNAPSHOT one.	2024-03-27 19:18:45 +02:00
Gleb Natapov	6ab78e13c6	topology cooordinator: propagate initial_token option to the coordinator The patch propagates initial_token option to the topology coordinator where it is added to join request parameter.	2024-03-26 18:43:16 +02:00
Marcin Maliszkiewicz	5a6d4dbc37	storage_service: add support for auth-v2 raft snapshots This patch adds new RPC for pulling snapshot of auth tables.	2024-03-01 16:25:14 +01:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Piotr Dulikowski	7601f40bf8	storage_service: introduce join_node_query verb When a node joins an existing cluster, it will ask a node that already belongs to the cluster about which topology operations to use when joining.	2024-02-07 10:02:00 +01:00
Raphael S. Carvalho	9519a0c9e4	storage_service: Implement table_load_stats RPC This implements the RPC for collecting table stats. Since both leaving and pending replica can be accounted during tablet migration, the RPC handler will look at tablet transition info and account only either leaving or replica based on the tablet migration stage. Replicas that are not leaving or pending, of course, don't contribute to the anomaly in the reported size. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Gleb Natapov	d576ed31dc	storage_service: topology request: drop explicit shutdown rpc Now that we have explicit status for each request we may use it to replace shutdown notification rpc. During a decommission, in left_token_ring state, we set done to true after metadata barrier that waits for all request to the decommissioning node to complete and notify the decommissioning node with a regular barrier. At this point the node will see that the request is complete and exit.	2024-01-16 17:02:54 +02:00
Gleb Natapov	1c18476385	storage_service: topology coordinator: update topology_requests table with requests progress Make topology coordinator update request's status in topology_requests table as it changes.	2024-01-16 15:35:18 +02:00
Gleb Natapov	1ce1c5001d	topology coordinator: add topology_requests table to group0 snapshot Since the table is updated through raft's group0 state machine its content needs to be part of the snapshot.	2024-01-16 13:57:27 +02:00
Petr Gusev	db1f0d5889	storage_service: wait_for_ip for new nodes When a new node joins the cluster we need to be sure that it's IP is known to all other nodes. In this patch we do this by waiting for the IP to appear in raft_address_map. A new raft_topology_cmd::command::wait_for_ip command is added. It's run on all nodes of the cluster before we put the topology into transition state. This applies both to new and replacing nodes. It's important to run wait_for_ip before moving to topology::transition_state::join_group0 since in this state node IPs are already used to populate pending nodes in erm.	2024-01-12 15:37:46 +04:00
Petr Gusev	802da1e7a5	storage_service.idl.hh: fix raft_topology_cmd.command declaration Make IDL correspond to the declaration of raft_topology_cmd::command in topology_state_machine.hh.	2024-01-12 12:23:22 +04:00
Benny Halevy	cdd5605d81	gms: endpoint_state: change application_state_map to std::unordered_map State changes are processed as a batch and there is no reason to maintain them as an ordered map. Instead, use a std::unordered_map that is more efficient. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Piotr Dulikowski	a8ee4d543a	raft: transfer information about group0 liveness in direct_fd_ping Add a new variant of the reply to the direct_fd_ping which specifies whether the local group0 is alive or not, and start actively using it. There is no need to introduce a cluster feature. Due to how our serialization framework works, nodes which do not recognize the new variant will treat it as the existing std::monostate. The std::monostate means "the node and group0 is alive"; nodes before the changes in this commit would send a std::monostate anyway, so this is completely transparent for the old nodes.	2023-11-23 00:34:22 +01:00
Kefu Chai	efd65aebb2	build: cmake: add check-header target to have feature parity with `configure.py`. we won't need this once we migrate to C++20 modules. but before that day comes, we need to stick with C++ headers. we generate a rule for each .hh files to create a corresponding .cc and then compile it, in order to verify the self-containness of that header. so the number of rule is quite large, to avoid the unnecessary overhead. the check-header target is enabled only if `Scylla_CHECK_HEADERS` option is enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15913	2023-11-13 10:27:06 +02:00
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Kefu Chai	b105be220b	build: cmake: add join_node.idl.hh to CMake we add a new verb in `7cbe5e3af8`, so let's update the CMake-based building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15658	2023-10-19 10:19:16 +03:00

1 2 3 4 5 ...

313 Commits