scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 11:00:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	6ca0464af5	mutation_fragment: add schema and permit We want to start tracking the memory consumption of mutation fragments. For this we need schema and permit during construction, and on each modification, so the memory consumption can be recalculated and pass to the permit. In this patch we just add the new parameters and go through the insane churn of updating all call sites. They will be used in the next patch.	2020-09-28 11:27:23 +03:00
Botond Dénes	3fab83b3a1	flat_mutation_reader: impl: add reader_permit parameter Not used yet, this patch does all the churn of propagating a permit to each impl. In the next patch we will use it to track to track the memory consumption of `_buffer`.	2020-09-28 10:53:48 +03:00
Pavel Emelyanov	9a15ebfe6a	repair: Move CHECKSUM_RANGE verb into repair/ The verb is sent by repair code, so it should be registered in the same place, not in main. Also -- the verb should be unregistered on stop. The global messaging service instance is made similarly to the row-level one, as there's no ready to use repair service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-17 09:52:48 +03:00
Pavel Emelyanov	d5769346d7	repair: Toss messaging init/uninit calls There goal is to make it possible to reg/unreg not only row-level verbs. While at it -- equip the init call with sharded<database>& argument, it will be needed by the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-17 09:52:48 +03:00
Avi Kivity	253a7640e3	Merge 'Clean up old cluster features' from Piotr Sarna " This series follows the suggestion from https://github.com/scylladb/scylla/pull/7203#issuecomment-689499773 discussion and deprecates a number of cluster features. The deprecation does not remove any features from the strings sent via gossip to other nodes, but it removes all checks for these features from code, assuming that the checks are always true. This assumption is quite safe for features introduced over 2 years ago, because the official upgrade path only allows upgrading from a previous official release, and these feature bits were introduced many release cycles ago. All deprecated features were picked from a `git blame` output which indicated that they come from 2018: ```git `e46537b7d3` 2016-05-31 11:44:17 +0200 RANGE_TOMBSTONES_FEATURE = "RANGE_TOMBSTONES"; `85c092c56c` 2016-07-11 10:59:40 +0100 LARGE_PARTITIONS_FEATURE = "LARGE_PARTITIONS"; `02bc0d2ab3` 2016-12-09 22:09:30 +0100 MATERIALIZED_VIEWS_FEATURE = "MATERIALIZED_VIEWS"; `67ca6959bd` 2017-01-30 19:50:13 +0000 COUNTERS_FEATURE = "COUNTERS"; `815c91a1b8` 2017-04-12 10:14:38 +0300 INDEXES_FEATURE = "INDEXES"; `d2a2a6d471` 2017-08-03 10:53:22 +0300 DIGEST_MULTIPARTITION_READ_FEATURE = "DIGEST_MULTIPARTITION_READ"; `ecd2bf128b` 2017-09-01 09:55:02 +0100 CORRECT_COUNTER_ORDER_FEATURE = "CORRECT_COUNTER_ORDER"; `713d75fd51` 2017-09-14 19:15:41 +0200 SCHEMA_TABLES_V3 = "SCHEMA_TABLES_V3"; `2f513514cc` 2017-11-29 11:57:09 +0000 CORRECT_NON_COMPOUND_RANGE_TOMBSTONES = "CORRECT_NON_COMPOUND_RANGE_TOMBSTONES"; `0be3bd383b` 2017-12-04 13:55:36 +0200 WRITE_FAILURE_REPLY_FEATURE = "WRITE_FAILURE_REPLY"; `0bab3e59c2` 2017-11-30 00:16:34 +0000 XXHASH_FEATURE = "XXHASH"; `fbc97626c4` 2018-01-14 21:28:58 -0500 ROLES_FEATURE = "ROLES"; `802be72ca6` 2018-03-18 06:25:52 +0100 LA_SSTABLE_FEATURE = "LA_SSTABLE_FORMAT"; `71e22fe981` 2018-05-25 10:37:54 +0800 STREAM_WITH_RPC_STREAM = "STREAM_WITH_RPC_STREAM"; ``` Tests: unit(dev) manual(verifying with cqlsh that the feature strings are indeed still set) " Closes #7234. * psarna-clean_up_features: gms: add comments for deprecated features gms: remove unused feature bits streaming: drop checks for RPC stream support roles: drop checks for roles schema support service: drop checks for xxhash support service: drop checks for write failure reply support sstables: drop checks for non-compound range tombstones support service: drop checks for v3 schema support repair: drop checks for large partitions support service: drop checks for digest multipartition read support sstables: drop checks for correct counter order support cql3: drop checks for materialized views support cql3: drop checks for counters support cql3: drop checks for indexing support	2020-09-16 10:53:25 +03:00
Piotr Sarna	7c8728dd73	Merge 'Add progress metrics for replace decommission removenode' from Asias. This series follows "repair: Add progress metrics for node ops #6842" and adds the metrics for the remaining node operations, i.e., replace, decommission and removenode. Fixes #1244, #6733 * asias-repair_progress_metrics_replace_decomm_removenode: repair: Add progress metrics for removenode ops repair: Add progress metrics for decommission ops repair: Add progress metrics for replace ops	2020-09-15 12:19:11 +02:00
Benny Halevy	0dc45529c8	abstract_replication_strategy: get_ranges_in_thread: copy _token_metadata if func may yield Change `94995acedb` added yielding to abstract_replication_strategy::do_get_ranges. And `07e253542d` used get_ranges_in_thread in compaction_manager. However, there is nothing to prevent token_metadata, and in particular its `_sorted_tokens` from changing while iterating over them in do_get_ranges if the latter yields. Therefore copy the the replication strategy `_token_metadata` in `get_ranges_in_thread(inet_address ep)`. If the caller provides `token_metadata` to get_ranges_in_thread, then the caller must make sure that we can safely yield while accessing token_metadata (like in `do_rebuild_replace_with_repair`). Fixes #7044 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200915074555.431088-1-bhalevy@scylladb.com>	2020-09-15 11:33:55 +03:00
Piotr Sarna	9e6098a422	repair: drop checks for large partitions support Large partitions are supported for over 2 years and upgrades are only allowed from versions which already have the support, so the checks are hereby dropped.	2020-09-14 12:07:20 +02:00
Pavel Emelyanov	a89c7198c2	range_tombstone_list: Introduce and use pop_as<>() The method extracts an element from the list, constructs a desired object from it and frees. This is common usage of range_tombstone_list. Having a helper helps encapsulating the exact collection inside the class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-07 23:17:41 +03:00
Pavel Emelyanov	f19ade31ee	repair: Mark some partition_hasher methods noexcept The net patch will change the way range tombstones are fed into hasher. To make sure the codeflow doesn't become exception-unsafe, mark the relevant methods as nont-throwing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-07 23:17:41 +03:00
Asias He	8b4530a643	repair: Add progress metrics for removenode ops The following metric is added: scylla_node_maintenance_operations_removenode_finished_percentage{shard="0",type="gauge"} 0.650000 It is the number of finished percentage for removenode operation so far. Fixes #1244, #6733	2020-08-31 14:43:39 +08:00
Asias He	25e03233f1	repair: Add progress metrics for decommission ops The following metric is added: scylla_node_maintenance_operations_decommission_finished_percentage{shard="0",type="gauge"} 0.650000 It is the number of finished percentage for decommission operation so far. Fixes #1244, #6733	2020-08-31 14:43:39 +08:00
Asias He	80cb157669	repair: Add progress metrics for replace ops The following metric is added: scylla_node_maintenance_operations_replace_finished_percentage{shard="0",type="gauge"} 0.650000 It is the number of finished percentage for replace operation so far. Fixes #1244, #6733	2020-08-31 14:03:05 +08:00
Avi Kivity	7416b3c34b	Merge 'scylla-gdb.py: Add scylla repairs command' from Asias " This series adds scylla repairs command to help debug repair. Fixes #7103 " * asias-repair_help_debug_scylla_repairs_cmd: scylla-gdb.py: Add scylla repairs command repair: Add repair_state to track repair states scylla-gdb.py: Print the pointers of elements in boost_intrusive_list_printer scylla-gdb.py: Add printer for gms::inet_address scylla-gdb.py: Fix a typo in boost_intrusive_list repair: Fix the incorrect comments for _all_nodes repair: Add row_level_repair object pointer in repair_meta repair: Add counter for reads issued and finished for repair_reader	2020-08-26 13:57:31 +03:00
Avi Kivity	6ff12b7f79	repair: apply_rows_on_follower(): remove copy of repair_rows list We copy a list, which was reported to generate a 15ms stall. This is easily fixed by moving it instead, which is safe since this is the last use of the variable. Fixes #7115.	2020-08-26 11:52:39 +03:00
Asias He	ab57cea783	repair: Add repair_state to track repair states Use repair_state to track the major state of repair from the beginning to the end of repair. With this patch, we can easily know at which state both the repair master and followers are. It is very helpful when debugging a repair hang issue. Refs #7103	2020-08-26 11:19:25 +08:00
Asias He	9ee86bb5a0	repair: Fix the incorrect comments for _all_nodes The _all_nodes field contains both the peer nodes and the node itself. Refs #7103	2020-08-26 10:12:07 +08:00
Asias He	656ff93d49	repair: Add row_level_repair object pointer in repair_meta It is helpful to track back the row_level_repair object for repair master when debugging. Refs #7103	2020-08-26 10:12:07 +08:00
Asias He	283c3dae0a	repair: Add counter for reads issued and finished for repair_reader It is helpful to check the reader blocks forever when debugging a repair hang. Refs #7103	2020-08-26 10:12:07 +08:00
Asias He	e86881be99	repair: Print repair reason in repair stats log It is useful to distinguish if the repair is a regular repair or used for node operations. In addition, log the keyspace and tables are repaired. Fixes #7086	2020-08-25 11:05:47 +03:00
Pavel Emelyanov	06f4828b93	db: Factor out get_local_ranges helper Storage service and repair code have identical helpers to get local ranges for keyspace. Move this helper's code onto database, later it will be reused by one more place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-21 14:58:40 +03:00
Pavel Emelyanov	24eaf827c0	migration_manager: Add messaging service as argument to get_schema_definition There are 4 places that call this helper: - storage proxy. Callers are rpc verb handlers and already have the proxy at hands from which they can get the messaging service instance - repair. There's local-global messaging instance at hands, and the caller is in verb handler too - streaming. The caller is verb handler, which is unregistered on stop, so the messaging service instance can be captured - migration manager itself. The caller already uses "this", so the messaging service instance can be get from it The better approach would be to make get_schema_definition be the method of migration_manager, but the manager is stopped for real on shutdown, thus referencing it from the callers might not be safe and needs revisiting. At the same time the messaging service is always alive, so using its reference is safe. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	704880d564	repair: Stop using global messaging_service references Now all the users of messaging service have the needed reference. Again, the messaging service is not really stopped at the end, so its usage is safe regardless of whether repair stuff itself leaks on stop or not. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	d7e90dbfa9	repair: Keep sharded messaging service reference on repair_meta The reference comes from repair_info and storage_service calls, both had been already patched for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	285648620b	repair: Keep sharded messaging service reference on repair_info This reference comes from the API that already has it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	74494bac87	repair: Keep reference on messaging in row-level code The row-level repair keeps its statics for needed services, same as the streaming does. Treat the messaging service the same way to stop using the global one in the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	45c31eadb3	repair: Push the sharded<messaging_service> reference down to sync_data_using_repair This function needs the messaging service inside, but the closest place where it can get one from is the storage_service API handlers. Temporarily move the call for global messaging service into storage service, its turn for this cleanup will come later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Pavel Emelyanov	6b0f4d5c8d	repair: Use existing sharded db reference The db.invoke_on_all's lambda tries to get the sharded db reference via the global storage service. This can be done in a much nicer way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Pavel Emelyanov	3d2e3203f7	repair: Mark repair.cc local functions as static Just a cleanup to facilitate code reading. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Piotr Jastrzebski	c001374636	codebase wide: replace count with contains C++20 introduced `contains` member functions for maps and sets for checking whether an element is present in the collection. Previously `count` function was often used in various ways. `contains` does not only express the intend of the code better but also does it in more unified way. This commit replaces all the occurences of the `count` with the `contains`. Tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>	2020-08-15 20:26:02 +03:00
Avi Kivity	736863c385	Merge "repair: Add progress metrics for node ops" from Asias " This series adds progress metrics for the node operations. Metrics for bootstrap and rebuild progress are added as a starter. I will add more for the remaining operations after getting feedback. With this the Scylla Monitor and Scylla Manager can know the progress of the bootstrap and other node operations. E.g., scylla_node_ops_bootstrap_nr_ranges_finished{shard="0",type="derive"} 50 scylla_node_ops_bootstrap_nr_ranges_total{shard="0",type="derive"} 1040 Fixes #1244, #6733 " * 'repair_progress_metrics_v3' of github.com:asias/scylla: repair: Add progress metrics for repair ops repair: Add progress metrics for rebuild ops repair: Add progress metrics for bootstrap ops	2020-08-12 11:42:14 +03:00
Avi Kivity	8853eddaf6	Merge 'repair: Track repair_meta created on both repair follower and master' from Asias " It is pretty hard to find the repair_meta object when debugging a core. This patch makes it is easier by putting repair_meta object created by both repair follower and master into a map. Fixes #7009 " * asias-repair_make_debug_eaiser_track_all_repair_metas: repair: Add repair_meta_tracker to track repair_meta for followers and masters repair: Move thread local object _repair_metas out of the function	2020-08-12 11:01:32 +03:00
Asias He	e9a520a22b	repair: Add repair_meta_tracker to track repair_meta for followers and masters It is pretty hard to find the repair_meta object when debugging a core. This patch makes it is easier by putting repair_meta object created by both repair follower and master into boost intrusive list. Fixes #7009	2020-08-12 15:44:22 +08:00
Asias He	58f4c730b0	repair: Move thread local object _repair_metas out of the function It is a lot of pain to access _repair_metas when debugging. Refs #7009	2020-08-12 11:23:18 +08:00
Avi Kivity	4547949420	Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias " This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test. Fixes #6940 Fixes #6975 Fixes #6976 " * 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla: repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower repair: Use clear_gently in get_sync_boundary to avoid stall utils: Add clear_gently repair: Use merge_to_gently to merge two lists utils: Add merge_to_gently	2020-08-11 14:52:23 +03:00
Asias He	c65ad02fcd	repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower The row_diff list in apply_rows_on_master_in_thread and apply_rows_on_follower can be large. Modify do_apply_rows to remove the row from the list when the row is consumed to avoid stall when the list is destroyed. Fixes #6975	2020-08-11 19:37:47 +08:00
Asias He	9f4b3a5fa6	repair: Use clear_gently in get_sync_boundary to avoid stall The _row_buf and _working_row_buf list can be large. Use clear_gently helper to avoid stalls. Fixes #6940	2020-08-11 19:37:47 +08:00
Piotr Jastrzebski	80e3923b3c	codebase wide: replace find(...) != end() with contains C++20 introduced `contains` member functions for maps and sets for checking whether an element is present in the collection. Previously the code pattern looked like: <collection>.find(<element>) != <collection>.end() In C++20 the same can be expressed with: <collection>.contains(<element>) This is not only more concise but also expresses the intend of the code more clearly. This commit replaces all the occurences of the old pattern with the new approach. Tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>	2020-08-11 13:28:50 +03:00
Asias He	97d47bffa5	repair: Add progress metrics for repair ops The following metric is added: scylla_node_maintenance_operations_repair_finished_percentage{shard="0",type="gauge"} 0.650000 It is the number of finished percentage for all ongoing repair operations. When all ongoing repair operations finish, the percentage stays at 100%. Fixes #1244, #6733	2020-08-11 18:15:10 +08:00
Asias He	53fee789f0	repair: Use merge_to_gently to merge two lists During a performance test, test_latency_read_with_nemesis during manager repair, it experienced a stall of 73 ms: ``` (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >::operator=(repair_row const&) at /usr/include/c++/9/bits/stl_iterator.h:515 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move<false, false, std::bidirectional_iterator_tag>::__copy_m<std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:312 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move_a<false, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:404 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move_a2<false, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:440 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::copy<std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:474 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__merge<std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}> >(std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>) at /usr/include/c++/9/bits/stl_algo.h:4923 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::merge<std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>(std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}) at /usr/include/c++/9/bits/stl_algo.h:5018 (inlined by) repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int) at ./repair/row_level.cc:1242 repair_meta::get_row_diff_source_op(seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int, seastar::rpc::sink<repair_hash_with_cmd>&, seastar::rpc::source<repair_row_on_wire_with_cmd>&) at ./repair/row_level.cc:1608 repair_meta::get_row_diff_with_rpc_stream(std::unordered_set<repair_hash, std::hash<repair_hash>, std::equal_to<repair_hash>, std::allocator<repair_hash> >, seastar::bool_class<needs_all_rows_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int) at ./repair/row_level.cc:1674 row_level_repair::get_missing_rows_from_follower_nodes(repair_meta&) at ./repair/row_level.cc:2413 ``` The problem was that when std::merge() ran out of one range, it copied the second range. To fix, use the new merge_to_gently helper. Fixes #6976	2020-08-11 10:37:34 +08:00
Asias He	e3c2d08f4f	repair: Add progress metrics for rebuild ops The following metric is added: scylla_node_maintenance_operations_rebuild_finished_percentage{shard="0",type="gauge"} 0.650000 It is the number of finished percentage for rebuild operation so far. Fixes #1244, #6733	2020-08-10 15:45:37 +08:00
Asias He	b23f65d1d9	repair: Add progress metrics for bootstrap ops The following metric is added: scylla_node_maintenance_operations_bootstrap_finished_percentage{shard="0",type="gauge"} 0.850000 It is the number of finished percentage for bootstrap operation so far. Fixes #1244, #6733	2020-08-10 15:45:37 +08:00
Asias He	e6f640441a	repair: Fix race between create_writer and wait_for_writer_done We saw scylla hit user after free in repair with the following procedure during tests: - n1 and n2 in the cluster - n2 ran decommission - n2 sent data to n1 using repair - n2 was killed forcely - n1 tried to remove repair_meta for n1 - n1 hit use after free on repair_meta object This was what happened on n1: 1) data was received -> do_apply_rows was called -> yield before create_writer() was called 2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called with _writer_done[node_idx] not engaged 3) step 1 resumed, create_writer() was called and _repair_writer object was referenced 4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed 5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object To fix, we should call wait_for_writer_done() after any pending operations were done which were protected by repair_meta::_gate. This prevents wait for writer done finishes before the writer is in the process of being created. Fixes: #6853 Fixes: #6868 Backports: 4.0, 4.1, 4.2	2020-07-28 11:53:40 +03:00
Botond Dénes	fe127a2155	sstables: clamp estimated_partitions to [1, +inf) in writers In some cases estimated number of partitions can be 0, which is albeit a legit estimation result, breaks many low-level sstable writer code, so some of these have assertions to ensure estimated partitions is > 0. To avoid hitting this assert all users of the sstable writers do the clamping, to ensure estimated partitions is at least 1. However leaving this to the callers is error prone as #6913 has shown it. As this clamping is standard practice, it is better to do it in the writers themselves, avoiding this problem altogether. This is exactly what this patch does. It also adds two unit tests, one that reproduces the crash in #6913, and another one that ensures all sstable writers are fine with estimated partitions being 0 now. Call sites previously doing the clamping are changed to not do it, it is unnecessary now as the writer does it itself. Fixes #6913 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>	2020-07-27 09:19:37 +02:00
Pavel Emelyanov	5060063cd6	messaging: Add missing per-service unregistering methods 5 services register handlers in messaging, but not all of them have clear unregistration methods. Summary: migration_manager: everything is in place, no changes gossiper: ditto proxy: some verbs unregistration is missing repair: no unregistration at all streaming: ditto This patch adds the needed unregistration methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-22 16:34:00 +03:00
Asias He	28f8798464	repair: Do not use libfmt format specifiers if not needed We recently saw a weird log message: WARN 2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4, uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error ({shard 0: fmt::v6::format_error (invalid type specifier), shard 1: fmt::v6::format_error (invalid type specifier)}) It turned out we have: throw std::runtime_error(format("repair id {:d} on shard {:d} failed to repair {:d} sub ranges", id, shard, nr_failed_ranges)); in the code, but we changed the id from integer to repair_uniq_id class. We do not really need to specify the format specifiers for numbers. Fixes #6874	2020-07-20 12:52:36 +03:00
Pavel Emelyanov	92f58f62f2	headers:: Remove flat_mutation_reader.hh from several other headers All they can live with forward declaration of the f._m._r. plus a seastar header in commitlog code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:47 +03:00
Pavel Emelyanov	8618a02815	migration_manager: Remove db/schema_tables.hh inclustion into header The schema_tables.hh -> migration_manager.hh couple seems to work as one of "single header for everyhing" creating big blot for many seemingly unrelated .hh's. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:43 +03:00
Asias He	4d7faac350	repair: Add uuid to a repair job Currently, repair uses an integer to identify a repair job. The repair id starts from 1 since node restart. As a result, different repair jobs will have same id across restart. To make the id more unique across restart, we can use an uuid in addition to the integer id. We can not drop the use of the integer id completely since the http api and nodetool use it. Fixes #6786	2020-07-16 11:03:19 +03:00
Asias He	38d964352d	repair: Relax node selection in bootstrap when nodes are less than RF Consider a cluster with two nodes: - n1 (dc1) - n2 (dc2) A third node is bootstrapped: - n3 (dc2) The n3 fails to bootstrap as follows: [shard 0] init - Startup failed: std::runtime_error (bootstrap_with_repair: keyspace=system_distributed, range=(9183073555191895134, 9196226903124807343], no existing node in local dc) The system_distributed keyspace is using SimpleStrategy with RF 3. For the keyspace that does not use NetworkTopologyStrategy, we should not require the source node to be in the same DC. Fixes: #6744 Backports: 4.0 4.1, 4.2	2020-07-14 11:54:34 +02:00

1 2 3 4 5 ...

400 Commits