scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Asias He	e6f640441a	repair: Fix race between create_writer and wait_for_writer_done We saw scylla hit user after free in repair with the following procedure during tests: - n1 and n2 in the cluster - n2 ran decommission - n2 sent data to n1 using repair - n2 was killed forcely - n1 tried to remove repair_meta for n1 - n1 hit use after free on repair_meta object This was what happened on n1: 1) data was received -> do_apply_rows was called -> yield before create_writer() was called 2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called with _writer_done[node_idx] not engaged 3) step 1 resumed, create_writer() was called and _repair_writer object was referenced 4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed 5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object To fix, we should call wait_for_writer_done() after any pending operations were done which were protected by repair_meta::_gate. This prevents wait for writer done finishes before the writer is in the process of being created. Fixes: #6853 Fixes: #6868 Backports: 4.0, 4.1, 4.2	2020-07-28 11:53:40 +03:00
Botond Dénes	fe127a2155	sstables: clamp estimated_partitions to [1, +inf) in writers In some cases estimated number of partitions can be 0, which is albeit a legit estimation result, breaks many low-level sstable writer code, so some of these have assertions to ensure estimated partitions is > 0. To avoid hitting this assert all users of the sstable writers do the clamping, to ensure estimated partitions is at least 1. However leaving this to the callers is error prone as #6913 has shown it. As this clamping is standard practice, it is better to do it in the writers themselves, avoiding this problem altogether. This is exactly what this patch does. It also adds two unit tests, one that reproduces the crash in #6913, and another one that ensures all sstable writers are fine with estimated partitions being 0 now. Call sites previously doing the clamping are changed to not do it, it is unnecessary now as the writer does it itself. Fixes #6913 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>	2020-07-27 09:19:37 +02:00
Pavel Emelyanov	5060063cd6	messaging: Add missing per-service unregistering methods 5 services register handlers in messaging, but not all of them have clear unregistration methods. Summary: migration_manager: everything is in place, no changes gossiper: ditto proxy: some verbs unregistration is missing repair: no unregistration at all streaming: ditto This patch adds the needed unregistration methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-22 16:34:00 +03:00
Asias He	28f8798464	repair: Do not use libfmt format specifiers if not needed We recently saw a weird log message: WARN 2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4, uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error ({shard 0: fmt::v6::format_error (invalid type specifier), shard 1: fmt::v6::format_error (invalid type specifier)}) It turned out we have: throw std::runtime_error(format("repair id {:d} on shard {:d} failed to repair {:d} sub ranges", id, shard, nr_failed_ranges)); in the code, but we changed the id from integer to repair_uniq_id class. We do not really need to specify the format specifiers for numbers. Fixes #6874	2020-07-20 12:52:36 +03:00
Pavel Emelyanov	92f58f62f2	headers:: Remove flat_mutation_reader.hh from several other headers All they can live with forward declaration of the f._m._r. plus a seastar header in commitlog code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:47 +03:00
Pavel Emelyanov	8618a02815	migration_manager: Remove db/schema_tables.hh inclustion into header The schema_tables.hh -> migration_manager.hh couple seems to work as one of "single header for everyhing" creating big blot for many seemingly unrelated .hh's. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-17 17:54:43 +03:00
Asias He	4d7faac350	repair: Add uuid to a repair job Currently, repair uses an integer to identify a repair job. The repair id starts from 1 since node restart. As a result, different repair jobs will have same id across restart. To make the id more unique across restart, we can use an uuid in addition to the integer id. We can not drop the use of the integer id completely since the http api and nodetool use it. Fixes #6786	2020-07-16 11:03:19 +03:00
Asias He	38d964352d	repair: Relax node selection in bootstrap when nodes are less than RF Consider a cluster with two nodes: - n1 (dc1) - n2 (dc2) A third node is bootstrapped: - n3 (dc2) The n3 fails to bootstrap as follows: [shard 0] init - Startup failed: std::runtime_error (bootstrap_with_repair: keyspace=system_distributed, range=(9183073555191895134, 9196226903124807343], no existing node in local dc) The system_distributed keyspace is using SimpleStrategy with RF 3. For the keyspace that does not use NetworkTopologyStrategy, we should not require the source node to be in the same DC. Fixes: #6744 Backports: 4.0 4.1, 4.2	2020-07-14 11:54:34 +02:00
Asias He	271fac56a3	repair: Add synchronous API to query repair status This new api blocks until the repair job is either finished or failed or timeout. E.g., - Without timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 - With timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5 The timeout is in second. The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes #6445	2020-07-14 11:20:15 +03:00
Asias He	a00ab8688f	repair: Relax size check of get_row_diff and set_diff In case a row hash conflict, a hash in set_diff will get more than one row from get_row_diff. For example, Node1 (Repair master): row1 -> hash1 row2 -> hash2 row3 -> hash3 row3' -> hash3 Node2 (Repair follower): row1 -> hash1 row2 -> hash2 We will have set_diff = {hash3} between node1 and node2, while get_row_diff({hash3}) will return two rows: row3 and row3'. And the error below was observed: repair - Got error in row level repair: std::runtime_error (row_diff.size() != set_diff.size()) In this case, node1 should send both row3 and row3' to peer node instead of fail the whole repair. Because node2 does not have row3 or row3', otherwise node1 won't send row with hash3 to node1 in the first place. Refs: #6252	2020-07-14 10:39:30 +03:00
Asias He	67f6da6466	repair: Switch to btree_set for repair_hash. In one of the longevity tests, we observed 1.3s reactor stall which came from repair_meta::get_full_row_hashes_source_op. It traced back to a call to std::unordered_set::insert() which triggered big memory allocation and reclaim. I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set and absl::btree_set. The absl::btree_set was the only one that seastar oversized allocation checker did not warn in my tests where around 300K repair hashes were inserted into the container. - unordered_set: hash_sets=295634, time=333029199 ns - flat_hash_set: hash_sets=295634, time=312484711 ns - node_hash_set: hash_sets=295634, time=346195835 ns - btree_set: hash_sets=295634, time=341379801 ns The btree_set is a bit slower than unordered_set but it does not have huge memory allocation. I do not measure real difference of total time to finish repair of the same dataset with unordered_set and btree_set. To fix, switch to absl btree_set container. Fixes #6190	2020-07-09 11:35:18 +03:00
Asias He	0929a5e82b	repair: Fix inaccurate exception message in check_failed_ranges The reason for the failure can be other reasons than failure of checksum. Fixes #6785	2020-07-07 18:27:16 +03:00
Asias He	6e6e554944	repair: Use warn level for logs with recoverable failures Those logs are not fatal and recoverable. We should make them warn level instead of info level. Fixes #5612	2020-07-07 18:27:16 +03:00
Asias He	7f3eb8b4e8	repair: Handle dropped table in repair_range In commit `12d929a5ae` (repair: Add table_id to row_level_repair), a call to find_column_family() was added in repair_range. In case of the table is dropped, it will fail the repair_range which in turn fails the bootstrap operation. Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test Fixes: #5942	2020-07-01 12:13:14 +03:00
Rafael Ávila de Espíndola	3964b1a551	row level repair: Don't return a variadic future from get_sink_source	2020-06-29 16:51:41 -07:00
Rafael Ávila de Espíndola	eeee63a9a3	row level repair: Don't return a variadic future from read_rows_from_disk	2020-06-29 16:51:10 -07:00
Rafael Ávila de Espíndola	af44684418	messaging_service: Don't return variadic futures from make_sink_and_source_for_*	2020-06-29 16:50:45 -07:00
Botond Dénes	fbbc86e18c	repair: row_level: destroy reader on EOS or error To avoid having to make it an optional with all the additional checks, we just replace it with an empty reader instead, this also also achieves the desired effect of releasing the read permit and all the associated resources early.	2020-06-23 21:08:21 +03:00
Botond Dénes	080f00b99a	repair: row_level: use evictable_reader for local reads Row level repair, when using a local reader, is prone to deadlocking on the streaming reader concurrency semaphore. This has been observed to happen with at least two participating nodes, running more concurrent repairs than the maximum allowed amount of reads by the concurrency semaphore. In this situation, it is possible that two repair instances, competing for the last available permits on both nodes, get a permit on one of the nodes and get queued on the other one respectively. As neither will let go of the permit it already acquired, nor give up waiting on the failed-to-acquired permit, a deadlock happens. To prevent this, we make the local repair reader evictable. For this we reuse the newly exposed evictable reader. The repair reader is paused after the repair buffer is filled, which is currently 32MB, so the cost of a possible reader recreation is amortized over 32MB read. The repair reader is said to be local, when it can use the shard-local partitioner. This is the case if the participating nodes are homogenous (their shard configuration is identical), that is the repair instance has to read just from one shard. A non-local reader uses the multishard reader, which already makes its shard readers evictable and hence is not prone to the deadlock described here.	2020-06-23 21:08:21 +03:00
Avi Kivity	de38091827	priority_manager: merge streaming_read and streaming_write classes into one class Streaming is handled by just once group for CPU scheduling, so separating it into read and write classes for I/O is artificial, and inflates the resources we allow for streaming if both reads and writes happen at the same time. Merge both classes into one class ("streaming") and adjust callers. The merged class has 200 shares, so it reduces streaming bandwidth if both directions are active at the same time (which is rare; I think it only happens in view building).	2020-06-22 15:09:04 +03:00
Rafael Ávila de Espíndola	f6e407ecd2	everywhere: Prepare for seastar api v4 (when_all_succeed return value) The seastar api v4 changes the return type of when_all_succeed. This patch adds discard_result when that is best solution to handle the change. This doesn't do the actual update to v4 since there are still a few issues left to fix in seastar. A patch doing just the update will follow. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200617233150.918110-1-espindola@scylladb.com>	2020-06-18 15:13:56 +03:00
Asias He	61e4387811	repair: Relax node selection in decommission for non network topology strategy In decommission operation, current code requires a node in local dc to sync data with. This requirement is too strong for a non network topology strategy. For example, consider: n1 dc1 n2 dc1 n3 dc2 n2 runs decommission operation. For a keyspace with simple strategy and RF = 2, it is possible n3 is the new owner but n3 is not in the same dc as n2. To fix, perform the dc check only for the network topology strategy. Fixes #6564	2020-06-15 11:26:02 +03:00
Avi Kivity	08313106ce	Merge 'Repair use table id instead of table name' from Asias " Use table_id instead of table_name in row level repair to find a table. It guarantees we repair the same table even if a table is dropped and a new table is created with the same name. Refs: #5942 " * asias-repair_use_table_id_instead_of_table_name: repair: Do not pass table names to repair_info repair: Add table_id to row_level_repair repair: Use table id to find a table in get_sharder_for_tables repair: Add table_ids to repair_info repair: Make func in tracker::run run inside a thread	2020-06-14 14:58:46 +03:00
Avi Kivity	9afd599d7c	Merge 'range_streamer: Handle table of RF 1 in get_range_fetch_map' from Asias " After "Make replacing node take writes" series, with repair based node operations disabled, we saw the replace operation fail like: ``` [shard 0] init - Startup failed: std::runtime_error (unable to find sufficient sources for streaming range (9203926935651910749, +inf) in keyspace system_auth) ``` The reason is the system_auth keyspace has default RF of 1. It is impossible to find a source node to stream from for the ranges owned by the replaced node. In the past, the replace operation with keyspace of RF 1 passes, because the replacing node calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node) before streaming. We saw: ``` [shard 0] range_streamer - Bootstrap : keyspace system_auth range (-9021954492552185543, -9016289150131785593] exists on {127.0.0.6} ``` Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in range_streamer::get_range_fetch_map will pass if the source is the node itself. However, it will not stream from the node itself. As a result, the system_auth keyspace will not get any data. After the "Make replacing node take writes" series, the replacing node calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node) after the streaming finishes. We saw: ``` [shard 0] range_streamer - Bootstrap : keyspace system_auth range (-9049647518073030406, -9048297455405660225] exists on {127.0.0.5} ``` Since 127.0.0.5 was dead, the source node check failed, so the bootstrap operation. Ta fix, we ignore the table of RF 1 when it is unable to find a source node to stream. Fixes #6351 " * asias-fix_bootstrap_with_rf_one_in_range_streamer: range_streamer: Handle table of RF 1 in get_range_fetch_map streaming: Use separate streaming reason for replace operation	2020-06-10 16:03:13 +03:00
Asias He	6c89cedf0a	repair: Do not pass table names to repair_info Get the table names from the table ids instead which prevents the user of repair_info class provides inconsistent table names and table ids. Refs: #5942	2020-06-01 17:44:05 +08:00
Asias He	12d929a5ae	repair: Add table_id to row_level_repair Now that repair_info has tables id for the tables we want to repair. Use table_id instead of table_name in row level repair to find a table. It guarantees we repair the same table even if a table is dropped and a new table is created with the same name. Refs: #5942	2020-06-01 17:34:25 +08:00
Asias He	7ea8bf648d	repair: Use table id to find a table in get_sharder_for_tables We are moving to use the table id instead of table name to get a table in repair. It guarantees the same table is repaired. Refs: #5942	2020-06-01 17:34:25 +08:00
Asias He	378e31b409	repair: Add table_ids to repair_info A helper get_table_ids is added to convert the table names to table ids. We convert it once and use the same table ids for the whole repair operations. This guarantees we repair the same table during the same repair request. Refs: #5942	2020-06-01 17:34:25 +08:00
Asias He	ad878a56eb	repair: Make func in tracker::run run inside a thread It simplify the code in func and makes it easier to write loop that does not stall. Refs: #5942	2020-06-01 17:34:16 +08:00
Asias He	c02fea5f04	repair: Ignore table removed in sync_data_using_repair Commit `75cf255c67` (repair: Ignore keyspace that is removed in sync_data_using_repair) is not enough to fix the issue because when the repair master checks if the table is dropped, the table might not be dropped yet on the repair master. To fix, the repair master should check if the follower failed the repair because the table is dropped by checking the error returned from follower. With this patch, we would see WARN 2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0 completed successfully, keyspace=ks, ignoring dropped tables={cf} when the table is dropped during bootstrap. Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test Fixes: #5942	2020-05-24 13:39:59 +03:00
Asias He	fa9ee234a0	streaming: Use separate streaming reason for replace operation Currently, replace and bootstrap share the same streaming reason, stream_reason::bootstrap, because they share most of the code in boot_strapper. In order to distinguish the two, we need to introduce a new stream reason, stream_reason::replace. It is safe to do so in a mixed cluster because current code only check if the stream_reason is stream_reason::repair. Refs: #6351	2020-05-22 09:30:52 +08:00
Botond Dénes	c29ccdea7e	repair: switch from queue+generating_reader to queue_reader The queue_reader was inveted exactly to replace this construct and is more efficient than it. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200520155618.369873-1-bdenes@scylladb.com>	2020-05-20 19:33:28 +03:00
Rafael Ávila de Espíndola	311fbe2f0a	repair: Make sure sinks are always closed In a recent next failure I got the following backtrace #3 0x00007efd71251a66 in __GI___assert_fail (assertion=assertion@entry=0x2d0c00 "this->_con->get()->sink_closed()", file=file@entry=0x32c9d0 "./seastar/include/seastar/rpc/rpc_impl.hh", line=line@entry=795, function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101 #4 0x0000000001f5d2c3 in seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd>::~sink_impl (this=<optimized out>, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:312 #5 0x0000000001f5d2f4 in seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/shared_ptr.hh:463 #6 seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/shared_ptr.hh:463 #7 0x000000000240f2e6 in seastar::shared_ptr<seastar::rpc::sink<repair_row_on_wire_with_cmd>::impl>::~shared_ptr (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:427 #8 seastar::rpc::sink<repair_row_on_wire_with_cmd>::~sink (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/rpc/rpc_types.hh:270 #9 <lambda(auto:134&)>::<lambda(const seastar::rpc::client_info&, uint64_t, seastar::rpc::source<repair_hash_with_cmd>)>::<lambda(std::__exception_ptr::exception_ptr)>::~<lambda> (this=0x601003118570, __in_chrg=<optimized out>) at repair/row_level.cc:2059 This patch changes a few functions to use finally to make sure the sink is always closed. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200515202803.60020-1-espindola@scylladb.com>	2020-05-18 08:13:42 +03:00
Asias He	b2c4d9fdbc	repair: Fix race between write_end_of_stream and apply_rows Consider: n1, n2, n1 is the repair master, n2 is the repair follower. === Case 1 === 1) n1 sends missing rows {r1, r2} to n2 2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1 is written to sstable, r2 is not written yet, r1 belongs to partition 1, r2 belongs to partition 2. It yields after row r1 is written. data: partition_start, r1 3) n1 sends repair_row_level_stop to n2 because error has happened on n1 4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream() data: partition_start, r1, partition_end 5) Step 2 resumes to apply the rows. data: partition_start, r1, partition_end, partition_end, partition_start, r2 === Case 2 === 1) n1 sends missing rows {r1, r2} to n2 2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1 is written to sstable, r2 is not written yet, r1 belongs to partition 1, r2 belongs to partition 2. It yields after partition_start for r2 is written but before _partition_opened is set to true. data: partition_start, r1, partition_end, partition_start 3) n1 sends repair_row_level_stop to n2 because error has happened on n1 4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream(). Since _partition_opened[node_idx] is false, partition_end is skipped, end_of_stream is written. data: partition_start, r1, partition_end, partition_start, end_of_stream This causes unbalanced partition_start and partition_end in the stream written to sstables. To fix, serialize the write_end_of_stream and apply_rows with a semaphore. Fixes: #6394 Fixes: #6296 Fixes: #6414	2020-05-14 18:15:01 +03:00
Asias He	b744dba75a	repair: Abort the queue in write_end_of_stream in case of error In write_end_of_stream, it does: 1) Write write_partition_end 2) Write empty mutation_fragment_opt If 1) fails, 2) will be skipped, the consumer of the queue will wait for the empty mutation_fragment_opt forever. Found this issue when injecting random exceptions between 1) and 2). Refs #6272 Refs #6248	2020-05-12 10:50:52 +02:00
Tomasz Grabiec	d78fbf7c16	Merge "storage_service: Make replacing node take writes" from Asias Background: Replace operation is used to replace a dead node in the cluster. Currently during replace operation, the replacing node does not take any writes. As a result, new writes to a range after the sync for that range is done, e.g., after streaming for that range is finished, will not be synced to the replacing node. Hinted hand off or repair after the replacing operation will help. But it is better if we can make the writes to the replacing node to avoid any post replacing operation actions. After this series and repair based node operation series, the replace operation will guarantee the replacing node has all the latest copy of data including the new writes during the replace operation. In short, no more repairs before or after the replacing operation. Just replacing the node is enough. Implementation: Filter the node being replaced out of the natural endpoints in storage_proxy, so that: The node being replaced will not be selected as the target for normal write or normal read. Do not depend on the gossip liveness to avoid selecting replacing node for normal write or normal read when the replacing node has the same ip address as the node being replaced. No more special handling for hibernate state in gossip which makes it is simpler and more robust. Replacing node will be marked as UP. Put the replacing node in the pending list, so that: Replacing node will take writes but write to replacing will not be counted as CL. Replacing node will not take normal read. Example: For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and being replaced by node n4. When n4 starts: writes to nodes {n1, n2, n3} are changed to normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}. reads to nodes {n1, n2, n3} are changed to normal_replica_reads = {n1, n2} only. This way, the replacing node n4 now takes writes but does not take reads. Tests: Measure the number of writes during pending period that is the replacing starts and finishes the replace operation. Start 5 nodes, n1 to n5. Stop n5 Start write in the background Start n6 to replace n5 Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status. Before: 2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE 2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483} 2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857} 2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195} 2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764} 2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331} 2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL 2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088} 2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352} 2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004} 2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972} 2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513} Writes = 513 - 331 After: 2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE 2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886} 2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304} 2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049} 2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770} 2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604} 2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL 2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126} 2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549} 2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416} 2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580} 2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460} Writes = 45460 - 604 As we can see the replacing node did not take write before and take write after the patch. Check log of writer handler in storage_proxy storage_proxy - creating write handler for token: -2642068240672386521, keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2}, natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6} The node being replaced, n5=127.0.0.5, is filtered out and the replacing node, n6=127.0.0.6 is in the pending list. * asias/replace_take_writes: storage_service: Make replacing node take writes repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair abstract_replication_strategy: Add get_ranges which takes token_metadata abstract_replication_strategy: Add get_natural_endpoints_without_node_being_replaced abstract_replication_strategy: Add allow_remove_node_being_replaced_from_natural_endpoints token_metadata: Calculate pending ranges for replacing node storage_service: Unify handling of replaced node removal from gossip storage_service: Update tokens and replace address for replace operation	2020-04-30 19:28:35 +02:00
Avi Kivity	8925e00e96	Merge 'Fix hang in multishard_writer' from Asias " This series fix hang in multishard_writer when error happens. It contains - multishard_writer: Abort the queue attached to consumers when producer fails - repair: Fix hang when the writer is dead Fixes #6241 Refs: #6248 " * asias-stream_fix_multishard_writer_hang: repair: Fix hang when the writer is dead mutation_writer_test: Add test_multishard_writer_producer_aborts multishard_writer: Abort the queue attached to consumers when producer fails	2020-04-30 12:27:55 +03:00
Asias He	e3fbc8fba1	repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair We will change the update of tokens in token_metadata in the next patch so that the tokens of the replacing node are updated to token_metadata only after the replace operation is done. In order to get the correct ranges for the replacing node in do_rebuild_replace_with_repair, we need to use a copy of token_metadata contains the tokens of the replacing node. Refs: #5482	2020-04-30 10:22:30 +08:00
Asias He	35c5ef78b9	repair: Fix hang when the writer is dead Consdier: When repair master gets data from repair follower: 1) apply_rows_on_master_in_thread is called 2) a repair writer is created with _repair_writer.create_writer 3) the repair writer fails 4) data is written to the queue _mq[node_idx]->push_eventually attached with the writer Since the writer is dead. No one is going to fetch data from the _mq queue. The apply_rows_on_master_in_thread will block forever. To fix, when the writer is failed, we should abort the _mq queue. Refs: #6248	2020-04-28 12:14:32 +08:00
Asias He	13a9c5eaf7	repair: Send reason for node operations Since `956b092012` (Merge "Repair based node operation" from Asias), repair is used by other node operations like bootstrap, decommission and so on. Send the reason for the repair, so that we can handle the materialized view update correctly according to the reason of the operation. We want to trigger the view update only if the repair is used by repair operation. Otherwise, the view table will be handled twice, 1) when the view table is synced using repair 2) when the base table is synced using repair and view table update is triggered. Fixes #5930 Fixes #5998	2020-04-13 13:47:26 +03:00
Piotr Jastrzebski	e72696a8e6	sharding_info: rename the class to sharder Also rename all variables that were named si or sinfo to sharder. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	a3262a2cb2	repair: depend only on sharding logic not on partitioner repair does not use partitioner and only uses sharding logic. This means it does not have to depend on i_partitioner and can instead operate on sharding_info. This has an important consequence of allowing the repair of multiple tables having different partitioners at the same time. All tables repaired together still have to use the same sharding logic. To achieve this the change: 1. Removes partitioner field from repair_info 2. repair_info has access to sharding_info through schema objects of repaired tables 3. partitioner name is removed from shard_config 4. local and remote partitioners are removed from repair_meta. Remote sharding_info is used instead. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 09:37:48 +02:00
Piotr Jastrzebski	94ff653b99	selective_token_range_sharder: replace i_partitioner with sharding_info The class does not depend on partitioning logic but only uses sharding logic. This means it is possible and desirable to limit its dependency to only sharding_info. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 09:36:22 +02:00
Rafael Ávila de Espíndola	c5795e8199	everywhere: Replace engine().cpu_id() with this_shard_id() This is a bit simpler and might allow removing a few includes of reactor.hh. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200326194656.74041-1-espindola@scylladb.com>	2020-03-27 11:40:03 +03:00
Avi Kivity	0d885dbb00	Merge "Make all headers standalone" from Botond " Make sure all headers compile on their own, without requiring any additional includes externally. Even though this requirement is not documented in our coding guides it is still quasi enforced and we semi-regularly get and merge patches adding missing includes to headers. This patch-set fixes all headers and adds a `{mode}-headers` target that can be used to verify each header. This target should be built by promotion to ensure no new non-conforming code sneaks in. Individual headers can be verified using the `build/dev/path/to/header.hh.o` target, that is generated for every header. The majority of the headers was just missing `seastarx.hh`. I think we should just include this via a compiler flag to remove the noise from our code (in a followup). " * 'compiling-headers/v2' of https://github.com/denesb/scylla: configure.py: add {mode}-headers phony target treewide: add missing headers and/or forward declarations test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh sstables: size_tiered_backlog_tracker: move methods out-of-line sstables: date_tiered_compaction_strategy.hh: move methods out-of-line	2020-03-23 13:09:09 +02:00
Botond Dénes	e0284bb9ee	treewide: add missing headers and/or forward declarations	2020-03-23 09:29:45 +02:00
Asias He	be1a196988	repair: Handle keyspace with zero table The following error was seen in materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=started repair/repair.cc:662:36: runtime error: member call on null pointer of type 'const struct schema' Aborting on shard 0. The problem is in the test a keyspace was created without creating any table. Since db19a76b1f(selective_token_range_sharder: stop calling global_partitioner()), in get_partitioner_for_tables, we access nullptr when no table is present. schema_ptr last_s; for (auto t: tables) { // set last_s } last_s->get_partitione() To fix: 1) Skip the repair in sync_data_using_repair if there is no table in the keyspace 2) Throw if no schema_ptr is found in get_partitioner_for_tables. Be defensive. After: INFO [shard 0] repair - decommission_with_repair: started with keyspace=ks, leaving_node=127.0.0.2, nr_ranges=744 INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=started WARN [shard 0] repair - repair id 3 to sync data for keyspace=ks, no table in this keyspace INFO [shard 0] repair - repair id 3 completed successfully INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=succeeded Tests: materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test Fixes: #6022	2020-03-22 13:46:36 +02:00
Avi Kivity	d310e7c7ea	Merge 'repair: Ignore keyspace that is removed in sync_data_using_repair' from Asias repair: Ignore keyspace that is removed in sync_data_using_repair When a keyspace is removed during node operations, we should not fail the whole operation. Ignore the keyspace that is removed. Fixes #5942 * asias-repair_fix_5942: repair: Stop the nodes that have run repair_row_level_start repair: Ignore keyspace that is removed in sync_data_using_repair	2020-03-22 13:19:51 +02:00
Piotr Jastrzebski	924ed7bb1c	make_multishard_combining_reader: stop taking partitioner The function already takes schema so there's no need for it to take partitioner. It can be obtained using schema::get_partitioner Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-15 10:25:20 +01:00
Asias He	6a7c3f0af0	repair: Stop the nodes that have run repair_row_level_start It is ok to run repair_row_level_stop unconditionally. The node that hasn't received the repair_row_level_start will simply return an error that the repair_meta_id is not found. To avoid the unnecessary repair_row_level_stop verb, we can stop the nodes have run repair_row_level_start. This also makes the error message less confusing. For example: Before: INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 on shard 0 failed: std::runtime_error (get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not exist) INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 failed: std::runtime_error ({shard 0: std::runtime_error (get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not exist)}) WARN 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 to sync data for keyspace=ks, status=failed, keyspace does not exist any more, ignoring it: std::runtime_error ({shard 0: std::runtime_error (get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not exist)}) After: INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 on shard 0 failed: std::runtime_error (Failed to repair for keyspace=ks, cf=cf, range=(9041860168177642466, 9044815446631222376]) INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 failed: std::runtime_error ({shard 0: std::runtime_error (Failed to repair for keyspace=ks, cf=cf, range=(9041860168177642466, 9044815446631222376])}) WARN 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 to sync data for keyspace=ks, status=failed, keyspace does not exist any more, ignoring it: std::runtime_error ({shard 0: std::runtime_error (Failed to repair for keyspace=ks, cf=cf, range=(9041860168177642466, 9044815446631222376])}) Refs #5942	2020-03-09 18:24:02 +08:00

1 2 3 4 5 ...

358 Commits