Commit Graph

358 Commits

Author SHA1 Message Date
Asias He
e6f640441a repair: Fix race between create_writer and wait_for_writer_done
We saw scylla hit user after free in repair with the following procedure during tests:

- n1 and n2 in the cluster

- n2 ran decommission

- n2 sent data to n1 using repair

- n2 was killed forcely

- n1 tried to remove repair_meta for n1

- n1 hit use after free on repair_meta object

This was what happened on n1:

1) data was received -> do_apply_rows was called -> yield before create_writer() was called

2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
   with _writer_done[node_idx] not engaged

3) step 1 resumed, create_writer() was called and _repair_writer object was referenced

4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed

5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object

To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.

Fixes: #6853
Fixes: #6868
Backports: 4.0, 4.1, 4.2
2020-07-28 11:53:40 +03:00
Botond Dénes
fe127a2155 sstables: clamp estimated_partitions to [1, +inf) in writers
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.

Fixes #6913

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
2020-07-27 09:19:37 +02:00
Pavel Emelyanov
5060063cd6 messaging: Add missing per-service unregistering methods
5 services register handlers in messaging, but not all of them
have clear unregistration methods.

Summary:

migration_manager: everything is in place, no changes
gossiper: ditto
proxy: some verbs unregistration is missing
repair: no unregistration at all
streaming: ditto

This patch adds the needed unregistration methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:34:00 +03:00
Asias He
28f8798464 repair: Do not use libfmt format specifiers if not needed
We recently saw a weird log message:

   WARN  2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4,
   uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error
   ({shard 0: fmt::v6::format_error (invalid type specifier), shard 1:
   fmt::v6::format_error (invalid type specifier)})

It turned out we have:

   throw std::runtime_error(format("repair id {:d} on shard {:d} failed to
   repair {:d} sub ranges", id, shard, nr_failed_ranges));

in the code, but we changed the id from integer to repair_uniq_id class.

We do not really need to specify the format specifiers for numbers.

Fixes #6874
2020-07-20 12:52:36 +03:00
Pavel Emelyanov
92f58f62f2 headers:: Remove flat_mutation_reader.hh from several other headers
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:47 +03:00
Pavel Emelyanov
8618a02815 migration_manager: Remove db/schema_tables.hh inclustion into header
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:43 +03:00
Asias He
4d7faac350 repair: Add uuid to a repair job
Currently, repair uses an integer to identify a repair job. The repair
id starts from 1 since node restart. As a result, different repair jobs
will have same id across restart.

To make the id more unique across restart, we can use an uuid in
addition to the integer id. We can not drop the use of the integer id
completely since the http api and nodetool use it.

Fixes #6786
2020-07-16 11:03:19 +03:00
Asias He
38d964352d repair: Relax node selection in bootstrap when nodes are less than RF
Consider a cluster with two nodes:

 - n1 (dc1)
 - n2 (dc2)

A third node is bootstrapped:

 - n3 (dc2)

The n3 fails to bootstrap as follows:

 [shard 0] init - Startup failed: std::runtime_error
 (bootstrap_with_repair: keyspace=system_distributed,
 range=(9183073555191895134, 9196226903124807343], no existing node in
 local dc)

The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.

Fixes: #6744
Backports: 4.0 4.1, 4.2
2020-07-14 11:54:34 +02:00
Asias He
271fac56a3 repair: Add synchronous API to query repair status
This new api blocks until the repair job is either finished or failed or timeout.

E.g.,

- Without timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

- With timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5

The timeout is in second.

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes #6445
2020-07-14 11:20:15 +03:00
Asias He
a00ab8688f repair: Relax size check of get_row_diff and set_diff
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.

For example,

Node1 (Repair master):
row1  -> hash1
row2  -> hash2
row3  -> hash3
row3' -> hash3

Node2 (Repair follower):
row1  -> hash1
row2  -> hash2

We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:

   repair - Got error in row level repair: std::runtime_error
   (row_diff.size() != set_diff.size())

In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.

Refs: #6252
2020-07-14 10:39:30 +03:00
Asias He
67f6da6466 repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190
2020-07-09 11:35:18 +03:00
Asias He
0929a5e82b repair: Fix inaccurate exception message in check_failed_ranges
The reason for the failure can be other reasons than failure of
checksum.

Fixes #6785
2020-07-07 18:27:16 +03:00
Asias He
6e6e554944 repair: Use warn level for logs with recoverable failures
Those logs are not fatal and recoverable. We should make them warn level
instead of info level.

Fixes #5612
2020-07-07 18:27:16 +03:00
Asias He
7f3eb8b4e8 repair: Handle dropped table in repair_range
In commit 12d929a5ae (repair: Add table_id
to row_level_repair), a call to find_column_family() was added in
repair_range. In case of the table is dropped, it will fail the
repair_range which in turn fails the bootstrap operation.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
2020-07-01 12:13:14 +03:00
Rafael Ávila de Espíndola
3964b1a551 row level repair: Don't return a variadic future from get_sink_source 2020-06-29 16:51:41 -07:00
Rafael Ávila de Espíndola
eeee63a9a3 row level repair: Don't return a variadic future from read_rows_from_disk 2020-06-29 16:51:10 -07:00
Rafael Ávila de Espíndola
af44684418 messaging_service: Don't return variadic futures from make_sink_and_source_for_* 2020-06-29 16:50:45 -07:00
Botond Dénes
fbbc86e18c repair: row_level: destroy reader on EOS or error
To avoid having to make it an optional with all the additional checks,
we just replace it with an empty reader instead, this also also achieves
the desired effect of releasing the read permit and all the associated
resources early.
2020-06-23 21:08:21 +03:00
Botond Dénes
080f00b99a repair: row_level: use evictable_reader for local reads
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.

To prevent this, we make the local repair reader evictable. For this we
reuse the newly exposed evictable reader.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.

The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.
2020-06-23 21:08:21 +03:00
Avi Kivity
de38091827 priority_manager: merge streaming_read and streaming_write classes into one class
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.

Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
2020-06-22 15:09:04 +03:00
Rafael Ávila de Espíndola
f6e407ecd2 everywhere: Prepare for seastar api v4 (when_all_succeed return value)
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.

This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
2020-06-18 15:13:56 +03:00
Asias He
61e4387811 repair: Relax node selection in decommission for non network topology strategy
In decommission operation, current code requires a node in local dc to
sync data with. This requirement is too strong for a non network topology
strategy. For example, consider:

   n1 dc1
   n2 dc1
   n3 dc2

n2 runs decommission operation. For a keyspace with simple strategy and
RF = 2, it is possible n3 is the new owner but n3 is not in the same dc
as n2.

To fix, perform the dc check only for the network topology strategy.

Fixes #6564
2020-06-15 11:26:02 +03:00
Avi Kivity
08313106ce Merge 'Repair use table id instead of table name' from Asias
"
Use table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.

Refs: #5942
"

* asias-repair_use_table_id_instead_of_table_name:
  repair: Do not pass table names to repair_info
  repair: Add table_id to row_level_repair
  repair: Use table id to find a table in get_sharder_for_tables
  repair: Add table_ids to repair_info
  repair: Make func in tracker::run run inside a thread
2020-06-14 14:58:46 +03:00
Avi Kivity
9afd599d7c Merge 'range_streamer: Handle table of RF 1 in get_range_fetch_map' from Asias
"
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:

```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.

In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```

Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.

After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```

Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.

Ta fix, we ignore the table of RF 1 when it is unable to find a source
node to stream.

Fixes #6351
"

* asias-fix_bootstrap_with_rf_one_in_range_streamer:
  range_streamer: Handle table of RF 1 in get_range_fetch_map
  streaming: Use separate streaming reason for replace operation
2020-06-10 16:03:13 +03:00
Asias He
6c89cedf0a repair: Do not pass table names to repair_info
Get the table names from the table ids instead which prevents the user
of repair_info class provides inconsistent table names and table ids.

Refs: #5942
2020-06-01 17:44:05 +08:00
Asias He
12d929a5ae repair: Add table_id to row_level_repair
Now that repair_info has tables id for the tables we want to repair. Use
table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
7ea8bf648d repair: Use table id to find a table in get_sharder_for_tables
We are moving to use the table id instead of table name to get a table
in repair. It guarantees the same table is repaired.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
378e31b409 repair: Add table_ids to repair_info
A helper get_table_ids is added to convert the table names to table ids.
We convert it once and use the same table ids for the whole repair
operations. This guarantees we repair the same table during the same
repair request.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
ad878a56eb repair: Make func in tracker::run run inside a thread
It simplify the code in func and makes it easier to write loop that does
not stall.

Refs: #5942
2020-06-01 17:34:16 +08:00
Asias He
c02fea5f04 repair: Ignore table removed in sync_data_using_repair
Commit 75cf255c67 (repair: Ignore keyspace
that is removed in sync_data_using_repair) is not enough to fix the
issue because when the repair master checks if the table is dropped, the
table might not be dropped yet on the repair master.

To fix, the repair master should check if the follower failed the repair
because the table is dropped by checking the error returned from
follower.

With this patch, we would see

WARN  2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0
completed successfully, keyspace=ks, ignoring dropped tables={cf}

when the table is dropped during bootstrap.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test

Fixes: #5942
2020-05-24 13:39:59 +03:00
Asias He
fa9ee234a0 streaming: Use separate streaming reason for replace operation
Currently, replace and bootstrap share the same streaming reason,
stream_reason::bootstrap, because they share most of the code
in boot_strapper.

In order to distinguish the two, we need to introduce a new stream
reason, stream_reason::replace. It is safe to do so in a mixed cluster
because current code only check if the stream_reason is
stream_reason::repair.

Refs: #6351
2020-05-22 09:30:52 +08:00
Botond Dénes
c29ccdea7e repair: switch from queue+generating_reader to queue_reader
The queue_reader was inveted exactly to replace this construct and is
more efficient than it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200520155618.369873-1-bdenes@scylladb.com>
2020-05-20 19:33:28 +03:00
Rafael Ávila de Espíndola
311fbe2f0a repair: Make sure sinks are always closed
In a recent next failure I got the following backtrace

#3  0x00007efd71251a66 in __GI___assert_fail (assertion=assertion@entry=0x2d0c00 "this->_con->get()->sink_closed()", file=file@entry=0x32c9d0 "./seastar/include/seastar/rpc/rpc_impl.hh", line=line@entry=795,
    function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
#4  0x0000000001f5d2c3 in seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd>::~sink_impl (this=<optimized out>, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:312
#5  0x0000000001f5d2f4 in seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>)
    at ./seastar/include/seastar/core/shared_ptr.hh:463
#6  seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/shared_ptr.hh:463
#7  0x000000000240f2e6 in seastar::shared_ptr<seastar::rpc::sink<repair_row_on_wire_with_cmd>::impl>::~shared_ptr (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:427
#8  seastar::rpc::sink<repair_row_on_wire_with_cmd>::~sink (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/rpc/rpc_types.hh:270
#9  <lambda(auto:134&)>::<lambda(const seastar::rpc::client_info&, uint64_t, seastar::rpc::source<repair_hash_with_cmd>)>::<lambda(std::__exception_ptr::exception_ptr)>::~<lambda> (this=0x601003118570, __in_chrg=<optimized out>)
    at repair/row_level.cc:2059

This patch changes a few functions to use finally to make sure the sink
is always closed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
2020-05-18 08:13:42 +03:00
Asias He
b2c4d9fdbc repair: Fix race between write_end_of_stream and apply_rows
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.

=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to
   partition 1, r2 belongs to partition 2. It yields after row r1 is
   written.
   data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
   data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
   data: partition_start, r1, partition_end, partition_end, partition_start, r2

=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to partition
   1, r2 belongs to partition 2. It yields after partition_start for r2
   is written but before _partition_opened is set to true.
   data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
   Since _partition_opened[node_idx] is false, partition_end is skipped,
   end_of_stream is written.
   data: partition_start, r1, partition_end, partition_start, end_of_stream

This causes unbalanced partition_start and partition_end in the stream
written to sstables.

To fix, serialize the write_end_of_stream and apply_rows with a semaphore.

Fixes: #6394
Fixes: #6296
Fixes: #6414
2020-05-14 18:15:01 +03:00
Asias He
b744dba75a repair: Abort the queue in write_end_of_stream in case of error
In write_end_of_stream, it does:

1) Write write_partition_end
2) Write empty mutation_fragment_opt

If 1) fails, 2) will be skipped, the consumer of the queue will wait for
the empty mutation_fragment_opt forever.

Found this issue when injecting random exceptions between 1) and 2).

Refs #6272
Refs #6248
2020-05-12 10:50:52 +02:00
Tomasz Grabiec
d78fbf7c16 Merge "storage_service: Make replacing node take writes" from Asias
Background:

Replace operation is used to replace a dead node in the cluster.
Currently during replace operation, the replacing node does not take any
writes. As a result, new writes to a range after the sync for that range
is done, e.g., after streaming for that range is finished, will not be
synced to the replacing node. Hinted hand off or repair after the
replacing operation will help. But it is better if we can make the
writes to the replacing node to avoid any post replacing operation
actions.

After this series and repair based node operation series, the replace
operation will guarantee the replacing node has all the latest copy of
data including the new writes during the replace operation. In short, no
more repairs before or after the replacing operation. Just replacing the
node is enough.

Implementation:

Filter the node being replaced out of the natural endpoints in
storage_proxy, so that:
The node being replaced will not be selected as the target for
normal write or normal read.

Do not depend on the gossip liveness to avoid selecting replacing node
for normal write or normal read when the replacing node has the same
ip address as the node being replaced. No more special handling for
hibernate state in gossip which makes it is simpler and more robust.
Replacing node will be marked as UP.

Put the replacing node in the pending list, so that:
Replacing node will take writes but write to replacing will not be
counted as CL.

Replacing node will not take normal read.

Example:

For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and
being replaced by node n4. When n4 starts:

writes to nodes {n1, n2, n3} are changed to
normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}.

reads to nodes {n1, n2, n3} are changed to
normal_replica_reads = {n1, n2} only.

This way, the replacing node n4 now takes writes but does not take reads.

Tests:

Measure the number of writes during pending period that is the
replacing starts and finishes the replace operation.
Start 5 nodes, n1 to n5.
Stop n5
Start write in the background
Start n6 to replace n5
Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status.
Before:
2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483}
2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857}
2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195}
2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764}
2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331}
2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088}
2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352}
2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004}
2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972}
2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513}

Writes = 513 - 331

After:
2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886}
2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304}
2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049}
2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770}
2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604}
2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126}
2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549}
2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416}
2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580}
2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460}

Writes = 45460 - 604

As we can see the replacing node did not take write before and take write after the patch.

Check log of writer handler in storage_proxy
storage_proxy - creating write handler for token: -2642068240672386521,
keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2},
natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6}

The node being replaced, n5=127.0.0.5, is filtered out and the replacing
node, n6=127.0.0.6 is in the pending list.

* asias/replace_take_writes:
  storage_service: Make replacing node take writes
  repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair
  abstract_replication_strategy: Add get_ranges which takes token_metadata
  abstract_replication_strategy: Add get_natural_endpoints_without_node_being_replaced
  abstract_replication_strategy: Add allow_remove_node_being_replaced_from_natural_endpoints
  token_metadata: Calculate pending ranges for replacing node
  storage_service: Unify handling of replaced node removal from gossip
  storage_service: Update tokens and replace address for replace operation
2020-04-30 19:28:35 +02:00
Avi Kivity
8925e00e96 Merge 'Fix hang in multishard_writer' from Asias
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead

Fixes #6241
Refs: #6248
"

* asias-stream_fix_multishard_writer_hang:
  repair: Fix hang when the writer is dead
  mutation_writer_test: Add test_multishard_writer_producer_aborts
  multishard_writer: Abort the queue attached to consumers when producer fails
2020-04-30 12:27:55 +03:00
Asias He
e3fbc8fba1 repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair
We will change the update of tokens in token_metadata in the next patch
so that the tokens of the replacing node are updated to token_metadata
only after the replace operation is done. In order to get the correct
ranges for the replacing node in do_rebuild_replace_with_repair, we need
to use a copy of token_metadata contains the tokens of the replacing
node.

Refs: #5482
2020-04-30 10:22:30 +08:00
Asias He
35c5ef78b9 repair: Fix hang when the writer is dead
Consdier:

When repair master gets data from repair follower:

1) apply_rows_on_master_in_thread is called
2) a repair writer is created with _repair_writer.create_writer
3) the repair writer fails
4) data is written to the queue _mq[node_idx]->push_eventually attached
   with the writer

Since the writer is dead. No one is going to fetch data from the _mq
queue. The apply_rows_on_master_in_thread will block forever.

To fix, when the writer is failed, we should abort the _mq queue.

Refs: #6248
2020-04-28 12:14:32 +08:00
Asias He
13a9c5eaf7 repair: Send reason for node operations
Since 956b092012 (Merge "Repair based node
operation" from Asias), repair is used by other node operations like
bootstrap, decommission and so on.

Send the reason for the repair, so that we can handle the materialized
view update correctly according to the reason of the operation. We want
to trigger the view update only if the repair is used by repair
operation. Otherwise, the view table will be handled twice, 1) when the
view table is synced using repair 2) when the base table is synced using
repair and view table update is triggered.

Fixes #5930
Fixes #5998
2020-04-13 13:47:26 +03:00
Piotr Jastrzebski
e72696a8e6 sharding_info: rename the class to sharder
Also rename all variables that were named si or sinfo
to sharder.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
a3262a2cb2 repair: depend only on sharding logic not on partitioner
repair does not use partitioner and only uses sharding logic.
This means it does not have to depend on i_partitioner and can
instead operate on sharding_info.

This has an important consequence of allowing the repair of
multiple tables having different partitioners at the same time.

All tables repaired together still have to use the same
sharding logic.

To achieve this the change:
1. Removes partitioner field from repair_info
2. repair_info has access to sharding_info through schema
   objects of repaired tables
3. partitioner name is removed from shard_config
4. local and remote partitioners are removed from repair_meta.
   Remote sharding_info is used instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 09:37:48 +02:00
Piotr Jastrzebski
94ff653b99 selective_token_range_sharder: replace i_partitioner with sharding_info
The class does not depend on partitioning logic but only uses
sharding logic. This means it is possible and desirable to limit its
dependency to only sharding_info.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 09:36:22 +02:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Avi Kivity
0d885dbb00 Merge "Make all headers standalone" from Botond
"
Make sure all headers compile on their own, without requiring any
additional includes externally.

Even though this requirement is not documented in our coding guides it
is still quasi enforced and we semi-regularly get and merge patches
adding missing includes to headers.

This patch-set fixes all headers and adds a `{mode}-headers` target that
can be used to verify each header. This target should be built by
promotion to ensure no new non-conforming code sneaks in.
Individual headers can be verified using the
`build/dev/path/to/header.hh.o` target, that is generated for every
header.

The majority of the headers was just missing `seastarx.hh`. I think we
should just include this via a compiler flag to remove the noise from
our code (in a followup).
"

* 'compiling-headers/v2' of https://github.com/denesb/scylla:
  configure.py: add {mode}-headers phony target
  treewide: add missing headers and/or forward declarations
  test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
  sstables: size_tiered_backlog_tracker: move methods out-of-line
  sstables: date_tiered_compaction_strategy.hh: move methods out-of-line
2020-03-23 13:09:09 +02:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Asias He
be1a196988 repair: Handle keyspace with zero table
The following error was seen in
materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test

INFO [shard 0] repair - repair id 3 to sync data for
keyspace=ks, status=started repair/repair.cc:662:36: runtime error: member call
on null pointer of type 'const struct schema'
Aborting on shard 0.

The problem is in the test a keyspace was created without creating any
table. Since db19a76b1f(selective_token_range_sharder: stop calling
global_partitioner()), in get_partitioner_for_tables, we access nullptr
when no table is present.

	schema_ptr last_s;
	for (auto t: tables) {
	    // set last_s
	}
	last_s->get_partitione()

To fix:

1) Skip the repair in sync_data_using_repair if there is no table in the keyspace
2) Throw if no schema_ptr is found in get_partitioner_for_tables. Be defensive.

After:

INFO [shard 0] repair - decommission_with_repair: started with keyspace=ks, leaving_node=127.0.0.2, nr_ranges=744
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=started
WARN [shard 0] repair - repair id 3 to sync data for keyspace=ks, no table in this keyspace
INFO [shard 0] repair - repair id 3 completed successfully
INFO [shard 0] repair - repair id 3 to sync data for keyspace=ks, status=succeeded

Tests: materialized_views_test.py:TestMaterializedViews.decommission_node_during_mv_insert_4_nodes_test
Fixes: #6022
2020-03-22 13:46:36 +02:00
Avi Kivity
d310e7c7ea Merge 'repair: Ignore keyspace that is removed in sync_data_using_repair' from Asias
repair: Ignore keyspace that is removed in sync_data_using_repair

When a keyspace is removed during node operations, we should not fail
the whole operation. Ignore the keyspace that is removed.

Fixes #5942

* asias-repair_fix_5942:
  repair: Stop the nodes that have run repair_row_level_start
  repair: Ignore keyspace that is removed in sync_data_using_repair
2020-03-22 13:19:51 +02:00
Piotr Jastrzebski
924ed7bb1c make_multishard_combining_reader: stop taking partitioner
The function already takes schema so there's no need
for it to take partitioner. It can be obtained using
schema::get_partitioner

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Asias He
6a7c3f0af0 repair: Stop the nodes that have run repair_row_level_start
It is ok to run repair_row_level_stop unconditionally. The node that
hasn't received the repair_row_level_start will simply return an error
that the repair_meta_id is not found. To avoid the unnecessary
repair_row_level_stop verb, we can stop the nodes have run
repair_row_level_start. This also makes the error message less
confusing.

For example:

Before:

INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 on shard 0
     failed: std::runtime_error (get_repair_meta: repair_meta_id 8 for
     node 127.0.0.4 does not exist)
INFO 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1
     failed: std::runtime_error ({shard 0: std::runtime_error
     (get_repair_meta: repair_meta_id 8 for node 127.0.0.4 does not
     exist)})
WARN 2020-03-09 15:55:43,369 [shard 0] repair - repair id 1 to
     sync data for keyspace=ks, status=failed, keyspace does not exist
     any more, ignoring it: std::runtime_error ({shard 0:
     std::runtime_error (get_repair_meta: repair_meta_id 8 for node
     127.0.0.4 does not exist)})

After:

INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 on shard 0 failed:
     std::runtime_error (Failed to repair for keyspace=ks, cf=cf,
     range=(9041860168177642466, 9044815446631222376])
INFO 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 failed:
     std::runtime_error ({shard 0: std::runtime_error (Failed to repair
     for keyspace=ks, cf=cf, range=(9041860168177642466,
     9044815446631222376])})
WARN 2020-03-09 16:09:09,217 [shard 0] repair - repair id 1 to sync data
     for keyspace=ks, status=failed, keyspace does not exist any more,
     ignoring it: std::runtime_error ({shard 0: std::runtime_error
     (Failed to repair for keyspace=ks, cf=cf,
     range=(9041860168177642466, 9044815446631222376])})

Refs #5942
2020-03-09 18:24:02 +08:00