migration_manager: Fix nullptr dereference in maybe_schedule_schema_pull

Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.

In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:

   4 nodes in the tests

   n1, n2, n3, n4 are started

   n1 is stopped

   n1 is changed to use different shard config

   n1 is restarted ( 2019-01-27 04:56:00,377 )

The backtrace happened on n2 right fater n1 restarts:

   0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
   1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
   2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
   3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
   4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
   5 Segmentation fault on shard 0.
   6 Backtrace:
   7 0x00000000041c0782
   8 0x00000000040d9a8c
   9 0x00000000040d9d35
   10 0x00000000040d9d83
   11 /lib64/libpthread.so.0+0x00000000000121af
   12 0x0000000001a8ac0e
   13 0x00000000040ba39e
   14 0x00000000040ba561
   15 0x000000000418c247
   16 0x0000000004265437
   17 0x000000000054766e
   18 /lib64/libc.so.6+0x0000000000020f29
   19 0x00000000005b17d9

We do not know when this backtrace happened, but according to log from n3 an n4:

   INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
   INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL

We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.

So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.

Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
(cherry picked from commit 28d6d117d2)
This commit is contained in:
Asias He
2019-01-31 09:02:51 +08:00
committed by Pekka Enberg
parent 5174b1cd13
commit da80f27f44

View File

@@ -204,6 +204,10 @@ future<> migration_manager::maybe_schedule_schema_pull(const utils::UUID& their_
return make_ready_future<>();
}
const auto* value = ep_state->get_application_state_ptr(gms::application_state::SCHEMA);
if (!value) {
mlogger.debug("application_state::SCHEMA does not exist for {}, not submitting migration task", endpoint);
return make_ready_future<>();
}
utils::UUID current_version{value->value};
auto& db = proxy.get_db().local();
if (db.get_version() == current_version) {