When start a cluster on AWS, the seed node might get ready after
non-seed nodes is ready to contact it. Wait for seed node longer to make
the boot up process more robust.
Right now, gossip returns hard coded cluster and partitioner name.
sstring get_cluster_name() {
// FIXME: DatabaseDescriptor.getClusterName()
return "my_cluster_name";
}
sstring get_partitioner_name() {
// FIXME: DatabaseDescriptor.getPartitionerName()
return "my_partitioner_name";
}
Fix it by setting the correct name from configure option.
With this
cqlsh 127.0.0.$i -e "SELECT * from system.local;
returns correct cluster_name.
Fixes#291
failure_detector::{interpret, force_conviction} will call into callback: convict
, which might start an async operation. Protect it by ref count.
Fixes#269
We have this call chain,
gossiper::run -> do_status_check -> interpret -> convict -> mark_dead
since gossip::run is executed inside a seastar thread, we can assure all
functions above run inside a seastar thread.
There are three places where async operations can be scheduled
- gossiper timer handler
- API called by user
- messaging service handler
Use reference tracking infrastructure to protect.
Fixes#268
We are printing out error messages when a remote connection is closed
ERROR [shard 0] gossip - Fail to send GossipDigestACK2 to 127.0.0.2:0: rpc::closed_error (connection is closed)
ERROR [shard 0] gossip - Fail to handle GOSSIP_DIGEST_ACK: rpc::closed_error (connection is closed)
WARN [shard 0] unimplemented
this is causing issues with DTEST as it validates after finishing a run
that there are no ERRORs in the log
The rule is:
We can handle it correctly if error occurs -> log warn
We can not handle it correctly when error occurs -> log error
Fixes#144
So that do_before_change_notifications and do_on_change_notifications
are under seastar::async.
Now, before_change callbacks are inside seastar::async context.
It is easier to futurize apply_new_states and handle_major_state_change.
Now, on_change, on_join and on_restart callbacks are inside
seastar::async context.
It is not correct to use _scheduled_gossip_task.armed() to tell if
gossip is enabled or not , since timer set _armed = false before calling
the timer callback.
It was working correctly because we did not actually check is_enabled()
flag inside the timer callback but inside the send_gossip_digest_syn()'s
continuation and at that time the timer is armed again.
Use a standalone flag to do so.
We sleep storage_service_ring_delay until we abort due to failing to
talk to a seed node. We should retry sending GossipDigestSyn message,
instead of sending it once.
With this, we can start the seed node and normal node in a script like
below, without any sleep between.
./scylla --listen-address 127.0.0.1
./scylla --listen-address 127.0.0.2
This is useful for testing.