scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	7323fe76db	gossiper: Replicate endpoint_state::is_alive() Broken in `f570e41d18`. Not replicating this may cause coordinator to treat a node which is down as alive, or vice verse. Fixes regression in dtest: consistency_test.py:TestAvailability.test_simple_strategy which was expected to get "unavailable" exception but it was getting a timeout. Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>	2017-11-14 15:58:00 +02:00
Tomasz Grabiec	f570e41d18	gms/gossiper: Remove periodic replication of endpoint state map For large clusters the map can be big and cause latency problems. Since we now actively replicate changes, this is no longer needed.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	84c7b63c51	gossiper: Check for features in the change listener In preparation for removal of periodic replication	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	2d5fb9d109	gms/gossiper: Replicate changes incrementally to other shards storage_service depends on endpoint states to be replicated to all shards before token metadata is replicated. Currently this is taken care of by storage_service::replicate_to_all_cores(), invoked from storage_service's change listener. It copies whole endpoint state map, which is expensive in large clusters. It's more efficient to replicate only incremental changes, and only once, rather than for each application state.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	28c9609370	gms/gossiper: Document validity of endpoint_state properties	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	5cc83b9b3c	gms/gossiper: Process endpoints in parallel Makes state application faster due to increased parallelism. Refs #2855. Bootrap of 11th node, ignoring apply_state_locally() which complete instantly: Before: DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms After: DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	8f01e08690	gms/gossiper: Serialize state changes and notifications for given node It's possible that a change listener for a later state will run before change listener for the previous state completes, in which case node's state can be corruped. For example, the previous change listener may override sysytem.peers with an old value. This patch fixes the problem by serializing state changes and listeners for each node. The implementation uses loading_shared_values so that the lock remains alive as long as there is anyone holding it. Using endpoint_state_map for that doesn't seem appropraite, because entries can be removed from it while listeners are still running. There is code in the gossiper which anticipates that entry may be gone across deferring points in some places.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	6fccf7f4d0	gms/gossiper: Encapsulate lookup of endpoint_state	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	41ffefd194	gossiper: Add and improve logging	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	0ed84710d9	gms/gossiper: Don't fire change listeners when there is no change apply_new_states() always fires change listeners for received values, even if we already processed the state earlier. Some change listeners are heavy-weight, e.g. storage_service::handle_state_normal(). We should avoid calling them more than necessary. Make sure that we always run the change listeners by putting them in a defer() block. Otherwise, if exception is thrown in the middle of state application, change listeners would not be run. Later we would not detect the change for states which were already applied, and not run the change listers. Fixes #2867	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	c780a74b58	gms/gossiper: Allow parallel apply_state_locally() It is serialized since `e428d06f40`. This causes regression in performance of application state propagation due to reduced parallelism. Processing states for each node has high latency due to memtable flushes triggered by update_tokens() and commitlog syncs done by system.peers updates, if commitlog sync mode is set to "batch". We have high internal concurrency for these, so increasing parallelism significantly reduces time to process all states. Fixes #2855.	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	f20a805eca	gms/gossiper: Avoid copies in endpoint_state::add_application_state()	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	a71624d58d	gms/failure_detector: Ignore short update intervals Failure detector decides that a node is down if it hasn't received a change of its heartbeat for longer than ~11 times the average of past intervals between updates. If there are multiple incoming ACKs containing information about the same node, we may detect and report a change for each of them. This will cause failure_detector to establish that the average report period is in milliseconds. After the update storm is over, it will claim the node failure very soon, because report period will now be a large multiple of the average. Fix by not counting short updates into the calculation of average arrival time. Fixes #2861.	2017-10-18 08:49:52 +02:00
Duarte Nunes	f67a553b96	gms/endpoint_state: Remove get_application_state() It is no longer used, as all callsites have moved to get_application_state_ptr(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	9d5c6e0c72	gms/endpoint_state: Avoid copies in is_shutdown() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	92df519b91	gms/gossiper: Cleanup get_supported_features() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	39f71f7d12	gms/gossiper: Cleanup get_gossip_status() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	db660f1e08	gms/gossiper: Cleanup seen_any_seed() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	88dd97fe8e	gms/gossiper: Cleanup get_host_id() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	95079795ce	gms/gossiper: Removed dead uses_vnodes() function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	7db7704edc	gms/gossiper: Cleanup uses_host_id() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2984bdab29	gms/gossiper: Add get_application_state_ptr() This patch introduces the get_application_state_ptr() function, which allows access to a versioned_value of a particular endpoint. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	f41748af81	gms/gossiper: Cleanup notify_failure_detector() Now that we have get_endpoint_state_for_endpoint_ptr(), which does not return a copy and allows mutating the actual state, we can use it instead of repeating the lookup code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2210d10552	gms/gossiper: Cleanup is_alive() Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is enabled, mark it as const, and have some callers use it instead of open coding the logic. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	ceef45a6fe	gms/gossiper: Const-qualify functions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	955aee1588	gms/gossiper: Cleanup convict() Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify logging, and also protect expensive operations by checking the log level. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	cf99a41226	gms/gossiper: Add non-const get_endpoint_state_for_endpoint_ptr() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	d0fba1a113	gms/failure_detector: Simplify alive/dead endpoint count Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	dc65cda1a3	gms/failure_detector: Fix if/else style to include braces Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Tomasz Grabiec	66a15ccd18	gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr() Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>	2017-10-10 18:27:43 +01:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Duarte Nunes	bc976b4773	endpoint_state: const-qualify functions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:30:28 +01:00
Asias He	c0b965ee56	gossip: Better check for gossip stabilization on startup This is a backport of Apache CASSANDRA-9401 (2b1e6aba405002ce86d5badf4223de9751bf867d) It is better to check the number of nodes in the endpoint_state_map is not changing for gossip stabilization. Fixes #2853 Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>	2017-09-29 08:57:25 +02:00
Tomasz Grabiec	7a58fb5767	gossiper: Allow waiting for feature to be enabled Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>	2017-09-27 11:57:06 +01:00
Asias He	98e9049820	gossip: Print SCHEMA_TABLES_VERSION correctly Found this when debugging gossip with debug print. The application state SCHEMA_TABLES_VERSION was printed as UNKNOWN. Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>	2017-09-26 08:38:28 +02:00
Asias He	6022b7423a	gossip: Make maybe_enable_features public It will be needed by storage_service.	2017-09-20 16:58:33 +08:00
Asias He	68c7a391b5	gossip: Move the _features_condvar signal code to maybe_enable_features It is easier to call to features update logic outside gossiper.	2017-09-20 16:58:32 +08:00
Asias He	8f8273969d	gossip: Do not wait for echo message in mark_alive gossiper::apply_state_locally() calls handle_major_state_change() for each endpoint, in a seastar thread, which calls mark_alive() for new nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously waits for each node to respond before it moves on to the next entry. As a result it may take a while before whole state is processed. Apache (tm) Cassandra (tm) sends echos in the background. In a large cluster, we see at the time the joining node starts streaming, it hasn't managed to apply all the endpoint_state for peer nodes, so the joining node does not know some of the nodes yet, which results in the joining node ingores to stream from some of the existing nodes. Fixes #2787 Fixes #2797 Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>	2017-09-19 10:49:00 +03:00
Asias He	5ff0b113c9	gossip: Fix indentation in apply_state_locally Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>	2017-09-14 10:09:50 +03:00
Asias He	c84dcabb8f	gossip: Use boost::copy_range in apply_state_locally boost::copy_range is better because the vector is allocated with the correct size instead of growing when the inserter is called. [avi: also crashes less] Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>	2017-09-13 11:33:15 +03:00
Asias He	fa9d47c7f3	gossip: Fix a log message typo in compare_endpoint_startup Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>	2017-09-13 09:54:56 +02:00
Pekka Enberg	d2632ddf1d	Merge "gossip: optimize apply_state_locally for large cluster" from Asias "This series tries to improve the bootstrap of a node in a large cluster by improving how gossip applies the gossip node state. In #2404, the joining node failed to bootstrap, because it did not see the seed node when storage_service::bootstrap ran. After this series, we apply the whole gossip state contained in the gossip ack/ack2 message before applying the next one, and we apply the state of the seed node earlier than non-seed node so we can have the seed node's state faster. We also add some randomness to the order of applying gossip node state to prevent some of the nodes' state are always applied earlier than the others. This series improves apply_state_locally for large cluster: - Tune the order of applying endpoint_state - Serialize apply_state_locally - Avoid copying of the gossip state map Fixes #2404" * tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev: gossip: Avoid copying with apply_state_locally gossip: Serialize apply_state_locally gossip: Tune the order of applying endpoint_state in apply_state_locally gossip: Introduce is_seed helper gossip: Pass const endpoint_state& in notify_failure_detector gossip: Pass reference in notify_failure_detector	2017-09-08 11:41:43 +03:00
Asias He	57dd3cb2c5	gossip: Do not use c_str() in the logger Use logger.info("{}", msg) instead. Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>	2017-09-08 10:59:42 +03:00
Asias He	e98ce7887b	gossip: Avoid copying with apply_state_locally Move the std::map<inet_address, endpoint_state> map from the gossip ack/ack2 message directly and move it around in apply_state_locally to avoid copying the map.	2017-09-08 15:19:48 +08:00
Asias He	fd879b4e09	gossip: Serialize apply_state_locally apply_state_locally will be called when gossip ack/ack2 message is received. It will use the std::map<inet_address, endpoint_state>& map to update the endpoint state. However, we can receive multiple such gossip ack/ack2 messages from multiple peer nodes in parallel. Currently, we process them in parallel. It is better to apply all the states from one node then move to apply all the states from another node than interleaving. Because it is more important to have the state of the whole cluster than to have a bit newer state from another peer (if it is newer), especially when the node boots up and runs its first round of gossip exchange. After this patch, we apply the whole gossip state contained in the gossip ack/ack2 message before applying the next one.	2017-09-08 15:19:47 +08:00
Asias He	9ccba950ba	gossip: Tune the order of applying endpoint_state in apply_state_locally We currently always apply the endpoint_state in the order of the endpoint ip address. This is not good because some of the endpoint's state is always applied earlier than the others. In large cluster, the number of endpoints can be large, it takes time to apply all of them. To make it more fair, we apply the endpoint_state randomly. Apply the seed node's state earlier because in bootstrap, we will check if we have seen the seed node in storage_service::bootstrap. In #2404, the bootstrap failed because, the joining node hasn't apply the seed node's state when storage_service::bootstrap runs.	2017-09-08 15:19:47 +08:00
Asias He	c5456ed38f	gossip: Introduce is_seed helper To check if a endpoint is a seed node.	2017-09-08 15:19:47 +08:00
Asias He	32edd95241	gossip: Pass const endpoint_state& in notify_failure_detector	2017-09-08 15:19:47 +08:00
Asias He	46e562cbfa	gossip: Pass reference in notify_failure_detector In large cluster, the map can be large. Pass reference to avoid copying.	2017-09-08 15:19:47 +08:00
Asias He	cc18da5640	Revert "gossip: Make bootstrap more robust" This reverts commit `b56ba02335`. After commit `8fa35d6ddf` (messaging_service: Get rid of timeout and retry logic for streaming verb), streaming verb in rpc does not check if a node is in gossip memebership since all the retry logic is removed. Remove the extra wait before removing the joining node from gossip membership. Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>	2017-08-30 10:14:11 +03:00

1 2 3 4 5 ...

435 Commits