scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 04:26:48 +00:00

Author	SHA1	Message	Date
Pavel Solodovnikov	777985b64d	gms: gossiper: maybe_enable_features() should enable features in seastar::async context Since `gms::feature::enable()` requires `seastar::async` context to be present. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	5b5fbb4b33	gms: feature_service: expose registered features map This will be used for re-enabling previously enabled cluster features, which will be introduces in later patches. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	a2f5ad432f	gms: feature_service: persist enabled features Save each feature enabled through the feature_service instance in the `system.scylla_local` under the 'enabled_features' key. The features would be persisted only if the underlying query context used by `db::system_keyspace` is initialized. Since `system.scylla_local` table is essentially a string->string map, use an ad-hoc method for serializing enabled features set: the same as used in gossiper for translating supported features set via gossip. The entry should be saved before we enable the feature so that crash-after-enable is safe. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Pavel Solodovnikov	e891f874df	gms: move `to_feature_set()` function from gossiper to feature_service This utility will also be used for de-serialization of persisted enabled features, which will be introduced in a later patch. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-11-28 14:18:11 +02:00
Benny Halevy	55967a8597	batchlog_manager: endpoint_filter: move to gossiper There's nothing in this function that actually requries the batchlog manager instance. It uses a random number engine that's moved along with it to class gossiper. This resolves a circular dependency between the batchlog_manager and storage_proxy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-11-23 08:27:30 +02:00
Pavel Emelyanov	9fccf7f3af	gossiper: Guard background processing with gate When shutdown gossiper may have some messages being processed in the background. This brings two problems. First, the gossiper itself is about to disappear soon and messages might step on the freed instance (however, this one is not real now, gossiper is not freed for real, just ::stop() is called). Second, messages processing may notify other subsystems which, in turn, do not expect this after gossiper is shutdown. The common solution to this is to run background code through a gate that gets closed at some point, the ::shutdown() in gossiper case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 10:25:03 +03:00
Pavel Emelyanov	42f44adb98	gossiper: Helper for background messaging processing Some messages are processed by gossiper on shard0 in the no-wait manner. Add a generic helper for that to facilitate next patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 10:24:44 +03:00
Asias He	1657e7be14	gossiper: Send generation number with shutdown message Consider: - n1, n2 in the cluster - n2 shutdown - n2 sends gossip shutdown message to n1 - n1 delays processing of the handler of shutdown message - n2 restarts - n1 learns new gossip state of n2 - n1 resumes to handle the shutdown message - n1 will mark n2 as shutdown status incorrectly until n2 restarts again To prevent this, we can send the gossip generation number along with the shutdown message. If the generation number does not match the local generation number for the remote node, the shutdown message will be ignored. Since we use the rpc::optional to send the generation number, it works with mixed cluster. Fixes #8597 Closes #9381	2021-09-27 11:08:43 +03:00
Avi Kivity	6702711d9c	Merge "Gossiper start-stop sanitation (+ bonus track)" from Pavel E " The main challenge here is to move messaging_service.start_listen() call from out of gossiper into main. Other changes are pretty minor compared to that and include - patch gossiper API towards a standard start-shutdown-stop form - gossiping "sharder info" in initial state - configure cluster name and seeds via gossip_config tests: unit(dev) dtest.bootstrap_test.start_stop_test_node(dev) manual(dev): start+stop, nodetool enable-/disablegossip refs: #2737 refs: #2795 refs: #5489 " * 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla: code: Expell gossiper.hh from other headers storage_service: Gossip "sharder" in initial states gossiper: Relax set_seeds() gossiper, main: Turn init_gossiper into get_seeds_from_config storage_service: Eliminate the do-bind argument from everywhere gossiper: Drop ms-registered manipulations messaging, main, gossiper: Move listening start into main gossiper: Do handlers reg/unreg from start/stop gossiper: Split (un)init_messaging_handler() gossiper: Relocate stop_gossiping() into .stop() gossiper: Introduce .shutdown() and use where appropriate gossiper: Set cluster_name via gossip_config gossiper, main: Straighten start/stop tests/cql_test_env: Open-code tst_init_ms_fd_gossiper tests/cql_test_env: De-global most of gossiper gossiper: Merge start_gossiping() overloads into one gossiper: Use is_... helpers gossiper: Fix do_shadow_round comment gossiper: Dispose dead code	2021-09-23 12:18:38 +03:00
Pavel Emelyanov	968e117315	gossiper: Relax set_seeds() It's much shorter and simpler to pass the seeds, obtained from the config, into gossiper via gossip_config rahter than with the help of a special call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	7680274e02	storage_service: Eliminate the do-bind argument from everywhere The same as in previous patch -- the gossiper doesn't need to know if it should call messaging.start_listen() or not, neither should do the storage_service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	0607a2b84f	gossiper: Drop ms-registered manipulations Now it's no-op and can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	ca316f32f0	messaging, main, gossiper: Move listening start into main Before preparing the cluster join process the messaging should be put into listening state. Right now it's done "on-demand" by the call to the do_shadow_round(), also there's a safety call in the start_gossiping(). Tests, however, should not start listening, so the do_bind boolean exists and is passed all the way around. Make the main() code explicitly call the messaging.start_listen() and leave tests without it. This change makes messaging start listening a bit earlier, but in between these old and new places there's nothing that needs messaging to stay deaf. As the do_bind becomes useless, the wait_for_gossip_to_settle() is also moved into main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	f644eb1cf7	gossiper: Do handlers reg/unreg from start/stop On start handlers can be registered any time before the messaging starts to listen. On stop handlers can remain registered any long, since the messaging service stops early in drain_on_shutdown(). One tricky place is API start_/stop_gossiping(). The latter calls gossiper::stop() thus unregistering the handlers. So to make the start_gossiping() work it must call gossiper::start() in advance. Overall the gossiper start/stop becomes this: gossiper.start() `- registers handlers gossiper.start_gossiping() `- // starts gossiping gossiper.shutdown() `- // stops gossiping gossiper.stop() `- calls shutdown() // re-entrable `- unregisters handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	9aba3e6f9f	gossiper: Split (un)init_messaging_handler() As a preparation for the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	dfe54207cb	gossiper: Relocate stop_gossiping() into .stop() The helper in question is called in two places: 1. In main() as a fuse against early exception before creating the drain_on_shutdown() defer 2. In the stop_gossiping() API call Both can be replaced with the stop_gossiping() call from the .stop() method, here's why: 1. In main the gossiper::stop() call is already deferred right after the gossiper is started. So this change moves it above. It may happen that an exception pops up before the old fuse was deferred, but that's OK -- the stop_gossiping() is safe against early- and re- entrances 2. The stop_gossiping() change is effectlvey a rename -- it calls the stop_gossiping() as it did before, but with the help of the .stop() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	e24c5034b5	gossiper: Introduce .shutdown() and use where appropriate The start/stop sequence we're moving towards assumes a shutdown (or drain) method that will be called early on stop to notify the service that the system is going down so it could prepare. For gossiper it already means calling stop_gossiping() on the shard-0 instance. So by and large this patch renames a few stop_gossiping() calls into .shutdown() ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	25210334b6	gossiper: Set cluster_name via gossip_config It's taken purely from the db::config and thus can be set up early. Right now the empty name is converted into "Test Cluster" one, but remains empty in the config and is later used by the system_keyspace code. This logic remains intact. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	084abb824e	gossiper, main: Straighten start/stop Turn the gossiper start/stop sequence into the canonical form gossiper.start(std::ref(dependencies)...).get(); auto stop_gossiper = defer({ gossiper.invoke_on_all(&gossiper::stop).get(); }); gossiper.invoke_on_all(&gossiper::start).get(); The deferred call should be gossiper.stop(); but for now keep the instances memory alive. This trick is safe at this point, because .start() and .stop() methods are both empty (still). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:05 +03:00
Pavel Emelyanov	89adb0df90	gossiper: Merge start_gossiping() overloads into one There are two of them and one is only called from the API with the do_bind always set to "yes". This fact makes it possible to remove it by adding relevant defaults for the other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	e71bd23b3d	gossiper: Use is_... helpers There are several state booleans on the service and some helpers to manipulate/check those. Make the code consistent by always using these helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	efb0ddff21	gossiper: Fix do_shadow_round comment Shadow round is used during each boot, not only during node replacement Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Pavel Emelyanov	f7ab1aa876	gossiper: Dispose dead code The debug_show() is unused, as well as the advertise_myself(). The _features_condvar used to be listened on before `f32f08c9`, now it's signal-only. Feature frendship with gossiper is not required. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-21 11:19:16 +03:00
Nadav Har'El	4ffd8c1f2b	alternator: stub TTL operations This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive operations to Alternator. These operations can enable, disable, or inquire about, the chosen expiration-time attribute. Currently, the information about the chosen attribute is only saved, with no actual expiration of any items taking place. Some of the tests for the TTL feature start to pass, so their xfail tag is removed. Because this this new feature is incomplete, it is not enabled unless the "alternator-ttl" experimental feature is enabled. Moreover, for these operations to be allowed, the entire cluster needs to support this experimental feature, because all nodes need to participate in the data expiration - if some old nodes don't support Alternator TTL, some of the data they hold won't get expired... So we don't allow enabling TTL until all the nodes in the cluster support this feature. The implementation is in a new source file, alternator/ttl.cc. This source file will continue to grow as we implement the expiration feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Avi Kivity	aa68927873	gossiper: remove get_local_gossiper() from some inline helpers Some state accessors called get_local_gossiper(); this is removed and replaced with a parameter. Some callers (redis, alternators) now have the gossiper passed as a parameter during initialization so they can use the adjusted API.	2021-09-07 17:03:37 +03:00
Avi Kivity	9ce1af9fcb	gossiper: remove get_gossiper() from stop_gossiping() Have the callers pass it instead, and they all have a reference already except for cql_test_env (which will be fixed later). The checks for initialization it does are likely unnecessary, but we'll only be able to prove it when get_gossiper() is completely removed.	2021-09-07 16:20:04 +03:00
Avi Kivity	fcd5376585	gossiper: remove uses of get_local_gossiper for its rpc server Initialization happens in the gossiper itself, so we can capture 'this'. If we need to move to shard 0, use sharded::invoke_on() to get the local instance.	2021-09-07 16:06:11 +03:00
Avi Kivity	61f02ece39	gossiper: remove calls to global get_gossiper from within the gossiper itself gossiper is a peering_sharded_service, so it has access to sharded<gossiper>. Remove the global call.	2021-09-07 15:15:09 +03:00
Piotr Sarna	da67c594c8	gms: add UDA feature UDA stands for user-defined aggregates and the feature implies that the whole cluster supports them.	2021-08-13 11:14:12 +02:00
Piotr Jastrzebski	1bdcef6890	features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled These features have been around for over 2 years and every reasonable deployment should have them enabled. The only case when those features could be not enabled is when the user has used enable_sstables_mc_format config flag to disable MC sstable format. This case has been eliminated by removing enable_sstables_mc_format config flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Asias He	2ad8fb756e	gossip: Promote gossip quarantine over log to info level 1) Start node n1, n2, n3 2) Bootstrap n4 and kill n4 in the middle of bootstrap 3) Wipe data on n4 and start n4 again After step 2, n1, n2 and n3 will remove n4 from gossip after fat_client_timeout and put n4 in quarantine for quarantine_delay(). If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2 and n3 will ignore gossip updates from n4, and n4 will not learn gossip updates from the cluster. After PR #8896, the bootstrap will be rejected. This patch promotes the gossip quarantine over log to info level, so that dtest can wait for the log to bootstrap the node again. Refs #8889 Refs #8890 Closes #8905	2021-06-24 12:51:32 +03:00
Pavel Emelyanov	d606321575	feature: Remove unused friendship with gossiper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Asias He	7a32cab524	gossip: Fix use-after-free in real_mark_alive and mark_dead In commit `11a8912093` (gossiper: get_gossip_status: return string_view and make noexcept) get_gossip_status returns a pointer to an endpoint_state in endpoint_state_map. After commit `425e3b1182` (gossip: Introduce direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive can yield in the middle of the function. It is possible that endpoint_state can be removed, causing use-after-free to access it. To fix, make a copy before we yield. Fixes #8859 Closes #8862	2021-06-16 21:16:26 +02:00
Asias He	c2cfdcd345	gossiper: Set minimum value for quarantine_delay When a new node bootstraps to join the cluster, it will be set in bootstrap gossip status. If the node is gone in the middle, the node will be removed by gossip after the new node fails to update gossip after fat_client_timeout, which reverts the new node as pending node. However, if the new node is slow to update gossip and it finishes bootstrapping after existing nodes have removed the new node after fat_client_timeout. In handle_state_normal handler, the existing nodes will fail to find the host id for the new node and throw and in turn terminate the scylla process. To mitigate the problem, we set fat_client_timeout which is half of quarantine_delay to a minimum value if users set a small ring_delay value. Refs #8702 Refs #8859 Closes #8860	2021-06-16 09:34:49 +02:00
Asias He	0665d9c346	gossip: Handle nodes removed from live endpoints directly When a node is removed from the _live_endpoints list directly, e.g., a node being decommissioned, it is possible the node might not be marked as down in gossiper::failure_detector_loop_for_node loop before the loop exits. When the gossiper::failure_detector_loop loop starts again, the node will not be considered because it is not present in _live_endpoints list any more. As a result, the node will not be marked as down though gossiper::failure_detector_loop_for_node loop. To fix, we mark the nodes that are removed from _live_endpoints lists as down in the gossiper::failure_detector_loop loop. Fixes #8712 Closes #8770	2021-06-09 15:02:25 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Asias He	9b902fad79	gossiper: Update timestamp for nodes in ack and ack2 msg handler In commit `425e3b1182` (gossip: Introduce direct failure detector), the call to notify_failure_detector inside ack and ack2 msg handler was removed since there is no need to update the old failure detector anymore. However, the timestamp for endpoit_state is also updated inside notify_failure_detector. With the new failure detector we still need the timestamp for endpoit_state. Otherwise, nodes might be removed from gossip wrongly. For example, as we saw in issue #8702: INFO 2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2 has been silent for 5000ms, removing from gossip To fix, update the timestamp as we do before in ack and ack2 msg handler. Fixes #8702 Closes #8777	2021-06-06 09:21:23 +03:00
Kamil Braun	2ac9239f6a	gms: introduce CDC_GENERATIONS_V2 feature When a node joins this feature (which it does immediately when upgrading to a version that has this commit), it says: "I understand the new generation storage format and the new identifier format". Thus, when the feature becomes enabled - after all nodes have joined it - it means that it's safe to create new generations using these new storage/ID formats.	2021-05-25 16:07:23 +02:00
Kamil Braun	4658adbe18	tree-wide: introduce cdc::generation_id_v2 This is a new type of CDC generation identifiers. Compared to old IDs, additionally to the timestamp it contains an UUID. These new identifiers will allow a safer and more efficient algorithm of introducing new generations into a cluster (introduced in a later commit). For now, nodes keep using the old identifier format when creating new generations and whenever they learn about a new CDC generation from gossip they assume that it also is stored in the v1 format. But they do know how to (de)serialize the second format and how to persist new identifiers in local tables.	2021-05-24 17:50:21 +02:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Asias He	425e3b1182	gossip: Introduce direct failure detector Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes #8488 Closes #8036	2021-05-24 10:47:06 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Kamil Braun	03ad111beb	tree-wide: comments on deprecated functions to access global variables Closes #8665	2021-05-18 11:31:10 +03:00
Eliran Sinvani	5eb84f110e	gossiper: remove excess error logging from gossiper We remove a log of severity error that is later thrown as an exception, being catched few lines below and then printed out as a warning. Fixes #8616 Closes #8617	2021-05-12 15:02:35 +02:00
Asias He	9ea57dff21	gossip: Relax failure detector update We currently only update the failure detector for a node when a higher version of application state is received. Since gossip syn messages do not contain application state, so this means we do not update the failure detector upon receiving gossip syn messages, even if a message from peer node is received which implies the peer node is alive. This patch relaxes the failure detector update rule to update the failure detector for the sender of gossip messages directly. Refs #8296 Closes #8476	2021-04-14 13:16:00 +02:00
Avi Kivity	fcc17d43a6	treewide: correct mislicensed source files alternator/expressions.g had both AGPL and proprietary licensing. The proprietary one is removed. gms/inet_address_serializer.hh had only a proprietary license; it is replaced by the AGPL. Fixes #8465. Closes #8466	2021-04-12 17:42:59 +03:00
Kamil Braun	99fd2244a3	tree-wide: introduce cdc::generation_id type This is a follow-up to the previous commit. Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: 1. if a piece of code doesn't care about the timestamp, it just passes the ID around 2. if it does care, it can simply access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. Using the occasion, we change the `do_handle_cdc_generation_intercept...` function to be a standard function, not a coroutine. It turns out that - depending on the shape of the passed-in argument - the function would sometimes miscompile (the compiled code would not copy the argument to the coroutine frame).	2021-04-07 13:47:13 +02:00
Kamil Braun	e486e0f759	tree-wide: rename "cdc streams timestamp" to "cdc generation id" Each CDC generation always has a timestamp, but the fact that the timestamp identifies the generation is an implementation detail. We abstract away from this detail by using a more generic naming scheme: a generation "identifier" (whatever that is - a timestamp or something else). It's possible that a CDC generation will be identified by more than a timestamp in the (near) future. The actual string gossiped by nodes in their application state is left as "CDC_STREAMS_TIMESTAMP" for backward compatibility. Some stale comments have been updated.	2021-04-06 13:15:31 +02:00
Avi Kivity	40b60e8f09	Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem, we can do the replacing operation in multiple stages. One solution is to introduce a new gossip status state as proposed here: gossip: Introduce STATUS_PREPARE_REPLACE #7416 1) replacing node does not respond echo message 2) replacing node advertises prepare_replace state (Remove replacing node from natural endpoint, but do not put in pending list yet) 3) replacing node responds echo message 4) replacing node advertises hibernate state (Put replacing node in pending list) Since we now have the node ops verb introduced in `829b4c1438` (repair: Make removenode safe by default), we can do the multiple stage without introducing a new gossip status state. This patch uses the NODE_OPS_CMD infrastructure to implement replace operation. Improvements: 1) It solves the race between marking replacing node alive and sending writes to replacing node 2) The cluster reverts to a state before the replace operation automatically in case of error. As a result, it solves when the replacing node fails in the middle of the operation, the repacing node will be in HIBERNATE status forever issue. 3) The gossip status of the node to be replaced is not changed until the replace operation is successful. HIBERNATE gossip status is not used anymore. 4) Users can now pass a list of dead nodes to ignore explicitly. Fixes #8013 Closes #8330 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for replace operation gossip: Add advertise_to_nodes gossip: Add helper to wait for a node to be up gossip: Add is_normal_ring_member helper	2021-04-04 12:54:09 +03:00
Asias He	bdb95233e8	gossip: Add advertise_to_nodes gossiper::advertise_to_nodes() is added to allow respond to gossip echo message with specified nodes and the current gossip generation number for the nodes. This is helpful to avoid the restarted node to be marked as alive during a pending replace operation. After this patch, when a node sends a echo message, the gossip generation number is sent in the echo message. Since the generation number changes after a restart, the receiver of the echo message can compare the generation number to tell if the node has restarted. Refs #8013	2021-04-01 09:38:54 +08:00

1 2 3 4 5 ...

681 Commits