scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-13 03:12:13 +00:00

Author	SHA1	Message	Date
Asias He	9ea57dff21	gossip: Relax failure detector update We currently only update the failure detector for a node when a higher version of application state is received. Since gossip syn messages do not contain application state, so this means we do not update the failure detector upon receiving gossip syn messages, even if a message from peer node is received which implies the peer node is alive. This patch relaxes the failure detector update rule to update the failure detector for the sender of gossip messages directly. Refs #8296 Closes #8476	2021-04-14 13:16:00 +02:00
Avi Kivity	fcc17d43a6	treewide: correct mislicensed source files alternator/expressions.g had both AGPL and proprietary licensing. The proprietary one is removed. gms/inet_address_serializer.hh had only a proprietary license; it is replaced by the AGPL. Fixes #8465. Closes #8466	2021-04-12 17:42:59 +03:00
Kamil Braun	99fd2244a3	tree-wide: introduce cdc::generation_id type This is a follow-up to the previous commit. Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: 1. if a piece of code doesn't care about the timestamp, it just passes the ID around 2. if it does care, it can simply access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. Using the occasion, we change the `do_handle_cdc_generation_intercept...` function to be a standard function, not a coroutine. It turns out that - depending on the shape of the passed-in argument - the function would sometimes miscompile (the compiled code would not copy the argument to the coroutine frame).	2021-04-07 13:47:13 +02:00
Kamil Braun	e486e0f759	tree-wide: rename "cdc streams timestamp" to "cdc generation id" Each CDC generation always has a timestamp, but the fact that the timestamp identifies the generation is an implementation detail. We abstract away from this detail by using a more generic naming scheme: a generation "identifier" (whatever that is - a timestamp or something else). It's possible that a CDC generation will be identified by more than a timestamp in the (near) future. The actual string gossiped by nodes in their application state is left as "CDC_STREAMS_TIMESTAMP" for backward compatibility. Some stale comments have been updated.	2021-04-06 13:15:31 +02:00
Avi Kivity	40b60e8f09	Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem, we can do the replacing operation in multiple stages. One solution is to introduce a new gossip status state as proposed here: gossip: Introduce STATUS_PREPARE_REPLACE #7416 1) replacing node does not respond echo message 2) replacing node advertises prepare_replace state (Remove replacing node from natural endpoint, but do not put in pending list yet) 3) replacing node responds echo message 4) replacing node advertises hibernate state (Put replacing node in pending list) Since we now have the node ops verb introduced in `829b4c1438` (repair: Make removenode safe by default), we can do the multiple stage without introducing a new gossip status state. This patch uses the NODE_OPS_CMD infrastructure to implement replace operation. Improvements: 1) It solves the race between marking replacing node alive and sending writes to replacing node 2) The cluster reverts to a state before the replace operation automatically in case of error. As a result, it solves when the replacing node fails in the middle of the operation, the repacing node will be in HIBERNATE status forever issue. 3) The gossip status of the node to be replaced is not changed until the replace operation is successful. HIBERNATE gossip status is not used anymore. 4) Users can now pass a list of dead nodes to ignore explicitly. Fixes #8013 Closes #8330 * github.com:scylladb/scylla: repair: Switch to use NODE_OPS_CMD for replace operation gossip: Add advertise_to_nodes gossip: Add helper to wait for a node to be up gossip: Add is_normal_ring_member helper	2021-04-04 12:54:09 +03:00
Asias He	bdb95233e8	gossip: Add advertise_to_nodes gossiper::advertise_to_nodes() is added to allow respond to gossip echo message with specified nodes and the current gossip generation number for the nodes. This is helpful to avoid the restarted node to be marked as alive during a pending replace operation. After this patch, when a node sends a echo message, the gossip generation number is sent in the echo message. Since the generation number changes after a restart, the receiver of the echo message can compare the generation number to tell if the node has restarted. Refs #8013	2021-04-01 09:38:54 +08:00
Asias He	f690f3ee8e	gossip: Add helper to wait for a node to be up This patch adds gossiper::wait_alive helper to wait for nodes to be up on all shards. Refs #8013	2021-04-01 09:38:54 +08:00
Asias He	4f5676630e	gossip: Add is_normal_ring_member helper Check if a node is in NORMAL or SHUTDOWN status which means the node is part of the token ring from the gossip point of view and operates in normal status or was in normal status but is shutdown. Refs #8013	2021-04-01 09:38:54 +08:00
Piotr Jastrzebski	57c7964d6c	config: ignore enable_sstables_mc_format flag Don't allow users to disable MC sstables format any more. We would like to retire some old cluster features that has been around for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this we first have to make sure that all existing clusters have them enabled. It is impossible to know that unless we stop supporting enable_sstables_mc_format flag. Test: unit(dev) Refs #8352 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8360	2021-03-31 12:23:59 +03:00
Piotr Sarna	e42dee6afb	gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true The feature is assumed to be true due to being over 2 years old. It's still advertised in gossip, but it's assumed to always be present.	2021-03-30 09:37:13 +02:00
Piotr Sarna	08c4350968	gms: make TRUNCATION_TABLE feature unconditionally true Turns out the feature was not used presently. Historically, the commit which removed the support is `30a700c5b0` .	2021-03-30 09:36:45 +02:00
Piotr Sarna	c070178c7e	gms: make ROW_LEVEL_REPAIR feature unconditionally true The feature is assumed to be true due to being over 2 years old. It's still advertised in gossip, but it's assumed to always be present.	2021-03-30 09:36:11 +02:00
Asias He	dc40184faa	gossip: Handle timeout error in gossiper::do_shadow_round Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb is not handled in gossiper::do_shadow_round. If the GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes timeout, gossiper::do_shadow_round will throw an exception and fail the whole boot up process. It is fine that some of the remote nodes timeout in shadow round. It is not a must to talk to all nodes. This patch fixes an issue we saw recently in our sct tests: ``` INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping INFO \| scylla[1579]: [shard 0] gossip - gossip is already stopped INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping was successful ... ERR \| scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out) ``` Fixes #8187 Closes #8213	2021-03-08 13:03:41 +01:00
Botond Dénes	5c84aa52db	gms: add RANGE_SCAN_DATA_VARIANT cluster feature To control the transition to the data variant of range scans. As there is a difference in how the data and mutation variants calculate pages sizes, the transition to the former has to happen in a controlled manner, when all nodes in the cluster support it, to avoid artificial differences in page content and subsequently triggering false-positive read repair.	2021-03-02 07:53:53 +02:00
Eliran Sinvani	63b794d104	schema: recalculate digest when computed_columns feature is enabled The schema digest is affected by the computed_columns feature, this means that we have to recalculate our schema digest when this feature is enabled.	2021-02-11 13:48:58 +02:00
Piotr Sarna	8f98c0585f	failure_detector: add a missing const qualifier The mean() method is effectively const, so it should be marked as such. Message-Id: <14dd39e8419136909fcf10508c34de3752faa7fe.1612953601.git.sarna@scylladb.com>	2021-02-10 13:04:37 +02:00
Piotr Sarna	faca59efa6	failure_detector: add getting last update time point It can be useful to use the information how long ago an endpoint responded to heartbeat.	2021-02-08 16:45:58 +01:00
Piotr Sarna	d23584c8f7	failure_detector: return arrival samples by const reference There's no point in always returning the whole map by value - callers can decide to copy the map of their own if need be.	2021-02-08 11:50:32 +01:00
Piotr Sarna	445e6e44f4	failure_detector: remove unimplemented is_alive method The method was never implemented, so it makes no sense to keep it in the header.	2021-02-08 11:49:50 +01:00
Asias He	c82250e0cf	gossip: Allow deferring advertise of local node to be up Currently the replacing node sets the status as STATUS_UNKNOWN when it starts gossip service for the first time before it sets the status to HIBERNATE to start the replacing operation. This introduces the following race: 1) Replacing node using the same IP address of the node to be replaced starts gossip service without setting the gossip STATUS (will be seen as STATUS_UNKNOWN by other nodes) 2) Replacing node waits for gossip to settle and learns status and tokens of existing nodes 3) Replacing node announces the HIBERNATE STATUS. After Step 1 and before Step 3, existing nodes will mark the replacing node as UP, but haven't marked the replacing node as doing replacing yet. As a result, the replacing node will not be excluded from the read replicas and will be considered a target node to serve CQL reads. To fix, we make the replacing node avoid responding echo message when it is not ready. Fixes #7312 Closes #7714	2021-01-26 19:02:11 +01:00
Juliusz Stasiewicz	b150906d39	gossip: Added SNITCH_NAME to `application_state` Snitch name needs to be exchanged within cluster once, on shadow round, so joining nodes cannot use wrong snitch. The snitch names are compared on bootstrap and on normal node start. If the cluster already used mixed snitches, the upgrade to this version will fail. In this case customer needs to add a node with correct snitch for every node with the wrong snitch, then put down the nodes with the wrong snitch and only then do the upgrade. Fixes #6832 Closes #7739	2020-12-09 15:45:25 +02:00
Asias He	0a3a2a82e1	api: Add force_remove_endpoint for gossip It is used to force remove a node from gossip membership if something goes wrong. Note: run the force_remove_endpoint api at the same time on _all_ the nodes in the cluster in order to prevent the removed nodes come back. Becasue nodes without running the force_remove_endpoint api cmd can gossip around the removed node information to other nodes in 2 * ring_delay (2 * 30 seconds by default) time. For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove node 3 from gossip membership prior the auto removal (3 days by default), run the api cmd on both node 1 and node 2 at the same time. $ curl -X POST --header "Accept: application/json" "http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3" $ curl -X POST --header "Accept: application/json" "http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3" Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes are not present. Fixes #2134 Closes #5436	2020-11-29 13:58:46 +02:00
Piotr Jastrzebski	d2897d8f8b	alternator: guard streams with an experimental flag Add new alternator-streams experimental flag for alternator streams control. CDC becomes GA and won't be guarded by an experimental flag any more. Alternator Streams stay experimental so now they need to be controlled by their own experimental flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-11-12 12:36:16 +01:00
Piotr Jastrzebski	e9072542c1	Mark CDC as GA Enable CDC by default. Rename CDC experimental feature to UNUSED_CDC to keep accepting cdc flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-11-12 12:36:13 +01:00
Benny Halevy	a0436ea324	gossiper: convert to shared_token_metadata get() the latest token_metadata& from the shared_token_metadata before each use. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-11 14:20:23 +02:00
Benny Halevy	29ed59f8c4	main: start a shared_token_metadata And use it to get a token_metadata& compatible with current usage, until the services are converted to use token_metadata_ptr. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-11 14:20:23 +02:00
Nadav Har'El	7ff72b0ba5	Merge 'secondary_index: fix returned rows token ordering' from Piotr Grabowski Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens. Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation. Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens. Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`. Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes. Fix tests according to new token ordering and add one new test to validate this aspect explicitly. Fixes #7443 Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order. Closes #7534 * github.com:scylladb/scylla: tests: add token ordering test of indexed selects tests: fix tests according to new token ordering secondary_index: use new token_column_computation feature: add correct_idx_token_in_secondary_index column_computation: add token_column_computation token_column_computation: rename as legacy	2020-11-05 18:44:49 +01:00
Piotr Grabowski	6624d933c9	feature: add correct_idx_token_in_secondary_index Add new correct_idx_token_in_secondary_index feature, which will be used to determine if all nodes in the cluster support new token_column_computation. This column computation will replace legacy_token_column_computation in secondary indexes, which was incorrect as this column computation produced values that when compared with unsigned comparison (CQL type bytes comparison) resulted in different ordering than token signed comparison. See issue: https://github.com/scylladb/scylla/issues/7443	2020-11-04 12:02:42 +01:00
Benny Halevy	e4614d4836	gossiper: mark trivial methods noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:47 +02:00
Benny Halevy	1ba4c84ae2	gossiper: get_cluster_name, get_partitioner_name: make noexcept These methods can return a const sstring& rather than allocating a sstring. And with that they can be marked noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:29 +02:00
Benny Halevy	11a8912093	gossiper: get_gossip_status: return string_view and make noexcept Change get_gossip_status to return string_view, and with that it can be noexcept now that it doesn't allocate memory via sstring. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	126e486fde	gms/endpoint_state: mark methods using get_status noexcept Now that get_status returns string_view, just compare it with a const char* rather than making a sstring out of it, and consequently, can be marked noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	6b9191b6c2	gms/endpoint_state: get_status: return string_view and make noexcept get_status doesn't need to allocate a sstring, it can just return a std::string_view to the status string, if found. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	232c665bab	gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept Although std::map::find is not guaranteed to be noexcept it depends on the comperator used and in this case comparing application_state is noexcept. Therefore, we can safely mark get_application_state_ptr noexcept. is_cql_ready depends on get_application_state_ptr and otherwise handles an exceptions boost::lexical_cast so it can be marked noexcept as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	5d8e2c038b	gms/endpoint_state: mark trivial methods noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	d4c364507e	gms/heart_beat_state: mark methods noexcept Now that get_next_version() is noexcept, update_heart_beat can be noexcept too. All others are trivially noexcept. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	68a2920201	gms/versioned_value: mark trivial methods noexcept Also, versioned_value::compare_to() can be marked const. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	c295f521b9	gms/version_generator: mark get_next_version noexcept It is trivially so. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	e28d80ec0c	messaging: msg_addr: mark methods noexcept Based on gms::inet_address. With that, gossiper::get_msg_addr can be marked noexcept (and const while at it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Benny Halevy	232fc19525	gms/inet_address: mark methods noexcept Based on the corresponding net::inet_address calls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-11-01 16:46:18 +02:00
Asias He	d47033837a	gossiper: Use dedicated gossip scheduling group Gossip currently runs inside the default (main) scheduling group. It is fine to run inside default scheduling group. From time to time, we see many tasks in main scheduling group and we suspect gossip. It is best we can move gossip to a dedicated scheduling group, so that we can catch bugs that leak tasks to main group more easily. After this patch, we can check: scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"} Fixes: #7154 Tests: unit(dev)	2020-10-29 12:53:37 +02:00
Tomasz Grabiec	14fdd2f501	Merge "Gossip echo message improvement" from Asias This series improves gossip echo message handling in a loaded cluster. Refs: #7197 * git://github.com/asias/scylla.git gossip_echo_improve_7197: gossiper: Handle echo message on any shard gossiper: Increase echo message timeout gossiper: Remove unused _last_processed_message_at	2020-09-24 15:13:55 +02:00
Asias He	88b7587755	gossiper: Handle echo message on any shard Echo message does not need to access gossip internal states, we can run it on all shards and avoid forwarding to shard zero. This makes gossip marking node up more robust when shard zero is loaded. There is an argument that we should make echo message return only when all shards have responded so that all shards are live and responding. However, in a heavily loaded cluster, one shard might be overloaded on multiple nodes in the cluster at the same time. If we require echo response on all shards, we have a chance local node will mark all peer nodes as down. As a result, the whole cluster is down. This is much worse than not excluding a node with a slow shard from a cluster. Refs: #7197	2020-09-24 10:10:54 +08:00
Asias He	173d115a64	gossiper: Remove unused _last_processed_message_at It is not used any more. We can get rid of it. Refs: #7197	2020-09-24 09:48:54 +08:00
Pavel Emelyanov	a75b048616	gossiper: Unregister verbs if shadow round aborts start The gossiper verbs are registered in two places -- start_gossiping and do_shadow_round(). And unregistered in one -- stop_gossiping iff the start took place. Respectively, there's a chance that after a shadow round scylla exits without starting gossiping thus leaving verbs armed. Fix by unregistering verbs on stop if they are still registered. fixes: #7262 tests: manual(start, abort start after shadow round), unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200921140357.24495-1-xemul@scylladb.com>	2020-09-22 10:18:01 +02:00
Piotr Sarna	dd085b146a	gms: add comments for deprecated features Features which are propagated to other nodes via gossip, but assumed they are supported in the code, are now marked with comments.	2020-09-14 12:59:19 +02:00
Piotr Sarna	defe6f49df	gms: remove unused feature bits Checks for features introduced over 2 years ago were removed in previous commits, so all that is left is removing the feature bits itself. Note that the feature strings are still sent to other nodes just to be double sure, but the new code assumes that all these features are implicitly enabled.	2020-09-14 12:35:28 +02:00
Piotr Sarna	21a77612b3	gms: add a cluster feature for fixed hashing The new hashing routine which properly takes null cells into account is now enabled if the whole cluster is aware of it.	2020-09-10 13:16:44 +02:00
Pavel Emelyanov	812eed27fe	code: Force formatting of pointer in .debug and .trace ... and tests. Printin a pointer in logs is considered to be a bad practice, so the proposal is to keep this explicit (with fmt::ptr) and allow it for .debug and .trace cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-26 20:44:11 +03:00
Asias He	6cadf4e4fa	gossip: Apply state for local node in shadow round We saw errors in killed_wiped_node_cannot_join_test dtest: Aug 2020 10:30:43 [node4] Missing: ['A node with address 127.0.76.4 already exists, cancelling join']: The test does: n1, n2, n3, n4 wipe data on n4 start n4 again with the same ip address Without this patch, n4 will bootstrap into the cluster new tokens. We should prevent n4 to bootstrap because there is an existing node in the cluster. In shadow round, the local node should apply the application state of the node with the same ip address. This is useful to detect a node trying to bootstrap with the same IP address of an existing node. Tests: bootstrap_test.py Fixes: #7073	2020-08-25 12:53:59 +03:00

1 2 3 4 5 ...

637 Commits