scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 10:41:12 +00:00

Author	SHA1	Message	Date
Kamil Braun	3ee56e1936	Merge 'raft topology: enable writes to previous CDC generations' from Patryk Jędrzejczak When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. This PR adjusts the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. To load all required generations into memory, we replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. To ensure this set doesn't grow endlessly, we remove an entry from this set together with the data in CDC_GENERATIONS_V3. Currently, we may clear a CDC generation's data from CDC_GENERATIONS_V3 if it is not the last committed generation and it is at least 24 hours old (according to the topology coordinator's clock). However, after allowing writes to the previous CDC generations, this condition became incorrect. We might clear data of a generation that could still be written to. The new solution introduced in this PR is to clear data of the generations that finished operating more than 24 hours ago. Apart from the changes mentioned above, this PR hardens `test_cdc_generation_clearing.py`. Fixes scylladb/scylladb#16916 Fixes scylladb/scylladb#17184 Fixes scylladb/scylladb#17288 Closes scylladb/scylladb#17374 * github.com:scylladb/scylladb: test: harden test_cdc_generation_clearing test: test clean-up of committed_cdc_generations raft topology: clean committed_cdc_generations raft topology: clean only obsolete CDC generations' data storage_service: topology_state_load: load all committed CDC generations system_keyspace: load_topology_state: fix indentation raft topology: store committed CDC generations' IDs in the topology	2024-02-22 11:41:25 +01:00
Anna Stuchlik	37237407f6	doc: remove info about outdated versions This PR removes information about outdated versions, including disclaimers and information when a given feature was added. Now that the documentation is versioned, information about outdated versions is unnecessary (and makes the docs harder to read). Fixes https://github.com/scylladb/scylladb/issues/12110 Closes scylladb/scylladb#17430	2024-02-20 19:32:13 +02:00
Avi Kivity	93af3dd69b	Merge 'Maintenance socket: set filesystem permissions to 660' from Mikołaj Grzebieluch Set filesystem permissions for the maintenance socket to 660 (previously it was 755) to allow a scyllaadm's group to connect. Split the logic of creating sockets into two separate functions, one for each case: when it is a regular cql controller or used by maintenance_socket. Fixes https://github.com/scylladb/scylladb/issues/16487. Closes scylladb/scylladb#17113 * github.com:scylladb/scylladb: maintenance_socket: add option to set owning group transport/controller: get rid of magic number for socket path's maximal length transport/controller: set unix_domain_socket_permissions for maintenance_socket transport/controller: pass unix_domain_socket_permissions to generic_server::listen transport/controller: split configuring sockets into separate functions	2024-02-20 15:09:54 +02:00
Patryk Jędrzejczak	e145e758eb	raft topology: store committed CDC generations' IDs in the topology When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. We need to adjust the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. This patch is the first step of the adjustment. We replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. This set is sorted by timestamps, just like `unpublished_cdc_generations`. This patch is mostly refactoring. The last generation in `committed_cdc_generations` is the equivalent of the previous `current_cdc_generation_{uuid, timestamp}`. The other generations are irrelevant for now. They will be used in the following patches. After introducing `committed_cdc_generations`, a newly committed generation is also unpublished (it was current and unpublished before the patch). We introduce `add_new_committed_cdc_generation`, which updates both sets of generations so that we don't have to call `add_committed_cdc_generation` and `add_unpublished_cdc_generation` together. It's easy to forget that both of them are necessary. Before this patch, there was no call to `add_unpublished_cdc_generation` in `topology_coordinator::build_coordinator_state`. It was a bug reported in scylladb/scylladb#17288. This patch fixes it. This patch also removes "the current generation" notion from the Raft-based topology. For the Raft-based topology, the current generation was the last committed generation. However, for the `cdc::metadata`, it was the generation operating now. These two generations could be different, which was confusing. For the `cdc::metadata`, the current generation is relevant as it is handled differently, but for the Raft-based topology, it isn't. Therefore, we change only the Raft-based topology. The generation called "current" is called "the last committed" from now.	2024-02-20 12:35:16 +01:00
Anna Stuchlik	69ead0142d	doc: remove outdated/invalid entries from FAQ This commit removes outdated or invalid FAQ entries specified in https://github.com/scylladb/scylladb/issues/16631 In addition, the questions about Cassandra compatibility are removed as they are already answered on the forum: https://forum.scylladb.com/t/which-cassandra-version-is-scylladb-it-compatible-with/84 Also, the incorrect entry about the cache has been removed and the correct answer is added to the forum. Fixes https://github.com/scylladb/scylladb/issues/17003 The question about troubleshooting performance issues has also been removed, as it's already covered on the Forum. Also, it removes the Apache copyright entry, which should not be added to the FAQ page. Closes scylladb/scylladb#17200	2024-02-20 08:43:58 +02:00
Anna Stuchlik	4f8f183736	doc: remove SSTable2json from the docs This commit removes the SSTable2json documentation, as well as the links to the removed page. In addition, it adds a redirection for that page to prevent 404. Fixes https://github.com/scylladb/scylladb/issues/17204 Closes scylladb/scylladb#17340	2024-02-20 08:43:27 +02:00
Mikołaj Grzebieluch	182cfebe40	maintenance_socket: add option to set owning group Option `maintenance-socket-group` sets the owning group of the maintenance socket. If not set, the group will be the same as the user running the scylla node.	2024-02-19 10:21:00 +01:00
Anna Stuchlik	ef1468d5ec	doc: remove Enterprise OS support from Open Source With this commit: - The information about ScyllaDB Enterprise OS support is removed from the Open Source documentation. - The information about ScyllaDB Open Source OS support is moved to the os-support-info file in the _common folder. - The os-support-info file is included in the os-support page using the scylladb_include_flag directive. This update employs the solution we added with https://github.com/scylladb/scylladb/pull/16753. It allows to dynamically add content to a page depending on the opensource/enterprise flag. Refs https://github.com/scylladb/scylladb/issues/15484 Closes scylladb/scylladb#17310	2024-02-18 22:09:06 +02:00
Avi Kivity	9bb4482ad0	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations	2024-02-18 19:21:53 +02:00
Anna Stuchlik	e132ffdb60	doc: add missing redirections This commit adds the missing redirections to the pages whose source files were previously stored in the install-scylla folder and were moved to another location. Closes scylladb/scylladb#17367	2024-02-16 14:09:26 +02:00
Anna Stuchlik	710d182654	doc: update Handling Node Failures to add topology This commit updates the Handling Node Failures page to specify that the quorum requirement refers to both schema and topology updates. Closes scylladb/scylladb#17321	2024-02-14 17:15:13 +01:00
Tzach Livyatan	902733cd7e	Docs: rename doc page from REST tp Admin REST API Closes scylladb/scylladb#17334	2024-02-14 13:49:54 +02:00
Anna Stuchlik	02cd84adbf	doc: remove OSS-vs-Ent Matrix from OSS docs This commit removes the Open Source vs. Enterprise matrix from the Open Source documentation. In addition, a redirection is added to prevent 404 in the OSS docs, and to the removed page is replaced with a link to the same page in the Enterprise docs. This commit must be reverted enterprise.git, because we want to keep the Matrix in the Enterprise docs. Fixes https://github.com/scylladb/scylladb/issues/17289 Closes scylladb/scylladb#17295	2024-02-13 17:17:22 +02:00
Yaniv Kaul	d2ef100b60	Typos: more/less then -> more/less than Fix repated typos in comments: more then -> more than, less then -> less than Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#17303	2024-02-13 17:16:15 +02:00
David Garcia	f45d9d33f1	docs: remove liveness asterisks Instead of adding an asterisk next to "liveness" linking to the glossary, we will temporarily replace them with a hyperlink pending the implementation of tooltip functionality. Closes scylladb/scylladb#17244	2024-02-12 20:37:52 +02:00
Patryk Jędrzejczak	38e1ddb8bc	docs: using-scylla: cdc: remove info about failing writes to old generations In one of the previous patches, we have allowed writing to the previous CDC generations for `generation_leeway`. This change has made the information about failing writes to the previous generation and the "rejecting writes to an old generation" example obsolete so we remove them. After the change, a write can only fail if its timestamp is distant from the node's timestamp. We add the information about it.	2024-02-12 10:14:00 +01:00
Patryk Jędrzejczak	9b923f8b81	docs: dev: cdc: document writing to previous CDC generations We update the dev documentation after allowing writes to the previous CDC generations in one of the previous patches.	2024-02-12 10:14:00 +01:00
Kamil Braun	e9e24f47ec	Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski This PR implements a procedure that upgrades existing clusters to use raft-based topology operations. The procedure does not start automatically, it must be triggered manually by the administrator after making sure that no topology operations are currently running. Upgrade is triggered by sending `POST /storage_service/raft_topology/upgrade` request. This causes the topology coordinator to start who drives the rest of the process: it builds the `system.topology` state based on information observed in gossip and tells all nodes to switch to raft mode. Then, topology coordinator runs normally. Upgrade progress is tracked in a new static column `upgrade_state` in `system.topology`. The procedure also serves as an extension to the current recovery procedure on raft. The current recovery procedure requires restarting nodes in a special mode which disables raft, perform `nodetool removenode` on the dead nodes, clean up some state on the nodes and restart them so that they automatically rebuild the group 0. Raft topology fits into existing procedure by falling back to legacy topology operations after disabling raft. After rebuilding the group 0, upgrade needs to be triggered again. Because upgrade is manual and it might not be convenient for administrators to run it right after upgrading the cluster, we allow the cluster to operate in legacy topology operations mode until upgrade, which includes allowing new nodes to join. In order to allow it, nodes now ask the cluster about the mode they should use to join before proceeding by using a new `JOIN_NODE_QUERY` RPC. The procedure is explained in more detail in `topology-over-raft.md`. Fixes: https://github.com/scylladb/scylladb/issues/15008 Closes scylladb/scylladb#17077 * github.com:scylladb/scylladb: test/topology_custom: upgrade/recovery tests for topology on raft cdc/generation_service: in legacy mode, fall back to raft tables system_keyspace: add read_cdc_generation_opt cdc/generation_service: turn off gossip notifications in raft topo mode cql_test_env: move raft_topology_change_enabled var earlier group0_state_machine: pull snapshot after raft topology feature enabled storage_service: disable persistent feature enabler on upgrade storage_service: replicate raft features to system.peers storage_service: gossip tokens and cdc generation in raft topology mode API: add api for triggering and monitoring topology-on-raft upgrade storage_service: infer which topology operations to use on startup storage_service: set the topology kind value based on group 0 state raft_group0: expose link to the upgrade doc in the header feature_service: fall back to checking legacy features on startup storage_service: add fiber for tracking the topology upgrade progress gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES topology_coordinator: implement core upgrade logic topology_coordinator: extract top-level error handling logic storage_service: initialize discovery leader's state earlier topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data topology_state_machine: introduce upgrade_state storage_service: disallow topology ops when upgrade is in progress raft_group0_client: add in_recovery method storage_service: introduce join_node_query verb raft_group0: make discover_group0 public raft_group0: filter current node's IP in discover_group0 raft_group0: remove my_id arg from discover_group0 storage_service: make _raft_topology_change_enabled more advanced docs: document raft topology upgrade and recovery	2024-02-09 11:54:53 +01:00
Kefu Chai	64c829da70	docs: reformat the state machine diagram using mermaid for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16620	2024-02-08 19:43:53 +02:00
Piotr Dulikowski	1104f8b00f	docs: document raft topology upgrade and recovery	2024-02-07 09:54:54 +01:00
David Garcia	f14edf3543	docs: correct image sorting order for reference docs This commit displays images in reference docs in the correct order. Prior to this fix, the images were listed as 4.0.0, 4.0.1, and 4.0.2, but they should be sorted in reverse order (4.0.2, 4.0.1, 4.0.0). The changes made in this PR resolve the issue introduced in https://github.com/scylladb/scylladb/pull/16942 when common functions for Azure and GCP were extracted into a separate file without reversing the list as defined in the original extension: https://github.com/scylladb/scylladb/pull/16942/files#diff-b8f6253ea8fdcca681deb556ca61cd1f3feb3b7aeb7e856b145ef9b685aad460L185 Closes scylladb/scylladb#17185	2024-02-06 16:24:22 +02:00
David Garcia	ad1c9ae452	docs: fix logging in images extensions Adds a missing logging import in the file scylladb_common_images extension, which prevents the enterprise build from building. Additionally, it standardizes logging handling across the extensions and removes "ami" references in Azure and GCP extensions. Closes scylladb/scylladb#17137	2024-02-06 13:00:37 +02:00
Botond Dénes	115ee4e1f5	Merge 'doc: remove the OSS and Enterprise Features pages' from Anna Stuchlik This PR removes the following pages: - ScyllaDB Open Source Features - ScyllaDB Enterprise Features They were outdated, incomplete, and misleading. They were also redundant, as the per-release updates are added as Release Notes. With this update, the features listed on the removed pages are added under the common page: ScyllaDB Features. In addition, a reference to the Enterprise-only Features section is added. Note: No redirections are added because no file paths or URLs are changed with this PR. Fixes https://github.com/scylladb/scylladb/issues/13485 Refs https://github.com/scylladb/scylladb/issues/16496 (nobackport) Closes scylladb/scylladb#17150 * github.com:scylladb/scylladb: Update docs/using-scylla/features.rst doc: remove the OSS and Enterprise Features pages	2024-02-06 08:17:18 +02:00
Botond Dénes	edb983d165	Merge 'doc: add the 5.4-to-2024.1 upgrade guide' from Anna Stuchlik This PR: - Adds the upgrade guide from ScyllaDB Open Source 5.4 to ScyllaDB Enterprise 2024.1. Note: The need to include the "Restore system tables" step in rollback has been confirmed; see https://github.com/scylladb/scylladb/issues/11907#issuecomment-1842657959. - Removes the 5.1-to-2022.2 upgrade guide (unsupported versions). Fixes https://github.com/scylladb/scylladb/issues/16445 Closes scylladb/scylladb#16887 * github.com:scylladb/scylladb: doc: fix the OSS version number doc: metric updates between 2024.1. and 5.4 doc: remove the 5.1-to-2022.2 upgrade guide doc: add the 5.4-to-2024.1 upgrade guide	2024-02-06 08:16:05 +02:00
Anna Stuchlik	d6723134ab	doc: fix the OSS version number Replace "5.2" with "5.4", as this is the 5.4-to-2024.1 upgrade guide.	2024-02-05 21:10:50 +01:00
Kamil Braun	968d1e3e78	Merge 'raft topology: make rollback_to_normal a transition state' from Patryk Jędrzejczak After changing `left_token_ring` from a node state to a transition state in scylladb/scylladb#17009, we do the same for `rollback_to_normal`. `rollback_to_normal` was created as a node state because `left_token_ring` was a node state. This change will allow us to distinguish a failed removenode from a failed decommission in the `rollback_to_normal` handler. Currently, we use the same logic for both of them, so it's not required. However, this might change, as it has happened with the decommission and the failed bootstrap/replace in the `left_token_ring` state (scylladb/scylladb#16797). We are making this change now because it would be much harder after branching. Fixes scylladb/scylladb#17032 Closes scylladb/scylladb#17136 * github.com:scylladb/scylladb: docs: dev: topology-over-raft: align indentation docs: dev: topology-over-raft: document the rollback_to_normal state topology_coordinator: improve logs in rollback_to_normal handler raft topology: make rollback_to_normal a transition state	2024-02-05 16:30:20 +01:00
Anna Stuchlik	6d6c400b77	doc: metric updates between 2024.1. and 5.4 This commit adds the information about metrics updates between these two versions. Fixes https://github.com/scylladb/scylladb/issues/16446	2024-02-05 16:24:16 +01:00
Anna Stuchlik	1e9c7ab6d1	Update docs/using-scylla/features.rst Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>	2024-02-05 14:44:31 +01:00
Anna Stuchlik	f7afa6773f	doc: remove the OSS and Enterprise Features pages This commit removes the following pages: - ScyllaDB Open Source Features - ScyllaDB Enterprise Features They were outdated, incomplete, and misleading. They were also redundant, as the per-release updates are added as Release Notes. With this update, the features listed on the removed pages are added under the common page: ScyllaDB Features. Note: No redirections are added, because no file paths or URLs are changed with this commit. Fixes https://github.com/scylladb/scylladb/issues/13485 Refs https://github.com/scylladb/scylladb/issues/16496	2024-02-04 20:55:40 +01:00
Patryk Jędrzejczak	2687204c7f	docs: dev: topology-over-raft: align indentation	2024-02-02 16:55:28 +01:00
Patryk Jędrzejczak	fdd3c3a280	docs: dev: topology-over-raft: document the rollback_to_normal state In one of the previous patches, we changed the `rollback_to_normal` state from a node state to a transition state. We document it in this patch. The node state wasn't documented, so there is nothing to remove.	2024-02-02 16:55:28 +01:00
Kefu Chai	792fa4441e	docs: s/ontop/on top/ this misspelling is identified by codespell. ontop cannot be found on merriam-webster, but "on top" can, see https://www.merriam-webster.com/dictionary/on%20top, so let's replace ontop with "on top". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17127	2024-02-02 15:20:40 +01:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Kamil Braun	0912d2a2c6	Merge 'raft topology: make left_token_ring a transition state' from Patryk Jędrzejczak When a node is in the `left_token_ring` state, we don't know how it has ended up in this state. We cannot distinguish a node that has finished decommissioning from a node that has failed bootstrap. The main problem it causes is that we incorrectly send the `barrier_and_drain` command to a node that has failed bootstrapping or replacing. We must do it for a node that has finished decommissioning because it could still coordinate requests. However, since we cannot distinguish nodes in the `left_token_ring` state, we must send the command to all of them. This issue appeared in scylladb/scylladb#16797 and this PR is a follow-up that fixes it. The solution is changing `left_token_ring` from a node state to a transition state. Fixes scylladb/scylladb#16944 Closes scylladb/scylladb#17009 * github.com:scylladb/scylladb: docs: dev: topology-over-raft: document the left_token_ring state topology_coordinator: adjust reason string in left_token_ring handler raft topology: make left_token_ring a transition state topology_coordinator: rollback_current_topology_op: remove unused exclude_nodes	2024-01-29 15:29:01 +01:00
Beni Peled	8009170d3a	docs: update the installation instructions with the new gpg 2024 key Closes scylladb/scylladb#17019	2024-01-29 14:37:25 +02:00
Patryk Jędrzejczak	7c10cae6c4	docs: dev: topology-over-raft: document the left_token_ring state In one of the previous patches, we changed the `left_token_ring` state from a node state to a transition state. We document it in this patch. The node state wasn't documented, so there is nothing to remove.	2024-01-29 10:39:07 +01:00
Tzach Livyatan	06a9a925a5	Update link to sizing / pricing calc Closes scylladb/scylladb#17015	2024-01-29 11:07:20 +02:00
Anna Stuchlik	dfa88ccc28	doc: document nodetool resetlocalschema This adds the documentation for the nodetool resetlocalschema command. The syntax description is based on the description for Cassandra and the ScyllaDB help for nodetool. Fixes https://github.com/scylladb/scylladb/issues/16286 Closes scylladb/scylladb#16790	2024-01-28 21:09:02 +01:00
Kefu Chai	9ee6c00c84	docs: fix misspellings these misspellings are identified by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17005	2024-01-26 13:14:21 +02:00
Kamil Braun	4f736894e1	Merge 'Add maintenance mode' from Mikołaj Grzebieluch In this mode, the node is not reachable from the outside, i.e. * it refuses all incoming RPC connections, * it does not join the cluster, thus * all group0 operations are disabled (e.g. schema changes), * all cluster-wide operations are disabled for this node (e.g. repair), * other nodes see this node as dead, * cannot read or write data from/to other nodes, * it does not open Alternator and Redis transport ports and the TCP CQL port. The only way to make CQL queries is to use the maintenance socket. The node serves only local data. To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file. REST API works as usual, but some routes are disabled: * authorization_cache * failure_detector * hinted_hand_off_manager This PR also updates the maintenance socket documentation: * add cqlsh usage to the documentation * update the documentation to use `WhiteListRoundRobinPolicy` Fixes #5489. Closes scylladb/scylladb#15346 * github.com:scylladb/scylladb: test.py: add test for maintenance mode test.py: generalize usage of cluster_con test.py: when connecting to node in maintenance mode use maintenance socket docs: add maintenance mode documentation main: add maintenance mode main: move some REST routes initialization before joining group0 message_service: add sanity check that rpc connections are not created in the maintenance mode raft_group0_client: disable group0 operations in the maintenance mode service/storage_service: add start_maintenance_mode() method storage_service: add MAINTENANCE option to mode enum service/maintenance_mode: add maintenance_mode_enabled bool class service/maintenance_mode: move maintenance_socket_enabled definition to seperate file db/config: add maintenance mode flag docs: add cqlsh usage to maintenance socket documentation docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy	2024-01-26 11:02:34 +01:00
Raphael S. Carvalho	7ed5b44d52	load_balancer: Implement resize decisions This implements the ability in load balancer to emit split or merge requests, cancel ongoing ones if they're no longer needed, and also finalize those that are ready for the topology changes. That's all based on average tablet size, collected by coordinator from all nodes, and split and merge thresholds. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	0d5ba1ee4b	tablets: Add resize decision metadata to tablet metadata The new metadata describes the ongoing resize operation (can be either of merge, split or none) that spans tablets of a given table. That's managed by group0, so down nodes will be able to see the decision when they come back up and see the changes to the metadata. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:06 -03:00
Mikołaj Grzebieluch	9c07a189e8	docs: add maintenance mode documentation	2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch	81ef9fc91e	docs: add cqlsh usage to maintenance socket documentation After https://github.com/scylladb/scylla-cqlsh/pull/67, the user can use cqlsh to connect to the node by maintenance socket.	2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch	2c34d9fcd8	docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy After https://github.com/scylladb/python-driver/pull/287, the user can use WhiteListRoundRobinPolicy to connect to the node by maintenance socket.	2024-01-25 14:52:24 +01:00
Avi Kivity	69d597075a	Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec New tablet replicas are allocated and rebuilt synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet rebuilding transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes https://github.com/scylladb/scylladb/issues/16690. Closes scylladb/scylladb#16894 * github.com:scylladb/scylladb: tests: tablets: Add tests for removenode and replace tablets: Add support for removenode and replace handling topology_coordinator: tablets: Do not fail in a tight loop topology_coordinator: tablets: Avoid warnings about ignored failured future storage_service, topology: Track excluded state in locator::topology raft topology: Introduce param-less topology::get_excluded_nodes() raft topology: Move get_excluded_nodes() to topology tablets: load_balancer: Generalize load tracking tablets: Introduce get_migration_streaming_info() which works on migration request tablets: Move migration_to_transition_info() to tablets.hh tablets: Extract get_new_replicas() which works on migraiton request tablets: Move tablet_migration_info to tablets.hh tablets: Store transition kind per tablet	2024-01-25 14:49:43 +02:00
Botond Dénes	7bb3ed7f23	docs/operating-scylla: scylla-sstable.rst: fix checksum list Add empty line before list of different checksums in validate-checksums's description. Otherwise the list is not rendered. Closes scylladb/scylladb#16401	2024-01-24 16:34:13 +01:00
Avi Kivity	4a57b67634	docs: add a rough diagram of module interaction It is incomplete and maybe inaccurate, but it is a start. Closes scylladb/scylladb#16903	2024-01-23 18:08:48 +02:00
David Garcia	77822fc51d	chore: add azure and gcp images extensions Closes scylladb/scylladb#16942	2024-01-23 16:06:40 +02:00
Anna Stuchlik	9076a944c5	doc: improve the ScyllaDB for Developers page This commit improves the developer-oriented section of the core documentation: - Added links to the developer sections in the new Get Started guide (Develop with ScyllaDB and Tutorials and Example Projects) for ease of access. - Replaced the outdated Learn to Use ScyllaDB page with a link to the up-to-date page in the Get Started guide. This involves removing the learn.rst file and adding an appropriate redirection. - Removed the Apache Copyrights, as this page does not need it. - Removed the Features panel box as there was only one feature listed, which looked weird. Also, we are in the process of removing the Features section. Closes scylladb/scylladb#16800	2024-01-23 10:06:31 +02:00

1 2 3 4 5 ...

1173 Commits