scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Kamil Braun	50ebce8acc	Merge 'Purge old ip on change' from Petr Gusev When a node changes IP address we need to remove its old IP from `system.peers` and gossiper. We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted. The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address. The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1. To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup. The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes. Fixes #16886 Fixes #16691 Fixes #17199 Closes scylladb/scylladb#17162 * github.com:scylladb/scylladb: test_change_ip: improve the test raft_ip_address_updater: remove stale IPs from gossiper raft_address_map: add my ip with the new generation system_keyspace::update_peer_info: check ep and host_id are not empty system_keyspace::update_peer_info: make host_id an explicit parameter system_keyspace::update_peer_info: remove any_set flag optimisation system_keyspace: remove duplicate ips for host_id system_keyspace: peers table: use coroutines storage_service::raft_ip_address_updater: log gossiper event name raft topology: ip change: purge old IP on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes	2024-02-15 17:40:29 +01:00
Petr Gusev	2bf75c1a4e	system_keyspace::update_peer_info: check ep and host_id are not empty	2024-02-15 13:21:04 +04:00
Petr Gusev	86410d71d1	system_keyspace::update_peer_info: make host_id an explicit parameter The host_id field should always be set, so it's more appropriate to pass it as a separate parameter. The function storage_service::get_peer_info_for_update is updated. It shouldn't look for host_id app state is the passed map, instead the callers should get the host_id on their own.	2024-02-15 13:21:04 +04:00
Petr Gusev	e0072f7cb3	system_keyspace::update_peer_info: remove any_set flag optimisation This optimization never worked -- there were four usages of the update_peer_info function and in all of them some of the peer_info fields were set or should be set: * sync_raft_topology_nodes/process_normal_node: e.g. tokens is set * sync_raft_topology_nodes/process_transition_node: host_id is set * handle_state_normal: tokens is set * storage_service::on_change: get_peer_info_for_update could potentially return a peer_info with all fields set to empty, but this shouldn't be possible, host_id should always be set. Moreover, there is a bug here: we extract host_id from the states_ parameter, which represent the gossiper application states that have been changed. This parameter contains host_id only if a node changes its IP address, in all other cases host_id is unset. This means we could end up with a record with empty host_id, if it wasn't previously set by some other means. We are going to fix this bug in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	4a14988735	system_keyspace: remove duplicate ips for host_id When a node changes IP we call sync_raft_topology_nodes from raft_ip_address_updater::on_endpoint_change with the old IP value in prev_ip parameter. It's possible that the nodes crashes right after we insert a new IP for the host_id, but before we remove the old IP. In this commit we fix the possible inconsistency by removing the system.peers record with old timestamp. This is what the new peers_table_read_fixup function is responsible for. We call this function in all system_keyspace methods that read the system.peers table. The function loads the table in memory, decides if some rows are stale by comparing their timestamps and removes them. The new function also removes the records with no host_id, so we no longer need the get_host_id function. We'll add a test for the problem this commit fixes in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	fa8718085a	system_keyspace: peers table: use coroutines This is a refactoring commit with no observable changes in behaviour. We switch the functions to coroutines, it'll be easy to work with them in this way in the next commit. Also, we add more const-s along the way.	2024-02-15 13:21:04 +04:00
Gleb Natapov	9b52dc4560	topology coordinator: make ignored_nodes list global and permanent Currently ignored_nodes list is part of a request (removenode or replace) and exists only while a request is handled. This patch changes it to be global and exist outside of any request. Node stays in the list until they eventually removed and moved to the "left" state. If a node is specified in the ignore-dead-nodes option for any command it will be ignored for all other operations that support ignored_nodes (like tablet migration).	2024-02-13 16:15:35 +02:00
Piotr Dulikowski	fb02453686	system_keyspace: add read_cdc_generation_opt The `system_keyspace::read_cdc_generation` loads a cdc generation from the system tables. One of its preconditions is that the generation exists - this precondition is quite easy to satisfy in raft mode, and the function was designed to be used solely in that mode. In legacy mode however, in case when we revert from raft mode through recovery, it might be necessary to use generations created in raft mode for some time. In order to make the function useful as a fallback in case lookup of a generation in legacy mode fails, introduce a relaxed variant of `read_cdc_generation` which returns std::nullopt if the generation does not exist.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	3513a07d8a	feature_service: fall back to checking legacy features on startup When checking features on startup (i.e. whether support for any feature was revoked in an unsafe way), it might happen that upgrade to raft topology didn't finish yet. In that case, instead of loading an empty set of features - which supposedly represents the set of features that were enabled until last boot - we should fall back to loading the set from the legacy `enabled_features` key in `system.scylla_local`.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	32a2e24a0f	topology_state_machine: introduce upgrade_state `upgrade_state` is a static column which will be used to track the progress of building the topology state machine.	2024-02-08 18:05:02 +01:00
Patryk Jędrzejczak	25b90f5554	raft topology: make rollback_to_normal a transition state After changing `left_token_ring` from a node state to a transition state in scylladb/scylladb#17009, we do the same for `rollback_to_normal`. `rollback_to_normal` was created as a node state because `left_token_ring` was a node state. This change will allow us to distinguish a failed removenode from a failed decommission in the `rollback_to_normal` handler. Currently, we use the same logic for both of them, so it's not required. However, this might change, as it has happened with the decommission and the failed bootstrap/replace in the `left_token_ring` state (scylladb/scylladb#16797). We are making this change now because it would be much harder after branching. The change also simplifies the code in `topology_coordinator:rollback_current_topology_op`. Moving the `rollback_to_normal` handler from `handle_node_transition` to `handle_topology_transition` created a large diff. There is only one change - adding `auto node = get_node_to_work_on(std::move(guard));`.	2024-02-02 16:55:20 +01:00
Patryk Jędrzejczak	b0eef50b2e	raft topology: make left_token_ring a transition state A node can be in the `left_token_ring` state after: - a finished decommission, - a failed bootstrap, - a failed replace. When a node is in the `left_token_ring` state, we don't know how it has ended up in this state. We cannot distinguish a node that has finished decommissioning from a node that has failed bootstrap. The main problem it causes is that we incorrectly send the `barrier_and_drain` command to a node that has failed bootstrapping or replacing. We must do it for a node that has finished decommissioning because it could still coordinate requests. However, since we cannot distinguish nodes in the `left_token_ring` state, we must send the command to all of them. This issue appeared in scylladb/scylladb#16797 and this patch is a follow-up that fixes it. The solution is changing `left_token_ring` from a node state to a transition state. Regarding implementation, most of the changes are simple refactoring. The less obvious are: - Before this patch, in `system_keyspace::left_topology_state`, we had to keep the ignored nodes' IDs for replace to ensure that the replacing node will have access to it after moving to the `left_token_ring` state, which happens when replace fails. We don't need this workaround anymore. When we enter the new `left_token_ring` transition state, the new node will still be in the `decommissioning` state, so it won't lose its request param. - Before this patch, a decommissioning node lost its tokens while moving to the `left_token_ring` state. After the patch, it loses tokens while still being in the `decommissioning` state. We ensure that all `decommissioning` handlers correctly handle a node that lost its tokens. Moving the `left_token_ring` handler from `handle_node_transition` to `handle_topology_transition` created a large diff. There are only three changes: - adding `auto node = get_node_to_work_on(std::move(guard));`, - adding `builder.del_transition_state()`, - changing error logged when `global_token_metadata_barrier` fails.	2024-01-29 10:39:07 +01:00
Avi Kivity	03313d359e	Merge ' db: commitlog_replayer: ignore mutations affected by (tablet) cleanups ' from Michał Chojnowski To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay. This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations. Fixes #16752 Closes scylladb/scylladb#16888 * github.com:scylladb/scylladb: test: test_tablets: add a test for cleanup after migration test: pylib: add ScyllaCluster.wipe_sstables test: boost: add commitlog_cleanup_test db: commitlog_replayer: ignore mutations affected by (tablet) cleanups replica: table: garbage-collect irrelevant system.commitlog_cleanups records db: commitlog: add min_position() replica: table: populate system.commitlog_cleanups on tablet cleanup db: system_keyspace: add system.commitlog_cleanups replica: table: refresh compound sstable set after tablet cleanup	2024-01-25 20:51:03 +02:00
Patryk Jędrzejczak	378cbd0b70	raft topology: ensure at most one transitioning node We add a sanity check to ensure at most one transitioning node at a time. If there is more, something must have gone wrong. In the future, we might implement concurrent topology operations. Then, we will remove this sanity check. We also extend the comment describing `transition_nodes` so that it better explains why we use a map and how it should be handled.	2024-01-25 13:42:46 +01:00
Michał Chojnowski	7c5a8894be	db: system_keyspace: add system.commitlog_cleanups Add a system table which will hold records of cleanup operations, for the purpose of filtering commitlog replays to avoid data resurrections.	2024-01-24 10:37:38 +01:00
Gleb Natapov	84197ff735	storage_service: topology coordinator: check topology operation completion using status in topology_requests table Instead of trying to guess if a request completed by looking into the topology state (which is sometimes can be error prone) look at the request status in the new topology_requests. If request failed report a reason for the failure from the table.	2024-01-16 17:02:54 +02:00
Gleb Natapov	584551f849	topology coordinator: add request_id to the topology state machine Provide a unique ID for each topology request and store it the topology state machine. It will be used to index new topology requests table in order to retrieve request status.	2024-01-16 13:57:27 +02:00
Gleb Natapov	ecb8778950	system keyspace: introduce local table to store topology requests status The table has the following schema and will be managed by raft: CREATE TABLE system.topology_requests ( id timeuuid PRIMARY KEY, initiating_host uuid, start_time timestamp, done boolean, error text, end_time timestamp, ); In case of an request completing with an error the "error" filed will be non empty when "done" is set to true.	2024-01-16 13:57:16 +02:00
Gleb Natapov	a4ac64a652	system_keyspace: raft topology: load ignore nodes parameter together with removenode topology request Next patch will need ignore nodes list while processing removenode request. Load it.	2024-01-14 14:44:07 +02:00
Gleb Natapov	cc54796e23	raft topology: add cleanup state to the topology state machine The patch adds cleanup state to the persistent and in memory state and handles the loading. The state can be "clean" which means no cleanup needed, "needed" which means the node is dirty and needs to run cleanup at some point, "running" which means that cleanup is running by the node right now and when it will be completed the state will be reset to "clean".	2024-01-14 13:30:54 +02:00
Kefu Chai	be364d30fd	db: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16664	2024-01-09 11:44:19 +02:00
Benny Halevy	c520fc23f0	system_keyspace: update_peer_info: drop single-column overloads They are no longer used. Instead, all callers now pass peer_info. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	7670f60b83	system_keyspace: load_tokens/peers/host_ids: enforce presence of host_id Skip rows that have no host_id to make sure the node state we load always has a valid host_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	74159bb5ae	system_keyspace: drop update_tokens(endpoint, tokens) overload It is unused now after the previous patch to update_peer_info in one call. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	b2735d47f7	system_keyspace: update_peer_info: use struct peer_info for all optional values Define struct peer_info holding optional values for all system.peers columns, allowing the caller to update any column. Pass the values as std::vector<std::optional<data_value>> to query_processor::execute_internal. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:30 +02:00
Benny Halevy	328ce23c78	types: add data_value_list data_value_list is a wrapper around std::initializer_list<data_value>. Use it for passing values to `cql3::query_processor::execute_internal` and friends. A following path will add a std::variant for data_value_or_unset and extend data_value_list to support unset values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:17:27 +02:00
Benny Halevy	85b3232086	system_keyspace: get rid of update_cached_values It's a no-op. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 10:10:51 +02:00
Benny Halevy	f64ecc2edf	storage_service: do not update peer info for this node system_keyspace had a hack to skip update_peer_info for the local node, and then to remove an entry for the local node in system.peers if `update_tokens(endpoint, ...)` was called for this node. This change unhacks system_keyspace by considering update of system.peers with the local address as an internal error and fixing the call sites that do that. Fixes #16425 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 10:10:51 +02:00
Kamil Braun	3b108f2e31	Merge 'db: config: make consistent_cluster_management mandatory' from Patryk Jędrzejczak We make `consistent_cluster_management` mandatory in 5.5. This option will be always unused and assumed to be true. Additionally, we make `override_decommission` deprecated, as this option has been supported only with `consistent_cluster_management=false`. Making `consistent_cluster_management` mandatory also simplifies the code. Branches that execute only with `consistent_cluster_management` disabled are removed. We also update documentation by removing information irrelevant in 5.5. Fixes scylladb/scylladb#15854 Note about upgrades: this PR does not introduce any more limitations to the upgrade procedure than there are already. As in scylladb/scylladb#16254, we can upgrade from the first version of Scylla that supports the schema commitlog feature, i.e. from 5.1 (or corresponding Enterprise release) or later. Assuming this PR ends up in 5.5, the documented upgrade support is from 5.4. For corresponding Enterprise release, it's from 2023.x (based on 5.2), so all requirements are met. Closes scylladb/scylladb#16334 * github.com:scylladb/scylladb: docs: update after making consistent_cluster_management mandatory system_keyspace, main, cql_test_env: fix indendations db: config: make consistent_cluster_management mandatory test: boost: schema_change_test: replace disable_raft_schema_config db: config: make override_decommission deprecated db: config: make force_schema_commit_log deprecated	2023-12-18 09:44:52 +01:00
Kefu Chai	81d5c4e661	db/system_keyspace: explicitly instantiate used template future<std::optional<utils::UUID>> system_keyspace::get_scylla_local_param_as<utils::UUID>(const sstring&) is used by db/schema_tables.cc. so let's instantiate this template explicitly. otherwise we'd have following link failure: ``` : && /home/kefu/.local/bin/clang++ -ffunction-sections -fdata-sections -O3 -g -gz -Xlinker --build-id=sha1 -fuse-ld=lld -dynamic-linker=/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2 -Xlinker --gc-sections CMakeFiles/scylla_version.dir/Release/release.cc.o CMakeFiles/scylla.dir/Release/main.cc.o -o Release/scylla Release/libscylla-main.a api/Release/libapi.a alternator/Release/libalternator.a db/Release/libdb.a cdc/Release/libcdc.a compaction/Release/libcompaction.a cql3/Release/libcql3.a data_dictionary/Release/libdata_dictionary.a gms/Release/libgms.a index/Release/libindex.a lang/Release/liblang.a message/Release/libmessage.a mutation/Release/libmutation.a mutation_writer/Release/libmutation_writer.a raft/Release/libraft.a readers/Release/libreaders.a redis/Release/libredis.a repair/Release/librepair.a replica/Release/libreplica.a schema/Release/libschema.a service/Release/libservice.a sstables/Release/libsstables.a streaming/Release/libstreaming.a test/perf/Release/libtest-perf.a thrift/Release/libthrift.a tools/Release/libtools.a transport/Release/libtransport.a types/Release/libtypes.a utils/Release/libutils.a seastar/Release/libseastar.a /usr/lib64/libboost_program_options.so.1.81.0 test/lib/Release/libtest-lib.a Release/libscylla-main.a -Xlinker --push-state -Xlinker --whole-archive auth/Release/libscylla_auth.a -Xlinker --pop-state /usr/lib64/libcrypt.so cdc/Release/libcdc.a compaction/Release/libcompaction.a mutation_writer/Release/libmutation_writer.a -Xlinker --push-state -Xlinker --whole-archive dht/Release/libscylla_dht.a -Xlinker --pop-state index/Release/libindex.a -Xlinker --push-state -Xlinker --whole-archive locator/Release/libscylla_locator.a -Xlinker --pop-state message/Release/libmessage.a gms/Release/libgms.a sstables/Release/libsstables.a readers/Release/libreaders.a schema/Release/libschema.a -Xlinker --push-state -Xlinker --whole-archive tracing/Release/libscylla_tracing.a -Xlinker --pop-state service/Release/libservice.a node_ops/Release/libnode_ops.a service/Release/libservice.a node_ops/Release/libnode_ops.a raft/Release/libraft.a repair/Release/librepair.a streaming/Release/libstreaming.a replica/Release/libreplica.a /usr/lib64/libabsl_raw_hash_set.so.2308.0.0 /usr/lib64/libabsl_hash.so.2308.0.0 /usr/lib64/libabsl_city.so.2308.0.0 /usr/lib64/libabsl_bad_variant_access.so.2308.0.0 /usr/lib64/libabsl_low_level_hash.so.2308.0.0 /usr/lib64/libabsl_bad_optional_access.so.2308.0.0 /usr/lib64/libabsl_hashtablez_sampler.so.2308.0.0 /usr/lib64/libabsl_exponential_biased.so.2308.0.0 /usr/lib64/libabsl_synchronization.so.2308.0.0 /usr/lib64/libabsl_graphcycles_internal.so.2308.0.0 /usr/lib64/libabsl_kernel_timeout_internal.so.2308.0.0 /usr/lib64/libabsl_stacktrace.so.2308.0.0 /usr/lib64/libabsl_symbolize.so.2308.0.0 /usr/lib64/libabsl_malloc_internal.so.2308.0.0 /usr/lib64/libabsl_debugging_internal.so.2308.0.0 /usr/lib64/libabsl_demangle_internal.so.2308.0.0 /usr/lib64/libabsl_time.so.2308.0.0 /usr/lib64/libabsl_strings.so.2308.0.0 /usr/lib64/libabsl_int128.so.2308.0.0 /usr/lib64/libabsl_strings_internal.so.2308.0.0 /usr/lib64/libabsl_string_view.so.2308.0.0 /usr/lib64/libabsl_throw_delegate.so.2308.0.0 /usr/lib64/libabsl_base.so.2308.0.0 /usr/lib64/libabsl_spinlock_wait.so.2308.0.0 /usr/lib64/libabsl_civil_time.so.2308.0.0 /usr/lib64/libabsl_time_zone.so.2308.0.0 /usr/lib64/libabsl_raw_logging_internal.so.2308.0.0 /usr/lib64/libabsl_log_severity.so.2308.0.0 -lsystemd /usr/lib64/libz.so /usr/lib64/libdeflate.so types/Release/libtypes.a utils/Release/libutils.a /usr/lib64/libcryptopp.so /usr/lib64/libboost_regex.so.1.81.0 /usr/lib64/libicui18n.so /usr/lib64/libicuuc.so /usr/lib64/libboost_unit_test_framework.so.1.81.0 seastar/Release/libseastar_perf_testing.a /usr/lib64/libjsoncpp.so.1.9.5 interface/Release/libinterface.a /usr/lib64/libthrift.so db/Release/libdb.a data_dictionary/Release/libdata_dictionary.a cql3/Release/libcql3.a transport/Release/libtransport.a cql3/Release/libcql3.a transport/Release/libtransport.a lang/Release/liblang.a /usr/lib64/liblua-5.4.so -lm rust/Release/libwasmtime_bindings.a rust/librust_combined.a /usr/lib64/libsnappy.so.1.1.10 mutation/Release/libmutation.a seastar/Release/libseastar.a /usr/lib64/libboost_program_options.so /usr/lib64/libboost_thread.so /usr/lib64/libboost_chrono.so /usr/lib64/libboost_atomic.so /usr/lib64/libcares.so /usr/lib64/libcryptopp.so /usr/lib64/libfmt.so.10.0.0 /usr/lib64/liblz4.so -ldl /usr/lib64/libgnutls.so -latomic /usr/lib64/libsctp.so /usr/lib64/libyaml-cpp.so /usr/lib64/libhwloc.so //usr/lib64/liburing.so /usr/lib64/libnuma.so /usr/lib64/libxxhash.so && : ld.lld: error: undefined symbol: seastar::future<std::optional<utils::UUID>> db::system_keyspace::get_scylla_local_param_as<utils::UUID>(seastar::basic_sstring<char, unsigned int, 15u, true> const&) >>> referenced by schema_tables.cc:981 (./build/./db/schema_tables.cc:981) >>> schema_tables.cc.o:(db::schema_tables::merge_schema(seastar::sharded<db::system_keyspace>&, seastar::sharded<service::storage_proxy>&, gms::feature_service&, std::vector<mutation, std::allocator<mutation>>, bool)::$_1::operator()()) in archive db/Release/libdb.a >>> referenced by schema_tables.cc:981 (./build/./db/schema_tables.cc:981) >>> schema_tables.cc.o:(db::schema_tables::recalculate_schema_version(seastar::sharded<db::system_keyspace>&, seastar::sharded<service::storage_proxy>&, gms::feature_service&)::$_0::operator()() const) in archive db/Release/libdb.a >>> referenced by schema_tables.cc:981 (./build/./db/schema_tables.cc:981) >>> schema_tables.cc.o:(db::schema_tables::merge_schema(seastar::sharded<db::system_keyspace>&, seastar::sharded<service::storage_proxy>&, gms::feature_service&, std::vector<mutation, std::allocator<mutation>>, bool)::$_1::operator()() (.resume)) in archive db/Release/libdb.a clang++: error: linker command failed with exit code 1 (use -v to see invocation) ``` it seems that, without the explicit instantiation, clang-18 just inlines the body of the instantiated template function at the caller site. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16434	2023-12-17 15:12:05 +02:00
Kamil Braun	6a4106edf3	migration_manager: don't attach empty system.scylla_local mutation in migration request handler In `effb9fb3cb` migration request handler (called when a node requests schema pull) was extended with a `system.scylla_local` mutation: ``` cm.emplace_back(co_await self._sys_ks.local().get_group0_schema_version()); ``` This mutation is empty if the GROUP0_SCHEMA_VERSIONING feature is disabled. Nevertheless, it turned out to cause problems during upgrades. The following scenario shows the problem: We upgrade from 5.2 to enterprise version with the aforementioned patch. In 5.2, `system.scylla_local` does not use schema commitlog. After the first node upgrades to the enterprise version, it immediately on boot creates a new enterprise-only table (`system_replicated_keys.encrypted_keys`) -- the specific table is not important, only the fact that a schema change is performed. This happens before the restarting node notices other nodes being UP, so the schema change is not immediately pushed to the other nodes. Instead, soon after boot, the other non-upgraded nodes pull the schema from the upgraded node. The upgraded node attaches a `system.scylla_local` mutation to the vector of returned mutations. The non-upgraded nodes try to apply this vector of mutations. Because some of these mutations are for tables that already use schema commitlog, while the `system.scylla_local` table does not use schema commitlog, this triggers the following error (even though the mutation is empty): ``` Cannot apply atomically across commitlog domains: system.scylla_local, system_schema.keyspaces ``` Fortunately, the fix is simple -- instead of attaching an empty mutation, do not attach a mutation at all if the handler of migration request notices that group0_schema_version is not present. Note that group0_schema_version is only present if the GROUP0_SCHEMA_VERSIONING feature is enabled, which happens only after the whole upgrade finishes. Refs: scylladb/scylladb#16414 Not using "Fixes" because the issue will only be fixed once this PR is merged to `master` and the commit is cherry-picked onto next-enterprise. Closes scylladb/scylladb#16416	2023-12-14 22:58:13 +01:00
Patryk Jędrzejczak	dced4bb924	system_keyspace, main, cql_test_env: fix indendations Broken in the previous patch.	2023-12-14 16:54:04 +01:00
Patryk Jędrzejczak	5ebfbf42bc	db: config: make consistent_cluster_management mandatory Code that executed only when consistent_cluster_management=false is removed. In particular, after this patch: - raft_group0 and raft_group_registry are always enabled, - raft_group0::status_for_monitoring::disabled becomes unused, - topology tests can only run with consistent_cluster_management.	2023-12-14 16:54:04 +01:00
Tomasz Grabiec	effb9fb3cb	Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620). If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957). When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary. We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`. Fixes: #7620 Fixes: #13957 --- This is a reincarnation of PR scylladb/scylladb#15331. The previous PR was reverted due to a bug it unmasked; the bug has now been fixed (scylladb/scylladb#16139). Some refactors from the previous PR were already merged separately, so this one is a bit smaller. I have checked with @Lorak-mmk's reproducer (https://github.com/Lorak-mmk/udt_schema_change_reproducer -- many thanks for it!) that the originally exposed bug is no longer reproducing on this PR, and that it can still be reproduced if I revert the aforementioned fix on top of this PR. Closes scylladb/scylladb#16242 * github.com:scylladb/scylladb: docs: describe group 0 schema versioning in raft docs test: add test for group 0 schema versioning feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations schema_tables: use schema version from group 0 if present migration_manager: store `group0_schema_version` in `scylla_local` during schema changes system_keyspace: make `get/set_scylla_local_param` public feature_service: add `GROUP0_SCHEMA_VERSIONING` feature	2023-12-11 12:17:57 +01:00
Kamil Braun	3db8ac80cb	migration_manager: store `group0_schema_version` in `scylla_local` during schema changes We extend schema mutations with an additional mutation to the `system.scylla_local` table which: - in Raft mode, stores a UUID under the `group0_schema_version` key. - outside Raft mode, stores a tombstone under that key. As we will see in later commits, nodes will use this after applying schema mutations. If the key is absent or has a tombstone, they'll calculate the global schema digest on their own -- using the old way. If the key is present, they'll take the schema version from there. The Raft-mode schema version is equal to the group 0 state ID of this schema command. The tombstone is necessary for the case of performing a schema change in RECOVERY mode. It will force a revert to the old digest-based way. Note that extending schema mutations with a `system.scylla_local` mutation is possible thanks to earlier commits which moved `system.scylla_local` to schema commitlog, so all mutations in the schema mutations vector still go to the same commitlog domain. Also, since we introduce a replicated tombstone to `system.scylla_local`, we need to set GC grace to nonzero. We set it to `schema_gc_grace`, which makes sense given the use case.	2023-12-08 17:45:41 +01:00
Avi Kivity	9c0f05efa1	Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later. This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted. The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained. The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was. This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas. Closes scylladb/scylladb#15847 * github.com:scylladb/scylladb: test: tablets: Add test for failed streaming being fenced away error_injection: Introduce poll_for_message() error_injection: Make is_enabled() public api: Add API to kill connection to a particular host range_streamer: Do not block topology change barriers around streaming range_streamer, tablets: Do not keep token metadata around streaming tablets: Fail gracefully when migrating tablet has no pending replica storage_service, api: Add API to disable tablet balancing storage_service, api: Add API to migrate a tablet storage_service, raft topology: Run streaming under session topology guard storage_service, tablets: Use session to guard tablet streaming tablets: Add per-tablet session id field to tablet metadata service: range_streamer: Propagate topology_guard to receivers streaming: Always close the rpc::sink storage_service: Introduce concept of a topology_guard storage_service: Introduce session concept tablets: Fix topology_metadata_guard holding on to the old erm docs: Document the topology_guard mechanism	2023-12-07 16:29:02 +02:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	31c995332c	storage_service, raft topology: Run streaming under session topology guard Prevents stale streaming operation from running beyond topology operation they were started in. After the session field is cleared, or changed to something else, the old topology_guard used by streaming is interrupted and fenced and the next barrier will join with any remaining work.	2023-12-06 18:36:17 +01:00
Benny Halevy	64145388c9	db/system_keyspace: use topology via db rather than fb_utilities So not to rely on fb_utilities. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Benny Halevy	4bb4d673c3	db/system_keyspace: save_local_info: get broadcast addresses from caller So not to rely on fb_utilities. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-05 08:42:49 +02:00
Patryk Jędrzejczak	c8ee7d4499	db: make schema commitlog feature mandatory Using consistent cluster management and not using schema commitlog ends with a bad configuration throw during bootstrap. Soon, we will make consistent cluster management mandatory. This forces us to also make schema commitlog mandatory, which we do in this patch. A booting node decides to use schema commitlog if at least one of the two statements below is true: - the node has `force_schema_commitlog=true` config, - the node knows that the cluster supports the `SCHEMA_COMMITLOG` cluster feature. The `SCHEMA_COMMITLOG` cluster feature has been added in version 5.1. This patch is supposed to be a part of version 6.0. We don't support a direct upgrade from 5.1 to 6.0 because it skips two versions - 5.2 and 5.4. So, in a supported upgrade we can assume that the version which we upgrade from has schema commitlog. This means that we don't need to check the `SCHEMA_COMMITLOG` feature during an upgrade. The reasoning above also applies to Scylla Enterprise. Version 2024.2 will be based on 6.0. Probably, we will only support an upgrade to 2024.2 from 2024.1, which is based on 5.4. But even if we support an upgrade from 2023.x, this patch won't break anything because 2023.1 is based on 5.2, which has schema commitlog. Upgrades from 2022.x definitely won't be supported. When we populate a new cluster, we can use the `force_schema_commitlog=true` config to use schema commitlog unconditionally. Then, the cluster feature check is irrelevant. This check could fail because we initiate schema commitlog before we learn about the features. The `force_schema_commitlog=true` config is especially useful when we want to use consistent cluster management. Failing feature checks would lead to crashes during initial bootstraps. Moreover, there is no point in creating a new cluster with `consistent_cluster_management=true` and `force_schema_commitlog=false`. It would just cause some initial bootstraps to fail, and after successful restarts, the result would be the same as if we used `force_schema_commitlog=true` from the start. In conclusion, we can unconditionally use schema commitlog without any checks in 6.0 because we can always safely upgrade a cluster and start a new cluster. Apart from making schema commitlog mandatory, this patch adds two changes that are its consequences: - making the unneeded `force_schema_commitlog` config unused, - deprecating the `SCHEMA_COMMITLOG` feature, which is always assumed to be true. Closes scylladb/scylladb#16254	2023-12-04 21:02:16 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Gleb Natapov	95dd0e453d	storage_service: topology coordinator: add rollback_to_normal node state When a topology coordinator rolls back from unsuccessful topology operation it advances the fence (which is now in the raft state) after moving to normal state. We do not want this to fail (only majority of nodes is needed for it to not to), but currently it may fail in case the coordinator moves to another node after changing the rollback node's state to normal, but before updating the fence. To solve that the rollback operation needs to go through a new rollback_to_normal state that will do the fencing before moving to normal. This patch introduces that state, but does not use it yet.	2023-11-23 15:27:28 +02:00
Gleb Natapov	6edbf4b663	storage_service: topology coordinator: put fence version into the raft state Currently when the coordinator decides to move the fence it issues an RPC to each node and each node locally advances fence version. This is fine if there are no failures or failures are handled by retrying fencing, but if we want to allow topology changes to progress even in the presence of barrier failures it is easier to store the fence version in the raft state. The nodes that missed fence rpc may easily catch up to the latest fence version by simply executing a raft barrier.	2023-11-19 15:28:08 +02:00
Kamil Braun	f094e23d84	system_keyspace: use system memory for `system.raft` table `system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: scylladb/scylladb#15622 Closes scylladb/scylladb#15972	2023-11-08 11:21:14 +02:00
Botond Dénes	76ab66ca1f	Merge 'Support state change for S3-backed sstables' from Pavel Emelyanov The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly. There's also the upload facility that's implemented as state change too, but this PR doesn't support this part. fixes: #13017 Closes scylladb/scylladb#15829 * github.com:scylladb/scylladb: test: Make test_sstables_excluding_staging_correctness run over s3 too sstables,s3: Support state change (without generation change) system_keyspace: Add state field to system.sstables sstable_directory: Tune up sstables entries processing comment system_keyspace: Tune up status change trace message sstables: Add state string to state enum class convert	2023-11-07 10:45:41 +02:00
Kamil Braun	0846d324d7	Merge 'rollback topology operation on streaming failure' from Gleb This patch series adds error handling for streaming failure during topology operations instead of an infinite retry. If streaming fails the operation is rolled back: bootstrap/replace nodes move to left and decommissioned/remove nodes move back to normal state. * 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev: raft: make sure that all operation forwarded to a leader are completed before destroying raft server storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier tests: add tests for streaming failure in bootstrap/replace/remove/decomission test/pylib: do not stop node if decommission failed with an expected error storage_service: raft topology: fix typo in "decommission" everywhere storage_service: raft topology: add streaming error injection storage_service: raft topology: do not increase topology version during CDC repair storage_service: raft topology: rollback topology operation on streaming failure. storage_service: raft topology: load request parameters in left_token_ring state as well storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch storage_service: raft topology: make global_token_metadata_barrier node independent storage_service: raft topology: split get_excluded_nodes from exec_global_command storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true storage_service: raft topology: simplify streaming RPC failure handling	2023-11-02 10:15:45 +01:00
Gleb Natapov	0a8c3e5c78	storage_service: raft topology: load request parameters in left_token_ring state as well Next patch will want to access request parameters in left_token_ring for failure recovery purposes.	2023-10-25 12:56:27 +03:00
Piotr Dulikowski	63aa9332aa	raft topology: assign tokens after join node response rpc Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier. However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet. Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier.	2023-10-25 11:50:17 +02:00
Piotr Dulikowski	2d161676c7	raft topology: loosen assumptions about transition nodes having tokens In later commits, tokens for a joining/replacing node will not be inserted when the node enters `bootstrapping`/`replacing` state but at some later step of the procedure. Loosen some of the assumptions in `storage_service::topology_state_load` and `system_keyspace::load_topology_state` appropriately.	2023-10-25 11:50:17 +02:00

1 2 3 4 5 ...

695 Commits