scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 22:13:19 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	f2ed9fcd7e	schema_mutations, migration_manager: Ignore empty partitions in per-table digest Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d`, it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in 18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485.	2023-07-03 23:06:55 +02:00
Tomasz Grabiec	0c86abab4d	migration_manager, schema_tables: Implement migration_manager::reload_schema() Will recreate schema_ptr's from schema tables like during table alter. Will be needed when digest calculation changes in reaction to cluster feature at run time.	2023-07-03 20:32:59 +02:00
Kamil Braun	ff386e7a44	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336	2023-06-29 22:46:42 +02:00
Tomasz Grabiec	a9282103ba	Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy This series aims at hardening schema merges and preventing inconsistencies across shards by updating the database shards before calling the notification callback. As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux. In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state . Other changes optimize this path by: - updating shard 0 first, to seed the effective_replication_map. - executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change. Fixes #13137 Closes #14158 * github.com:scylladb/scylladb: migration_manager: propagate listener notification exceptions storage_service: keyspace_changed: execute only on shard 0 database: modify_keyspace_on_all_shards: execute func first on shard 0 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards database: add modify_keyspace_on_all_shards schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace database: create_keyspace_on_all_shards database: update_keyspace_on_all_shards database: drop_keyspace_on_all_shards	2023-06-29 12:17:53 +02:00
Tomasz Grabiec	50e8ec77c6	Merge 'Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in `53636167ca`, then in `79ee38181c`. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`	2023-06-28 20:40:03 +02:00
Kamil Braun	b912eeade5	Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. * 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla: test: add test for group0 raft command merging service: raft: respect max mutation size limit when persisting raft entries group0_state_machine: merge commands before applying them whenever possible	2023-06-28 17:21:07 +02:00
Kamil Braun	51cec2be86	storage_service: wait for nodes to be UP on bootstrap The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Refs: #14042 This doesn't completely fix #14042 yet becasue it's specific to gossiper-based topology mode only. For Raft-based topology, the node joining procedure will be coordinated by the topology coordinator right from the start and it will be the coordinator who issues the 'wait for node to see other live nodes'.	2023-06-28 16:20:29 +02:00
Kamil Braun	5ec5c7704c	storage_service: wait for NORMAL state handler before `setup_group0()` `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in `53636167ca`, then in `79ee38181c`. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). We do it only in non-raft-topology mode, because with Raft-based topology, node state changes are propagated to the cluster through explicit global barriers and we plan to remove node statuses from gossiper altogether. Fixes: #12972	2023-06-28 16:19:24 +02:00
Kamil Braun	64c302e777	storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()` This piece of `storage_service::wait_for_ring_to_settle()` will be performed earlier in the boot procedure in follow-up commits. Make it more generic, to be able to wait for `n` nodes to show up. Here we wait for `2` nodes - ourselves and at least one other.	2023-06-28 12:36:06 +02:00
Gleb Natapov	945f476363	test: add test for group0 raft command merging Add a test that submits 3 large commands each one a little bit larger than 1/3 of maximum mutation size. Check that in the end 2 command were executed (first 2 were merged and third was executed separately).	2023-06-27 14:59:55 +03:00
Gleb Natapov	8307b09c64	service: raft: respect max mutation size limit when persisting raft entries The code that preserves raft entries builds one batch statement to store all of them, but the butch's statement execute() merges all of the statements into one mutation and passes it to the database. The mutation can be larger than max mutation size limit and the write will fail. Fix it by splitting the write to multiple batch statements if needed.	2023-06-27 14:59:55 +03:00
Gleb Natapov	311cfa1be8	group0_state_machine: merge commands before applying them whenever possible Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. Fixes: #12581	2023-06-27 14:40:46 +03:00
Benny Halevy	825d617a53	migration_manager: propagate listener notification exceptions `1e29b07e40` claimed to make event notification exception safe, but swallawing the exceptions isn't safe at all, as this might leave the node in an inconsistent state if e.g. storage_service::keyspace_changed fails on any of the shards. Propagating the exception here will cause abort, but it is better than leaving the node up, but in an inconsistent state. We keep notifying other listeners even if any of them failed Based on `1e29b07e40`: ``` If one of the listeners throws an exception, we must ensure that other listeners are still notified. ``` The decision about swallowing exceptions can't be made in such a generic layer. Specific notification listeners that may ignore exceptions, like in transport/evenet_notifier, may decide to swallow their local exceptions on their own (as done in this patch). Refs #3389 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	a690f0e81f	storage_service: keyspace_changed: execute only on shard 0 Previously all shards called `update_topology_change_info` which in turn calls `mutate_token_metadata`, ending up in quadratic complexity. Now that the notifications are called after all database shards are updated, we can apply the changes on token metadata / effective replication map only on shard 0 and count on replicate_to_all_cores to propagate those changes to all other shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Petr Gusev	1e851262f2	storage_proxy: handler responses, use pointers to default constructed values instead of nulls The current Seastar RPC infrastructure lacks support for null values in tuples in handler responses. In this commit we add the make_default_rpc_tuple function, which solves the problem by returning pointers to default-constructed values for smart pointer types rather than nulls. The problem was introduced in this commit `2d791a5ed4`. The function `encode_replica_exception_for_rpc` used `default_tuple_maker` callback to create tuples containing exceptions. Callers returned pointers to default-constructed values in this callback, e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`. The commit changed this to just `SourceTuple{}`, which means nullptr for pointer types. Fixes: #14282 Closes #14352	2023-06-26 11:10:38 +03:00
Kamil Braun	e6942d31d3	Merge 'query processor code cleanup' from Gleb The series contains mostly cleanups for query processor and no functional change. The last patch is a small cleanup for the storage_proxy. * 'qp-cleanup' of https://github.com/gleb-cloudius/scylla: storage_proxy: remove unused variable client_state: co-routinise has_column_family_access function query_processor: get rid of internal_state and create individual query_satate for each request cql3: move validation::validate_column_family from client_state::has_column_family_access client_state: drop unneeded argument from has.*access functions cql3: move check for dropping cdc tables from auth to the drop statement code itself query_processor: co-routinise execute_prepared_without_checking_exception_message function query_processor: co-routinize execute_direct_without_checking_exception_message function cql3: remove empty statement::validate functions cql3: remove empty function validate_cluster_support cql3/statements: fix indentation and spurious white spaces query_processor: move statement::validate call into execute_with_params function query_processor: co-routinise execute_with_params function query_processor: execute statement::validate before each execution of internal query instead of only during prepare query_processor: get rid of shared internal_query_state query_processor: co-routinize execute_paged_internal function query_processor: co_routinize execute_batch_without_checking_exception_message function query_processor: co-routinize process_authorized_statement function	2023-06-23 10:32:57 +02:00
Gleb Natapov	94fcba5662	storage_proxy: remove unused variable	2023-06-22 15:26:20 +03:00
Gleb Natapov	caee26ab4f	client_state: co-routinise has_column_family_access function	2023-06-22 15:26:20 +03:00
Gleb Natapov	4bad482e4b	cql3: move validation::validate_column_family from client_state::has_column_family_access Checking keyspace/table presence should not be part of authorization code and it is not done consistently today. For instance keyspace presence is not checked in "alter keyspace" during authorization, but during statement execution. Make it consistent.	2023-06-22 13:57:36 +03:00
Gleb Natapov	31bddb65c7	client_state: drop unneeded argument from has.access functions After previous patch we can drop db argument to most of has.access functions in the client_state.	2023-06-22 13:57:36 +03:00
Gleb Natapov	06bcce53b5	cql3: move check for dropping cdc tables from auth to the drop statement code itself Checking if a table is CDC log and cannot be dropped should not be done as part of authentication (this has nothing to do with auth), but in the drop statement itself. Throwing unauthorized_exception is wrong as well, but unfortunately it is enshrined with a test. Not sure if it is a good idea to change it now.	2023-06-22 13:57:36 +03:00
Botond Dénes	e1c2de4fb8	Merge 'forward_service: fix forgetting case-sensitivity in aggregates ' from Jan Ciołek There was a bug that caused aggregates to fail when used on column-sensitive columns. For example: ```cql SELECT SUM("SomeColumn") FROM ks.table; ``` would fail, with a message saying that there is no column "somecolumn". This is because the case-sensitivity got lost on the way. For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written. The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it. This converted the name to lowercase, and later such column couldn't be found. To fix it, let's make the `column_identifier` case-sensitive. It will preserve the name, without converting it to lowercase. Fixes: https://github.com/scylladb/scylladb/issues/14307 Closes #14340 * github.com:scylladb/scylladb: service/forward_service.cc: make case-sensitivity explicit cql-pytest/test_aggregate: test case-sensitive column name in aggregate forward_service: fix forgetting case-sensitivity in aggregates	2023-06-22 08:25:33 +03:00
Avi Kivity	8576502c48	Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected. This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node. Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details. Closes #13850 * github.com:scylladb/scylladb: test: pylib: increase checking period for `get_alive_endpoints` test: add node banning test test: pylib: manager_client: `get_cql()` helper test: pylib: ScyllaCluster: server pause/unpause API raft topology: ban left nodes raft topology: skip `left_token_ring` state during `removenode` raft topology: prepare decommission path for node banning raft topology: introduce `left_token_ring` state raft topology: `raft_topology_cmd` implicit constructor messaging_service: implement host banning messaging_service: exchange host IDs and map them to connections messaging_service: store the node's host ID messaging_service: don't use parameter defaults in constructor main: move messaging_service init after system_keyspace init	2023-06-21 20:16:45 +03:00
Jan Ciołek	16c21d7252	service/forward_service.cc: make case-sensitivity explicit Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-21 16:02:41 +02:00
Jan Ciolek	7fca350075	forward_service: fix forgetting case-sensitivity in aggregates There was a bug that caused aggregates to fail when used on column-sensitive columns. For example: ``` SELECT SUM("SomeColumn") FROM ks.table; ``` would fail, with a message saying that there is no column "somecolumn". This is because the case-sensitivity got lost on the way. For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written. The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it. This converted the name to lowercase, and later such column couldn't be found. To fix it, let's make the `column_identifier` case-sensitive. It will preserve the name, without converting it to lowercase. Fixes: https://github.com/scylladb/scylladb/issues/14307 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-21 14:37:42 +02:00
Avi Kivity	e233f471b8	Merge 'Respect tablet shard assignment' from Tomasz Grabiec This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets): 1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored. 2. CDC subsystem was not adjusted (not supported yet) 3. sstable sharding metadata reflects tablet boundaries 5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables) 6. The system is NOT prepared to handle tablet migration / topology changes in a safe way. 7. Sstable cleanup is not wired properly yet After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead. To make the life easier, support was added to obtain table pointer from the schema pointer: ``` schema_ptr s; s->table().shard_of(...) ``` Closes #13939 * github.com:scylladb/scylladb: locator: network_topology_startegy: Allocate shards to tablets locator: Store node shard count in topology service: topology: Extract topology updating to a lambda test: Move test_tablets under topology_experimental sstables: Add trace-level logging related to shard calculation schema: Catch incorrect uses of schema::get_sharder() dht: Rename dht::shard_of() to dht::static_shard_of() treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() storage_proxy: Avoid multishard reader for tablets storage_proxy: Obtain shard from erm in the read path db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() forward_service: Use table sharder alternator: Use table sharder db: multishard: Obtain sharder from erm sstable_directory: Improve trace-level logging db: table: Introduce shard_of() helper db: Use table sharder in compaction sstables: Compute sstable shards using sharder from erm when loading sstables: Generate sharding metadata using sharder from erm when writing test: partitioner: Test split_range_to_single_shard() on tablet-like sharder dht: Make split_range_to_single_shard() prepared for tablet sharder sstables: Move compute_shards_for_this_sstable() to load() dht: Take sharder externally in splitting functions locator: Make sharder accessible through effective_replication_map dht: sharder: Document guarantees about mapping stability tablets: Implement tablet sharder tablets: Include pending replica in get_shard() dht: sharder: Introduce next_shard() db: token_ring_table: Filter out tablet-based keyspaces db: schema: Attach table pointer to schema schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load() schema_registry: Make learn(schema_ptr) attach entry to the target schema test: lib: cql_test_env: Expose feature_service test: Extract throttle object to separate header	2023-06-21 10:20:41 +03:00
Calle Wilund	f18e967939	storage_proxy: Make split_stats resilient to being called from different scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14294	2023-06-21 10:08:27 +03:00
Tomasz Grabiec	e110167a2a	locator: Store node shard count in topology Will be needed by tablet allocator.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	dd968e16bf	service: topology: Extract topology updating to a lambda Reduces code duplication.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	21198e8470	treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	fb0bdcec0c	storage_proxy: Avoid multishard reader for tablets Currently, the coordinator splits the partition range at vnode (or tablet) boundaries and then tries to merge adjacent ranges which target the same replica. This is an optimization which makes less sense with tablets, which are supposed to be of substantial size. If we don't merge the ranges, then with tablets we can avoid using the multishard reader on the replica side, since each tablet lives on a single shard. The main reason to avoid a multishard reader is avoiding its complexity, and avoiding adapting it to work with tablet sharding. Currently, the multishard reader implementation makes several assumptions about shard assignment which do not hold with tablets. It assumes that shards are assigned in a round-robin fashion.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	10e05eec66	storage_proxy: Obtain shard from erm in the read path dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e48ec6fed3	db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	d4497a058e	forward_service: Use table sharder schema::get_sharder() does not return the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Kamil Braun	643e69af89	Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature. Two new columns are added to `system.topology`: - `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes. - `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator. These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns. During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features. Closes #14232 * github.com:scylladb/scylladb: storage_service: update supported cluster features in group0 on start storage_service: add methods for features to topology mutation builder storage_service: use explicit ::set overload instead of a template storage_service: reimplement mutation builder setters storage_service: introduce topology_mutation_builder_base topology_state_machine: include information about features system_keyspace: introduce deserialize_set_column db/system_keyspace: add storage for cluster features managed in group 0	2023-06-20 18:32:00 +02:00
Piotr Dulikowski	3e955945de	storage_service: update supported cluster features in group0 on start Now, when a node starts, it will update its `supported_features` row in `system.topology` via `update_topology_with_local_metadata`. At this point, the functionality behind cluster features on raft is mostly incomplete and the state of the `supported_features` column does not influence anything so it's safe to update this column unconditionally. In the future, the node will only join / start group0 server if it is sure that it supports all enabled features and it can safely update the `supported_features` parameter.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	707e929831	storage_service: add methods for features to topology mutation builder The newly added `supported_features` and `enabled_features` columns can now be modified via topology mutation builders: - `supported_features` can now be overwritten via a new overload of `topology_node_mutation_builder::set`. - `enabled_features` can now be extended (i.e. more elements can be added to it) via `topology_mutation_builder::add_enabled_features`. As the set of enabled features only grows, this should be sufficient.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	2a4462a01f	storage_service: use explicit ::set overload instead of a template The `topology_node_mutation_builder::set` function has an overload which accepts any type which can be converted to string via `::format`. Its presence can lead to easy mistakes which can only be detected at runtime rather at compile time. A concrete example: I wrote a function that accepts an std::set<S> where S is convertible to sstring; it turns out that std::string_view is not std::convertible_to sstring and overload resolution falled back to the catch-all overload. This commit gets rid of the catch-all overload and replaces it with explicit ones. Fortunately, it was used for only two enums, so it wasn't much work.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	a8aaeabfac	storage_service: reimplement mutation builder setters As promised in the previous commit which introduced topology_mutation_builder_base, this commit adjusts existing setters of topology mutation builder and topology node mutation builder to use helper methods defined in the base class. Note that the `::set` method for the unordered set of tokens now does not delete the column in case an empty value is set, instead it just writes an empty set. This semantic is arguably more clear given that we have an explicit `::del` method and it shouldn't affect the existing implementation - we never intentionally insert an empty set of tokens.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	ee12192125	storage_service: introduce topology_mutation_builder_base Introduces `topology_mutation_builder_base` which will be a base class for both topology mutation builder and topology node mutation builder. Its purpose is to abstract away some detail about setting/deleting/etc. column in the mutation, the actual topology (node) mutation builder will only have to care about converting types and/or allowing only particular columns to be set. The class is using CRTP: derived classes provide access to the row being modified, schema and the timestamp. For the sake of commit diff readability, this commt only introduces this class and changes the builders to derive from it but no setter implementations are modified - this will be done in the next commit.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	bc84d59665	topology_state_machine: include information about features Now, the newly added `supported_features` and `enabled_features` columns are reflected in the `topology_state_machine` structure.	2023-06-20 16:41:05 +02:00
Kamil Braun	63229e48e8	raft topology: ban left nodes	2023-06-20 13:03:46 +02:00
Kamil Braun	737c1b4ae6	raft topology: skip `left_token_ring` state during `removenode` The "tell the node to shut down" RPC would fail every time in the removenode path (since the node is dead), which is kind of awkward. Besides, for removenode we don't really need the `left_token_ring` state, we don't need to coordinate with the node - writes destined for it are failing anyway (since it's dead) and we can ban the node immediately. Remove the node from group 0 while in `write_both_read_new` transition state (even when we implement abort, in this state it's too late to abort, we're committed to removing the node - so it's fine to remove it from group 0 at this point).	2023-06-20 13:03:46 +02:00
Kamil Braun	977680773b	raft topology: prepare decommission path for node banning Currently the decommissioned node waits until it observes that it was moved to the `left` state, then proceeds to leave group 0 and shut down. Unfortunately, this strategy won't work once we introduce banning nodes that are in `left` state - there is no guarantee that the decommissioning node will observe that it entered `left` state. The replication of Raft commands races with the ban propagating through the cluster. We also can't make the node leave as soon as it observes the `left_token_ring` state, which would defeat the purpose of `left_token_ring` - allowing all nodes to observe that the node has left the token ring before it shuts down. We could introduce yet another state between `left_token_ring` and `left`, which the node waits for before shutting down; the coordinator would request a barrier from the node before moving to `left` state. The alternative - which we chose here - is to have the coordinator explicitly tell the node to shutdown while we're in `left_token_ring` through a direct RPC. We introduce `raft_topology_cmd::command::shutdown` and send it to the node while in `left_token_ring` state, after we requested a cluster barrier. We don't require the RPC to succeed; we need to allow it to fail to preserve availability. This is because an earlier incarnation of the coordinator may have requested the node to shut down already, so the new coordinator will fail the RPC as the node is already dead. This also improves availability in general - if the node dies while we're in `left_token_ring`, we can proceed. We don't lose safety from that, since we'll ban the node (later commit). We only lose a bit of user experience if there's a failure at this decommission step - the decommissioning node may hang, never receiving the RPC (it will be necessary to shut it down manually). Another complication arising from banning the node is that it won't be able to leave group 0 on its own; by the time it tries that, it may have already been banned by the cluster (the coordinator moves the node to `left` state after telling it to shut down). So we get rid of the `leave_group0` step from `raft_decommission()` (which simplifies the function too), putting a `remove_from_raft_config` inside the coordinator code instead - after we told the node to shut down. (Removing the node from configuration is also another reason why we need to allow the above RPC to fail; the node won't be able to handle the request once it's outside the configuration, because it handles all coordinator requests by starting a read barrier.) Finally, a complication arises when the coordinator is the decommissioning node. The node would shut down in the middle of handling the `left_token_ring` state, leading to harmless but awkward errors even though there were no node/network failures (the original coordinator would fail the `left_token_ring` state logic; a new coordinator would take over and do it again, this time succeeding). We fix that by checking if we're the decommissioning node at the beginning of `left_token_ring` state handler, and if so, stepping down from leadership by becoming a nonvoter first.	2023-06-20 13:03:46 +02:00
Kamil Braun	b8ddfd9ef9	raft topology: introduce `left_token_ring` state We want for the decommissioning node to wait before shutting down until every node learns that it left the token ring. Otherwise some nodes may still try coordinating writes to that nodes after it already shut down, leading to unnecessary failures on the data path(e.g. for CL=ALL writes). Before this change, a node would shut down immediately after observing that it was in `left` state; some other nodes may still see it in `decommissioning` state and the topology transition state as `write_both_read_new`, so they'd try to write to that node. After this change, the node first enters the `left_token_ring` state before entering `left`, while the topology transition state is removed (so we've finished the token ring change - the node no longer has tokens in the ring, but it's still part of the topology). There we perform a read barrier, allowing all nodes to observe that the decommissioning node has indeed left the token ring. Only after that barrier succeeds we allow the node to shut down.	2023-06-20 13:03:46 +02:00
Kamil Braun	c94c07804d	raft topology: `raft_topology_cmd` implicit constructor Saves some redundant typing when passing `raft_topology_cmd` parameters, so we can change this: ``` raft_topology_cmd{raft_topology_cmd::command::fence_old_reads} ``` into this: ``` raft_topology_cmd::command::fence_old_reads ```	2023-06-20 13:03:46 +02:00
Kamil Braun	8b152361f4	Merge 'raft topology: fixes after #13884 ' from Gusev Petr This PR fixes some problems found after the PR was merged: * missed `node_to_work_on` assignment in `handle_topology_transition`; * change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation. * `update_fence_version` has beed moved after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error. Fixes: #14303 Closes #14292 * github.com:scylladb/scylladb: main.cc: move update_fence_version after group0_service.setup_group0_if_exist shared_token_metadata: update_fence_version: on_internal_error -> throw storage_service: handle_topology_transition: fix missed node assignment	2023-06-20 13:02:17 +02:00
Kamil Braun	732feca115	storage_proxy: query_partition_key_range_concurrent: don't access empty range `query_partition_range_concurrent` implements an optimization when querying a token range that intersects multiple vnodes. Instead of sending a query for each vnode separately, it sometimes sends a single query to cover multiple vnodes - if the intersection of replica sets for those vnodes is large enough to satisfy the CL and good enough in terms of the heat metric. To check the latter condition, the code would take the smallest heat metric of the intersected replica set and compare them to smallest heat metrics of replica sets calculated separately for each vnode. Unfortunately, there was an edge case that the code didn't handle: the intersected replica set might be empty and the code would access an empty range. This was catched by an assertion added in `8db1d75c6c` by the dtest `test_query_dc_with_rf_0_does_not_crash_db`. The fix is simple: check if the intersected set is empty - if so, don't calculate the heat metrics because we can decide early that the optimization doesn't apply. Also change the `assert` to `on_internal_error`. Fixes #14284 Closes #14300	2023-06-20 07:56:40 +03:00
Kamil Braun	aa2ccb3ac4	Merge 'raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs' from Mikołaj Grzebieluch Another node can stop after it joined the group0 but before it advertised itself in gossip. `get_inet_addrs` will try to resolve all IPs and `wait_for_peers_to_enter_synchronize_state` will loop indefinitely. But `wait_for_peers_to_enter_synchronize_state` can return early if one of the nodes confirms that the upgrade procedure has finished. For that, it doesn't need the IPs of all group 0 members - only the IP of some nodes which can do the confirmation. This pr restructures the code so that IPs of nodes are resolved inside the `max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs. Then, even if some IPs won't be resolved, but one of the nodes confirms a successful upgrade, we can continue. Fixes #13543 Closes #14046 * github.com:scylladb/scylladb: raft topology: test: check if aborted node replacing blocks bootstrap raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs	2023-06-19 12:31:27 +02:00
Petr Gusev	1770feebda	storage_service: handle_topology_transition: fix missed node assignment This defect remained after the refactoring of exec_global_command in #13884.	2023-06-19 11:26:57 +04:00

1 2 3 4 5 ...

3539 Commits