scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 22:25:48 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	25fec0acce	gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature The new feature will prevent secondary indexes on static columns from being created unless the whole cluster is ready to support them.	2022-12-06 11:21:16 +01:00
Tomasz Grabiec	1a6bf2e9ca	Merge 'service/raft: specialized verb for failure detector pinger' from Kamil Braun We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace. Closes #12107 * github.com:scylladb/scylladb: service/raft: specialized verb for failure detector pinger db: system_keyspace: de-staticize `{get,set}_raft_server_id` service/raft: make this node's Raft ID available early in group registry	2022-12-02 13:54:02 +01:00
Kamil Braun	cbdcc944b5	service/raft: specialized verb for failure detector pinger We used GOSSIP_ECHO verb to perform failure detection. Now we use a special verb DIRECT_FD_PING introduced for this purpose. There are multiple reasons to do so. One minor reason: we want to use the same connection as other Raft verbs: if we can't deliver Raft append_entries or vote messages somewhere, that endpoint should be marked dead; if we can, the endpoint should be marked alive. So putting pings on the same connection as the other Raft verbs is important when dealing with weird situations where some connections are available but others are not. Observe that in `do_get_rpc_client_idx`, we put the new verb in the right place. Another minor reason: we remove the awkward gossiper `echo_pinger` abstraction which required storing and updating gossiper generation numbers. This also removes one dependency from Raft service code to gossiper. Major reason 1: the gossip echo handler has a weird mechanism where a replacing node returns errors during the replace operation to some of the nodes. In Raft however, we want to mark servers as alive when they are alive, including a server running on a node that's replacing another node. Major reason 2, related to the previous one: when server B is replacing server A with the same IP, the failure detector will try to ping both servers. Both servers are mapped to the same IP by the address map, so pings to both servers will reach server B. We want server B to respond to the pings destined for server B, but not to pings destined for server A, so the sender can mark B alive but keep A marked dead. To do this, we include the destination's Raft ID in our RPCs. The destination compares the received ID with its own. If it's different, it returns a `wrong_destination` response, and the failure detector knows that the ping did not reach the destination (it reached someone else). Yet another reason: removes "Not ready to respond gossip echo message" log spam during replace.	2022-12-01 20:54:18 +01:00
Avi Kivity	a4b77a5691	Merge 'Cleanup sstables::test_env's manager usage' from Pavel Emelyanov Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain. Closes #12155 * github.com:scylladb/scylladb: test, utils: Use only one tempdir sstable_compaction_test: Dont create nested envs mutation_reader_test: Remove unused create_sstable() helper tests, lib: Move globals onto sstables::test_env tests: Use sstables::test_env.db_config() to access config features: Mark feature_config_from_db_config const sstable_3_x_test: Use env method to create sst sstable_3_x_test: Indentation fix after previous patch sstable_3_x_test: Use sstable::test_env test: Add config to sstable::test_env creation config: Add constexpr value for default murmur ignore bits	2022-12-01 17:47:25 +02:00
Pavel Emelyanov	b4e31ad359	features: Mark feature_config_from_db_config const It's in fact such. Other than that, next patch will call it with const config at hand and fail to compile without this fix Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:39:27 +03:00
Konstantin Osipov	73e5298273	raft: (address map) actively maintain ip <-> raft server id map 1) make address map API flexible Before this patch: - having a mapping without an actual IP address was an internal error - not having a mapping for an IP address was an internal error - re-mapping to a new IP address wasn't allowed After this patch: - the address map may contain a mapping without an actual IP address, and the caller must be prepared for it: find() will return a nullopt. This happens when we first add an entry to Raft configuration and only later learn its IP address, e.g. via gossip. - it is allowed to re-map an existing entry to a new address; 2) subscribe to gossip notifications Learning IP addresses from gossip allows us to adjust the address map whenever a node IP address changes. Gossiper is also the only valid source of re-mapping, other sources (RPC) should not re-map, since otherwise a packet from a removed server can remap the id to a wrong address and impact liveness of a Raft cluster. 3) prompt address map state with app state Initialize the raft address map with initial gossip application state, specifically IPs of members of the cluster. With this, we no longer need to store these IPs in Raft configuration (and update them when they change). The obvious drawback of this approach is that a node may join Raft config before it propagates its IP address to the cluster via gossip - so the boot process has to wait until it happens. Gossip also doesn't tell us which IPs are members of Raft configuration, so we subscribe to Group0 configuration changes to mark the members of Raft config "non-expiring" in the address translation map. Thanks to the changes above, Raft configuration no longer stores IP addresses. We still keep the 'server_info' column in the raft_config system table, in case we change our mind or decide to store something else in there.	2022-11-29 19:55:43 +03:00
Nadav Har'El	2dedb5ea75	alternator: make Alternator TTL feature no longer "experimental" Until now, the Alternator TTL feature was considered "experimental", and had to be manually enabled on all nodes of the cluster to be usable. This patch removes this requirement and in essence GAs this feature. Even after this patch, Alternator TTL is still a "cluster feature", i.e., for this feature to be usable every node in the cluster needs to support it. If any of the nodes is old and does not yet support this feature, the UpdateTimeToLive request will not be accepted, so although the expiration-scanning threads may exist on the newer nodes, they will not do anything because none of the tables can be marked as having expiration enabled. This patch does not contain documentation fixes - the documentation still suggests that the Alternator TTL feature is experimental. The documentation patch will come separately. Fixes #12037 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12049	2022-11-24 17:21:39 +02:00
Kamil Braun	d7649a86c4	Merge 'Build up to support of dynamic IP address changes in Raft' from Konstantin Osipov We plan to stop storing IP addresses in Raft configuration, and instead use the information disseminated through gossip to locate Raft peers. Implement patches that are building up to that: * improve Raft API of configuration change notifications * disseminate raft host id in Gossip * avoid using Raft addresses from Raft configuraiton, and instead consistently use the translation layer between raft server id <-> IP address Closes #11953 * github.com:scylladb/scylladb: raft: persist the initial raft address map raft: (upgrade) do not use IP addresses from Raft config raft: (and gossip) begin gossiping raft server ids raft: change the API of conf change notifications	2022-11-18 11:38:19 +01:00
Asias He	4571fcf9e7	token_metadata: Rename is_member to is_normal_token_owner The name is_normal_token_owner is more clear than is_member. The is_normal_token_owner reflects what it really checks.	2022-11-18 09:29:20 +08:00
Konstantin Osipov	051dceeaff	raft: (and gossip) begin gossiping raft server ids We plan to use gossip data to educate Raft RPC about IP addresses of raft peers. Add raft server ids to application state, so that when we get a notification about a gossip peer we can identify which raft server id this notification is for, specifically, we can find what IP address stands for this server id, and, whenever the IP address changes, we can update Raft address map with the new address. On the same token, at boot time, we now have to start Gossip before Raft, since Raft won't be able to send any messages without gossip data about IP addresses.	2022-11-17 12:07:31 +03:00
Asias He	16bd9ec8b1	gossip: Improve get_live_token_owners and get_unreachable_token_owners The get_live_token_owners returns the nodes that are part of the ring and live. The get_unreachable_token_owners returns the nodes that are part of the ring and is not alive. The token_metadata::get_all_endpoints returns nodes that are part of the ring. The patch changes both functions to use the more authoritative source to get the nodes that are part of the ring and call is_alive to check if the node is up or down. So that the correctness does not depend on any derived information. This patch fixes a truncate issue in storage_proxy::truncate_blocking where it calls get_live_token_owners and get_unreachable_token_owners to decide the nodes to talk with for truncate operation. The truncate failed because incorrect nodes were returned. Fixes #10296 Fixes #11928 Closes #11952	2022-11-15 14:21:48 +01:00
Kamil Braun	2c20f2ab9d	gms: gossiper: move `direct_fd_pinger` out to a separate service In later commit `direct_fd_pinger` will operate in terms of `raft::server_id`s. Decouple it from `gossiper` since we don't want to entangle `gossiper` with Raft-specific stuff.	2022-11-04 09:38:08 +01:00
Kamil Braun	e9a4263e14	gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class `gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them is to maintain a mapping between `gms::inet_address`es and `direct_failure_detector::pinger::endpoint_id`s, another is to cache the last known gossiper's generation number to use it for sending gossip echo messages. The latter is the only gossiper-specific thing in this class. We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the gossiper-specific thing -- the generation number management -- to a smaller class, `echo_pinger`. `echo_pinger` is a top-level class (not a nested one like `direct_fd_pinger` was) so we can forward-declare it and pass references to it without including gms/gossiper.hh header.	2022-11-04 09:38:08 +01:00
Pavel Emelyanov	e245780d56	gossiper: Request topology states in shadow round When doing shadow round for replacement the bootstrapping node needs to know the dc/rack info about the node it replaces to configure it on topology. This topology info is later used by e.g. repair service. fixes: #11829 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11838	2022-10-25 13:21:20 +03:00
Pavel Emelyanov	898579027d	gossiper: Pass current snitch name into checker Gossiper makes sure local snitch name is the same as the one of other nodes in the ring. It now gets global snitch to get the name, this patch passes the name as an argument, because the caller (storage_service) has snitch instance local reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-20 12:33:38 +03:00
Botond Dénes	2d581e9e8f	Merge "Maintain dc/rack by topology" from Pavel Emelyanov " There's an ongoing effort to move the endpoint -> {dc/rack} mappings from snitch onto topology object and this set finalizes it. After it the snitch service stops depending on gossiper and system keyspace and is ready for de-globalization. As a nice side-effect the system keyspace no longer needs to maintain the dc/rack info cache and its starting code gets relaxed. refs: #2737 refs: #2795 " * 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits) system_keyspace: Dont maintain dc/rack cache system_keyspace: Indentation fix after previous patch system_keyspace: Coroutinuze build_dc_rack_info() topology: Move all post-configuration to topology::config snitch: Start early gossiper: Do not export system keyspace snitch: Remove gossiper reference snitch: Mark get_datacenter/_rack methods const snitch: Drop some dead dependency knots snitch, code: Make get_datacenter() report local dc only snitch, code: Make get_rack() report local rack only storage_service: Populate pending endpoint in on_alive() code: Populate pending locations topology: Put local dc/rack on topology early topology: Add pending locations collection topology: Make get_location() errors more verbose token_metadata: Add config, spread everywhere token_metadata: Hide token_metadata_impl copy constructor gosspier: Remove messaging service getter snitch: Get local address to gossip via config ...	2022-10-19 06:50:21 +03:00
Asias He	6134fe4d1f	storage_service: Prevent removed node to rejoin in handle_state_normal - Start n1, n2, n3 (127.0.0.3) - Stop n3 - Change ip address of n3 to 127.0.0.33 and restart n3 - Decommission n3 - Start new node n4 The node n4 will learn from the gossip entry for 127.0.0.3 that node 127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of the ring. This patch prevents this by checking the status for the host id on all the entries. If any of the entries shows the node with the host id is in LEFT status, reject to put the node in NORMAL status. Fixes #11355 Closes #11361	2022-10-13 15:11:32 +02:00
Pavel Emelyanov	16188a261e	gossiper: Do not export system keyspace No users of it left. Despite the gossiper->system_keyspace dependency is not needed either, keep it alive because gossiper still updates system keyspace with feature masks, so chances are it will be reactivated some time later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Pavel Emelyanov	072ef88ed1	gosspier: Remove messaging service getter No code needs to borrow messaging from gossiper, which is nice Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-11 05:17:08 +03:00
Benny Halevy	9ad41c700e	gms/feature_service: add large_collection_detection cluster feature And a corresponding db::schema_feature::SCYLLA_LARGE_COLLECTIONS We want to enable the schema change supporting collection_elements only when all nodes are upgraded so that we can roll back if the rolling upgrade process is aborted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:07 +03:00
Kamil Braun	67ee6500e3	service/raft: raft_group_registry: pass `direct_fd_pinger` by reference It was passed to `raft_group_registry::direct_fd_proxy` by value. That is a bug, we want to pass a reference to the instance that is living inside `gossiper`. Fortunately this bug didn't cause problems, because the pinger is only used for one function, `get_address`, which looks up an address in a map and if it doesn't find it, accesses the map that lives inside `gossiper` on shard 0 (and then caches it in the local copy). Explicitly delete the copy constructor of `direct_fd_pinger` so this doesn't happen again. Closes #11661	2022-10-03 16:40:35 +02:00
Pavel Emelyanov	7ae73c665b	gossiper: Remove some dead code Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #11599	2022-09-22 06:58:29 +03:00
Kamil Braun	2fe3e67a47	gms: feature_service: don't distinguish between 'known' and 'supported' features `feature_service` provided two sets of features: `known_feature_set` and `supported_feature_set`. The purpose of both and the distinction between them was unclear and undocumented. The 'supported' features were gossiped by every node. Once a feature is supported by every node in the cluster, it becomes 'enabled'. This means that whatever piece of functionality is covered by the feature, it can by used by the cluster from now on. The 'known' set was used to perform feature checks on node start; if the node saw that a feature is enabled in the cluster, but the node does not 'know' the feature, it would refuse to start. However, if the feature was 'known', but wasn't 'supported', the node would not complain. This means that we could in theory allow the following scenario: 1. all nodes support feature X. 2. X becomes enabled in the cluster. 3. the user changes the configuration of some node so feature X will become unsupported but still known. 4. The node restarts without error. So now we have a feature X which is enabled in the cluster, but not every node supports it. That does not make sense. It is not clear whether it was accidental or purposeful that we used the 'known' set instead of the 'supported' set to perform the feature check. What I think is clear, is that having two sets makes the entire thing unnecessarily complicated and hard to think about. Fortunately, at the base to which this patch is applied, the sets are always the same. So we can easily get rid of one of them. I decided that the name which should stay is 'supported', I think it's more specific than 'known' and it matches the name of the corresponding gossiper application state. Closes #11512	2022-09-12 13:09:12 +03:00
Kamil Braun	be1ef9d2a7	gms: feature_service: remove the USES_RAFT feature It was not and won't be used for anything. Note that the feature was always disabled or masked so no node ever announced it, thus it's safe to get rid of. Closes #11505	2022-09-09 18:05:46 +02:00
Avi Kivity	35fbba3a5b	Revert "gms: gossiper: include nodes with empty feature sets when calculating enabled features" This reverts commit `08842444b4`. It causes a failure in test_shutdown_all_and_replace_node. Fixes #11316.	2022-08-18 15:01:50 +03:00
Kamil Braun	08842444b4	gms: gossiper: include nodes with empty feature sets when calculating enabled features Right now, if there's a node for which we don't know the features supported by this node (they are neither persisted locally, nor gossiped by that node), we would skip this node in calculating the set of enabled features and potentially enable a feature which shouldn't be enabled - because that node may not know it. We should only enable a feature when we know that all nodes have upgraded and know the feature. This bug caused us problems when we tried to move RAFT out of experimental. There are dtests such as `partitioner_tests.py` in which nodes would enable features prematurely, which caused the Raft upgrade procedure to break (the procedure starts only when all nodes upgrade and announce that they know the SUPPORTS_RAFT cluster feature). Closes #11225	2022-08-16 19:07:41 +03:00
Nadav Har'El	8b00c91c13	cql, index: make collection indexing a cluster feature Prevent a user from creating a secondary index on a collection column if the cluster has any nodes which don't support this feature. Such nodes will not be able to correctly handle requests related to this index, so better not allow creating one. Attempting to create an index on a collection before the entire cluster supports this feature will result in the error: Indexing of collection columns not supported by some older nodes in this cluster. Please upgrade them. Tested by manually disabling this feature in feature_service.cc and seeing this error message during collection indexing test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Benny Halevy	d295d8e280	everywhere: define locator::host_id as a strong tagged_uuid type So it can be distinguished from other uuid-based identifiers in the system. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11276	2022-08-12 06:01:44 +03:00
Avi Kivity	e9cbc9ee85	Merge 'Add support for empty replica pages' from Botond Dénes Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones. The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by `3131cbea62`, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones. The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set. Upgrade sanity test was conducted as following: * Created cluster of 3 nodes with RF=3 with master version * Wrote small dataset of 1000 rows. * Deleted prefix of 980 rows. * Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100` * Also did some manual queries via `cqlsh` with smaller page size and tracing on. * Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`. * Confirmed there are no errors or read-repairs. Perf regression test: ``` build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60 ``` Before: ``` median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors) median absolute deviation: 973.40 maximum: 135511.63 minimum: 104978.74 ``` After: ``` median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors) median absolute deviation: 2979.13 maximum: 134538.13 minimum: 114688.07 ``` Diff: +~200 instruction/op. Fixes: https://github.com/scylladb/scylla/issues/7689 Fixes: https://github.com/scylladb/scylla/issues/3914 Fixes: https://github.com/scylladb/scylla/issues/7933 Refs: https://github.com/scylladb/scylla/issues/3672 Closes #11053 * github.com:scylladb/scylladb: test/cql-pytest: add test for query tombstone page limit query-result-writer: stop when tombstone-limit is reached service/pager: prepare for empty pages service/storage_proxy: set smallest continue pos as query's continue pos service/storage_proxy: propagate last position on digest reads query: result_merger::get() don't reset last-pos on short-reads and last pages query: add tombstone-limit to read-command service/storage_proxy: add get_tombstone_limit() query: add tombstone_limit type db/config: add config item for query tombstone limit gms: add cluster feature for empty replica pages tree: don't use query::read_command's IDL constructor	2022-08-10 13:38:06 +03:00
Asias He	12ab2c3d8d	storage_service: Prevent removed node to restart and join the cluster 1) Start node1,2,3 2) Stop node3 3) Run nodetool removenode $host_id_of_node3 4) Restart node3 Step 4 is wrong and not allowed. If it happens it will bring back node3 to the cluster. This patch adds a check during node restart to detect such operation error and reject the restart. With this patch, we would see the following in step 4. ``` init - Startup failed: std::runtime_error (The node 127.0.0.3 with host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the cluster. Can not restart the removed node to join the cluster again!) ``` Refs #11217 Closes #11244	2022-08-09 12:46:21 +03:00
Botond Dénes	1bc14b5e3b	gms: add cluster feature for empty replica pages So we can start using them only when the entire cluster supports it.	2022-08-09 10:00:40 +03:00
Benny Halevy	2b017ce285	schema, everywhere: define and use table_schema_version as a strong type Define table_schema_version as a distinct tagged_uuid class, So it can be differentiated from other uuid-class types, in particular table_id. Added reversed(table_schema_version) for convenience and uniformity since the same logic is currently open coded in several places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:45 +03:00
Warren Krewenki	4178ccd27f	gossiper: Correct typo in log message Closes #11212	2022-08-05 18:21:36 +03:00
Kamil Braun	a1aa9cf3f7	gms: gossiper: mark some member functions const	2022-08-04 12:19:43 +02:00
Kamil Braun	566e5f2a4f	gms: gossiper: move `endpoint_filter` to `storage_proxy` module The function only uses one public function of `gossiper` (`is_alive`) and is used only in one place in `storage_proxy`. Make it a static function private to the `storage_proxy` module. The function used a `default_random_engine` field in `gossiper` for generating random numbers. Turn this field into a static `thread_local` variable inside the function - no other `gossiper` members used the field.	2022-08-04 12:16:09 +02:00
Jadw1	2c46222e31	db,gms: Add SCYLLA_AGGREGATES schema features This schema feature will be used to guard system_schema.scylla_aggregates schema table.	2022-07-18 14:18:48 +02:00
Jadw1	346fb08680	gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature Feature that indicate whether the cluter supports optional UDA parameter (reduction function) and parallelization of uda and native aggregates.	2022-07-18 14:18:48 +02:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Tomasz Grabiec	62df9f446c	Introduce SCHEMA_COMMITLOG cluster feature	2022-07-06 22:08:56 +02:00
Asias He	a33c370f9a	gossip: Speed up wait for gossip settle In a large cluster, a node would receive frequent and periodic gossip application state updates like CACHE_HITRATES or VIEW_BACKLOG from peer nodes. Those states are not critical. They should not be counted for the _msg_processing counter which is used to decide if gossip is settled. This patch fixes the long settle on every restart issue reported by users. Refs #10337 Closes #10892	2022-07-06 11:26:32 +03:00
Avi Kivity	dab56b82fa	Merge 'Per-partition rate limiting' from Piotr Dulikowski Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode. This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension: ``` ALTER TABLE ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second': 100, 'max_reads_per_second': 200 }; ``` Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead. Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all. The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here. Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`: - Write mode: ``` `8f690fdd47` (PR base): 129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op) This PR: 125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op) ``` - Read mode: ``` `8f690fdd47` (PR base): 150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op) This PR: 151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op) ``` Manual upgrade test: - Start 3 nodes, 4 shards each, Scylla version `8f690fdd47` - Create a keyspace with scylla-bench, RF=3 - Start reading and writing with scylla-bench with CL=QUORUM - Manually upgrade nodes one by one to the version from this PR - Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded - Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected Fixes: #4703 Closes #9810 * github.com:scylladb/scylla: storage_proxy: metrics for per-partition rate limiting of reads storage_proxy: metrics for per-partition rate limiting of writes database: add stats for per partition rate limiting tests: add per_partition_rate_limit_test config: add add_per_partition_rate_limit_extension function for testing cf_prop_defs: guard per-partition rate limit with a feature query-request: add allow_limit flag storage_proxy: add allow rate limit flag to get_read_executor storage_proxy: resultize return type of get_read_executor storage_proxy: add per partition rate limit info to read RPC storage_proxy: add per partition rate limit info to query_result_local(_digest) storage_proxy: add allow rate limit flag to mutate/mutate_result storage_proxy: add allow rate limit flag to mutate_internal storage_proxy: add allow rate limit flag to mutate_begin storage_proxy: choose the right per partition rate limit info in write handler storage_proxy: resultize return types of write handler creation path storage_proxy: add per partition rate limit to mutation_holders storage_proxy: add per partition rate limit info to write RPC storage_proxy: add per partition rate limit info to mutate_locally database: apply per-partition rate limiting for reads/writes database: move and rename: classify_query -> classify_request schema: add per_partition_rate_limit schema extension db: add rate_limiter storage_proxy: propagate rate_limit_exception through read RPC gms: add TYPED_ERRORS_IN_READ_RPC cluster feature storage_proxy: pass rate_limit_exception through write RPC replica: add rate_limit_exception and a simple serialization framework docs: design doc for per-partition rate limiting transport: add rate_limit_error	2022-06-24 01:32:13 +03:00
Piotr Dulikowski	000f417d23	gms: add TYPED_ERRORS_IN_READ_RPC cluster feature We would like to extend the read RPC to return an optional, second value which indicates an exception - seastar type-erases exception on the RPC handler boundary and we need to differentiate rate_limit_exception from others. However, it may happen that a replica with an up-to-date version of Scylla tries to return an exception in this way to a coordinator with an old version and the coordinator will drop the error, thinking that the request succeeded. In order to protect from that, we introduce the `TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas will start returning exceptions in the new way, and until then all exceptions will be reported using seastar's type-erasure mechanism.	2022-06-22 20:16:48 +02:00
Pavel Emelyanov	820be06ac1	hints: Remove snitch dependency After previous patch hints manager class gets unused dependency on snitch. While removing it it turns out that several unrelated places get needed headers indirectly via host_filter.hh -> snitsh_base.hh inclusion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Geoffrey Beausire	ee9841b138	Ensure gossip is enabled on all shards before starting the failure_detector_loop Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard. This change ensure that _enabled is set to true before moving forward Closes #10548	2022-06-17 14:10:45 +03:00
Avi Kivity	4b53af0bd5	treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime of the function object is less ambiguous, and so it is safer. Replace all eligible occurences (i.e. caller is a coroutine). One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra attention since there was a handle_exception() continuation attached. It is converted to a try/catch. Closes #10699	2022-05-31 09:06:24 +03:00
Kamil Braun	4c3678e2a0	gms: gossiper: fix `direct_fd_pinger::_generation_number` initialization It's an `int64_t` that needs to be explicitly initialized, otherwise the value is undefined. This is probably the cause of #10639, although I'm not sure - I couldn't reproduce it (the bug is dependent on how the binary is compiled, so that's probably it). We'll see if it reproduces with this fix, and if it will, close the issue. Closes #10681	2022-05-29 13:08:09 +03:00
Gleb Natapov	083b47cecb	gossiper: replace ad-hoc guard with defer() msg_proc_guard is a guard that makes sure _msg_processing is always decreased. We can use regular defer() to achieve the same. Message-Id: <YoZTQPbTMWAdCObs@scylladb.com>	2022-05-24 19:20:25 +03:00
Avi Kivity	528ab5a502	treewide: change metric calls from make_derive to make_counter make_derive was recently deprecated in favor of make_counter, so make the change throughput the codebase. Closes #10564	2022-05-14 12:53:55 +02:00
Avi Kivity	5937b1fa23	treewide: remove empty comments in top-of-files After `fcb8d040` ("treewide: use Software Package Data Exchange (SPDX) license identifiers"), many dual-licensed files were left with empty comments on top. Remove them to avoid visual noise. Closes #10562	2022-05-13 07:11:58 +02:00
Tomasz Grabiec	f703e8ded5	Merge 'New failure detector for Raft' from Kamil Braun We introduce a new service that performs failure detection by periodically pinging endpoints. The set of pinged endpoints can be dynamically extended and shrinked. To learn about liveness of endpoints, user of the service registers a listener and chooses a threshold - a duration of time which has to pass since the last successful ping in order to mark an endpoint as dead. When an endpoint responds it's immediately marked as alive. Endpoints are identified using abstract integer identifiers. The method of performing a ping is a dependency of the service provided by the user through the `pinger` interface. The implementation of `pinger` is responsible for translating the abstract endpoint IDs to 'real' addresses. For example, production implementation may map endpoint IDs to IP addresses and use TCP/IP to perform the ping, while a test/simulation implementation may use a simulated network that also operates on abstract identifiers. Similarly, the method of measuring time is a dependency provided by the user using the `clock` interface. The service operates on abstract time intervals and timepoints. So, for example, in a production implementation time can be measured using a stopwatch, while in test/simulation we can use a logical clock. The service distributes work across different shards. When an endpoint is added to the set of detected endpoints, the service will choose a shard with the smallest amount of workers and create a worker that is responsible for periodically pinging this endpoint on that shard and sending notifications to listeners. We modify the randomized nemesis test to use the new service. The service is sharded, but for simplicity of implementation in the test we implement rpcs and sleeps by routing the requests to shard 0, where logical timers and network live. rpcs are using the existing simulated network and clock using the existing logical timers. We also integrate the service with production code. There, `pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo message requires the node's gossip generation number. We handle this by embedding the pinger implementation inside `gossiper`, and making `gossiper` update the generation number (cached inside the pinger class) periodically. Production `clock` is a simple implementation which uses `std::chrono::steady_clock` and `seastar::sleep_until` underneath. Translating `steady_clock` durations to `direct_fd::clock` durations happens by taking the number of ticks. We connect the group 0 raft server rpc implementation to the new service, so that when servers are added or removed from the the group 0 configuration, corresponding endpoints are added to the direct failure detector service. Thus the set of detected endpoints will be equal to the group 0 configuration. On each shard, we register a listener for the service. The listener maintains a set of live addresses; on mark_alive it adds a server to the set and on mark_dead it removes it. This set is then used to implement the `raft::failure_detector` interface, consisting of `is_alive()` function, which simply checks set membership. --- v6: - remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: `f617aeca62..d4b225437c` v5: - renamed `rpc` to `pinger` - replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates` - replaced `unsigned` with `shard_id` - fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior) - improve `_num_workers` assertions - signal `_alive_start_changed` only when `_alive_start` indeed changed - renamed `{_marked}_alive_start` to `{_marked}_alive_start_index` v4: - rearrange ping_fiber(). Remove the loop at the end of the big `while` which was timing out listeners (after the sleep). Instead: - rely on the loop before the sleep for timing out listeners - before calling ping(), check if there is a timed out listener, if so abandon the ping, immediately proceed to the timing-out-listeners loop, and then immediately proceed to the next iteration of the big `while` (without sleeping) - inline send_mark_dead() and send_mark_alive(); each was used in exactly one place after the rearrangement - when marking alive, instead of repeatedly doing `--_alive_start` and signalling the condition variable, just do `_alive_start = 0` and signal the condition variable once - fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`. Indeed, we want to wait for the stopping code (`destroy_worker()`) to set `_alive_start = _fd._listeners.size()` before `notify_fiber()` finishes so `notify_fiber()` can send the final `mark_dead` notifications for this endpoint. There was a race before where `notify_fiber()` could finish before it sent those notifications (because it finished as soon as it noticed `_as.abort_requested()`) - fix some waits in the unit test; they depended on particular ordering of tasks by the Scylla reactor, the test could sometimes hang in debug mode which randomizes task order - fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an exceptional discarded future in some cases v3: - fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard - invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers - add a unit test (second patch) v2: - rename `direct_fd` namespace to `direct_failure_detector` - move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh} - cleaned license comments - removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead: - _listeners is now a boost::container::flat_multimap (previously it was std::multimap) - _alive_start is no longer an iterator to _listeners, but an index (size_t) - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed - ping_fiber() signals _alive_start_changed when it changes _alive_start - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start - replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still) - _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`) - `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes - same for `failure_detector::impl::remove_endpoint` - `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen - comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers) - remove unnecessary `if (_as.abort_requested())` in `ping_fiber()` - in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately. - due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start." - `register_listener` now takes a `listener&`, not `listener` - at `register_listener` comment why we allow different thresholds (second to last paragraph) - at `register_listener` mention that listeners can be registered on any shard (last paragraph) - add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`. - replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution: - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures. - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc - commented that `clock::sleep_until` should signalize aborts using `sleep_aborted` - `clock::now()` is `noexcept` - `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item - in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item - randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification) - message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called) - rebase Closes #10437 * github.com:scylladb/scylla: service: raft: remove `raft_gossip_failure_detector` service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness service: raft: add/remove direct failure detector endpoints on group 0 configuration changes main: start direct failure detector service messaging_service: abortable version of `send_gossip_echo` message: abortable version of `send_message` test: raft: randomized_nemesis_test: remove old failure_detector test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector` test: raft: randomized_nemesis_test: ping all shards on each tick test: unit test for new failure detector service direct_failure_detector: introduce new failure detector service	2022-05-11 14:46:27 +02:00

1 2 3 4 5 ...

809 Commits