scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 20:46:56 +00:00

Author	SHA1	Message	Date
Kamil Braun	5779230d28	group0_state_machine: use correct comparison for timeuuids in `merger` In `d2a4079bbe`, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row). However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the smaller of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse. Fix it by using the correct comparison method. Fixes: #14600	2023-07-11 11:48:02 +02:00
Kamil Braun	5ce802676f	utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&` The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte buffers. Introduce a version which operates directly on the `utils::UUID` type. To reuse existing comparison code, we serialize to a buffer before comparing. But we avoid allocations by using `std::array`. Since the serialized size needs to be known at compile time for `std::array`, mark `UUID::serialized_size()` as `constexpr`.	2023-07-11 11:48:02 +02:00
Kamil Braun	668beedadc	utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t` `timeuuid_tri_compare` takes `bytes_view` parameters and converts them to `const int8_t` before comparing. Extract the part that operates on `const int8_t*` to separate function which we will reuse in a later commit.	2023-07-11 11:48:02 +02:00
Tomasz Grabiec	65a5942ec0	Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (`50e8ec77c6`). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for any `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, after we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: `ca61d88764` but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers	2023-07-10 00:28:20 +02:00
Kefu Chai	1eb76d93b7	streaming: cast the progress to a float before formatting it before this change, we format a `long` using `{:f}`. fmtlib would throw an exception when actually formatting it. so, let's make the percentage a float before formatting it. Fixes #14587 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14588	2023-07-10 00:00:40 +03:00
Kefu Chai	894039d444	build: drop the warning on -O0 might fail tests Michał Chojnowski noted that this is not true. -O0 almost doubles the run time of `./test.py --mode=debug`. but it does not fail any of the tests. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14456	2023-07-09 23:23:12 +03:00
Avi Kivity	850d759fd9	Merge 'repair: optimise repair reader with different shard count' from Gusev Petr Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially. When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`. In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100). This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`: ``` with this patch 20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2... 20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859 without this patch 21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2... 21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140 ``` Fixes: #14093 Closes #14178 * github.com:scylladb/scylladb: repair_test: add test_reader_with_different_strategies repair: extract repair_reader declaration into reader.hh repair_meta: get_estimated_partitions fix repair_meta: use multishard_filter reader if the number of partitions is small repair_meta: delay _repair_reader creation database.hh: make_multishard_streaming_reader with range parameter database.cc: extract streaming_reader_lifecycle_policy	2023-07-09 23:21:06 +03:00
Aleksandra Martyniuk	61dc98b276	api: prevent non-owner cpu access to shared_ptr In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs to sstables is passes to reducer of map_reduce0. Reducer then accesses these shared pointers. As reducer is invoked on the same shard map_reduce0 is called, we have an illegal access to shared pointer on non-owner cpu. A set of shared pointers to sstables is trasnsformed in map function, which is guaranteed to be invoked on a shard associated with the service. Fixes: #14515. Closes #14532	2023-07-09 23:09:59 +03:00
Kefu Chai	7a334c53af	cql3: expression: correct format string fmtlib uses `{}` as the placeholder for the formatted argument, not `{}}`. so let's correct it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14586	2023-07-09 22:26:29 +03:00
Kefu Chai	56c3462cba	alternator: correct format string when formatting the error message for `api_error::validation`, we always include the caller in the error message, but in this case, forgot to pass the `caller` to `seastar::format()`. if fmtlib actually formats them, it would throw. so let's pass `caller` to `seastar::format()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14589	2023-07-09 22:25:13 +03:00
Michał Chojnowski	c41f0ebd2a	test: mutation_test: unflake test_external_memory_usage The test has about 1/2500000 chance to fail due to a conflict of random values. And it recently did, just to spite us. Fight back. Fixes #14563 Closes #14576	2023-07-08 15:20:25 +03:00
Kefu Chai	27d6ff36df	compound_compat: do not format an sstring with {:d} before this change, we format a sstring with "{:d}", fmtlib would throw `fmt::format_error` at runtime when formatting it. this is not expected. so, in this change, we just print the int8_t using `seastar::format()` in a single pass. and with the format specifier of `#02x` instead of adding the "0x" prefix manually. Fixes #14577 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14578	2023-07-08 15:13:11 +03:00
Kefu Chai	26dcfea84a	estimated_histogram: do not use dynamic format_string fmtlib allows us to specify the field width dynamically, so specify the field width in the same statement formatting the argument improves the readability. and use the constexpr fmt string allows us to switch to compile-time formatter supported by fmtlib v8. this change also use `fmt::print()` to format the argument right to the output ostream, instead of creating a temporary sstring, and copy it to the output ostream. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14579	2023-07-08 15:10:41 +03:00
Anna Stuchlik	88e62ec573	doc: improve User Data info in Launch on AWS Fixes https://github.com/scylladb/scylladb/issues/14565 This commit improves the description of ScyllaDB configuration via User Data on AWS. - The info about experimental features and developer mode is removed. - The description of User Data is fixed. - The example in User Data is updated. - The broken link is fixed. Closes #14569	2023-07-07 16:34:06 +02:00
Kamil Braun	de7f668441	Merge 'raft topology: send cdc generation data in parts' from Mikołaj Grzebieluch The CDC generation data can be large and not fit in a single command. This pr splits it into multiple mutations by smartly picking a `mutation_size_threshold` and sending each mutation as a separate group 0 command. Commands are sent sequentially to avoid concurrency problems. Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If a new generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. New or delayed nodes can never get the applied part in this scenario. Send the entire cdc_generations_v3 table in the snapshot to resolve this problem. A mechanism to remove old CDC generations will be introduced as a follow-up. Closes #13962 * github.com:scylladb/scylladb: test: raft topology: test `prepare_and_broadcast_cdc_generation_data` service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop raft topology: introduce `prepare_and_broadcast_cdc_generation_data` raft: add release_guard raft: group0_state_machine::merger take state_id as the maximal value from all merged commands raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot raft topology: make `mutation_size_threshold` depends on `max_command_size` raft: reduce max batch size of raft commands and raft entries raft: add description argument to add_entry_unguarded raft: introduce `write_mutations` command raft: refactor `topology_change` applying	2023-07-07 16:31:29 +02:00
Kamil Braun	f9cfd7e4f5	Merge 'raft: do not ping self in direct failure detector' from Konstantin Osipov Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead. Fixes #14388 Closes #14558 * github.com:scylladb/scylladb: direct_fd: do not ping self raft: initialize raft_group_registry with host id early raft: code cleanup	2023-07-07 14:26:17 +02:00
Mikołaj Grzebieluch	4e3c97d8d4	test: raft topology: test `prepare_and_broadcast_cdc_generation_data` This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size` is limited to less than 1 MB. It adds an injection which copies mutations generated by `get_cdc_generation_mutations` n times, where n is picked that the memory size of all mutations exceeds `max_command_size`. This test passes if cdc generation data is committed by raft in multiple commands. If all the data is committed in a single command, the leader node will loop trying to send raft command and getting the error: ``` storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {}) ```	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	8d6c95f9e3	service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it prints an error. This exception is not an error in this case, and it can be thrown when the leader has changed. It can happen in `add_entry_unguarded` while sending a part of the CDC generation data in the `write_mutations` command. Catch this exception in `topology_coordinator::run` and print a warning.	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	ade15ad74a	raft topology: introduce `prepare_and_broadcast_cdc_generation_data` Broadcasts all mutations returned from `prepare_new_cdc_generation_data` except the last one. Each mutation is sent in separate raft command. It takes `group0_guard`, and if the number of mutations is greater than one, the guard is dropped, and a new one is created and returned, otherwise the old one will be returned. Commands are sent in parallel and unguarded (the guard used for sending the last mutation will guarantee that the term hasn't been changed). Returns the generation's UUID, guard and last mutation, which will be sent with additional topology data by the caller. If we send the last mutation in the `write_mutation` command, we would use a total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of mutations), so it's better to send it in `topology_change` (we need to send it after all `write_mutations`) with some small metadata. With the default commitlog segment size, `mutation_size_threshold` will be 4 MB. In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation data can reach the size of 30 MB, thus there will be no more than 8 commands. In a multi-DC cluster with 100ms latencies between DCs, this operation should take about 200ms since we send the commands concurrently, but even if the commands were replicated sequentially by Raft, it should take no more than 1.6s, which is incomparably smaller than bootstrapping operation (bootstrapping is quick if there is no data in the cluster, but usually if one has 100 nodes they have tons of data, so indeed streaming/repair will take much longer (hours/days)). Fixes FIXME in pr #13683.	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	04c38c6185	raft: add release_guard This function takes guard and calls its destructor. It's used to not call raw destructor.	2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch	d2a4079bbe	raft: group0_state_machine::merger take state_id as the maximal value from all merged commands If `group0_state_machine` applies all commands individually (without batching), the resulting current `state_id` -- which will be compared with the `prev_state_id` of the next command if it is a guarded command -- equals the maximum of the `next_state_id` of all commands applied up to this point. That's because the current `state_id` is obtained from the history table by taking the row with the largest clustering key. When `group0_state_machine::apply` is called with a batch of commands, the current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id` only once. When a command is merged, its `next_state_id` overwrites `last_group0_state_id`, regardless of their order. Let's consider the following situation: The leader sends two unguarded `write_mutations` commands concurrently, with timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded `topology_change` with `prev_state_id` equal T2. Suppose that the command with timeuuid T2 is committed first, and these commands are small enough that all of `write_mutations` could be merged into one command. Some followers can get all of these three commands before its `fsm` polls them. In this situation, `group0_state_machine::apply` is called with all three of them and `merger` will merge both `write_mutations` into one command. After that, `merger::last_group0_state_id` will be equal to T1 (this command was committed as the second one). When it processes the `topology_change` command, it will compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in making this command a no-op (which wouldn't happen if the commands were applied individually). Such a scenario results in inconsistent results: one replica applies `topology_change`, but another makes it a no-op.	2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch	b2d22d665e	raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If new a generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. In this scenario, new or delayed nodes can never get the applied part. Send entire cdc_generations_v3 table in the snapshot to resolve this problem. As a follow-up, a mechanism to remove old CDC generations will be introduced.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	dc6017b71b	raft topology: make `mutation_size_threshold` depends on `max_command_size` `get_cdc_generation_mutations` splits data to mutations of maximal size `mutation_size_treshold`. Before this commit it was hardcoded to 2 MB. Calculate `mutation_size_threshold` to leave space for cdc generation data and not exceed `max_command_size`.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	6dad582796	raft: reduce max batch size of raft commands and raft entries For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size` (half of the commitlog segment size), so with some additional information, it can exceed this threshold resulting in throwing an exception when writing mutation to the commitlog. A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size` (half of the commitlog segment size). It doesn't have additional metadata, but it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble, but it is prefered to be careful. Make `raft_sys_table_storage::_max_mutation_size` and `group0_state_machine::merger::max_command_size` more strict to leave space for metadata. Fixed typo "1204" => "1024".	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	760d415781	raft: add description argument to add_entry_unguarded Provide useful description for `write_mutations` and `broadcast_tables_query` that is stored in `system.group0_history`. Reduces scope of issue #13370.	2023-07-07 13:11:44 +02:00
Anna Stuchlik	799ae97b52	doc: add the Rust CDC Connector to the docs Fixes https://github.com/scylladb/scylladb/issues/13877 This commit adds the information about Rust CDC Connector to the documentation. All relevant pages are updated: the ScyllaDB Rust Driver page, and other places in the docs where Java and Go CDC connectors are mentioned. In addition, the drivers table is updated to indicate Rust driver support for CDC. Closes #14530	2023-07-07 11:13:25 +02:00
Nadav Har'El	edfb89ef65	sstables: stop warning when auto-snapshot leaves non-empty directory When a table is dropped, we delete its sstables, and finally try to delete the table's top-level directory with the rmdir system call. When the auto-snapshot feature is enabled (this is still Scylla's default), the snapshot will remain in that directory so it won't be empty and will cannot be removed. Today, this results in a long, ugly and scary warning in the log: ``` WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored. ``` It is bad to log as a warning something which is completely normal - it happens every time a table is dropped with the perfectly valid (and even default) auto-snapshot mode. We should only log a warning if the deletion failed because of some unexpected reason. And in fact, this is exactly what the code tried to do - it does not log a warning if the rmdir failed with EEXIST. It even had a comment saying why it was doing this. But the problem is that in Linux, deleting a non-empty directory does not return EEXIST, it returns ENOTEMPTY... Posix actually allows both. So we need to check both, and this is the only change in this patch. To confirm this that this patch works, edit test/cql-pytest/run.py and change auto-snapshot from 0 to 1, run test/alternator/run (for example) and see many "Directory not empty" warnings as above. With this patch, none of these warnings appear. Fixes #13538 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14557	2023-07-07 11:08:10 +02:00
Benny Halevy	cd44ad9338	docs: compaction: correct min_sstable_size default value DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L` which is 50 MB, not 50 bytes. Fixes #14413 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14414	2023-07-07 11:08:10 +02:00
Marcin Maliszkiewicz	c5de25be4c	locator: use deferred_close in azure and gcp snitches Close needs to be called even if function throws in the middle. Closes #14458	2023-07-07 11:08:10 +02:00
Avi Kivity	1f9a999c26	cql3: statement_restrictions: clean up dead code We have plenty of code marked with #if 0. Once it was an indication of missing functionality, but the code has evolved so much it's useless as an indication and only a distraction. Delete it. Closes #14511	2023-07-07 11:08:10 +02:00
Gleb Natapov	4f23eec44f	Rename experimental raft feature to consistent-topology-changes Make the name more descriptive Fixes #14145 Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>	2023-07-07 11:08:10 +02:00
Kamil Braun	3c139265b3	Merge 'doc: remove the dead link to unirestore' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/14459 This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements. Closes #14519 * github.com:scylladb/scylladb: doc: minor language improvements on the Migration Tools page doc: remove the link to the private repository	2023-07-07 11:08:10 +02:00
Nadav Har'El	d6aba8232b	alternator: configurable override for DescribeEndpoints The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554) where even if a user specifies a specific enpoint URL, the SDK uses DescribeEndpoints to try to "refresh" the endpoint. The problem is that DescribeEndpoints can't return a scheme (http or https) and the SDK arbitrarily picks https - making it unable to communicate with Alternator over http. As an example, the new "dynamodb shell" (written in C++) cannot communicate with Alternator running over http. This patch adds a configuration option, "alternator_describe_endpoints", which can be used to override what DescribeEndpoints does: 1. Empty string (the default) leaves the current behavior - DescribeEndpoints echos the request's "Host" header. 2. The string "disabled" disables the DescribeEndpoints (it will return an UnknownOperationException). This is how DynamoDB Local behaves, and the AWS C++ SDK and the Dynamodb Shell work well in this mode. 3. Any other string is a fixed string to be returned by DescribeEndpoints. It can be useful in setups that should return a known address. Note that this patch does not, by default, change the current behaivor of DescribeEndpoints. But it us the future to override its behavior in a user experiences problems in the field - without code changes. Fixes #14410. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14432	2023-07-07 11:08:10 +02:00
Konstantin Osipov	ff41ea86b6	direct_fd: do not ping self No need to ping self in direct failure detector. This is confusing during debugging and adds extra overhead. Fixes #14388	2023-07-06 21:05:39 +03:00
Konstantin Osipov	50140980ac	raft: initialize raft_group_registry with host id early Earlier, when local query processor wasn't available at the beginning of system start, we couldn't query our own host id when initializing the raft group registry. The local host id is needed by the registry since it is responsible to route RPC messages to specific raft groups, and needs to reject messages destined to a different host. Now that the host id is known early at boot, remove the optional and pass host id in the constructor. Resolves an earlier fixme.	2023-07-06 20:54:05 +03:00
Konstantin Osipov	d79d05aa46	raft: code cleanup Rename raft_rpc::_server_id to raft_rpc::_my_id as is already the name used in raft_group0: - for consistency - to reflect which server id it is.	2023-07-06 19:46:24 +03:00
Kamil Braun	0d437a7d63	Merge 'utils: error injection: add inject_with_handler for interactions with injected code' from Mikołaj Grzebieluch Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint. This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument. The `injection_handler` class is used to wait for events inside the injected code. The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message. Closes #14357. Closes #14460 * github.com:scylladb/scylladb: tests: introduce InjectionHandler class for communicating with injected code api/error_injection: add message_injection endpoint tests: utils: error injections: add test for inject_with_handler utils: error injection: add inject_with_handler for interactions with injected code utils: error injection: create structure for error injections data	2023-07-06 18:16:51 +02:00
Mikołaj Grzebieluch	907c0e8900	tests: introduce InjectionHandler class for communicating with injected code Add a client for sending empty messages to the injected code from tests.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	8b1f5ba293	api/error_injection: add message_injection endpoint Add an endpoint for sending empty messages to the injected code.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	7e5c42af0a	tests: utils: error injections: add test for inject_with_handler Add a test checking the correctness of the `inject_with_handler` method in presence of concurrency.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	086b3369f4	utils: error injection: add inject_with_handler for interactions with injected code Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint. This commit adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument. The `injection_handler` class is used to wait for events inside the injected code. The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message. There is a counter of received messages in `received_messages_counter`; it is shared between the injection_data, which is created once when enabling an injection on a given shard, and all `injection_handler`s, that are created separately for each firing of this injection. The `counter` is incremented when receiving a message from the REST endpoint and the condition variable is signaled. Each `injection_handler` (separate for each firing) stores its own private counter, `_read_messages_counter` that private counter is incremented whenever we wait for a message, and compared to the received counter. We sleep on the condition variable if not enough messages were received.	2023-07-06 12:32:07 +02:00
Kamil Braun	431a8f8591	test: rename `test_topology_ip.py` to `test_replace.py` No idea why it was named like that before.	2023-07-06 10:24:46 +02:00
Kamil Braun	452d9a3c77	test: test bootstrap after IP change Regression test for #14468.	2023-07-06 10:24:46 +02:00
Kamil Braun	2032d7dbe4	test: scylla_cluster: return the new IP from `change_ip` API Also simplify the API by getting rid of `ActionReturn` and returning errors through exceptions (which are correctly forwarded to the client for some time already).	2023-07-06 10:24:46 +02:00
Kamil Braun	00f51ea753	test: node replace with `ignore_dead_nodes` test Regression test for #14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.	2023-07-06 10:24:46 +02:00
Kamil Braun	9b136ee574	test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`	2023-07-06 10:24:46 +02:00
Kamil Braun	9b8e5550b1	storage_service: remove `get_nodes_to_sync_with` It's no longer used.	2023-07-06 10:24:46 +02:00
Kamil Braun	96278a09d4	storage_service: use `token_metadata` to calculate nodes waited for to be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In #14468 and #14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: `ca61d88764` but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes #14468 Fixes #14487	2023-07-06 10:24:46 +02:00
Kamil Braun	bbcf8305bb	storage_service: don't calculate `ignore_nodes` before waiting for normal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (`50e8ec77c6`). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for any `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, after we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.	2023-07-06 10:24:44 +02:00
Tomasz Grabiec	c25201c1a3	Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes https://github.com/scylladb/scylladb/issues/14503 Closes #14502 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-05 21:21:43 +02:00

1 2 3 4 5 ...

37776 Commits