scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 10:30:38 +00:00

Author	SHA1	Message	Date
Kamil Braun	0a7854ea4d	Merge 'test: test_topology_ops: fix flakiness and reenable bg writes' from Patryk Jędrzejczak We decrease the server's request timeouts in topology tests so that they are lower than the driver's timeout. Before, the driver could time out its request before the server handled it successfully. This problem caused scylladb/scylladb#15924. Since scylladb/scylladb#15924 is the last issue mentioned in scylladb/scylladb#15962, this PR also reenables background writes in `test_topology_ops` with tablets disabled. The test doesn't pass with tablets and background writes because of scylladb/scylladb#17025. We will reenable background writes with tablets after fixing that issue. Fixes scylladb/scylladb#15924 Fixes scylladb/scylladb#15962 Closes scylladb/scylladb#17585 * github.com:scylladb/scylladb: test: test_topology_ops: reenable background writes without tablets test: test_topology_ops: run with and without tablets test: topology: decrease the server's request timeouts	2024-03-04 20:57:24 +01:00
Patryk Jędrzejczak	f1d9248df9	test: wait for CDC generations publishing before checking CDC-topology consistency Tests that verify upgrading to the raft-based topology (`test_topology_upgrade`, `test_topology_recovery_basic`, `test_topology_recovery_majority_loss`) have flaky `check_system_topology_and_cdc_generations_v3_consistency` calls. `assert topo_results[0] == topo_res` can fail because of different `unpublished_cdc_generations` on different nodes. The upgrade procedure creates a new CDC generation, which is later published by the CDC generation publisher. However, this can happen after the upgrade procedure finishes. In tests, if publishing happens just before querying `system.topology` in `check_system_topology_and_cdc_generations_v3_consistency`, we can observe different `unpublished_cdc_generations` on different nodes. It is an expected and temporary inconsistency. For the same reasons, `check_system_topology_and_cdc_generations_v3_consistency` can fail after adding a new node. To make the tests not flaky, we wait until the CDC generation publisher finishes its job. Then, all nodes should always have equal (and empty) `unpublished_cdc_generations`. Fixes scylladb/scylladb#17587 Fixes scylladb/scylladb#17600 Fixes scylladb/scylladb#17621 Closes scylladb/scylladb#17622	2024-03-04 19:28:51 +02:00
Kamil Braun	ec1f574b3a	test/pylib: util: silence exception from `refresh_nodes` Driver's `refresh_nodes` function may throw an exception if we call it in the middle of driver reconnecting. Silence it. Fixes scylladb/scylladb#17616 Closes scylladb/scylladb#17620	2024-03-04 17:50:16 +02:00
Patryk Jędrzejczak	e7d4e080e9	test: test_topology_ops: reenable background writes without tablets After fixing scylladb/scylladb#15924 in one of the previous patches, we reenable background writes in `test_topology_ops`. We also start background writes a bit later after adding all nodes. Without this change and with tablets, the test fails with: ``` > await cql.run_async(f"CREATE TABLE tbl (pk int PRIMARY KEY, v int)") E cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Datacenter datacenter1 doesn't have enough nodes for replication_factor=3"> ``` The change above makes the test a bit weaker, but we don't have to worry about it. If adding nodes is bugged, other tests should detect it. Unfortunately, the test still doesn't pass with tablets and background writes because of scylladb/scylladb#17025, so we keep background writes disabled with tablets and leave FIXME. Fixes scylladb/scylladb#15962	2024-02-29 18:37:41 +01:00
Patryk Jędrzejczak	90317c5ceb	test: test_topology_ops: run with and without tablets `test_topology_ops` is a valuable test that has uncovered many bugs. It's worth running it with and without tablets.	2024-02-29 18:37:41 +01:00
Patryk Jędrzejczak	9dfb26428b	test: topology: decrease the server's request timeouts We decrease the server's request timeouts in topology tests so that they are lower than the driver's timeout. Before, the driver could time out its request before the server handled it successfully. This problem caused scylladb/scylladb#15924. A high server's request timeout can slow down the topology tests (see the new comment in `make_scylla_conf`). We make the timeout dependent on the testing mode to not slow down tests for no reason. We don't touch the driver's request timeout. Decreasing it in some modes would require too much effort for almost no improvement. Fixes scylladb/scylladb#15924	2024-02-29 18:37:38 +01:00
Petr Gusev	6afa80a443	sync_raft_topology_nodes: do no emit REMOVED_NODE on IP change Calling notify_left for old ip on topology change in raft mode was a regression. In gossiper mode it didn't occur. In gossiper mode the function handle_state_normal was responsible for spotting IP addresses that weren't managing any parts of the data, and it would then initiate their removal by calling remove_endpoint. This removal process did not include calling notify_left. Actually, notify_left was only supposed to be called (via excise) by a 'real' removal procedures - removenode and decommission. The redundant notify_left caused troubles in scylla python driver. The driver could receive REMOVED_NODE and NEW_NODE notifications in the same time and their handling routines could race with each other. In this commit we fix the problem by not calling notify_left if the remove_ip lambda was called from the ip change code path. Also, we add a test which verifies that the driver log doesn't mention the REMOVED_NODE notification. Fixes scylladb/scylladb#17444 Closes scylladb/scylladb#17561	2024-02-29 10:18:20 +01:00
Avi Kivity	616eec2214	Merge ' test/topology_custom: test_read_repair.py: reduce run-time ' from Botond Dénes This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before `f068d1a6fa`, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine). Fixes: https://github.com/scylladb/scylladb/issues/15425 Fixes: https://github.com/scylladb/scylladb/issues/16899 Closes scylladb/scylladb#17529 * github.com:scylladb/scylladb: test/topology_custom: test_read_repair.py: reduce run-time replica/database: get_query_max_result_size(): use query_page_size_in_bytes replica/database: use include page-size in max-result-size query-request: max_result_size: add without_page_limit() db/config: introduce query_page_size_in_bytes	2024-02-27 18:54:38 +02:00
Nadav Har'El	fc861742d7	cql: avoid undefined behavior in totimestamp() of extreme dates This patch fixes a UBSAN-reported integer overflow during one of our existing tests, test_native_functions.py::test_mintimeuuid_extreme_from_totimestamp when attempting to convert an extreme "date" value, millions of years in the past, into a "timestamp" value. When UBSAN crashing is enabled, this test crashes before this patch, and succeeds after this patch. The "date" CQL type is 32-bit count of days from the epoch, which can span 2^31 days (5 million years) before or after the epoch. Meanwhile, the "timestamp" type measures the number of milliseconds from the same epoch, in 64 bits. Luckily (or intentionally), every "date", however extreme, can be converted into a "timestamp": This is because 2^31 days is 1.85e17 milliseconds, well below timestamp's limit of 2^63 milliseconds (9.2e18). But it turns out that our conversion function, date_to_time_point(), used some boost::gregorian library code, which carried out these calculations in microsecond resolution. The extra conversion to microseconds wasn't just wasteful, it also caused an integer overflow in the extreme case: 2^31 days is 1.85e20 microseconds, which does NOT fit in a 64-bit integer. UBSAN notices this overflow, and complains (plus, the conversion is incorrect). The fix is to do the trivial conversion on our own (a day is, by convention, exactly 86400 seconds - no fancy library is needed), without the grace of Boost. The result is simpler, faster, correct for the Pliocene-age dates, and fixes the UBSAN crash in the test. Fixes #17516 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17527	2024-02-27 17:04:18 +02:00
Botond Dénes	5dc145a93f	test/topology_custom: test_read_repair.py: reduce run-time This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before `f068d1a6fa`, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).	2024-02-27 02:27:55 -05:00
Botond Dénes	8213e66815	replica/database: use include page-size in max-result-size This patch changes get_unlimited_query_max_result_size(): * Also set the page-size field, not just the soft/hard limits * Renames it to get_query_max_result_size() * Update callers, specifically storage_proxy::get_max_result_size(), which now has a much simpler common return path and has to drop the page size on one rare return path. This is a purely mechanical change, no behaviour is changed.	2024-02-27 02:27:55 -05:00
Raphael S. Carvalho	f07c233ad5	Fix potential data resurrection when another compaction type does cleanup work Since commit `f1bbf70`, many compaction types can do cleanup work, but turns out we forgot to invalidate cache on their completion. So if a node regains ownership of token that had partition deleted in its previous owner (and tombstone is already gone), data can be resurrected. Tablet is not affected, as it explicitly invalidates cache during migration cleanup stage. Scylla 5.4 is affected. Fixes #17501. Fixes #17452. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17502	2024-02-25 13:08:04 +02:00
Botond Dénes	89efa89dd7	Merge 'test: add fmt::formatters' from Kefu Chai before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for some types used in testing. Refs #13245 Closes scylladb/scylladb#17485 * github.com:scylladb/scylladb: test/unit: add fmt::formatter for tree_test_key_base test: add printer for type for BOOST_REQUIRE_EQUAL test: add fmt::formatters test/perf: add fmt::formatters for scheduling_latency_measurer and perf_result	2024-02-23 09:32:39 +02:00
Botond Dénes	1f363a876e	Merge 'utils: add fmt::formatter for occupancy_stats, managed_bytes and friends ' from Kefu Chai before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * managed_bytes * managed_bytes_view * managed_bytes_opt * occupancy_stats and drop their operator<<:s Refs https://github.com/scylladb/scylladb/issues/13245 Closes scylladb/scylladb#17462 * github.com:scylladb/scylladb: utils/managed_bytes: add fmt::formatters for managed_bytes and friends utils/logalloc: add fmt::formatter for occupancy_stats	2024-02-23 09:31:22 +02:00
Botond Dénes	d314ad2725	Merge 'sstables: close index_reader in has_partition_key' from Aleksandra Martyniuk If index_reader isn't closed before it is destroyed, then ongoing sstables reads won't be awaited and assertion will be triggered. Close index_reader in has_partition_key before destroying it. Fixes: #17232. Closes scylladb/scylladb#17355 * github.com:scylladb/scylladb: test: add test to check if reader is closed sstables: close index_reader in has_partition_key	2024-02-23 09:27:55 +02:00
Kefu Chai	010fb5f323	tools/scylla-nodetool: make keyspace argument optional for "ring" the "keyspace" argument of the "ring" command is optional. but before this change, we considered it a mandatory option. it was wrong. so, in this change, we make it optional, and print out the warning message if the keyspace is not specified. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17472	2024-02-23 09:25:29 +02:00
Botond Dénes	a08d9ba2a4	Merge 'tools/scylla-nodetool: fixes to address test failures with dtest' from Kefu Chai * tighten the param check for toppartitions * add an extra empty line inbetween reports Closes scylladb/scylladb#17486 * github.com:scylladb/scylladb: tools/scylla-nodetool: add an extra empty line inbetween reports tools/scylla-nodetool: tighten the param check for toppartitions	2024-02-23 09:05:30 +02:00
Botond Dénes	959d33ba39	Merge 'repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: #17028. Fixes: #15370. Fixes: #15598. Closes scylladb/scylladb#17231 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-23 08:25:45 +02:00
Kefu Chai	3574c22d73	test/nodetool/utils: print out unmatched output on test failure would be more helpful if the matched could print out the unmatched output on test failure. so, in this change, both stdout and stderr are printed if they fail to match with the expected error. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17489	2024-02-23 08:20:30 +02:00
Kefu Chai	381c389b56	tools/scylla-nodetool: tighten the param check for toppartitions the test cases of `test_any_of_required_parameters_is_missing` considers that we should either pass all positional argument or pass none of them, otherwise nodetool should fail. but `scylla nodetool` supported partial positional argument. to be more consistent with the expected behavior, in this change, we enforce the sanity check so that we only accept either all positional args or none of them. the corresponding test is added. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 12:57:51 +08:00
Kefu Chai	3d9054991b	utils/logalloc: add fmt::formatter for occupancy_stats before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `occupancy_stats`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 11:32:41 +08:00
Avi Kivity	bf107dae84	test/unit: add fmt::formatter for tree_test_key_base before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for the classes derived from `tree_test_key_base` (this change was extracted from a larger change at #15599) Refs #13245	2024-02-23 10:52:12 +08:00
Kefu Chai	a70318e722	test: add printer for type for BOOST_REQUIRE_EQUAL after dropping the operator<< for vector, we would not able to use BOOST_REQUIRE_EQUAL to compare vector<>. to be prepared for this, less defined the printer for Boost.test Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 10:52:12 +08:00
Kefu Chai	63396f780d	test: add fmt::formatters the operator<< for `cql3::expr::test_utils::mutation_column_value` is preserved, as it used by test/lib/expr_test_utils.cc, which prints std::map<sstring, cql3::expr::test_utils::mutation_column_value> using the homebrew generic formatter for std::map<>. and the formatter uses operator<< for printing the elements in map. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 10:52:12 +08:00
Kefu Chai	2ccd9e695d	test/perf: add fmt::formatters for scheduling_latency_measurer and perf_result before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * scheduling_latency_measurer * perf_result and drop their operator<<:s Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 10:17:50 +08:00
Avi Kivity	67f8dc5a7c	Merge 'mutation: add fmt::formatter for clustering_row, row_tombstone and friends' from Kefu Chai before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * row_tombstone * row_marker * deletable_row::printer * row::printer * clustering_row::printer * static_row::printer * partition_start * partition_end * mutation_fragment::printer and drop their operator<<:s Refs #13245 Closes scylladb/scylladb#17461 * github.com:scylladb/scylladb: mutation: add fmt::formatter for clustering_row and friends mutation: add fmt::formatter for row_tombstone and friends	2024-02-22 16:16:26 +02:00
Aleksandra Martyniuk	4530be9e5b	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key.	2024-02-22 14:53:14 +01:00
Nadav Har'El	b0233c0833	Merge 'interval: rename nonwrapping_interval to interval' from Avi Kivity Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias. Closes scylladb/scylladb#17455 * github.com:scylladb/scylladb: interval: rename nonwrapping_interval to interval interval: rename interval_test to wrapping_interval_test	2024-02-22 14:03:43 +02:00
Kamil Braun	3ee56e1936	Merge 'raft topology: enable writes to previous CDC generations' from Patryk Jędrzejczak When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. This PR adjusts the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. To load all required generations into memory, we replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. To ensure this set doesn't grow endlessly, we remove an entry from this set together with the data in CDC_GENERATIONS_V3. Currently, we may clear a CDC generation's data from CDC_GENERATIONS_V3 if it is not the last committed generation and it is at least 24 hours old (according to the topology coordinator's clock). However, after allowing writes to the previous CDC generations, this condition became incorrect. We might clear data of a generation that could still be written to. The new solution introduced in this PR is to clear data of the generations that finished operating more than 24 hours ago. Apart from the changes mentioned above, this PR hardens `test_cdc_generation_clearing.py`. Fixes scylladb/scylladb#16916 Fixes scylladb/scylladb#17184 Fixes scylladb/scylladb#17288 Closes scylladb/scylladb#17374 * github.com:scylladb/scylladb: test: harden test_cdc_generation_clearing test: test clean-up of committed_cdc_generations raft topology: clean committed_cdc_generations raft topology: clean only obsolete CDC generations' data storage_service: topology_state_load: load all committed CDC generations system_keyspace: load_topology_state: fix indentation raft topology: store committed CDC generations' IDs in the topology	2024-02-22 11:41:25 +01:00
Kefu Chai	37c6073fd5	mutation: add fmt::formatter for clustering_row and friends before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * clustering_row::printer * static_row::printer * partition_start * partition_end * mutation_fragment::printer and drop their operator<<:s Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-22 17:53:34 +08:00
Avi Kivity	51df8b9173	interval: rename nonwrapping_interval to interval Our interval template started life as `range`, and was supported wrapping to follow Cassandra's convention of wrapping around the maximum token. We later recognized that an interval type should usually be non-wrapping and split it into wrapping_range and nonwrapping_range, with `range` aliasing wrapping_range to preserve compatibility. Even later, we realized the name was already taken by C++ ranges and so renamed it to `interval`. Given that intervals are usually non-wrapping, the default `interval` type is non-wrapping. We can now simplify it further, recognizing that everyone assumes that an interval is non-wrapping and so doesn't need the nonwrapping_interval_designation. We just rename nonwrapping_interval to `interval` and remove the type alias.	2024-02-21 19:43:17 +02:00
Avi Kivity	e338f0e009	interval: rename interval_test to wrapping_interval_test As preparation for reclaiming the name `interval` for nonwrapping_interval, rename interval_test to wrapping_interval_test.	2024-02-21 19:38:53 +02:00
Botond Dénes	ca585903b7	test/cql-pytest: remove skip_with_tablets fixture All tests that used it are fixed, and we should not add any new tests failing with tablets from now on, so remove.	2024-02-21 02:08:49 -05:00
Botond Dénes	8df82d4781	test/cql-pytest: test_select_from_mutation_fragments.py parameterize tests To run with both vnodes and tablets. For this functionality, both replication methods should be covered with tests, because it uses different ways to produce partition lists, depending on the replication method. Also add scylla_only to those tests that were missing this fixture before. All tests in this suite are scylla-only and with the parameterization, this is even more apparent.	2024-02-21 02:08:49 -05:00
Botond Dénes	b09b949159	test/cql-pytest: test_select_from_mutation_fragments.py: remove skip_with_tablets The underlying functionality was fixed, the tests should now pass with tablets.	2024-02-21 02:08:49 -05:00
Botond Dénes	7bdd0c2cae	locator: introduce tablet_range_spliter Given a list of partition-ranges, yields the intersection of this range-list, with that of that tablet-ranges, for tablets located on the given host. This will be used in multishard_mutation_query.cc, to obtain the ranges to read from the local node: given the read ranges, obtain the ranges belonging to tablets who have replicas on the local node.	2024-02-21 02:08:48 -05:00
Botond Dénes	239484f259	interval: add before() overload which takes another interval The current point variant cannot take inclusiveness into account, when said point comes from another interval bound. This method had no tests at all, so add tests covering both overloads.	2024-02-21 02:08:48 -05:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Tomasz Grabiec	e63d8ae272	Merge 'Handle tablet migration failure while streaming' from Pavel Emelyanov It can happen that a node is lost during tablet migration involving that node. Migration will be stuck, blocking topology state machine. To recover from this, the current procedure is for the admin to execute nodetool removenode or replacing the node. This marks the node as "ignored" and tablet state machine can pick this up and abort the migration. This PR implements the handling for streaming stage only and adds a test for it. Checking other stages needs more work with failure injection to inject failures into specific barrier. To handle streaming failure two new stages are introduced -- cleanup_target and revert_migration. The former is to clean the pending replica that could receive some data by the time streaming stopped working, the latter is like end_migration, but doesn't commit the new_replicas into replicas field. refs: #16527 Closes scylladb/scylladb#17360 * github.com:scylladb/scylladb: test/topology: Add checking error paths for failed migration topology.tablets_migration: Handle failed streaming topology.tablets_migration: Add cleanup_target transition stage topology.tablets_migration: Add revert_migration transition stage storage_service: Rewrap cleanup stage checking in cleanup_tablet() test/topology: Move helpers to get tablet replicas to pylib	2024-02-20 18:50:55 +01:00
Botond Dénes	73a3a3faf3	Merge 'tools/scylla-nodetool: implement tablestats' from Kefu Chai Refs #15588 Closes scylladb/scylladb#17387 * github.com:scylladb/scylladb: tools/scylla-nodetool: implement tablestats utils/rjson: add templated streaming_writer::Write()	2024-02-20 14:46:07 +02:00
Patryk Jędrzejczak	419354bc9f	test: harden test_cdc_generation_clearing In one of the previous patches, we fixed scylladb/scylladb#16916 as a side effect. We removed `system_keyspace::get_cdc_generations_cleanup_candidate`, which contained the bug causing the issue. Even though we didn't have to fix this issue directly, it showed us that `test_cdc_generation_clearing` was too weak. If something went wrong during/after the only clearing, the test still could pass because the clearing was the last action in the test. In scylladb/scylladb#16916, the CDC generation publisher was stuck after the clearing because of a recurring error. The test wouldn't detect it. Therefore, we harden the test by expecting two clearings instead of one. If something goes wrong during the first clearing, there is a high chance that the second clearing will fail. The new test version wouldn't pass with the old bug in the code.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	2b724735d1	test: test clean-up of committed_cdc_generations We extend `test_cdc_generation_clearing`. Now, it also tests the clean-up of `TOPOLOGY.committed_cdc_generations` added in the previous patch. In the implementation, we harden the already existing `check_system_topology_and_cdc_generations_v3_consistency`. After the previous patch, data of every generation present in `committed_cdc_generations` should be present in CDC_GENERATIONS_V3. In other words, `committed_cdc_generations` should always be a subset of a set containing generations in CDC_GENERATIONS_V3. Before the previous patch, this wasn't true after the clearing, so the new version of `test_cdc_generation_clearing` wouldn't pass back then.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	b8aa74f539	raft topology: clean only obsolete CDC generations' data Currently, we may clear a CDC generation's data from CDC_GENERATIONS_V3 if it is not the last committed generation and it is at least 24 hours old (according to the topology coordinator's clock). However, after allowing writes to the previous CDC generations, this condition became incorrect. We might clear data of a generation that could still be written to. The new solution is to clear data of the generations that finished operating more than 24 hours ago. The rationale behind it is in the new comment in `topology_coordinator:clean_obsolete_cdc_generations`. The previous solution used the clean-up candidate. After introducing `committed_cdc_generations`, it became unneeded. The last obsolete generation can be computed in `topology_coordinator:clean_obsolete_cdc_generations`. Therefore, we remove all the code that handles the clean-up candidate. After changing how we clear CDC generations' data, `test_current_cdc_generation_is_not_removed` became obsolete. The tested feature is not present in the code anymore. `test_dependency_on_timestamps` became the only test case covering the CDC generation's data clearing. We adjust it after the changes.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	e145e758eb	raft topology: store committed CDC generations' IDs in the topology When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. We need to adjust the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. This patch is the first step of the adjustment. We replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. This set is sorted by timestamps, just like `unpublished_cdc_generations`. This patch is mostly refactoring. The last generation in `committed_cdc_generations` is the equivalent of the previous `current_cdc_generation_{uuid, timestamp}`. The other generations are irrelevant for now. They will be used in the following patches. After introducing `committed_cdc_generations`, a newly committed generation is also unpublished (it was current and unpublished before the patch). We introduce `add_new_committed_cdc_generation`, which updates both sets of generations so that we don't have to call `add_committed_cdc_generation` and `add_unpublished_cdc_generation` together. It's easy to forget that both of them are necessary. Before this patch, there was no call to `add_unpublished_cdc_generation` in `topology_coordinator::build_coordinator_state`. It was a bug reported in scylladb/scylladb#17288. This patch fixes it. This patch also removes "the current generation" notion from the Raft-based topology. For the Raft-based topology, the current generation was the last committed generation. However, for the `cdc::metadata`, it was the generation operating now. These two generations could be different, which was confusing. For the `cdc::metadata`, the current generation is relevant as it is handled differently, but for the Raft-based topology, it isn't. Therefore, we change only the Raft-based topology. The generation called "current" is called "the last committed" from now.	2024-02-20 12:35:16 +01:00
Kefu Chai	c627d9134e	tools/scylla-nodetool: implement tablestats Refs #15588 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-20 18:12:35 +08:00
Botond Dénes	050c6dcad7	api: storage_service/keyspaces: add replication filter To allow to filter the returned keyspaces based by the replication they use: tablets or vnodes. The filter can be disabled by omitting the parameter or passing "all". The default is "all". Fixes: #16509 Closes scylladb/scylladb#17319	2024-02-20 09:04:41 +01:00
Botond Dénes	2a494b6c47	Merge 'test/nodetool: parameterize test_ring' from Kefu Chai so we exercise the cases where state and status are not "normal" and "up". turns out the MBean is able to cache some objects. so the requets retrieving datacenter and rack are now marked `ANY`. * filter out the requests whose `multiple` is `ANY` * include the unconsumed requets in the raised `AssertionError`. this should help with debugging. Fixes #17401 Closes scylladb/scylladb#17417 * github.com:scylladb/scylladb: test/nodetool: parameterize test_ring test/nodetool: fail a test only with leftover expected requests	2024-02-20 08:48:11 +02:00
Kefu Chai	64f9d90f7b	tools/scylla-nodetool: implement toppartitions Refs #15588 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17357	2024-02-20 08:16:43 +02:00
Pavel Emelyanov	1440eddc58	test/topology: Add checking error paths for failed migration For now only fail streaming stage and check that migration doesn't get stuck and doesn't make tablet appear on dead node. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:59:06 +03:00
Pavel Emelyanov	c06cbc391f	test/topology: Move helpers to get tablet replicas to pylib These are very useful and will be used across different test files soon Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:53:36 +03:00

1 2 3 4 5 ...

6392 Commits