scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Botond Dénes	b18d9e5d0d	Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken. This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series. Fixes: https://github.com/scylladb/scylladb/issues/18674 - [x] This patch has to be backported because it fixes broken functionality (cherry picked from commit `dbccb61636`) (cherry picked from commit `4590026b38`) (cherry picked from commit `feea609e37`) (cherry picked from commit `0c61b1822c`) (cherry picked from commit `8ef4fbdb87`) Refs #18705 Closes scylladb/scylladb#19240 * github.com:scylladb/scylladb: test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update test/pylib: rest_client: add get_injection() api/error_injection: add getter for error_injection utils/error_injection: add set_parameter() replica/database: fix live-update enable_compacting_data_for_streaming_and_repair	2024-06-13 12:45:23 +03:00
Botond Dénes	d4563e2b28	test/pylib: rest_client: add get_injection() The /v2/error_injection/{injection} endpoint now has a GET method too, expose this. (cherry picked from commit `0c61b1822c`)	2024-06-11 17:32:37 +00:00
Raphael S. Carvalho	d4c3a43b34	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `7b41630299`) Closes scylladb/scylladb#19229	2024-06-11 18:12:43 +03:00
Tomasz Grabiec	7956a2991e	api, storage_service: Introduce API to wait for topology to quiesce	2024-05-16 00:28:47 +02:00
Pavel Emelyanov	8bad828208	api: Add method to delete replica from tablet Copied from the add_replica counterpart TODO: Generalize common parts of move_tablet and add_\|del_tablet_replica Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-15 16:31:07 +03:00
Pavel Emelyanov	79ad760e95	api: Add method to add replica to a tablet The new API submits rebuild transition with new replicas set to be old (current) replicas plus the provided one. It looks and acts like the move_tablet API call with several changes: - lacks the "source" replica argument - submits "rebuild" transition kind - cross racks checks are not performed The 'force' argument is inherited from move_tablet, but is unused now and is left for future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:22:16 +03:00
Raphael S. Carvalho	6bdb456fad	sstables_loader: Fix loader when write selector is previous during tablet migration The loader is writing to pending replica even when write selector is set to previous. If migration is reverted, then the writes won't be rolled back as it assumes pending replicas weren't written to yet. That can cause data resurrection if tablet is later migrated back into the same replica. NOTE: write selector is handled correctly when set to next, because get_natural_endpoints() will return the next replica set, and none of the replicas will be considered leaving. And of course, selector set to both is also handled correctly. Fixes #17892. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17902	2024-03-24 01:20:50 +01:00
Tomasz Grabiec	a233a699cc	test: py: Add test for view replica pairing after replace	2024-03-15 13:20:08 +01:00
Asias He	ebc0ab94e5	repair: Add ranges option support for tablet repair The management tool, e.g., scylla manager, needs the ranges option to select which ranges to repair on a node to schedule repair jobs. This patch adds ranges option support. E.g., curl -X POST "http://127.0.0.1:10000/storage_service/repair_async/ks1?ranges=-4611686018427387905:-1,4611686018427387903:9223372036854775807" Fixes: #17416 Tests: test_tablet_repair_ranges_selection Closes scylladb/scylladb#17436	2024-03-11 20:03:12 +02:00
Patryk Wrobel	a39a5b671e	pylib/rest_client.py: add ownership API to ScyllaRESTAPIClient This change adds a member function that can be used to access 'storage_service/ownership' API. It will be used by tests that need to access this API. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Botond Dénes	41424231f1	Merge 'compaction: reshape sstables within compaction groups' from Lakshmi Narayanan Sreethar For tables using tablet based replication strategies, the sstables should be reshaped only within the compaction groups they belong to. The shard_reshaping_compaction_task_impl now groups the sstables based on their compaction groups before reshaping them. Fixes https://github.com/scylladb/scylladb/issues/16966 Closes scylladb/scylladb#17395 * github.com:scylladb/scylladb: test/topology_custom: add testcase to verify reshape with tablets test/pylib/rest_client: add get_sstable_info, enable/disable_autocompaction replica/distributed_loader: enable reshape for sstables compaction: reshape sstables within compaction groups replica/table : add method to get compaction group id for an sstable compaction: reshape: update total reshaped size only on success compaction: simplify exception handling in shard_reshaping_compaction_task_impl::run	2024-03-06 10:33:56 +02:00
Raphael S. Carvalho	f07c233ad5	Fix potential data resurrection when another compaction type does cleanup work Since commit `f1bbf70`, many compaction types can do cleanup work, but turns out we forgot to invalidate cache on their completion. So if a node regains ownership of token that had partition deleted in its previous owner (and tombstone is already gone), data can be resurrected. Tablet is not affected, as it explicitly invalidates cache during migration cleanup stage. Scylla 5.4 is affected. Fixes #17501. Fixes #17452. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17502	2024-02-25 13:08:04 +02:00
Lakshmi Narayanan Sreethar	ed2d8529f3	test/pylib/rest_client: add get_sstable_info, enable/disable_autocompaction Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-02-23 18:43:39 +05:30
Piotr Dulikowski	4d4976feb0	test/topology_custom: upgrade/recovery tests for topology on raft Adds three tests for the new upgrade procedure: - test_topology_upgrade - upgrades a cluster operating in legacy mode to use raft topology operations, - test_topology_recovery_basic - performs recovery on a three-node cluster, no node removal is done, - test_topology_majority_loss - simulates a majority loss scenario, i.e. removed two nodes out of three, performs recovery to rebuild the raft topology state and re-add two nodes back.	2024-02-08 19:12:28 +01:00
Asias He	57a4e5594d	test: Check repair status in ScyllaRESTAPIClient Raise an exception in case the repair is not successful.	2024-01-23 11:10:08 +08:00
Asias He	bfe5894a9f	test: Add test_tablet_repair A basic repair test that verifies tablet repair works.	2024-01-18 08:49:06 +08:00
Kefu Chai	317af97e41	test/pylib: shutdown unix RESTful client when stopping the ManagerClient, it would be better to close all connected connector, otherwise aiohttp complains like: ``` 13:57:53.763 ERROR> Unclosed connector connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f939d2ca5f0>, 96672.211256817)]'] connector: <aiohttp.connector.UnixConnector object at 0x7f939d2da890> ``` this warning message is printed to the console, and it is distracting when testing manually. so, in this change, let's close the client connecting to unix domain socket. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16675	2024-01-10 11:07:14 +02:00
Eliran Sinvani	e49b3ffc89	test.py: Dump coverage profile before killing a node Up until now the only way to get a coverage profile was to shut down the ScyllaDB nodes gracefully (using SIGTERM), this means that the coverage profile was lost for every node that was killed abruptly (SIGKILL). This in turn would have been requiring us to shut down all nodes gracefully which is not something we set out to do. Here we use the rest API for dumping the coverage profile which will cause the most minimal impact possible on the test runs. If the dumping fails (due to the node doesn't support the API or due to a real error in dumping we ignore it as it is not part of the system we would like to test. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2023-12-27 07:17:26 +02:00
Tomasz Grabiec	733eb21601	api: Add API to kill connection to a particular host For testing failure scenarios.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	1f57d1ea28	storage_service, api: Add API to migrate a tablet Will be used in tests, or for hot fixes in production.	2023-12-06 18:36:17 +01:00
Kamil Braun	69a6910a90	test/pylib: rest_client: make `data` optional in `put_json`	2023-10-06 10:55:45 +02:00
Kamil Braun	33463df7d2	test/pylib: fix some type errors	2023-10-06 10:55:45 +02:00
Avi Kivity	854188a486	Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large: 1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque 2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s 3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): * Node1: 1 live row, 1M dead rows * Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` After: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs https://github.com/scylladb/scylladb/issues/7929 Refs https://github.com/scylladb/scylladb/issues/3672 Refs https://github.com/scylladb/scylladb/issues/7933 Fixes https://github.com/scylladb/scylladb/issues/9111 Closes scylladb/scylladb#15414 * github.com:scylladb/scylladb: test/topology_custom: add test_read_repair.py replica/mutation_dump: detect end-of-page in range-scans tools/scylla-sstable: write: abort parser thread if writing fails test/pylib: add REST methods to get node exe and workdir paths test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} service/storage_proxy: add trace points for the actual read executor type service/storage_proxy: add trace points for read-repair storage_proxy: Add more trace-level logging to read-repair database: Fix accounting of small partitions in mutation query database, storage_proxy: Reconcile pages with no live rows incrementally	2023-10-05 22:39:34 +03:00
Pavel Emelyanov	4fdf12b1c7	test/pylib: Add flush_keyspace() method to rest client Which does POST /storage_service/keyspace_flush/{ks} Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-28 11:19:04 +03:00
Botond Dénes	8bd5f67039	test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} To support the equivalent (roughly) of the following nodetool commands: * nodetool refresh * nodetool flush * nodetool compact	2023-09-22 02:53:15 -04:00
Botond Dénes	7e7101c180	Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes" This reverts commit `628e6ffd33`, reversing changes made to `45ec76cfbf`. The test included with this PR is flaky and often breaks CI. Revert while a fix is found. Fixes: #15371	2023-09-13 10:45:37 +03:00
Botond Dénes	dc269cb6bd	test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} To support the equivalent (roughly) of the following nodetool commands: * nodetool refresh * nodetool flush * nodetool compact	2023-09-11 07:01:20 -04:00
Aleksandra Martyniuk	ede8182dd4	test: fix types and variable names in wait_for_host_down Fix types and variable names in ManagerClient::wait_for_host_down and related methods.	2023-09-05 15:01:59 +02:00
Kamil Braun	cdc3cd2b79	Merge 'raft: add fencing tests' from Petr Gusev In this PR a simple test for fencing is added. It exercises the data plane, meaning if it somehow happens that the node has a stale topology version, then requests from this node will get an error 'stale topology'. The test just decrements the node version manually through CQL, so it's quite artificial. To test a more real-world scenario we need to allow the topology change fiber to sometimes skip unavailable nodes. Now the algorithm fails and retries indefinitely in this case. The PR also adds some logs, and removes one seemingly redundant topology version increment, see the commit messages for details. Closes #14901 * github.com:scylladb/scylladb: test_fencing: add test_fence_hints test.py: output the skipped tests test.py: add skip_mode decorator and fixture test.py: add mode fixture hints: add debug log for dropped hints hints: send_one_hint: extend the scope of file_send_gate holder pylib: add ScyllaMetrics hints manager: add send_errors counter token_metadata: add debug logs fencing: add simple data plane test random_tables.py: add counter column type raft topology: don't increment version when transitioning to node_state::normal	2023-08-22 16:28:21 +02:00
Petr Gusev	0b7a90dff6	pylib: add ScyllaMetrics This patch adds facilities to work with Scylla metrics from test.py tests. The new metrics property was added to ManagerClient, its query method sends a request to Scylla metrics endpoint and returns and object to conveniently access the result. ScyllaMetrics is copy-pasted from test_shedding.py. It's difficult to reuse code between 'new' and 'old' styles of tests, we can't just import pylib in 'old' tests because of some problems with python search directories. A past commit of mine that attempted to solve this problem was rejected on review.	2023-08-22 14:31:04 +04:00
Gleb Natapov	517f6bfa8a	test: add rebuild test Add simple rebuild test that makes sure that rebuild operation does not fail.	2023-08-10 16:46:13 +03:00
Alejo Sanchez	ff564583a4	test/pylib: fix return type hint Fix type hint of return when using @asynccontextmanager. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-18 13:33:46 +02:00
Mikołaj Grzebieluch	382d797d81	tests: add a `parameters` argument to code that enables injections	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	907c0e8900	tests: introduce InjectionHandler class for communicating with injected code Add a client for sending empty messages to the injected code from tests.	2023-07-06 12:34:53 +02:00
Alejo Sanchez	62a945ccd5	test/pylib: get gossiper alive endpoints Helper to get list of gossiper alive endpoints from REST API. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-04-13 21:23:03 +02:00
Botond Dénes	e55f475db1	Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun Recently we enabled RBNO by default in all topology operations. This made the operations a bit slower (repair-based topology ops are a bit slower than classic streaming - they do more work), and in debug mode with large number of concurrent tests running, they might timeout. The timeout for bootstrap was already increased before, do the same for decommission/removenode. The previously used timeout was 300 seconds (this is the default used by aiohttp library when it makes HTTP requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which is 1000 seconds. Closes #12765 * github.com:scylladb/scylladb: test/pylib: use larger timeout for decommission/removenode test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT	2023-02-13 16:30:24 +02:00
Kamil Braun	54f85c641d	test/pylib: use larger timeout for decommission/removenode Recently we enabled RBNO by default in all topology operations. This made the operations a bit slower (repair-based topology ops are a bit slower than classic streaming - they do more work), and in debug mode with large number of concurrent tests running, they might timeout. The timeout for bootstrap was already increased before, do the same for decommission/removenode. The previously used timeout was 300 seconds (this is the default used by aiohttp library when it makes HTTP requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which is 1000 seconds.	2023-02-10 15:56:31 +01:00
Kamil Braun	ca4db9bb72	Merge 'test/raft: test snapshot threshold' from Alecco Force snapshot with schema changes while server down. Then verify schema when bringing back up the server. Closes #12726 * github.com:scylladb/scylladb: pytest/topology: check snapshot transfer raft conf error injection for snapshot test/pylib: one-shot error injection helper	2023-02-10 15:24:46 +01:00
Asias He	fc60484422	test: Increase START_TIMEOUT It is observed that CI machine is slow to run the test. Increase the timeout of adding servers.	2023-02-03 21:15:08 +08:00
Alejo Sanchez	9ceb6aba81	test/pylib: one-shot error injection helper Existing helper with async context manager only worked for non one-shot error injections. Fix it and add another helper for one-shot without a context manager. Fix tests using the previous helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-02-02 16:37:21 +01:00
Kamil Braun	2f84e820fd	test/pylib: scylla_cluster: return error details from test framework endpoints If an endpoint handler throws an exception, the details of the exception are not returned to the client. Normally this is desirable so that information is not leaked, but in this test framework we do want to return the details to the client so it can log a useful error message. Do it by wrapping every handler into a catch clause that returns the exception message. Also modify a bit how HTTPErrors are rendered so it's easier to discern the actual body of the error from other details (such as the params used to make the request etc.) Before: ``` E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error E E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff ``` After: ``` E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body: E Failed to start server at host 127.155.129.1. E Check the log files: E /home/kbraun/dev/scylladb/testlog/test.py.dev.log E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log ``` Closes #12563	2023-01-19 17:47:13 +02:00
Kamil Braun	d134c458e5	test/pylib: increase timeout when waiting for cluster before test Increase the timeout from default 5 minutes to 10 minutes. Sent as a workaround for #12546 to unblock next promotions. Closes #12547	2023-01-17 21:03:09 +02:00
Alejo Sanchez	1bfe234133	test/pylib: API get/set logger level of Scylla server Provide helpers to get and set logger level for Scylla servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12394	2022-12-25 13:58:43 +02:00
Kamil Braun	82eb9af80d	test/pylib: rest_client: allow returning JSON data from `put_json` We'll use `put_json` for requests which want to pass JSON data into the call and also return JSON.	2022-11-21 10:57:03 +01:00
Kamil Braun	9b2449d3ea	test: reenable test_topology::test_decommission_node_add_column Also improve the test to increase the probability of reproducing #11780 by injecting sleeps in appropriate places. Without the fix for #11780 from the earlier commit, the test reproduces the issue in roughly half of all runs in dev build on my laptop.	2022-11-16 14:01:50 +01:00
Alejo Sanchez	700054abee	test.py: use internal id to manage servers Instead of using assigned IP addresses, use an internal server id. Define types to distinguish local server id, host ID (UUID), and IP address. This is needed to test servers changing IP address and for node replace (host UUID). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Alejo Sanchez	a5316b0c6b	test.py: requests without aiohttp ClientSession Simplify REST helper by doing requests without a session. Reusing an aiohttp.ClientSession causes knock-on effects on `rest_api/test_task_manager` due to handling exceptions outside of an async with block. Requests for cluster management and Scylla REST API don't need session, anyway. Raise HTTPError with status code, text reason, params, and json. In ScyllaCluster.install_and_start() instead of adding one more custom exception, just catch all exceptions as they will be re-raised later. While there avoid code duplication and improve sanity, type checking, and lint score. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-11-10 09:14:37 +01:00
Petr Gusev	44f48bea0f	raft: test_remove_node_with_concurrent_ddl The test runs remove_node command with background ddl workload. It was written in an attempt to reproduce scylladb#11228 but seems to have value on its own. The if_exists parameter has been added to the add_table and drop_table functions, since the driver could retry the request sent to a removed node, but that request might have already been completed. Function wait_for_host_known waits until the information about the node reaches the destination node. Since we add new nodes at each iteration in main, this can take some time. A number of abort-related options was added SCYLLA_CMDLINE_OPTIONS as it simplifies nailing down problems. Closes #11734	2022-11-04 17:16:35 +01:00
Kamil Braun	4974a31510	test/topology_raft_disabled: more Raft upgrade tests The tests are checking the upgrade procedure and recovery from failure in scenarios like when a node fails causing the procedure to get stuck or when we lose a majority in a fully upgraded cluster. Added some new functionalities to `ScyllaRESTAPIClient` like injecting errors and obtaining gossip generation numbers.	2022-10-10 14:32:10 +02:00

1 2

55 Commits