scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 22:13:19 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	b56d6fbb84	Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. Closes scylladb/scylladb#23806 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-05-05 11:28:38 +03:00
Piotr Dulikowski	05c797795f	Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet. Closes scylladb/scylladb#23815 * github.com:scylladb/scylladb: test: Remove sstable_assertions::get_stats_metadata() test: Add sstable_assertions::operator->()	2025-05-05 09:33:45 +02:00
Nadav Har'El	834107ae97	test/cqlpy,alternator: fix reporting of Scylla crash during test The cqlpy and alternator test frameworks use a single Scylla node started once for all tests to run on. In the distant past, we had a problem where if one test caused Scylla to crash, the result was a confusing report of hundreds of failed tests - all tests after the crash "failed" and it wasn't easy to find which test really caused the crash. Our old solution to this problem was to have an autouse fixture (called cql_test_connection or dynamodb_test_connection) which tested the connection at the end of each test, and if it detected Scylla has crashed - it used pytest.exit() to report the error and have pytest exit and therefore stop running any further tests (which would have led to all of them testing). This approach had two problems: 1. The pytest.exit() caused the entire cqlpy suite to report a failure, but but not the individual test - the individual test might have failed as well, but that isn't guaranteed and in any case this test's output is missing the informative message that Scylla crashed during the test. This was fine when for each cqlpy failure we had two separate error logs in Jenkins - the specific failed function, and the failed file - but when we recently got rid of the suplication by removing the second one, we no longer see the "Scylla crashed" messages any more. 2. Exiting pytest will be the wrong thing to do if the same pytest run could run tests from different test suites. We don't do this today, but we plan to support this approach soon. This patch fixes both problems by replacing the pytest.exit() call by setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail() causes the current test - the one which caused Scylla to crash - to be reported as an "ERROR" and the "Scylla crashed" message will correctly appear in this test's log. The flag will cause all other tests in the same test suite to be skip()ed. But other tests in other directories, depending on different fixtures, might continue to run normally. Fixes #23287 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23307	2025-05-05 10:15:56 +03:00
Nadav Har'El	3ce7e250cc	alternator: fix schema "concurrent modification" errors In ScyllaDB, schema modification operations use "optimistic locking": A schema operation reads the current schema, decides what it wants to do and prepares changes to the schema, and then attempts to commit those changes - but only if the schema hasn't changed since the first read. If the schema has already been changed by some other node - we need to try again. In a loop. In Alternator, there are six operations that perform schema modification: CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and UpdateTimeToLive. All of them were missing this loop. We knew about this - and even had FIXME in all places. So all these operations, when facing contention of concurrent schema modifications on different nodes may fail one of these operations with an error like: Internal server error: service::group0_concurrent_modification (Failed to apply group 0 change due to concurrent modification). This problem had very minor effect, if any, on real users because the DynamoDB SDK automatically retries operations that fail with retryable errors - like this "Internal server error" - and most likely the schema operation will succeed upon retry. However, as shown in issue #13152 these failures were annoying in our CI, where tests - which disable request retries - failed on these errors. This patch fixes all six operations (the last three operations all use one common function, db::modify_tags(), so are fixed by one change) to add the missing loop. The patch also includes reproducing tests for all these operations - the new tests all fail before this patch, and pass with it. These new tests are much more reliable reproducers than the dtests we had that only sometimes - very rarely - reproduced the problem. Moreover, the new tests reproduces the bug seperately for each of the six operations, so if we forget to fix one of the six operations, one of the tests would have continued to fail. Of course I checked this during development. The new tests are in the test/cluster framework, not test/alternator, because this problem can only be reproduced in a multi-node cluster: On a single node, it serializes its schema modifications on its own; The collisions only happen when more than one node attempts schema modifications at the same time. Fixes #13152 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23827	2025-05-05 09:59:08 +03:00
Aleksandra Martyniuk	1f4edd8683	test_tablet_tasks: use injection to revoke resize Currently, test_tablet_resize_revoked tries to trigger split revoke by deleting some rows. This method isn't deterministic and so a test is flaky. Use error injection to trigger resize revoke. Fixes: #22570. Closes scylladb/scylladb#23966	2025-04-30 07:04:57 +03:00
Michał Chojnowski	9e2343ecb0	test_sstable_compression_dictionaries_autotrain: raise the timeout There were CI runs in which the training happened as planned, but it was too slow to fit within the timeout. Raise the timeout to pacify the CI. Fixes scylladb/scylladb#23964 Closes scylladb/scylladb#23965	2025-04-29 22:09:14 +03:00
Raphael S. Carvalho	d5bee4c814	test: Verify partitioned set store split and unsplit correctly Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	494ed6b887	dht: Add overlap_ratio() for token range Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak	0cdcf82cd0	Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. Closes scylladb/scylladb#23914 * https://github.com/scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation	2025-04-28 12:45:33 +02:00
Botond Dénes	d582c436e5	Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. Closes scylladb/scylladb#23787 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-04-28 09:32:45 +03:00
Nadav Har'El	262530f27c	Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as we'll as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). To achieve this, in this series we remove all base_info members that can change due to a base schema update, and we calculate the remaining values during view update generation, using the most up-to-date base schema version. To calculate the values that depend on the base schema version, we need to iterate over the view primary key and find the corresponding columns, which adds extra overhead for each batch of view updates. However, this overhead should be relatively small, as when creating a view update, we need to prepare each of its columns anyway. And if we need to read the old value of the base row, the relative overhead is even lower. After this change, the base info in view schemas stays the same for all base schema updates, so we'll no longer get issues with base_info being incompatible with a base schema version. Additionally, it's a step towards making the schema objects immutable, which we sometimes incorrectly assumed in the past (they're still not completely immutable yet, as some other fields in view_info other than base_info are initialized lazily and may depend on the base schema version). Fixes https://github.com/scylladb/scylladb/issues/9059 Fixes https://github.com/scylladb/scylladb/issues/21292 Fixes https://github.com/scylladb/scylladb/issues/22194 Fixes https://github.com/scylladb/scylladb/issues/22410 Closes scylladb/scylladb#23337 * github.com:scylladb/scylladb: test: remove flakiness from test_schema_is_recovered_after_dying mv: add a test for dropping an index while it's building base_info: remove the lw_shared_ptr variant view_info: don't re-set base_info after construction base_info: remove base_info snapshot semantics base_info: remove base schema from the base_info schema_registry: store base info instead of base schema for view entries base_info: make members non-const view_info: move the base info to a separate header view_info: move computation of view pk columns not in base pk to view_updates view_info: move base-dependent variables into base_info view_info: set base info on construction	2025-04-27 19:12:12 +03:00
Piotr Szymaniak	e588c8667f	alternator: Limit attribute name lengths Attribute names are now checked against DynamoDB-compatible length limits. When exceeded, Alternator emits exception identical or similar to the DDB one. It might be worth noting that DDB emits more than a single kind of an exception string for some exceptions. The tests' catch clauses handle all the observed kinds of messages from DynamoDB. The validation differentiates between key and non-key attributes and applies the limit accordingly. AWS DDB raises exceptions with somewhat different contents when the get request contains ProjectionExpression, so this case needed separate treatment to emit the corresponding exception string. The length-validating function was declared and defined in expressions.hh/.cc respectively, because that's where the relevant parsing happens. ** Tests The following tests were validated when handling this issue: test_limit_attribute_length_nonkey_good, test_limit_attribute_length_nonkey_bad, test_limit_attribute_length_key_good, test_limit_attribute_length_key_bad, test_limit_attribute_length_gsi_lsi_good, test_limit_attribute_length_gsi_lsi_bad, test_limit_attribute_length_gsi_lsi_projection_bad. Some of the tests were expanded into being more granular. Namely, there is a new test function `test_limit_attribute_length_key_bad_incoherent_names` which groups tests with too long attribute names in the case of incorrect (incoherent) user requests. Similarily, there is a new test function `test_limit_attribute_length_gsi_lsi_bad_incoherent_names` All the tests cover now each combination of the key/keys being too long. Both the new fuctions contain tests that verify that ScyllaDB throws length-related exceptions (instead of the coherency-related), similar to what DynamoDB does. The new test test_limit_gsiu_key_len_bad covers the case of too long attribute name inside GlobalSecondaryIndexUpdates. The new test test_limit_gsiu_key_len_bad_incoherent_names covers the case of incorrect (incoherent) user requests containing too long attribute names and GlobalSecondaryIndexUpdates. test_limit_attribute_length_key_bad was found to have contaned an illegal KeySchema structure. Some of the tests were corrected their match clause. All the tests are stripped of the xfail flag except test_limit_attribute_length_key_bad, which has it changed since it still fails due to Projection in GSI and LIS not implemented in Alternator. The xfail now points to #5036. Fixes scylladb/scylladb#9169 Closes scylladb/scylladb#23097	2025-04-27 18:39:20 +03:00
Piotr Dulikowski	82e1678fbe	test: mv: skip test_mv_tablets_empty_ip in debug mode This test shuts down a node and then replaces it with another one while continuously writing to the cluster. The test has been observed to take a lot of time in debug mode and time out on the replace operation. Replace takes very long because rebuilding tablets on the new node is very slow, and the slowest part is memtable flush which happens at the beginning of streaming. The slowness seems to be specific to the debug mode. Turn off the test in debug mode to deflake the CI. As a follow-up, the test is planned to be reworked into an quicker error injection test so that the code path tested by this test will be again exercised in debug unit tests (scylladb/scylladb#23898) Fixes: scylladb/scylladb#20316 Closes scylladb/scylladb#23900	2025-04-27 18:06:08 +03:00
Piotr Dulikowski	670a69007e	test: cluster: add test_bad_initial_token Adds a test which checks that rollback works properly in case when a bad value of the initial_token function is provided.	2025-04-25 12:25:15 +02:00
Aleksandra Martyniuk	76cd707b18	test: test_tablets: wait for cql Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level to prevent failing queries. Fixes: #23620. Closes scylladb/scylladb#23796	2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak	2a8bb47cfb	test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes Providing IP of an ignored node during removenode made the test flaky. It could happen that the address map contained mappings of two nodes with the same IP: 1. the node being ignored, 2. the node that expectedly failed replacing earlier in the test. So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps` could return the host ID of the second node instead of the first node and cause removenode to fail. We fix flakiness in this patch by providing the host ID of the ignored node instead of its IP. We would have to do it anyway sooner or later because providing IP is deprecated. The bug in `find_raft_nodes_from_hoeps` is tracked by scylladb/scylladb#23846. The test became flaky because of `f0af3f261e`. That patch is not present in 2025.1, so the test isn't flaky outside master, and hence there is no reason to backport this patch. Fixes scylladb/scylladb#23499 Closes scylladb/scylladb#23863	2025-04-24 20:17:19 +03:00
Pavel Emelyanov	68a178eba9	Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead. Fixes: #16095. Needs backport to 2025.1 and 6.2 as they contain the bug Closes scylladb/scylladb#23876 * github.com:scylladb/scylladb: test: test table drop during flush replica: skip flush of dropped table	2025-04-24 20:02:59 +03:00
Wojciech Mitros	ee5883770a	test: remove flakiness from test_schema_is_recovered_after_dying Due to the changes in creating schemas with base info the test_schema_is_recovered_after_dying seems to be flaky when checking that the schema is actually lost after 'grace_period'. We don't actually guarantee that the the schema will be lost at that exact moment so there's no reason to test this. To remove the flakiness, we remove the check and the related sleep, which should also slightly improve the speed of this test.	2025-04-24 01:09:35 +02:00
Wojciech Mitros	bf7bba9634	mv: add a test for dropping an index while it's building Dropping an index is a schema change of its base table and a schema drop of the index's materialized view. This combination of schema changes used to cause issues during view building, because when a view schema was dropped, it wasn't getting updated with the new version of the base schema, and while the view building was in progress, we would update the base schema for the base table mutation reader and try generating updates with a view schema that wasn't compatible with the base schema, failing on an `on_internal_error`. In this patch we add a test for this scenario. We create an index, halt its view building process using an injection, and drop it. If no errors are thrown, the test succeeds. The test was failing before https://github.com/scylladb/scylladb/pull/23337 and is passing afterwards.	2025-04-24 01:09:32 +02:00
Wojciech Mitros	d77f11d436	base_info: remove the lw_shared_ptr variant The base_dependent_view_info is no longer needed to be shared or modified in the view_info, so we no longer need to keep it as a shared pointer.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Wojciech Mitros	900687c818	view_info: set base info on construction Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as well as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). The first step towards that is making sure that all newly created schemas have the base info set. We achieve that by requiring a base schema when constructing a view schema. Unfortunately, this adds complexity each time we're making a view schema - we need to get the base schema as well. In most cases, the base schema is already available. The most problematic scenario is when we create a schema from mutations: - when parsing system tables we can get the schema from the database, as regular tables are parsed before views - when loading a view schema using the schema loader tool, we need to load the base additionally to the view schema, effectively doubling the work - when pulling the schema from another node - in this case we can only get the current version of the base schema from the local database Additionally, we need to consider the base schema version - when we generate view updates the version of the base schema used for reads should match the version of the base schema in view's base info. This is achieved by selecting the correct (old or new) schema in `db::schema_tables::merge_tables_and_views` and using the stored base schema in the schema_registry.	2025-04-24 01:08:39 +02:00
Benny Halevy	f279625f59	test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error The query may fail also on a no_such_keyspace exception, which generates the following cql error: ``` Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq" ``` Extend the pytest.raises match expression to include this error as well. Fixes #23812 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23875	2025-04-23 18:54:22 +03:00
Aleksandra Martyniuk	c1618c7de5	test: test table drop during flush	2025-04-23 14:29:28 +02:00
Nadav Har'El	64a5eee6b9	test/cqlpy: insert test names into Scylla logs Both test.py and test/cqlpy/run run many test functions against the same Scylla process. In the resulting log file, it is hard to understand which log messages are related to which test. In this patch, we log a message (using the "/system/log" REST API) every time a test is started or ends. The messages look like this: INFO 2025-04-22 15:10:44,625 [shard 1:strm] api - /system/log: test/cqlpy: Starting test_lwt.py::test_lwt_missing_row_with_static ... INFO 2025-04-22 15:10:44,631 [shard 0:strm] api - /system/log: test/cqlpy: Ended test_lwt.py::test_lwt_missing_row_with_static We already had a similar feature in test/alternator, added three years ago in commit `b0371b6bf8`. The implementation is similar but not identical due to different available utility functions, and in any case it's very simple. While at it, this patch also fixes the has_rest_api() to timeout after one second. Without this, if the REST API is blocked in a way that a connection attempt just hangs, the tests can hang. With the new timeout, the test will hang for a second, realize the REST API is not available, and remember this decision (the next tests will not wait one second again). We had the same bug in Alternator, and fixed it in `758f8f01d7`. This one second "pause" will only happen if the REST API port is blocked - in the more typical case the REST API port is just not listening but not blocked, and the failure will be noticed immediately and won't wait a whole second. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23857	2025-04-23 12:04:14 +03:00
Piotr Dulikowski	3d73c79a72	test: mv: skip test_view_building_scheduling_group in debug The test populates a table with 50k rows, creates a view on that table and then compares the time spent in streaming vs. gossip scheduling groups. It only takes 10s in dev mode on my machine, but is much slower in debug mode in CI - building the view doesn't finish within 2 minutes. The bigger the view to build, the more accurrate the measurement; moreover, the test scenario isn't interesting enough to be worth running it in debug mode as this should be covered by other tests. Therefore, just skip this test in debug mode. Fixes: scylladb/scylladb#23862 Closes scylladb/scylladb#23866	2025-04-23 11:29:35 +03:00
Pavel Emelyanov	a6ba535c3c	Merge 'test.py: refactoring before boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change though is switching the SQLite writer from a singleton to a thread locking mechanism that will be needed later on. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer [request](https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278). Closes scylladb/scylladb#23867 * github.com:scylladb/scylladb: test.py: move the readme file for LDAP tests to the correct location test.py: eliminate deprecation warning for xml.etree.ElementTree.Element test.py: align the behavior of max-failures parameter with pytest maxfail test.py: fix typo in toxiproxy name parameter test.py: add locking to the sqlite writer for resource gather test.py: add sqlite datetime adapter for resource gather test.py: change the parameter for get_modes_to_run()	2025-04-23 11:10:56 +03:00
Andrei Chekun	57b66e6b2e	test.py: move the readme file for LDAP tests to the correct location README file was created in incorrect location, now it moved to the directory with source files where it intended to be.	2025-04-22 19:03:28 +02:00
Andrei Chekun	cf4747c151	test.py: eliminate deprecation warning for xml.etree.ElementTree.Element Testing the truth value of an Element emits DeprecationWarning. This check is done correctly	2025-04-22 19:03:21 +02:00
Andrei Chekun	5c3501e4bf	test.py: fix typo in toxiproxy name parameter Fix typo in toxiproxy name parameter. No any functional changes just cosmetic fix.	2025-04-22 19:02:12 +02:00
Andrei Chekun	2c37a793d1	test.py: add locking to the sqlite writer for resource gather SQLite blocking the DB during writes, so it's not possible to make writes from several thread. To be able to gather metrics in several threads, we need a locking mechanism for threads during writes. So thread will not try to write metrics while another thread is performing writes.	2025-04-22 19:01:30 +02:00
Andrei Chekun	800710dc2c	test.py: add sqlite datetime adapter for resource gather Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12	2025-04-22 18:59:49 +02:00
Andrei Chekun	bf2a9e267e	test.py: change the parameter for get_modes_to_run() Change the parameter for get_modes_to_run() from session to config to narrow the scope, and prepare it to later use in method that do not have access to the session, but have access to the config object	2025-04-22 18:58:33 +02:00
Pavel Emelyanov	65efd2b2f6	Merge 'Refactor and enhance s3_tests' from Ernest Zaslavsky This PR introduces a cleanup mechanism in s3_tests to remove uploaded objects after the test completes, ensuring a clean testing environment. Additionally, the recently added test has been refactored and split into smaller, more maintainable parts, improving readability and extending its coverage to include the "proxied" case. As these changes primarily improve code aesthetics and maintainability, backporting is not necessary. Refs: https://github.com/scylladb/scylladb/issues/23830 Closes scylladb/scylladb#23828 * github.com:scylladb/scylladb: s3_tests: Improve and extend copy object test coverage s3_tests: Implement post-test cleanup for uploaded objects	2025-04-22 16:40:37 +03:00
Nadav Har'El	8d1a413357	test/scylla_gdb: better error message when running on dev build mode The test/scylla_gdb suite needs Scylla to have been built with debug symbols - which is NOT the case for the dev build. So the script test/scylla_gdb/run attempts to recognize when a developer runs it on an executable with the debug symbols missing - and prints a clear error. Unfortunately, as we noticed in #10863, and again in #23832, because wasmtime is compiled with debug symbols and linked with Scylla, build/dev/scylla "pretends" to have debug symbols, foiling the check in test/scylla_gdb/run. Reviewers rejected two solutions to this problem (pull requests #10865 and #10923), so in pull request #10937 I added a cosmetic solution just for test/scylla_gdb: in test/scylla_gdb/conftest.py we check that there are really debug symbols that interest us, and if not, exit immediately instead of failing each test separately. For some reason, the sys.exit() we used is no longer effective - it no longer exits pytest, so in this patch we use pytest.exit() instead. Fixes #23832 (sort of, we leave build/dev/scylla with the fake claim that it has debug symbols, but test/scylla_gdb will handle this situation more gracefully). Closes scylladb/scylladb#23834	2025-04-22 15:02:06 +03:00
Michael Litvak	5c1d24f983	test: test_mv_topology_change: increase timeout for remove_node The test `test_mv_write_to_dead_node` currently uses a timeout of 60 seconds for remove_node, after it was increased from 30 seconds to fix scylladb/scylladb#22953. Apparently it is still too low, and it was observed to fail in debug mode. Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000 seconds, but the test requires a timeout which is shorter than 5 minutes, because it is a regression test for an issue where MV updates hold topology changes for more than 5 minutes, and we want to verify in the test that the topology change completes in less than 5 minutes. To resolve the issue, we set the test to skip in debug mode, because the remove node operation is unpredictably slow, and we increase the timeout to 180 seconds which is hopefully enough time for remove_node in non-debug modes, and still sufficient to satisfy the test requirements. Fixes scylladb/scylladb#22530 Closes scylladb/scylladb#23833	2025-04-22 10:51:19 +02:00
Ernest Zaslavsky	edaa3f4bdd	s3_tests: Improve and extend copy object test coverage Refactored the copy object test to enhance readability and maintainability. The test was simplified and split into smaller, more focused parts. Additionally, a "proxied" variant of the test was introduced to expand coverage.	2025-04-21 20:54:14 +03:00
Ernest Zaslavsky	252a0a14af	s3_tests: Implement post-test cleanup for uploaded objects Ensure cleanup after tests by deleting objects uploaded to MinIO. This improves resource management and maintains a clean test environment.	2025-04-21 20:54:14 +03:00
Avi Kivity	2dcd2b21ae	Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631 Backport to 2025.1 because load imbalance is a serious problem in production. Closes scylladb/scylladb#23708 * github.com:scylladb/scylladb: tablets: Equalize per-table balance when allocating tablets for a new table load_sketch: Tolerate missing tablet_map when selecting for a given table tests: tablets: Simplify tests by moving common code to topology_builder	2025-04-21 17:06:30 +03:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Avi Kivity	0ba3ce1741	test: gdb: avoid using `file(1)` to determine if debug information is present The scylla_gdb tests verify, as a sanity check, that the executable was built with debug information. They do so via file(1). In Fedora 42, file(1) crashes on ELF files that have interpreter pathnames larger than 128 characters[1]. This was later fixed[2], but the fix is not in any release. Work around the problem by using objdump instead of file. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2354970 [2] `b3384a1fbf` Closes scylladb/scylladb#23823	2025-04-21 13:29:27 +03:00
Andrei Chekun	441cee8d9c	test.py: fix gathering logs in case of fail Currently log files have information about run_id twice: cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log However, sometimes the first run_id can be incorrect: cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log Removing first run_id in the name to not face this issue and because it's actually redundant. Removing creation empty file for scylla manager log, since it redundant and was done as incorrect assumption on the root cause of the fail. Add extension to the stacktrace file, so it will be opened in the browser in Jenkins in the new tab instead of downloading it. Fixes: https://github.com/scylladb/scylladb/issues/23731 Closes scylladb/scylladb#23797	2025-04-21 13:12:35 +03:00
Pavel Emelyanov	09caad6147	test: Remove sstable_assertions::get_stats_metadata() It mirrors the sstable method of the same name, which is public. With -> operator, it's just as convenient to call it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:53:41 +03:00
Pavel Emelyanov	294e56207d	test: Add sstable_assertions::operator->() ... and replace get_sstable() with it. It's more natural (despite having the only user) to consider the class to be yet another "pointer" to an sstable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:52:39 +03:00
Sergey Zolotukhin	2314feeae2	test: Ignore DEBUG,TRACE,INFO level messages when checking for failed mutations. Update the regular expression in `check_node_log_for_failed_mutations` to avoid false test failures when DEBUG-level logging is enabled. Fixes scylladb/scylladb#23688 Closes scylladb/scylladb#23658	2025-04-18 16:17:41 +03:00
Calle Wilund	4a44651fce	encryption_at_rest_test: Make fake_proxy read/write loop noexcept Fixes #23774 Test code falls into same when_all issue as http client did. Avoid passing exceptions through this, and instead catch and report in worker lambda. Closes scylladb/scylladb#23778	2025-04-18 16:17:41 +03:00

1 2 3 4 5 ...

8717 Commits