scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Avi Kivity	55c7bc746e	Revert "vector_search_validator: move high availability tests from vector-store.git" This reverts commit `caa0cbe328`. It is either extremely slow or broken. I was never able to get it to run on an r8gd.8xlarge (on the NVMe disk). Even when it passes, it is very slow. Test script: ``` git submodule update --recursive \|\| exit 125 rm -rf build d() { ./tools/toolchain/dbuild -it -- "$@"; } d ./configure.py --mode release \|\| exit 125 d ninja release-build \|\| exit 125 d ./test.py --mode release ``` Ref #27858 Ref #27859 Ref #27860	2025-12-25 12:30:22 +00:00
Alex	f769e52877	test: boost: Fix flaky test_large_file_upload_s3 by creating induvidual files for testing During CI runs, multiple instances of the same test may execute concurrently. Although the test uses a temporary directory, the downloaded bucket artifacts were written using an identical filename across all instances. This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions. Change Description This change ensures that each test instance uses a unique filename for downloaded bucket files. By isolating file writes per test execution, concurrent runs no longer interfere with each other. Fixes: #27824 backport not required Closes scylladb/scylladb#27843	2025-12-25 09:40:13 +02:00
Botond Dénes	c66275e05c	cql3/statements/batch_statement: make size error message more verbose Mention the type of batch: Logged or Unlogged. The size (warn/fail on too large size) error has different significance depending on the type. Refs: #27605 Closes scylladb/scylladb#27664	2025-12-24 15:27:01 +02:00
Botond Dénes	ccc03d0026	test/pylib/runner.py: pytest_configure(): coerce repeat to int Coerce the return value of config.getoption("--repeat") to int to avoid: Traceback (most recent call last): File "/usr/bin/pytest", line 8, in <module> sys.exit(console_main()) ~~~~~~~~~~~~^^ File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main code = main() File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main ret: ExitCode \| int = config.hook.pytest_cmdline_main(config=config) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(args) File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main config._do_configure() ~~~~~~~~~~~~~~~~~~~~^^ File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure self.hook.pytest_configure.call_historic(kwargs=dict(config=self)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False) File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(args) File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure config.run_ids = tuple(range(1, config.getoption("--repeat") + 1)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~ TypeError: can only concatenate str (not "int") to str Closes scylladb/scylladb#27649	2025-12-24 15:13:02 +02:00
Botond Dénes	b036a461b7	tools/scylla-sstable: dump-schema: incude UDT description in dump If the table uses UDTs, include the description of these (CREATE TYPE statement) in the schema dump. Without these the schema is not useful. Closes scylladb/scylladb#27559	2025-12-24 14:46:52 +02:00
Nadav Har'El	4ae45eb367	test/alternator: remove unused imports Remove many unused "import" statements or parts of import statement. All of them were detected by Copilot, but I verified each one manually and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27676	2025-12-24 13:44:28 +02:00
Nadav Har'El	da00401b7d	test/alternator: rename test with duplicate name The file test/alternator/test_transact.py accidentally had two tests with the same name, test_transact_get_items_projection_expression. This means the first of the two tests was ignored and never run. This patch renames the second of the two to a more appropriate (and unique...) name. I verified that after this change the number of tests in this file grows by one, and that still all tests pass on DynamoDB and fail (as expected by xfail) on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27702	2025-12-24 13:43:43 +02:00
Botond Dénes	95d4c73eb1	Merge 'Make object storage config truly updateable' from Pavel Emelyanov The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented. Refs: #7316 Fixes: #26509 Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup Closes scylladb/scylladb#27689 * github.com:scylladb/scylladb: sstables/storage_manager: Fix configured endpoints observer test/object_store: Add test to validate how endpoint config update works	2025-12-24 13:42:44 +02:00
Botond Dénes	12dcf79c60	Merge 'build: support (and prefer) sccache as the compiler cache' from Avi Kivity Currently, we support ccache as the compiler cache. Since it is transparent, nothing much is needed to support it. This series adds support to sccache[1] and prefers it over ccache when it is installed. sccache brings the following benefits over ccache: 1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler 2. Rust support 3. C++20 modules (upcoming[2]) It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce build times, but without a compiler cache and distributed build support, they come with too large a penalty. This removes the penalty. The series detects that sccache is installed, selects it if so (and if not overridden by a new option), enables it for C++ and Rust, and disables ccache transparent caching if sccache is selected. Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That is left for later. [1] https://github.com/mozilla/sccache [2] https://github.com/mozilla/sccache/pull/2516 Toolchain improvement, won't be backported. Closes scylladb/scylladb#27834 * github.com:scylladb/scylladb: build: apply sccache to rust builds too build: prevent double caching by compiler cache build: allow selecting compiler cache, including sccache	2025-12-24 13:40:02 +02:00
Nadav Har'El	74a57d2872	test/cqlpy: remove unused imports Remove many unused "import" statements or parts of import statement. All of them were detected by Copilot, but I verified each one manually and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27675	2025-12-24 13:31:41 +02:00
Avi Kivity	d6edad4117	test: pylib: resource_gather: don't take ownership of /sys/fs/cgroup under podman Under podman, we already own /sys/fs/cgroup. Run the chown command only under docker where the container does not map the host user to the container root user. The chown process is sometimes observed to fail with EPERM (see issue). But it's not needed, so avoid it. Fixes #27837. Closes scylladb/scylladb#27842	2025-12-24 10:56:24 +02:00
Marcin Maliszkiewicz	3c1e1f867d	raft: auth: add semaphore to auth_cache::load_all Auth cache loading at startup is racing between auth service and raft code and it doesn't support concurrency causing it to crash. We can't easily remove any of the places as during raft recovery snapshot is not loaded and it relies on loading cache via auth service. Therefore we add semaphore. Fixes https://github.com/scylladb/scylladb/issues/27540 Closes scylladb/scylladb#27573	2025-12-24 10:56:24 +02:00
Nadav Har'El	f3a4af199f	test/cqlpy/test_materialized_view.py: Fix for Commented-out code This patch was suggested and prepared by copilot, I am writing the commit message because the original one was worthless. In commit `cf138da`, for an an unexplained reason, a loop waiting until the expected value appears in a materialized view was replaced by a call for wait_for_view_built(). The old loop code was left behind in a comment, and this commented-out code is now bothering our AI. So let's delete the commented-out code. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27646	2025-12-24 10:56:23 +02:00
Botond Dénes	1bb897c7ca	Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised. Closes scylladb/scylladb#27360 * github.com:scylladb/scylladb: object_storage: Temporarily handle pure endpoint addresses as endpoints code: Remove dangling mentions of s3::endpoint_config docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name test: Add named constants for test_get_object_store_endpoints endpoint names s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper	2025-12-24 06:59:02 +02:00
Botond Dénes	954f2cbd2f	Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity For deployments fronted by a reverse proxy (haproxy or privatelink), we want to use proxy protocol v2 so that client information in system.clients is correct and so that the shard-aware selection protocol, which depends on the source port, works correctly. Add proxy-protocol enabled variants of each of the existing native transport listeners. Tests are added to verify this works. I also manually tested with haproxy. New feature, no backport. Closes scylladb/scylladb#27522 * github.com:scylladb/scylladb: test: add proxy protocol tests config, transport: support proxy protocol v2 enhanced connections	2025-12-24 06:58:00 +02:00
Nadav Har'El	e75c75f8cd	test/cqlpy: fix two tests that couldn't fail because of typo As noticed by copilot, two tests in test_guardrail_compact_storage.py could never fail, because they used `pytest.fail` instead of the correct `pytest.fail()` to fail. Unfortunately, Python has a footgun where if it sees a bare function name without parenthesis, instead of complaining it evaluates the function object and then ignores it, and absolutely nothing happens. So let's add the missing `()`. The test still passes, but now it at least has a chance of failing if we have a regression. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27658	2025-12-24 06:49:54 +02:00
Yaron Kaikov	d671ca9f53	fix: remove return from finally block in s3_proxy.py during any jenkins job that trigger `test.py` we get: ``` /jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block ``` The 'return' statement in the finally block was causing a SyntaxWarning. Moving the return outside the finally block ensures proper exception handling while maintaining the intended behavior. Closes scylladb/scylladb#27823	2025-12-24 06:48:03 +02:00
Avi Kivity	fc81983d42	test: sstable_validation_test: actually test `ms` version sstable_validation_test tests the `scylla sstable validate` command by passing it intentionally corrupted sstables. It uses an sstable cache to avoid re-creating the same sstables. However, the cache does not consider the sstable version, so if called twice with the same inputs for different versions, it will return an sstable with the original version for both calls. As a results, `ms` sstables were not tested. Fix this bug by adding the sstable version (and the schema for good measure) to the cache key. An additional bug, hidden by the first, was that we corrupted the sstable by overwriting its Index.db component. But `ms` sstables don't have an Index.db component, they have a Partitions.db component. Adjust the corrupting code to take that into account. With these two fixes, test_scylla_sstable_validate_mismatching_partition_large fails on `ms` sstables. Disable it for that version. Since it was previously practically untested, we're not losing any coverage. Fixing this test unblocks further work on making pytest take charge of running the tests. pytest exposed this problem, likely by running it on different runners (and thus reducing the effectiveness of the cache). Fixes #27822. Closes scylladb/scylladb#27825	2025-12-24 06:47:31 +02:00
Nadav Har'El	54f3e69fdc	Fix for Statement has no effect This problem and its fix was suggested by copilot, I'm just writing the cover letter. test/nodetool/test_status.py has the silly statement tokens == "?" which has no effect. Looking around the code suggested to me (and also to Copilot, nice) that the correct intent was assert tokens == "?" and not, say, tokens = "?". Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27659	2025-12-24 06:43:26 +02:00
Michał Hudobski	ce3320a3ff	auth: add system table permissions to VECTOR_SEARCH_INDEXING Due to the recent changes in the vector store service, the service needs to read two of the system tables to function correctly. This was not accounted for when the new permission was added. This patch fixes that by allowing these tables (group0_history and versions) to be read with the VECTOR_SEARCH_INDEXING permission. We also add a test that validates this behavior. Fixes: SCYLLADB-73 Closes scylladb/scylladb#27546	2025-12-23 15:53:07 +02:00
Pawel Pery	caa0cbe328	vector_search_validator: move high availability tests from vector-store.git Initially, tests for high availability were implemented in vector-store.git repository. High availability is currently implemented in scylladb.git repository so this repository should be the better place to store them. This commit copies these tests into the scylladb.git. The commit copies validator-vector-store/src/high_availability.rs (tests logic) and validator-tests/src/common.rs (utils for tests) into the local crate validator-scylla. The common.rs should be copied to be able for reviewer to see common test code and this code most likely be frequent to change - it will be hard to maintain one common version between two repositories. The commit updates also README for vector_search_validator; it shortly describe the validator modules. The commit updates reference to the latest vector-store.git master. As a next step on the vector-store.git high_availability.rs would be removed and common.rs moved from validator-tests into validator-vector-store. References: VECTOR-394 Closes scylladb/scylladb#27499	2025-12-23 15:53:07 +02:00
Aleksandra Martyniuk	bbe64e0e2a	test: rename duplicate tests There are two test with name test_repair_options_hosts_tablets in test/nodetool/test_cluster_repair.py and and two test_repair_keyspace in test/nodetool/test_repair.py. Due to that one of each pair is ignored. Rename the tests so that they are unique. Fixes: https://github.com/scylladb/scylladb/issues/27701. Closes scylladb/scylladb#27720	2025-12-23 15:53:06 +02:00
Calle Wilund	d5f72cd5fc	test::pylib::encryption_provider: Push up setting system_key_directory to all providers Fixes #27694 Unless set by config, the location will default to /etc/scylla, which is not a good place to write things for tests. Push the config properly and the directory (but _not_ creation) to all provider basetype. Closes scylladb/scylladb#27696	2025-12-23 15:53:06 +02:00
Dawid Mędrek	afde5f668a	test: Implement describing Boost tests in JSON format The Boost.Test framework offers a way to describe tests written in it by running them with the option `--list_content`. It can be parametrized by either HRF (Human Readable Format) or DOT (the Graphviz graph format) [1]. Thanks to that, we can learn the test tree structure and collect additional information about the tests (e.g. labels [2]). We currently emply that feature of the framework to collect and run Boost tests in Scylla. Unfortunately, both formats have their shortcomings: * HRF: the format is simple to parse, but it doesn't contain all relevant information, e.g. labels. * DOT: the format is designed for creating graphical visualizations, and it's relatively difficult to parse. To amend those problems, we implement a custom extension of the feature. It produces output in the JSON format and contains more than the most basic information about the tests; at the same time, it's easy to browse and parse. To obtain that output, the user needs to call a Boost.Test executable with the option `--list_json_content`. For example: ``` $ ./path/to/test/exec -- --list_json_content ``` Note that the argument should be prepended with a `--` to indicate that it targets user code, not Boost.Test itself. --- The structure of the new format looks like this (top-level downwards): - File name - Test suite(s) & free test cases - Test cases wrapped in test suites Note that it's different from the output the default Boost.Test formats produce: they organize information within test suites, which can potentially span multiple files [3]. The JSON format makes test files the primary object of interest and test suites from different files are always considered distinct. Example of the output (after applying some formatting): ``` $ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content [{"file":"test/boost/canonical_mutation_test.cc", "content": { "suites": [], "tests": [ {"name": "test_conversion_back_and_forth", "labels": ""}, {"name": "test_reading_with_different_schemas", "labels": ""} ] }}] ``` --- The implementation may be seen as a bit ugly, and it's effectively a hack. It's based on registering a global fixture [4] and linking that code to every Boost.Test executable. Unfortunately, there doesn't seem to be any better way. That would require more extensive changes in the test files (e.g. enforcing going through the same entry point in all of them). This implementation is a compromise between simplicity and effectiveness. The changes are kept minimal, while the developers writing new tests shouldn't need to remember to do anything special. Everything should work out of the box (at least as long as there's no non-trivial linking involved). Fixes scylladb/scylladb#25415 --- References: [1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html [2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html [3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html [4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.html Closes scylladb/scylladb#27527	2025-12-23 15:53:06 +02:00
Pavel Emelyanov	132aa753da	sstables/storage_manager: Fix configured endpoints observer On start the manager creates observer for object_storage_endpoints config parameter. The goal is to refresh the maintained set of endpoint parameters and client upon config change. The observer is created on shard 0 only, and when kicked it calls manager.invoke-on-all to update manager on all shards. However, there's a race here. The thing is that db::config values are implicitly "sharded" under the hood with the help of plain array. When any code tries to read a value from db::config::something, the reading code secretly gets the value from this inner array indexed by the current shard id. Next, when the config is updated, it first assigns new values to [0] element of the hidden array, then calls broadcast_to_all_shards() helper that copies the valaues from zeroth slot to all the others. But the manager's observer is triggered when the new value is assigned on zero index, and if the invoke-on-all lambda (mentioned above) happens to be faster than broadcast_to_all_shards, the non-zero shards will read old values from db::config's inner array. The fix is to instantiate observer on all shards and update only local shard, whenever this update is triggered. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:43:11 +03:00
Pavel Emelyanov	f902eb1632	test/object_store: Add test to validate how endpoint config update works There's a test for backup with non-existing endpoint/bucket/snapshot. It checks that API call to backup sstables properly fails in that case. This patch adds similar test for "unconfigured endpoint", but it adds the endpoint configuration on-the-fly and expects that backup will proceed after config update. Currently the test fails, as config update only affect the config itself, the storage_manager, that's in charge of maintaining endpoint clients, is not really updated. Next patch will fix it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:41:38 +03:00
Botond Dénes	58b5d43538	Merge 'test: multi LWT and counters test during tablets resize and migration' from Yauheni Khatsianevich This PR extends BaseLWTTester with optional counter-table configuration and verification, enabling randomized LWT tests over tablets with counters. And introduces new LWT with counters test durng tablets resize and migration - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency Refs: https://github.com/scylladb/qa-tasks/issues/1918 Refs: https://github.com/scylladb/qa-tasks/issues/1988 Refs https://github.com/scylladb/scylladb/issues/18068 Closes scylladb/scylladb#27170 * github.com:scylladb/scylladb: test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency test/lwt: add counter-table support to BaseLWTTester	2025-12-23 07:29:35 +02:00
Botond Dénes	bfdd4f7776	Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho Split prepare can run concurrently with repair. Consider this: 1) split prepare starts 2) incremental repair starts 3) split prepare finishes 4) incremental repair produces unsplit sstable 5) split is not happening on sstable produced by repair 5.1) that sstable is not marked as repaired yet 5.2) might belong to repairing set (has compaction disabled) 6) split executes 7) repairing or repaired set has unsplit sstable If split was acked to coordinator (meaning prepare phase finished), repair must make sure that all sstables produced by it are split. It's not happening today with incremental repair because it disables split on sstables belonging to repairing group. And there's a window where sstables produced by repair belong to that group. To solve the problem, we want the invariant where all sealed sstables will be split. To achieve this, streaming consumers are patched to produce unsealed sstable, and the new variant add_new_sstable_and_update_cache() will take care of splitting the sstable while it's unsealed. If no split is needed, the new sstable will be sealed and attached. This solution was also needed to interact nicely with out of space prevention too. If disk usage is critical, split must not happen on restart, and the invariant aforementioned allows for it, since any unsplit sstable left unsealed will be discarded on restart. The streaming consumer will fail if disk usage is critical too. The reason interposer consumer doesn't fully solve the problem is because incremental repair can start before split, and the sstable being produced when split decision was emitted must be split before attached. So we need a solution which covers both scenarios. Fixes #26041. Fixes #27414. Should be backported to 2025.4 that contains incremental repair Closes scylladb/scylladb#26528 * github.com:scylladb/scylladb: test: Add reproducer for split vs intra-node migration race test: Verify split failure on behalf of repair during critical disk utilization test: boost: Add failure_when_adding_new_sstable_test test: Add reproducer for split vs incremental repair race condition compaction: Fail split of new sstable if manager is disabled replica: Don't split in do_add_sstable_and_update_cache() streaming: Leave sstables unsealed until attached to the table replica: Wire add_new_sstables_and_update_cache() into intra-node streaming replica: Wire add_new_sstable_and_update_cache() into file streaming consumer replica: Wire add_new_sstable_and_update_cache() into streaming consumer replica: Document old add_sstable_and_update_cache() variants replica: Introduce add_new_sstables_and_update_cache() replica: Introduce add_new_sstable_and_update_cache() replica: Account for sstables being added before ACKing split replica: Remove repair read lock from maybe_split_new_sstable() compaction: Preserve state of input sstable in maybe_split_new_sstable() Rename maybe_split_sstable() to maybe_split_new_sstable() sstables: Allow storage::snapshot() to leave destination sstable unsealed sstables: Add option to leave sstable unsealed in the stream sink test: Verify unsealed sstable can be compacted sstables: Allow unsealed sstable to be loaded sstables: Restore sstable_writer_config::leave_unsealed	2025-12-23 07:28:56 +02:00
Botond Dénes	bf9640457e	Merge 'test: add crash detection during tests' from Cezar Moise After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues. The test directory is also scanned for coredumps. If any of the above are found, the test will fail with an error. The following checks are made: - Any log line matching `Assertion.failed` or containing `AddressSanitizer` is marked as a critical error - Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log - If any critical error is found, the log is also scanned for backtraces - Any backtraces found are decoded and saved - If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines - Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line - The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns` refs: https://github.com/scylladb/qa-tasks/issues/1804 --- [Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/): Following examples are run on a separate branch where changes have been made to enable these failures. `test_unfinished_writes_during_shutdown` - Errors are found in logs and are not ignored ``` failed on teardown with "Failed: Server 2096: found 1 error(s) (log: scylla-2096.log) ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator had probably migrated to another node) Server 2101: found 4 error(s) (log: scylla-2101.log) ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"]) ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94))) Server 2094: found 1 error(s) (log: scylla-2094.log) ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)" ``` `test_kill_coordinator_during_op` - aborts caused by injection - `ignore_cores_log_patterns` is not set - while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")` ``` failed on teardown with "Failed: Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1105-backtraces.txt Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1106-backtraces.txt Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1113-backtraces.txt Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1148-backtraces.txt" ``` Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1) Closes scylladb/scylladb#26177 github.com:scylladb/scylladb: test: add logging to crash_coordinator_before_stream injection test: add crash detection during tests test.py: add pid to ServerInfo	2025-12-23 07:27:58 +02:00
Pavel Emelyanov	cd2568ad00	test: Merge and parametrize test_backup_to_non_existent_something tests There are three tests in cluster/object_store suite that check how backup fails in case either of its parameters doesn't really exists. All three greatly duplicate each other, it makes sense to merge them into one larger parametrized test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27695	2025-12-23 07:02:18 +02:00
Avi Kivity	7586c5ccbd	Merge 'system.clients: add `client_options` map column' from Vladislav Zolotarov This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data. Caching and client metadata refactoring: * The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) per-connection. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1. So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now). * These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard. * Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once. * This will would reduce the "variable" (per-connection) memory usage by at least 50%. And in case of "ScyllaDB Python Driver" driver version - by 78%! * And all this for a price of a single `loading_shared_values` object per-shard (implements a hash table) and a minor overhead for each value stored in it. * Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) * Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362) Virtual table and API enhancements: * Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879) API and interface changes: * Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) Testing and validation: * Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90) Closes scylladb/scylladb#25746 * github.com:scylladb/scylladb: transport/server: declare a new "CLIENT_OPTIONS" option as supported service/client_state and alternator/server: use cached values for driver_name and driver_version fields system.clients: add a client_options column controller: update get_client_data to use foreign_ptr for client_data	2025-12-22 20:02:40 +02:00
Emil Maskovsky	d60b908a8e	test/raft: improve reporting in the randomized_nemesis_test digest functions The Boost ASSERTs in the digest functions of the randomized_nemesis_test were not working well inside the state machine digest functions, leading to unhelpful boost::execution_exception errors that terminated the apply fiber, and didn't provide any helpful information. Replaced by explicit checks with on_fatal_internal_error calls that provide more context about the failure. Also added validation of the digest value after appending or removing an element, which allows to determine which operation resulted in causing the wrong value. This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282, but adds improved error reporting. Refs: scylladb/scylladb#27307 Refs: scylladb/scylladb#17030 Closes scylladb/scylladb#27791	2025-12-22 20:02:40 +02:00
Andrei Chekun	6ffdada0ea	test.py: modify JUnit report for easier rerun on CI This will allow to add custom XML attribute to the JUnit report. In this case there will be path to the function that can be used to run with pytest command. Parametrized tests will have path to the function excluding parameter. Closes scylladb/scylladb#27707	2025-12-22 20:02:40 +02:00
Botond Dénes	af5e73def9	Merge 'test/cqlpy: remove unused variables' from Nadav Har'El These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing. All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually. Closes scylladb/scylladb#27667 * github.com:scylladb/scylladb: test/cqlpy: remove unused variables test/cqlpy: use unique partition in test	2025-12-22 20:02:39 +02:00
Avi Kivity	8e462d06be	build: apply sccache to rust builds too sccache works for rust as well as for C++; use it for rust builds as well.	2025-12-22 15:36:15 +02:00
Yaniv Michael Kaul	c1da552fa4	test/pylib/scylla_cluster.py:get_scylla_2025_1_executable() - retry curl download of 2025.1 For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot. Benign, I hope - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27746	2025-12-22 14:45:06 +02:00
Karol Nowacki	addac8b3f7	vector_search: test: Fix flaky DNS resolution test The `vector_store_client_test_dns_resolving_repeated` test had race conditions causing it to be flaky. Two main issues were identified: 1. Race between initial refresh and manual trigger: The test assumes a specific resolution sequence, but timing variations between the initial DNS refresh (on client creation) and the first manual trigger (in the test loop) can cause unexpected delayed scheduling. 2. Extra triggers from resolve_hostname fiber: During the client refresh phase, the background DNS fiber clears the client list. If resolve_hostname executes in the window after clearing but before the update completes, pending triggers are processed, incrementing the resolution count unexpectedly. At count 6, the mock resolver returns a valid address (count % 3 == 0), causing the test to fail. The fix relaxes test assertions to verify retry behavior and client clearing on DNS address loss, rather than enforcing exact resolution counts. Fixes: #27074 Closes scylladb/scylladb#27685	2025-12-21 20:02:16 +02:00
Vlad Zolotarov	85adf6bdb1	system.clients: add a client_options column This new column is going to contain all OPTIONS sent in the STARTUP frame of the corresponding CQL session. The new column has a `frozen<map<text, text>>` type, and we are also optimizing the amount of required memory for storing corresponding keys and values by caching them on each shard level. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:15 -05:00
Botond Dénes	df2ac0f257	Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic This PR migrates schema management tests from dtest to this repository. One reason is that there is an ongoing effort to migrate tests from dtest to here. Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following: - `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected - drop column works as intended - interrupts `populate` and proceeds to flush - flush probably works as intended - logs are consistent with what we expect and what I got in local test runs - `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests. Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the probably part of the flush analysis step described above. If the issue reoccurs, we will have more information. The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated. Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved. Small performance improvements: - Regex compilation extracted from the stress function to the module level, to avoid recompilation. - Do not materialize list in `stress_object` for loop. Use a generator expression. The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. scylla-dtest PR that removes migrated tests: [schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427) Fixes #26932 This is a migration of existing tests to this repository. No need for backport. Closes scylladb/scylladb#27106 * github.com:scylladb/scylladb: test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests test: dtest: schema_management_test.py: fix large partition add column test test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` test: dtest: schema_management_test.py: test enhancements test: dtest: schema_management_test.py: make the tests work test: dtest: migrate setup and tools from dtest test: dtest: copy unmodified schema_management_test.py replica: database: flush_all_tables log on completion	2025-12-19 12:30:00 +02:00
Botond Dénes	093e97a539	Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski `_verify_tasks_processed_metrics()` is used to check that the correct service level is used to process requests. It takes two service levels as arguments and executes numerous requests. After that, the number of tasks processed by one of the service levels is expected to rise by at least the number of executed requests. In contrast, the second service level is expected to process fewer tasks than the number of requests. Unfortunately, background noise may cause some tasks to be executed on the service level that is not supposed to process requests. This patch increases the number of executed requests to eliminate the chance of noise causing test failures. Additionally, this commit extends logging to make future investigation easier. Fixes: https://github.com/scylladb/scylladb/issues/27715 No backport, fix for test on master. Closes scylladb/scylladb#27735 * github.com:scylladb/scylladb: test: remove unused `get_processed_tasks_for_group` test: increase num of requests in driver_service_level tests	2025-12-19 10:54:14 +02:00
Emil Maskovsky	fa6e5d0754	test/random_failures: fix handling of banned notification After `39cec4a` node join may fail with either "init - Startup failed" notification or occasionally because it was banned, depending on timing. The change updates the test to handle both cases. Fixes: scylladb/scylladb#27697 No backport: This failure is only present in master. Closes scylladb/scylladb#27768	2025-12-19 09:55:31 +02:00
Emil Maskovsky	08518b2c12	test/raft: fix `test_joining_old_node_fails` flakiness When a node without the required feature attempts to join a Raft-based cluster with the feature enabled, there is a race between the join rejection response ("Feature check failed") and the ban notification ("received notification of being banned"). Depending on timing, either message may appear in the joining node's log. This starts to happen after `39cec4a` (which introduced informing the nodes about being banned). Updated the test to accept both error messages as valid, making the test robust against this race condition, which is more likely in debug mode or under slow execution. Fixes: scylladb/scylladb#27603 No backport: This failure is only present in master. Closes scylladb/scylladb#27760	2025-12-19 09:44:09 +02:00
Emil Maskovsky	2a75b1374e	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737	2025-12-19 09:42:19 +02:00
Łukasz Paszkowski	2cb9bb8f3a	test_user_writes_rejection: Disable speculative retries This test starts a 3-node cluster and creates a large blob file so that one node reaches critical disk utilization, triggering write rejections on that node. The test then writes data with CL=QUORUM and validates that the data: - did not reach the critically utilized node - did reach the remaining two nodes By default, tables use speculative retries to determine when coordinators may query additional replicas. Since the validation uses CL=ONE, it is possible that an additional request is sent to satisfy the consistency level. As a result: - the first check may fail if the additional request is sent to a node that already contains data, making it appear as if data reached the critically utilized node - the second check may fail if the additional request is sent to the critically utilized node, making it appear as if data did not reach the healthy node The patch fixes the flakiness by disabling the speculative retries. Fixes https://github.com/scylladb/scylladb/issues/27212 Closes scylladb/scylladb#27488	2025-12-19 09:39:09 +02:00
Dario Mirovic	f1d63d014c	test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. The patch also introduces a few small improvements: - `cs_run` renamed to `run_stress` for clarity - Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use - `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly - Added explanation comment on why we do `data[i].append(None)` - Replaced `alter_table` inner function with its body, for simplicity - Removed unnecessary `ck_rows` variable in `populate` - Removed unnecessary `isinstance(self.cluster. ScyllaCluster)` - Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed - Replaced functional programming style expressions for `new_versions` and `columns_list` with comprehension/generator statement python style code, improving readability Refs #26932 fix	2025-12-18 17:07:27 +01:00
Cezar Moise	0ef8ca4c57	test: add logging to crash_coordinator_before_stream injection In order to have the test ignore crashes caused by the injection, it needs to log its occurence.	2025-12-18 16:28:13 +02:00
Cezar Moise	95d0782f89	test: add crash detection during tests After tests end, an extra check if performed, looking into node logs. By default, it only searches for critical errors and scans for coredumps. If the test has the fixture `check_nodes_for_errors`, it will search for all errors. Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`. If any of the above are found, the test will fail with an error.	2025-12-18 16:28:13 +02:00
Dario Mirovic	f831ca5ab5	test: dtest: schema_management_test.py: fix large partition add column test `large_partition_with_add_column_test` and `large_partition_with_drop_column_test` were added on August 17th, 2020 in scylladb/scylla-dtest#1569. Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051. Since then this test has not been running. This patch fixes it - the test is updated and renamed and the testing environment now properly picks it up. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	1fe0509a9b	test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` Extract repeated cluster initialization code in `TestSchemaManagement` into a separate `prepare` method. It holds all the common code for cluster preparation, with just the necessary parameters. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	e7d76fd8f3	test: dtest: schema_management_test.py: test enhancements Extract regex compilation from the stress functions to the module level, to avoid unnecessary regex compilation repetition. Add descriptions to the stress functions. Do not materialize list in `stress_object` for loop. Use a generator expression. Make `_set_stress_val` an object method. Refs #26932	2025-12-18 12:54:43 +01:00

1 2 3 4 5 ...

10385 Commits