scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	89d8ae5cb6	Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client. This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling No backport needed since it is only refactoring of the existing code Closes scylladb/scylladb#28250 * github.com:scylladb/scylladb: exceptions: add helper to build a chain of error handlers http: extract error classification code aws_error: extract `retryable` from aws_error	2026-02-18 10:06:37 +03:00
Pavel Emelyanov	2f10fd93be	Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky - Correct `calc_part_size` function since it could return more than 10k parts - Add tests - Add more checks in `calc_part_size` to comply with S3 limits Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640 Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters Closes scylladb/scylladb#28592 * github.com:scylladb/scylladb: s3_client: add more constrains to the calc_part_size s3_client: add tests for calc_part_size s3_client: correct multipart part-size logic to respect 10k limit	2026-02-18 10:04:53 +03:00
Szymon Malewski	668d6fe019	vector: Improve similarity functions performance Improves performance of deserialization of vector data for calculating similarity functions. Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float> and then pass it to similarity functions as a std::span<const float>. This avoids overhead of data_value allocations and conversions. Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`: client concurrency 1: before: ~135 QPS, after: ~1005 QPS client concurrency 20: before: ~280 QPS, after: ~2097 QPS Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471 Closes scylladb/scylladb#28615	2026-02-18 00:33:34 +02:00
Calle Wilund	ab4e4a8ac7	commitlog: Always abort replenish queue on loop exit Fixes #28678 If replenish loop exits the sleep condition, with an empty queue, when "_shutdown" is already set, a waiter might get stuck, unsignalled waiting for segments, even though we are exiting. Simply move queue abort to always be done on loop exit. Closes scylladb/scylladb#28679	2026-02-17 23:46:47 +02:00
Dani Tweig	5dc06647e9	.github: add workflow to auto-close issues from ScyllaDB associates Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira. This is a new github action, and does not require any code backport. Fixes: PM-64 Closes scylladb/scylladb#28212	2026-02-17 17:18:32 +02:00
Dani Tweig	bb8a2c3a26	.github/workflow/:Add milestone sync to Jira based on GitHub Action What changed Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml Why (Requirements Summary) Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue Testing was performed in staging.git and the STAG Jira project. Fixes:PM-177 Closes scylladb/scylladb#28575	2026-02-17 16:41:03 +02:00
Botond Dénes	2e087882fa	Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund Fixes #28398 Fixes #28399 When used as path elements in google storage paths, the object names need to be URL encoded. Due to a.) tests not really using prefixes including non-url valid chars (i.e. / etc) and b.) the mock server used for most testing not enforcing this particular aspect, this was missed. Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show. "Real" GCS also behaves a bit different when listing with pager, compared to mock; The former will not give a pager token for last page, only penultimate. Adds handling for this. Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected. Closes scylladb/scylladb#28400 * github.com:scylladb/scylladb: utils/gcp/object_storage: URL-encode object names in URL:s utils::gcp::object_storage: Fix list object pager end condition detection	2026-02-17 16:40:02 +02:00
Andrei Chekun	1b5789cd63	test.py: refactor manager fixture The current manager flow have a flaw. It will trigger pytest.fail when it found errors on teardown regardless if the test was already failed. This will create an additional record in JUnit report with the same name and Jenkins will not be able to show the logs correctly. So to avoid this, this PR changes logic slightly. Now manager will check that test failed or not to avoid two fails for the same test in the report. If test passed, manager will check the cluster status and fail if something wrong with a status of it. There is no need to check the cluster status in case of test fail. If test passed, and cluster status if OK, but there are unexpected errors in the logs, test will fail as well. But this check will gather all information about the errors and potential stacktraces and will only fail the test if it's not yet failed to avoid double entry in report. Closes scylladb/scylladb#28633	2026-02-17 14:35:18 +01:00
Dawid Mędrek	5b5222d72f	Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway. No backport, test update. Closes scylladb/scylladb#28571 * github.com:scylladb/scylladb: test: run test_different_group0_ids in all modes test: make test_different_group0_ids work with the Raft-based topology	2026-02-17 13:56:41 +01:00
Dawid Mędrek	1b80f6982b	Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili Currently, the load balancing simulator computes node, shard and tablet load based on tablet count. This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count. This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series: - First part for tablet size collection via load_stats: scylladb/scylladb#26035 - Second part reconcile load_stats: scylladb/scylladb#26152 - The third part for load_sketch changes: scylladb/scylladb#26153 - The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254 - The fifth part changes the load balancing simulator: scylladb/scylladb#26438 This is a new feature and backport is not needed. Closes scylladb/scylladb#26438 * github.com:scylladb/scylladb: test, simulator: compute load based on tablet size instead of count test, simulator: generate tablet sizes and update load_stats test, simulator: postpone creation of load_stats_ptr	2026-02-17 13:29:37 +01:00
Avi Kivity	ffde2414e8	cql3: grammar: remove special case for vector similarity functions in selectors In `b03d520aff` ("cql3: introduce similarity functions syntax") we added vector similarity functions to the grammar. The grammar had to be modified because we wanted to support literals as vector similarity function arguments, and the general function syntax in selectors did not allow that. In `cc03f5c89d` ("cql3: support literals and bind variables in selectors") we extended the selector function call grammar to allow literals as function arguments. Here, we remove the special case for vector similarity functions as the general case in function calls covers all the possibilities the special case does. As a side effect, the vector similarity function names are no longer reserved. Note: the grammar change fixes an inconsistency with how the vector similarity functions were evaluated: typically, when a USE statement is in effect, an unqualified function is first matched against functions in the keyspace, and only if there is no match is the system keyspace checked. But with the previous implementation vector similarity functions ignored the USE keyspace and always matched only the system keyspace. This small inconsistency doesn't matter in practice because user defined functions are still experimental, and no one would name a UDF to conflict with a system function, but it is still good to fix it. Closes scylladb/scylladb#28481	2026-02-17 12:40:21 +01:00
Ernest Zaslavsky	30699ed84b	api: report restore params report restore params once the API's call for restore is invoked Closes scylladb/scylladb#28431	2026-02-17 14:27:21 +03:00
Andrei Chekun	767789304e	test.py: improve C++ fail summary in pytest Currently, if the test fail, pytest will output only some basic information about the fail. With this change, it will output the last 300 lines of the boost/seastar test output. Also add capturing the output of the failed tests to JUnit report, so it will be present in the report on Jenkins. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449 Closes scylladb/scylladb#28535	2026-02-17 14:25:28 +03:00
Pavel Emelyanov	6d4af84846	Merge 'test: increase open file limit for sstable tests' from Avi Kivity In `ebda2fd4db` ("test: cql_test_env: increase file descriptor limit"), we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests open 256 sstables, and with 2 files/sstable + resharding work it is understandable that they overflow the 1024 limit. No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches. Closes scylladb/scylladb#28646 * github.com:scylladb/scylladb: test: sstables::test_env: adjust file open limit test: extract cql_test_env's adjust_rlimit() for reuse	2026-02-17 14:19:43 +03:00
Avi Kivity	41925083dc	test: minio: tune sync setting Disable O_DSYNC in minio to avoid unnecessary slowdown in S3 tests. Closes scylladb/scylladb#28579	2026-02-17 14:19:27 +03:00
Avi Kivity	f03491b589	Update seastar submodule * seastar f55dc7eb...d2953d2a (13): > io_tester: Revive IO bandwidth configuration > Merge 'io_tester: add vectorized I/O support' from Travis Downs doc: add vectorized I/O options to io-tester.md io_tester: add vectorized I/O support > Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov reactor: Drop sched group IDs bitmap reactor: Allocate scheduling group on shard-0 first reactor: Detach init_scheduling_group_specific_data() reactor: Coroutinize create_scheduling_group() > set_iterator: increase compatibility with C++ ranges > test: fix race condition in test_connection_statistics > Add Claude Code project instructions > reactor: Unfriend pollable_fd via pollable_fd_state::make() > Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job. apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server. > treewide: remove remnants of SEASTAR_MODULE > test: Tune abort-accept test to use more readable async() > build: support sccache as a compiler cache (#3205) > posix-stack: Reuse parent class _reuseport from child > Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg tests: Add unit test for epoll busy spin bug reactor_backend: Fix another busy spin bug in epoll Closes scylladb/scylladb#28513	2026-02-17 13:13:22 +02:00
Jakub Smolar	189b056605	scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect Previous implementation of Scylla lifecycle brought flakiness to the test. This change leaves lifecycle management up to PythonTest.run_ctx, which implements more stability logic for setup/teardown. Replace pexpect-driven GDB interaction with GDB batch mode: - Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty() may lead to deadlocks in the child.", which ultimately caused CI deadlocks. - Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts. - Produces cleaner, more direct assertions around command execution and output. - Trade-off: batch mode adds ~10s per command per test, but with --dist=worksteal this is ~10% overall runtime increase across the suite. Closes scylladb/scylladb#28484	2026-02-17 11:36:20 +01:00
Łukasz Paszkowski	f45465b9f6	test_out_of_space_prevention.py: Lower the critical disk utilization threshold After PR https://github.com/scylladb/scylladb/pull/28396 reduced the test volumes to 20MiB to speed up test_out_of_space_prevention.py, keeping the original 0.8 critical disk utilization threshold can make the tests flaky: transient disk usage (e.g. commitlog segment churn) can push the node into ENOSPC during the run. These tests do not write much data, so reduce the critical disk utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB of headroom for temporary growth during the test. Fixes: https://github.com/scylladb/scylladb/issues/28463 Closes scylladb/scylladb#28593	2026-02-16 15:10:18 +02:00
Andrei Chekun	e26cf0b2d6	test/cluster: fix two flaky tests test_maintenance_socket with new way of running is flaky. Looks like the driver tries to reconnect with an old maintenance socket from previous driver and fails. This PR adds white list for connection that stabilize the test test_no_removed_node_event_on_ip_change was flaky on CI, while the issue never reproduced locally. The assumption that under load we have race condition and trying to check the logs before message is arrived. Small for loop to retry added to avoid such situation. Closes scylladb/scylladb#28635	2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak	0693091aff	test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart The test can currently fail like this: ``` > await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">}) ``` The following happens: - node A is restarted and becomes the group0 leader, - the driver sends the ALTER TABLE request to node B, - the request hits group 0 concurrent modification error 10 times and fails because node A performs tablet migrations at the the same time. What is unexpected is that even though the driver session uses the default retry policy, the driver doesn't retry the request on node A. The request is guaranteed to succeed on node A because it's the only node adding group0 entries. The driver doesn't retry the request on node A because of a missing `wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect the driver just in case to prevent hitting scylladb/python-driver#295. Moreover, we can revert the workaround from `4c9efc08d8`, as the fix from this commit also prevents DROP KEYSPACE failures. The commit has been tested in byo with `_concurrent_ddl_retries{0}` to verify that node A really can't hit group 0 concurrent modification error and always receives the ALTER TABLE request from the driver. All 300 runs in each build mode passed. Fixes #25938 Closes scylladb/scylladb#28632	2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz	6a4aef28ae	Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: https://github.com/scylladb/scylladb/issues/28204 Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent). Closes scylladb/scylladb#28625 * github.com:scylladb/scylladb: test: explicitly set compression algorithm in test_autoretrain_dict test: remove unneeded semicolons from python test	2026-02-16 11:38:24 +01:00
Ernest Zaslavsky	034c6fbd87	s3_client: limit multipart upload concurrency Prevent launching hundreds or thousands of fibers during multipart uploads by capping concurrent part submissions to 16. Closes scylladb/scylladb#28554	2026-02-16 13:32:58 +03:00
Botond Dénes	9f57d6285b	Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail, increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures. Fixes https://github.com/scylladb/scylladb/issues/27745 Backport: no, it's not a bug fix Closes scylladb/scylladb#28628 * github.com:scylladb/scylladb: test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call test: pylib: increase curl's number of retries when downloading scylla test: pylib: improve error reporting in get_scylla_2025_1_executable	2026-02-16 10:09:17 +02:00
Andrei Chekun	8c5c1096c2	test: ensure that that table used it cqlpy/test_tools have at least 3 pk One of the tests check that amount of the PK should be more than 2, but the method that creates it can return table with less keys. This leads to flakiness and to avoid it, this PR ensures that table will have at least 3 PK Closes scylladb/scylladb#28636	2026-02-16 09:50:58 +02:00
Anna Mikhlin	33cf97d688	.github/workflows: ignore quoted comments for trigger CI prevent CI from being triggered when trigger-ci command appears inside quoted (>) comment text Fixes: https://scylladb.atlassian.net/browse/RELENG-271 Closes scylladb/scylladb#28604	2026-02-16 09:33:16 +02:00
Andrei Chekun	e144d5b0bb	test.py: fix JUnit double test case records Move the hook for overwriting the XML reporter to be the first, to avoid double records. Closes scylladb/scylladb#28627	2026-02-15 19:02:24 +02:00
Avi Kivity	a365e2deaa	test: sstables::test_env: adjust file open limit The twcs compaction tests open more than 1024 files (not so good), and will fail in a user session with the default soft limit (1024). Attempt to raise the limit so the tests pass. On a modern systemd installation the hard limit is >500,000, so this will work. There's no problem in dbuild since it raises the file limit globally.	2026-02-15 14:27:37 +02:00
Avi Kivity	bab3afab88	test: extract cql_test_env's adjust_rlimit() for reuse The sstable-oriented sstable::test_env would also like to use it, so extract it into a neutral place.	2026-02-15 14:26:46 +02:00
Jenkins Promoter	69249671a7	Update pgo profiles - aarch64	2026-02-15 05:22:17 +02:00
Jenkins Promoter	27aaafb8aa	Update pgo profiles - x86_64	2026-02-15 04:26:36 +02:00
Piotr Dulikowski	9c1e310b0d	Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused the test case to fail because the ANN request duration exceeded the test case timeout. The PR introduces two changes: * Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites simultaneously with ANN requests that utilize those certificates. * Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout. Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write operation, potentially bypassing connect timeout. Fixes: #28012 Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test. Closes scylladb/scylladb#28617 * github.com:scylladb/scylladb: vector_search: Fix missing timeout on TLS handshake vector_search: test: Fix flaky cert rewrite test	2026-02-13 19:03:50 +01:00
Patryk Jędrzejczak	aebc108b1b	test: run test_different_group0_ids in all modes CI currently fails in release and debug modes if the PR only changes a test run only in dev mode. There is no reason to wait for the CI fix, as there is no reason to run this test only in dev mode in the first place. The test is very fast.	2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak	59746ea035	test: make test_different_group0_ids work with the Raft-based topology The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway.	2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz	1b0a68d1de	test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call It's difficult to say if our download backend would always return transient error correctly so that the curl could retry. Instead it's more robust to always retry on error.	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	8ca834d4a4	test: pylib: increase curl's number of retries when downloading scylla By default curl does exponential backoff, and we want to keep that but there is time cap of 10 minutes, so with 40 retries we'd wait long time, instead we set the cap to 60 seconds. Total waiting time (excluding receiving request time): before - 17m after - 35m	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	70366168aa	test: pylib: improve error reporting in get_scylla_2025_1_executable Curl or other tools this function calls will now log error in the place they fail instead of doing plain assert.	2026-02-12 16:18:52 +01:00
Andrzej Jackowski	9ffa62a986	test: explicitly set compression algorithm in test_autoretrain_dict When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: scylladb/scylladb#28204	2026-02-12 14:58:39 +01:00
Andrzej Jackowski	e63cfc38b3	test: remove unneeded semicolons from python test	2026-02-12 14:49:17 +01:00
Ferenc Szili	d7cfaf3f84	test, simulator: compute load based on tablet size instead of count This patch changes the load balancing simulator so that it computes table load based on tablet sizes instead of tablet count. best_shard_overcommit measured minimal allowed overcommit in cases where the number of tablets can not be evenly distributed across all the available shards. This is still the case, but instead of computing it as an integer div_ceil() of the average shard load, it is now computed by allocating the tablet sizes using the largest-tablet-first method. From these, we can get the lowest overcommit for the given set of nodes, shards and tablet sizes.	2026-02-12 12:54:55 +01:00
Ferenc Szili	216443c050	test, simulator: generate tablet sizes and update load_stats This change adds a random tablet size generator. The tablet sizes are created in load_stats. Further changes to the load balance simulator: - apply_plan() updates the load_stats after a migration plan is issued by the load balancer, - adds the option to set a command line option which controls the tablet size deviation factor.	2026-02-12 12:54:55 +01:00
Ferenc Szili	e31870a02d	test, simulator: postpone creation of load_stats_ptr With size based load balancing, we will have to move the tablet size in load_stats after each internode migration issued by balance_tablets(). This will be done in a subsequent commit in apply_plan() which is called from rebalance_tablets(). Currently, rebalance_tablets() is passed a load_stats_ptr which is defined as: using load_stats_ptr = lw_shared_ptr<const load_stats>; Because this is a pointer to const, apply_plan() can't modify it. So, we pass a reference to load_stats to rebalance_tablets() and create a load_stats_ptr from it for each call to balance_tablets().	2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk	f955a90309	test: fix test_remove_node_violating_rf_rack_with_rack_list test_remove_node_violating_rf_rack_with_rack_list creates a cluster with four nodes. One of the nodes is excluded, then another one is stopped, excluded, and removed. If the two stopped nodes were both voters, the majority is lost and the cluster loses its raft leader. As a result, the node cannot be removed and the operation times out. Add the 5th node to the cluster. This way the majority is always up. Fixes: https://github.com/scylladb/scylladb/issues/28596. Closes scylladb/scylladb#28610	2026-02-12 12:58:48 +02:00
Ferenc Szili	4ca40929ef	test: add read barrier to test_balance_empty_tablets The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 Closes scylladb/scylladb#28598	2026-02-12 11:16:34 +02:00
Karol Nowacki	079fe17e8b	vector_search: Fix missing timeout on TLS handshake Currently the TLS handshake in the vector search client does not have a timeout. This is because tls::connect does not perform handshake itself; the handshake is deferred until the first read/write operation is performed. This can lead to long hangs on ANN requests. This commit calls tls::check_session_is_resumed() after tls::connect to force the handshake to happen immediately and to run under with_timeout.	2026-02-12 10:08:37 +01:00
Karol Nowacki	aef5ff7491	vector_search: test: Fix flaky cert rewrite test The test is flaky most likely because when TLS certificate rewrite happens simultaneously with an ANN request, the handshake can hang for a long time (~60s). This leads to a timeout in the test case. This change introduces a checkpoint in the test so that it will wait for the certificate rewrite to happen before sending an ANN request, which should prevent the handshake from hanging and make the test more reliable. Fixes: #28012	2026-02-12 09:58:54 +01:00
Piotr Dulikowski	38c4a14a5b	Merge 'test: cluster: Fix test_sync_point' from Dawid Mędrek The test `test_sync_point` had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. --- Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. As a bonus, we rewrite the auxiliary code responsible for fetching metrics and manipulating sync points. Now it's asynchronous and uses the existing standard mechanisms available to developers. Furthermore, we reduce the time needed for executing `test_sync_point` by 27 seconds. --- The total difference in time needed to execute the whole test file (on my local machine, in dev mode): Before: CPU utilization: 0.9% real 2m7.811s user 0m25.446s sys 0m16.733s After: CPU utilization: 1.1% real 1m40.288s user 0m25.218s sys 0m16.566s --- Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203 Backport: This improves the stability of our CI, so let's backport it to all supported versions. Closes scylladb/scylladb#28602 * github.com:scylladb/scylladb: test: cluster: Reduce wait time in test_sync_point test: cluster: Fix test_sync_point test: cluster: Await sync points asynchronously test: cluster: Create sync points asynchronously test: cluster: Fetch hint metrics asynchronously	2026-02-12 09:34:09 +01:00
Dawid Mędrek	f83f911bae	test: cluster: Reduce wait time in test_sync_point If everything is OK, the sync point will not resolve with node 3 dead. As a result, the waiting will use all of the time we allocate for it, i.e. 30 seconds. That's a lot of time. There's no easy way to verify that the sync point will NOT resolve, but let's at least reduce the waiting to 3 seconds. If there's a bug, it should be enough to trigger it at some point, while reducing the average time needed for CI.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	a256ba7de0	test: cluster: Fix test_sync_point The test had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203	2026-02-10 17:05:02 +01:00
Dawid Mędrek	c5239edf2a	test: cluster: Await sync points asynchronously There's a dedicated HTTP API for communicating with the cluster, so let's use it instead of yet another custom solution.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	ac4af5f461	test: cluster: Create sync points asynchronously There's a dedicated HTTP API for communicating with the nodes, so let's use it instead of yet another custom solution.	2026-02-10 17:05:01 +01:00

1 2 3 4 5 ...

51953 Commits