scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	bf369326d6	Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki The default 100ms timeout for client readiness in tests is too aggressive. In some test environments, this is not enough time for client creation, which involves address resolution and TLS certificate reading, leading to flaky tests. This commit increases the default client creation timeout to 10 seconds. This makes the tests more robust, especially in slower execution environments, and prevents similar flakiness in other test cases. Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826 Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well. Closes scylladb/scylladb#28846 * github.com:scylladb/scylladb: vector_search: test: include ANN error in assertion vector_search: test: fix HTTPS client test flakiness (cherry picked from commit `2fb981413a`) Closes scylladb/scylladb#28879	2026-03-04 18:09:04 +01:00
Botond Dénes	49ed97cec8	Merge '[Backport 2026.1] Fix regression in Alternator TTL with tablets and node going down' from Scylladb[bot] Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used. Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done. It turns out that recently, in commits `817fdad` and `d88036d`, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica. Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function. So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix): 1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica() 2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason). Fixes SCYLLADB-777. - (cherry picked from commit `9ab3d5b946`) - (cherry picked from commit `0c7f499750`) - (cherry picked from commit `e463d528fe`) Parent PR: #28771 Closes scylladb/scylladb#28803 * github.com:scylladb/scylladb: test: add unit test for tablet_map::get_secondary_replica() test, alternator: add test for TTL expiration with a node down locator: fix get_secondary_replica() to match get_primary_replica()	2026-03-04 14:21:44 +02:00
Marcin Maliszkiewicz	81685b0d06	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature (cherry picked from commit `a83ee6cf66`) Closes scylladb/scylladb#28853	2026-03-04 08:28:39 +02:00
Patryk Jędrzejczak	c4aa14c1a7	test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip The test is currently flaky with `reuse_ip = True`. The issue is that the test retries replace before the first replace is rolled back and the first replacing node is removed from gossip. The second replacing node can see the entry of the first replacing node in gossip. This entry has a newer generation than the entry of the node being replaced, and both replacing nodes have the same IP as the node being replaced. Therefore, the second replacing node incorrectly considers this entry as the entry of the node being replaced. This entry is missing rack and DC, so the second replace fails with ``` ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed: std::runtime_error (Cannot replace node 8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on a different data center or rack. Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2) ``` Fixes SCYLLADB-805 Closes scylladb/scylladb#28829 (cherry picked from commit `ba7f314cdc`) Closes scylladb/scylladb#28850	2026-03-03 10:21:11 +01:00
Botond Dénes	0dfefc3f12	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: SCYLLADB-105 Closes scylladb/scylladb#28080 (cherry picked from commit `b637e17b19`) Closes scylladb/scylladb#28725	2026-02-27 06:32:15 +02:00
Łukasz Paszkowski	883e3e014a	compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever The futurization refactoring in `9d3755f276` ("replica: Futurize retrieval of sstable sets in compaction_group_view") changed maybe_wait_for_sstable_count_reduction() from a single predicated wait: ``` co_await cstate.compaction_done.wait([..] { return num_runs_for_compaction() <= threshold \|\| !can_perform_regular_compaction(t); }); ``` to a while loop with a predicated wait: ``` while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) { co_await cstate.compaction_done.wait([this, &t] { return !can_perform_regular_compaction(t); }); } ``` This was necessary because num_runs_for_compaction() became a coroutine (returns future<size_t>) and can no longer be called inside a condition_variable predicate (which must be synchronous). However, the inner wait's predicate — !can_perform_regular_compaction(t) — only returns true when compaction is disabled or the table is being removed. During normal operation, every signal() from compaction_done wakes the waiter, the predicate returns false, and the waiter immediately goes back to sleep without ever re-checking the outer while loop's num_runs_for_compaction() condition. This causes memtable flushes to hang forever in maybe_wait_for_sstable_count_reduction() whenever the sstable run count exceeds the threshold, because completed compactions signal compaction_done but the signal is swallowed by the predicate. Fix by replacing the predicated wait with a bare wait(), so that any signal (including from completed compactions) causes the outer while loop to re-evaluate num_runs_for_compaction(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610 Closes scylladb/scylladb#28801 (cherry picked from commit `bb57b0f3b7`)	2026-02-27 01:38:13 +02:00
Andrzej Jackowski	995df5dec6	test: fix configuration of test_autoretrain_dict `test_autoretrain_dict` sporadically fails because the default compression algorithm was changed after the test was written. `9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by changing the compression configuration during node startup. However, the configuration change had an incorrect YAML format and was ignored by ScyllaDB. This commit fixes it. Fixes: scylladb/scylladb#28204 Closes scylladb/scylladb#28746 (cherry picked from commit `cd4caed3d3`) Closes scylladb/scylladb#28794	2026-02-26 09:26:23 +02:00
Marcin Maliszkiewicz	502b7f296d	Merge '[Backport 2026.1] vector_search: return NaN for similarity_cosine with all-zero vectors' from Scylladb[bot] The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456 Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there. - (cherry picked from commit `af0889d194`) - (cherry picked from commit `4e32502bb3`) Parent PR: #28609 Closes scylladb/scylladb#28775 * github.com:scylladb/scylladb: test/vector_search: add reproducer for rescoring with zero vectors vector_search: return NaN for similarity_cosine with all-zero vectors	2026-02-25 14:34:58 +01:00
Nadav Har'El	b251ee02a4	test: add unit test for tablet_map::get_secondary_replica() This patch adds a unit test for tablet_map::get_secondary_replica(). It was never officially defined how the "primary" and "secondary" replicas were chosen, and their implementation changed over time, but the one invariant that this test verifies is that the secondary replica and the primary replica must be a different node. This test reproduces issue SCYLLADB-777, where we discovered that the get_primary_replica() changed without a corresponding change to get_primary_replica(). So before the previous patch, this test failed, and after the previous patch - it passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `e463d528fe`)	2026-02-25 12:59:26 +00:00
Nadav Har'El	f26d08dde2	test, alternator: add test for TTL expiration with a node down We have many single-node functional tests for Alternator TTL in test/alternator/test_ttl.py. This patch adds a multi-node test in test/cluster/test_alternator.py. The new test verifies that: 1. Even though Alternator TTL splits the work of scanning and expiring items between nodes, all the items get correctly expired. 2. When one node is down, all the items still expire because the "secondary" owner of each token range takes over expiring the items in this range while the "primary" owner is down. This new test is actually a port of a test we already had in dtest (alternator_ttl_tests.py::test_multinode_expiration). This port is faster and smaller then the original (fewer nodes, fewer rows), but it still found a regression (SCYLLADB-777) that dtest missed - the new test failed when running with tablets and in release build mode. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `0c7f499750`)	2026-02-25 12:59:26 +00:00
Avi Kivity	fdae3e4f3a	Merge '[Backport 2026.1] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot] - Correct `calc_part_size` function since it could return more than 10k parts - Add tests - Add more checks in `calc_part_size` to comply with S3 limits Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640 Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters - (cherry picked from commit `289e910cec`) - (cherry picked from commit `6280cb91ca`) - (cherry picked from commit `960adbb439`) Parent PR: #28592 Closes scylladb/scylladb#28697 * github.com:scylladb/scylladb: s3_client: add more constrains to the calc_part_size s3_client: add tests for calc_part_size s3_client: correct multipart part-size logic to respect 10k limit	2026-02-24 14:23:33 +02:00
Dawid Pawlik	2feed49285	test/vector_search: add reproducer for rescoring with zero vectors Add reproducer for the SCYLLADB-456 issue following exception on ANN vector queries with rescoring with similarity cosine. (cherry picked from commit `4e32502bb3`)	2026-02-23 17:09:51 +00:00
Dawid Pawlik	3007cb6f37	vector_search: return NaN for similarity_cosine with all-zero vectors The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456 (cherry picked from commit `af0889d194`)	2026-02-23 17:09:50 +00:00
Tomasz Grabiec	e90449f770	test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance Currently, the test assumes that when 'topology_coordinator_pause_before_processing_backlog: waiting' is logged, the task for decommission must be there. This was based on the assumption that topology coordinator is idle and decommission request wakes it up. But if the server is slow enough, it may still be running the load balancer in reaction to table creation, and block on that injection point before decommission request was added. Fix by waiting for the task to appear rather than the injection. Fixes SCYLLADB-715 (cherry picked from commit `d33d38139f`)	2026-02-20 16:35:39 +00:00
Tomasz Grabiec	98fd5c5e45	test: cluster: task_manager_client: Introduce wait_task_appears() (cherry picked from commit `2454de4f8f`)	2026-02-20 16:35:39 +00:00
Tomasz Grabiec	cca6a1c3dd	tests: pylib: util: Add exponential backoff to wait_for Allows balancing the trade-off between fast execution in case the condition is satisfied quickly and not adding load when it's not. (cherry picked from commit `e14eca46af`)	2026-02-20 16:35:39 +00:00
Szymon Malewski	86554e6192	vector: Improve similarity functions performance Improves performance of deserialization of vector data for calculating similarity functions. Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float> and then pass it to similarity functions as a std::span<const float>. This avoids overhead of data_value allocations and conversions. Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`: client concurrency 1: before: ~135 QPS, after: ~1005 QPS client concurrency 20: before: ~280 QPS, after: ~2097 QPS Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471 Closes scylladb/scylladb#28615 (cherry picked from commit `668d6fe019`) Closes scylladb/scylladb#28690	2026-02-19 14:14:39 +02:00
Asias He	637618560b	repair: Skip auto repair for tables using RF one There is no point running repair for tables using RF one. Row level repair will skip it but the auto repair scheduler will keep scheduling such repairs since repair_time could not be updated. Skip such repairs at the scheduler level for auto repair. If the request is issued by user, we will have to schedule such repair otherwise the user request will never be finished. Fixes SCYLLADB-561 Closes scylladb/scylladb#28640 (cherry picked from commit `1be80c9e86`) Closes scylladb/scylladb#28714	2026-02-19 13:07:37 +02:00
Avi Kivity	8c3c5777da	Merge '[Backport 2026.1] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot] The connection's `cpu_concurrency_t` struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 Backport: all supported affected versions, bug introduced with initial feature implementation in: `ed3e4f33fd` - (cherry picked from commit `0376d16ad3`) - (cherry picked from commit `3b98451776`) Parent PR: #28530 Closes scylladb/scylladb#28716 * github.com:scylladb/scylladb: test: auth_cluster: add test for hanged AUTHENTICATING connections transport: fix connection code to consume only initially taken semaphore units	2026-02-19 12:47:38 +02:00
Tomasz Grabiec	bb9a5261ec	Merge '[Backport 2026.1] test: fix flaky test_balance_empty_tablets' from Scylladb[bot] The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 The test is present in master and 2026.1, so we need to backport this. - (cherry picked from commit `4ca40929ef`) Parent PR: #28598 Closes scylladb/scylladb#28638 * github.com:scylladb/scylladb: test/cluster: Remove short_tablet_stats_refresh_interval injection test: add read barrier to test_balance_empty_tablets	2026-02-18 23:39:03 +01:00
Marcin Maliszkiewicz	d5d81cc066	test: auth_cluster: add test for hanged AUTHENTICATING connections Test runtime: Release - 2s Debug - 5s (cherry picked from commit `3b98451776`)	2026-02-18 19:43:02 +00:00
Botond Dénes	99a67484bf	Merge '[Backport 2026.1] cql3/statements/describe_statement: hide paxos state tables ' from Scylladb[bot] Paxos state tables are internal tables fully managed by Scylla and they shouldn't be exposed to the user nor they shouldn't be backed up. This commit hides those kind of tables from all listings and if such table is directly described with `DESC ks."tbl$paxos"`, the description is generated withing a comment and a note for the user is added. Fixes https://github.com/scylladb/scylladb/issues/28183 LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version. - (cherry picked from commit `f89a8c4ec4`) - (cherry picked from commit `9baaddb613`) Parent PR: #28230 Closes scylladb/scylladb#28508 * github.com:scylladb/scylladb: test/cqlpy: add reproducer for hidden Paxos table being shown by DESC cql3/statements/describe_statement: hide paxos state tables	2026-02-18 12:41:08 +02:00
Aleksandra Martyniuk	19cbaa1be2	test: fix test_remove_node_violating_rf_rack_with_rack_list test_remove_node_violating_rf_rack_with_rack_list creates a cluster with four nodes. One of the nodes is excluded, then another one is stopped, excluded, and removed. If the two stopped nodes were both voters, the majority is lost and the cluster loses its raft leader. As a result, the node cannot be removed and the operation times out. Add the 5th node to the cluster. This way the majority is always up. Fixes: https://github.com/scylladb/scylladb/issues/28596. Closes scylladb/scylladb#28610 (cherry picked from commit `f955a90309`) Closes scylladb/scylladb#28639	2026-02-18 12:36:52 +02:00
Ernest Zaslavsky	6e92ee1bb2	s3_client: add more constrains to the calc_part_size Enforce more checks on part size and object size as defined in "Amazon S3 multipart upload limits", see https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html (cherry picked from commit `960adbb439`)	2026-02-18 09:41:28 +00:00
Ernest Zaslavsky	4ecc402b79	s3_client: add tests for calc_part_size Introduce tests that validate the corrected multipart part-size calculation, including boundary conditions and error cases. (cherry picked from commit `6280cb91ca`)	2026-02-18 09:41:27 +00:00
Calle Wilund	cad92d5100	utils/gcp/object_storage: URL-encode object names in URL:s Fixes #28398 When used as path elements in google storage paths, the object names need to be URL encoded. Due to a.) tests not really using prefixes including non-url valid chars (i.e. / etc) and the mock server used for most testing not enforcing this particular aspect, this was missed. Modified unit tests to use prefixing for all names, so when run in real GS, any errors like this will show. (cherry picked from commit `87aa6c8387`)	2026-02-17 19:32:19 +00:00
Piotr Dulikowski	05a5bd542a	Merge '[Backport 2026.1] vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Scylladb[bot] Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused the test case to fail because the ANN request duration exceeded the test case timeout. The PR introduces two changes: * Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites simultaneously with ANN requests that utilize those certificates. * Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout. Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write operation, potentially bypassing connect timeout. Fixes: #28012 Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test. - (cherry picked from commit `aef5ff7491`) - (cherry picked from commit `079fe17e8b`) Parent PR: #28617 Closes scylladb/scylladb#28643 * github.com:scylladb/scylladb: vector_search: Fix missing timeout on TLS handshake vector_search: test: Fix flaky cert rewrite test	2026-02-17 10:44:13 +01:00
Patryk Jędrzejczak	8a626bb458	Merge '[Backport 2026.1] test: explicitly set compression algorithm in test_autoretrain_dict' from Scylladb[bot] When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: https://github.com/scylladb/scylladb/issues/28204 Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent). - (cherry picked from commit `e63cfc38b3`) - (cherry picked from commit `9ffa62a986`) Parent PR: #28625 Closes scylladb/scylladb#28667 * https://github.com/scylladb/scylladb: test: explicitly set compression algorithm in test_autoretrain_dict test: remove unneeded semicolons from python test	2026-02-17 10:19:26 +01:00
Patryk Jędrzejczak	3a56a0cf99	test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart The test can currently fail like this: ``` > await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">}) ``` The following happens: - node A is restarted and becomes the group0 leader, - the driver sends the ALTER TABLE request to node B, - the request hits group 0 concurrent modification error 10 times and fails because node A performs tablet migrations at the the same time. What is unexpected is that even though the driver session uses the default retry policy, the driver doesn't retry the request on node A. The request is guaranteed to succeed on node A because it's the only node adding group0 entries. The driver doesn't retry the request on node A because of a missing `wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect the driver just in case to prevent hitting scylladb/python-driver#295. Moreover, we can revert the workaround from `4c9efc08d8`, as the fix from this commit also prevents DROP KEYSPACE failures. The commit has been tested in byo with `_concurrent_ddl_retries{0}` to verify that node A really can't hit group 0 concurrent modification error and always receives the ALTER TABLE request from the driver. All 300 runs in each build mode passed. Fixes #25938 Closes scylladb/scylladb#28632 (cherry picked from commit `0693091aff`) Closes scylladb/scylladb#28673	2026-02-17 10:03:50 +01:00
Nikos Dragazis	0cdac69aab	test/cluster: Remove short_tablet_stats_refresh_interval injection The test `test_size_based_load_balancing.py::test_balance_empty_tablets` waits for tablet load stats to be refreshed and uses the `short_tablet_stats_refresh_interval` injection to speed up the refresh interval. This injection has no effect; it was replaced by the `tablet_load_stats_refresh_interval_in_seconds` config option (patch: `1d6808aec4`), so the test currently waits for 60 seconds (default refresh interval). Use the config option. This reduces the execution time to ~8 seconds. Fixes SCYLLADB-556. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28536 (cherry picked from commit `5d1e6243af`)	2026-02-17 08:35:16 +01:00
Ferenc Szili	f04a3acf33	test: add read barrier to test_balance_empty_tablets The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 Closes scylladb/scylladb#28598 (cherry picked from commit `4ca40929ef`)	2026-02-17 08:31:50 +01:00
Andrzej Jackowski	ad716f9341	test: explicitly set compression algorithm in test_autoretrain_dict When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: scylladb/scylladb#28204 (cherry picked from commit `9ffa62a986`)	2026-02-16 16:23:36 +00:00
Andrzej Jackowski	2edd87f2e1	test: remove unneeded semicolons from python test (cherry picked from commit `e63cfc38b3`)	2026-02-16 16:23:36 +00:00
Karol Nowacki	cfebb52db0	vector_search: test: Fix flaky cert rewrite test The test is flaky most likely because when TLS certificate rewrite happens simultaneously with an ANN request, the handshake can hang for a long time (~60s). This leads to a timeout in the test case. This change introduces a checkpoint in the test so that it will wait for the certificate rewrite to happen before sending an ANN request, which should prevent the handshake from hanging and make the test more reliable. Fixes: #28012 (cherry picked from commit `aef5ff7491`)	2026-02-13 21:24:45 +00:00
Dawid Mędrek	37ef37e8ab	test: cluster: Reduce wait time in test_sync_point If everything is OK, the sync point will not resolve with node 3 dead. As a result, the waiting will use all of the time we allocate for it, i.e. 30 seconds. That's a lot of time. There's no easy way to verify that the sync point will NOT resolve, but let's at least reduce the waiting to 3 seconds. If there's a bug, it should be enough to trigger it at some point, while reducing the average time needed for CI. (cherry picked from commit `f83f911bae`)	2026-02-12 12:13:19 +00:00
Dawid Mędrek	fdad814aa3	test: cluster: Fix test_sync_point The test had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203 (cherry picked from commit `a256ba7de0`)	2026-02-12 12:13:19 +00:00
Dawid Mędrek	0257f7cc89	test: cluster: Await sync points asynchronously There's a dedicated HTTP API for communicating with the cluster, so let's use it instead of yet another custom solution. (cherry picked from commit `c5239edf2a`)	2026-02-12 12:13:19 +00:00
Dawid Mędrek	07bfd920e7	test: cluster: Create sync points asynchronously There's a dedicated HTTP API for communicating with the nodes, so let's use it instead of yet another custom solution. (cherry picked from commit `ac4af5f461`)	2026-02-12 12:13:19 +00:00
Dawid Mędrek	698ba5bd0b	test: cluster: Fetch hint metrics asynchronously There's a dedicated API for fetching metrics now. Let's use it instead of developing yet another solution that's also worse. (cherry picked from commit `628e74f157`)	2026-02-12 12:13:19 +00:00
Pawel Pery	f4b79c1b1d	Revert "Merge 'vector_search: add validator tests' from Pawel Pery" This reverts commit `bcd1758911`, reversing changes made to `b2c2a99741`. There is a design decision to not introduce additional test orchestration tool for scylladb.git (see comments for #27499). One commit has already been reverted in `55c7bc7`. Last CI runs made validator test flaky, so it is a time to remove all remaining validator tests. It needs a backport to 2026.1 to remove remaining validator tests from there. Fixes: VECTOR-497 Closes scylladb/scylladb#28568 (cherry picked from commit `81d11a23ce`) Closes scylladb/scylladb#28577	2026-02-09 15:16:40 +02:00
Michał Hudobski	f633f57163	auth: add CDC streams and timestamps to vector search permissions It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR. Fixes: SCYLLADB-522 Closes scylladb/scylladb#28519 (cherry picked from commit `6b9fcc6ca3`) Closes scylladb/scylladb#28538	2026-02-05 10:31:39 +01:00
Nadav Har'El	5b15c52f1e	test/cqlpy: add reproducer for hidden Paxos table being shown by DESC This patch adds a reproducer test showing issue #28183 - that when LWT is used, hidden tables "...$paxos" are created but they are unexpectedly shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE. The new test was failing (in three places) on Scylla, as those internal (and illegally-named) tables are listed, and passes on Cassandra (which doesn't add hidden tables for LWT). The commit also contains another test, which verifies if direct description of paxos state table is wrapped in comment. Refs #28183. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `9baaddb613`)	2026-02-02 23:31:58 +00:00
Patryk Jędrzejczak	b62e1b405b	test: test_maintenance_mode: enable maintenance mode properly The same issue as the one fixed in `394207fd69`. This one didn't cause real problems, but it's still cleaner to fix it. (cherry picked from commit `7e7b9977c5`)	2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak	f3d2a16e66	test: test_maintenance_mode: shutdown cluster connections Leaked connections are known to cause inter-test issues. (cherry picked from commit `6c547e1692`)	2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak	eee99ebb3d	test: test_maintenance_mode: run with different keyspace options We extend the test to provide a reproducer for #27988 and to avoid similar bugs in the future. The test slows down from ~14s to ~19s on my local machine in dev mode. It seems reasonable. (cherry picked from commit `867a1ca346`)	2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak	c248744c5a	test: test_maintenance_mode: check that group0 is disabled by creating a keyspace In the following commit, we make the rest run with multiple keyspaces, and the old check becomes inconvenient. We also move it below to the part of the code that won't be executed for each keyspace. Additionally, we check if the error message is as expected. (cherry picked from commit `53f58b85b7`)	2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak	4ba3c08d45	test: test_maintenance_mode: get rid of the conditional skip This skip has already caused trouble. After `0668c642a2`, the skip was always hit, and the test was silently doing nothing. This made us miss #26816 for a long time. The test was fixed in `222eab45f8`, but we should get rid of the skip anyway. We increase the number of writes from 256 to 1000 to make the chance of not finding the key on server A even lower. If that still happens, it must be due to a bug, so we fail the test. We also make the test insert rows until server A is a replica of one row. The expected number of inserted rows is a small constant, so it should, in theory, make the test faster and cleaner (we need one row on server A, so we insert exactly one such row). It's possible to make the test fully deterministic, by e.g., hardcoding the key and tokens of all nodes via `initial_token`, but I'm afraid it would make the test "too deterministic" and could hide a bug. (cherry picked from commit `408c6ea3ee`)	2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak	c8c21cc29c	test: test_maintenance_mode: remove the redundant value from the query result (cherry picked from commit `c92962ca45`)	2026-02-02 17:02:16 +00:00
Ferenc Szili	523d529d27	test: add test and reproducer for load_stats refresh exception This patch adds a test and reproducer for the issue where the load_stats refresh procedure throws exceptions if any of the tables have been dropped since load_stats was produced. (cherry picked from commit `92dbde54a5`)	2026-02-01 00:34:26 +00:00
Botond Dénes	dc89e2ea37	Merge '[Backport 2026.1] test: test_alternator_proxy_protocol: fix race between node startup and test start' from Scylladb[bot] test_alternator_proxy_protocol starts a node and connects via the alternator ports. Starting a node, by default, waits until the CQL ports are up. This does not guarantee that the alternator ports are up (they will be up very soon after this), so there is a short window where a connection to the alternator ports will fail. Fix by adding a ServerUpState=SERVING mode, which waits for the node to report to its supervisor (systemd, which we are pretending to be) that its ports are open. The test is then adjusted to request this new ServerUpState. Fixes #28210 Fixes #28211 Flaky tests are only in master and branch-2026.1, so backporting there. - (cherry picked from commit `ebac810c4e`) - (cherry picked from commit `59f2a3ce72`) Parent PR: #28291 Closes scylladb/scylladb#28443 * github.com:scylladb/scylladb: test: test_alternator_proxy_protocol: wait for the node to report itself as serving test: cluster_manager: add ability to wait for supervisor STATUS=serving	2026-01-30 15:59:09 +02:00

1 2 3 4 5 ...

10691 Commits