scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 18:50:53 +00:00

Author	SHA1	Message	Date
Szymon Malewski	668d6fe019	vector: Improve similarity functions performance Improves performance of deserialization of vector data for calculating similarity functions. Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float> and then pass it to similarity functions as a std::span<const float>. This avoids overhead of data_value allocations and conversions. Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`: client concurrency 1: before: ~135 QPS, after: ~1005 QPS client concurrency 20: before: ~280 QPS, after: ~2097 QPS Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471 Closes scylladb/scylladb#28615	2026-02-18 00:33:34 +02:00
Emil Maskovsky	6b98f44485	init: fix infinite loop on npos wrap with updated Seastar Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop. Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated. Refs: scylladb/seastar#3236	2026-02-17 17:57:13 +00:00
Emil Maskovsky	bda0fc9d93	init: remove unnecessary object creation in emplace calls Simplifies code by directly passing constructor arguments to emplace, avoiding redundant temporary gms::inet_address() object creation. Improves clarity and potentially performance in affected areas.	2026-02-17 17:57:12 +00:00
Marcin Maliszkiewicz	741969cf4c	test: boost: add auth cache tests The cache is covered already with general auth dtests but some cases are more tricky and easier to express directly as calls to cache class. For such tests boost test file was added.	2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz	c11eb73a59	auth: add cache size metrics	2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz	a23e503e7b	auth: remove old permissions cache	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	7eedf50c12	auth: ldap: add permissions reload to unified cache The LDAP server may change role-chain assignments without notifying Scylla. As a result, effective permissions can change, so some form of polling is required. Currently, this is handled via cache expiration. However, the unified cache is designed to be consistent and does not support expiration. To provide an equivalent mechanism for LDAP, we will periodically reload the permissions portion of the new cache at intervals matching the previously configured expiration time.	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	3b98451776	test: auth_cluster: add test for hanged AUTHENTICATING connections Test runtime: Release - 2s Debug - 5s	2026-02-17 17:55:48 +01:00
Botond Dénes	2e087882fa	Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund Fixes #28398 Fixes #28399 When used as path elements in google storage paths, the object names need to be URL encoded. Due to a.) tests not really using prefixes including non-url valid chars (i.e. / etc) and b.) the mock server used for most testing not enforcing this particular aspect, this was missed. Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show. "Real" GCS also behaves a bit different when listing with pager, compared to mock; The former will not give a pager token for last page, only penultimate. Adds handling for this. Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected. Closes scylladb/scylladb#28400 * github.com:scylladb/scylladb: utils/gcp/object_storage: URL-encode object names in URL:s utils::gcp::object_storage: Fix list object pager end condition detection	2026-02-17 16:40:02 +02:00
Andrei Chekun	1b5789cd63	test.py: refactor manager fixture The current manager flow have a flaw. It will trigger pytest.fail when it found errors on teardown regardless if the test was already failed. This will create an additional record in JUnit report with the same name and Jenkins will not be able to show the logs correctly. So to avoid this, this PR changes logic slightly. Now manager will check that test failed or not to avoid two fails for the same test in the report. If test passed, manager will check the cluster status and fail if something wrong with a status of it. There is no need to check the cluster status in case of test fail. If test passed, and cluster status if OK, but there are unexpected errors in the logs, test will fail as well. But this check will gather all information about the errors and potential stacktraces and will only fail the test if it's not yet failed to avoid double entry in report. Closes scylladb/scylladb#28633	2026-02-17 14:35:18 +01:00
Dawid Mędrek	5b5222d72f	Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway. No backport, test update. Closes scylladb/scylladb#28571 * github.com:scylladb/scylladb: test: run test_different_group0_ids in all modes test: make test_different_group0_ids work with the Raft-based topology	2026-02-17 13:56:41 +01:00
Dawid Mędrek	1b80f6982b	Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili Currently, the load balancing simulator computes node, shard and tablet load based on tablet count. This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count. This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series: - First part for tablet size collection via load_stats: scylladb/scylladb#26035 - Second part reconcile load_stats: scylladb/scylladb#26152 - The third part for load_sketch changes: scylladb/scylladb#26153 - The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254 - The fifth part changes the load balancing simulator: scylladb/scylladb#26438 This is a new feature and backport is not needed. Closes scylladb/scylladb#26438 * github.com:scylladb/scylladb: test, simulator: compute load based on tablet size instead of count test, simulator: generate tablet sizes and update load_stats test, simulator: postpone creation of load_stats_ptr	2026-02-17 13:29:37 +01:00
Andrei Chekun	767789304e	test.py: improve C++ fail summary in pytest Currently, if the test fail, pytest will output only some basic information about the fail. With this change, it will output the last 300 lines of the boost/seastar test output. Also add capturing the output of the failed tests to JUnit report, so it will be present in the report on Jenkins. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449 Closes scylladb/scylladb#28535	2026-02-17 14:25:28 +03:00
Pavel Emelyanov	6d4af84846	Merge 'test: increase open file limit for sstable tests' from Avi Kivity In `ebda2fd4db` ("test: cql_test_env: increase file descriptor limit"), we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests open 256 sstables, and with 2 files/sstable + resharding work it is understandable that they overflow the 1024 limit. No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches. Closes scylladb/scylladb#28646 * github.com:scylladb/scylladb: test: sstables::test_env: adjust file open limit test: extract cql_test_env's adjust_rlimit() for reuse	2026-02-17 14:19:43 +03:00
Avi Kivity	41925083dc	test: minio: tune sync setting Disable O_DSYNC in minio to avoid unnecessary slowdown in S3 tests. Closes scylladb/scylladb#28579	2026-02-17 14:19:27 +03:00
Jakub Smolar	189b056605	scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect Previous implementation of Scylla lifecycle brought flakiness to the test. This change leaves lifecycle management up to PythonTest.run_ctx, which implements more stability logic for setup/teardown. Replace pexpect-driven GDB interaction with GDB batch mode: - Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty() may lead to deadlocks in the child.", which ultimately caused CI deadlocks. - Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts. - Produces cleaner, more direct assertions around command execution and output. - Trade-off: batch mode adds ~10s per command per test, but with --dist=worksteal this is ~10% overall runtime increase across the suite. Closes scylladb/scylladb#28484	2026-02-17 11:36:20 +01:00
Łukasz Paszkowski	f45465b9f6	test_out_of_space_prevention.py: Lower the critical disk utilization threshold After PR https://github.com/scylladb/scylladb/pull/28396 reduced the test volumes to 20MiB to speed up test_out_of_space_prevention.py, keeping the original 0.8 critical disk utilization threshold can make the tests flaky: transient disk usage (e.g. commitlog segment churn) can push the node into ENOSPC during the run. These tests do not write much data, so reduce the critical disk utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB of headroom for temporary growth during the test. Fixes: https://github.com/scylladb/scylladb/issues/28463 Closes scylladb/scylladb#28593	2026-02-16 15:10:18 +02:00
Andrei Chekun	e26cf0b2d6	test/cluster: fix two flaky tests test_maintenance_socket with new way of running is flaky. Looks like the driver tries to reconnect with an old maintenance socket from previous driver and fails. This PR adds white list for connection that stabilize the test test_no_removed_node_event_on_ip_change was flaky on CI, while the issue never reproduced locally. The assumption that under load we have race condition and trying to check the logs before message is arrived. Small for loop to retry added to avoid such situation. Closes scylladb/scylladb#28635	2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak	0693091aff	test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart The test can currently fail like this: ``` > await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">}) ``` The following happens: - node A is restarted and becomes the group0 leader, - the driver sends the ALTER TABLE request to node B, - the request hits group 0 concurrent modification error 10 times and fails because node A performs tablet migrations at the the same time. What is unexpected is that even though the driver session uses the default retry policy, the driver doesn't retry the request on node A. The request is guaranteed to succeed on node A because it's the only node adding group0 entries. The driver doesn't retry the request on node A because of a missing `wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect the driver just in case to prevent hitting scylladb/python-driver#295. Moreover, we can revert the workaround from `4c9efc08d8`, as the fix from this commit also prevents DROP KEYSPACE failures. The commit has been tested in byo with `_concurrent_ddl_retries{0}` to verify that node A really can't hit group 0 concurrent modification error and always receives the ALTER TABLE request from the driver. All 300 runs in each build mode passed. Fixes #25938 Closes scylladb/scylladb#28632	2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz	6a4aef28ae	Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: https://github.com/scylladb/scylladb/issues/28204 Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent). Closes scylladb/scylladb#28625 * github.com:scylladb/scylladb: test: explicitly set compression algorithm in test_autoretrain_dict test: remove unneeded semicolons from python test	2026-02-16 11:38:24 +01:00
Botond Dénes	9f57d6285b	Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail, increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures. Fixes https://github.com/scylladb/scylladb/issues/27745 Backport: no, it's not a bug fix Closes scylladb/scylladb#28628 * github.com:scylladb/scylladb: test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call test: pylib: increase curl's number of retries when downloading scylla test: pylib: improve error reporting in get_scylla_2025_1_executable	2026-02-16 10:09:17 +02:00
Petr Gusev	c785d242a7	tests: extract get_topology_version helper This is a refactoring commit. We need to load the cluster version for a host in several places, so extract a helper for this.	2026-02-16 08:57:42 +01:00
Petr Gusev	ffe3262e8d	global tablets barrier: require all nodes to ack barrier_and_drain Previously, global_tablet_token_metadata_barrier() could proceed with fencing even if some nodes did not acknowledge the barrier_and_drain. This could cause problems: * In scylladb/scylladb#26864, replica locks did not provide mutual exclusion, because “fenced out” requests from old topology versions could run in parallel with requests using newer versions. * In scylladb/scylladb#26375, the barrier could succeed even though we did not wait for closed sessions to become unused. This could leave aborted repair or streaming tasks running concurrently after a tablet transition was aborted, and thus running concurrently with the next transition. In this commit we add a parameter drain_all_nodes: bool to the global_token_metadata_barrier function. If this parameter is set, the barrier waits for all nodes to acknowledge the barrier_and_drain round of RPCs. If any of the nodes are not accessible or throw an error, such errors are rethrown to the caller. We set this parameter only in global_tablet_token_metadata_barrier since for topology migrations the old behavior should be preserved. In case of errors, the tablet migration is blocked until the problem goes away by itself or the problematic node is added to the ignore_nodes list. The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is removed: with tablets, we now drain all nodes, so after a successful barrier_and_drain round there can be no coordinators with an old topology version. The fence_token check after executing a request on a replica is therefore unnecessary for tablets, but still required for vnodes, where topology changes do not wait for all nodes. Topology fencing is covered by test_fence_lwt_during_bootstrap. Fixes scylladb/scylladb#26864 Fixes scylladb/scylladb#26375	2026-02-16 08:57:42 +01:00
Petr Gusev	df73f723a6	storage_proxy: hold erms in replica handlers Add explicit erm-holding variables in all replica-side RPC handlers. This is required to ensure that tablet migration waits for in-flight replica requests even if a non-replica coordinator has been fenced out. Holding erms on the replica side may increase the global-barrier wait time, since the barrier must drain these requests. We believe this is acceptable because: * We already hold erms during replica-side request execution, but in an ad-hoc, non-systemic way in lower layers of storage_proxy (e.g. in sp::mutate_locally and do_query_tablets). * Replica requests are bounded by replica-side timeouts, so the global-barrier wait time cannot exceed the maximum of these timeouts. For Paxos verbs, we use token_metadata_guard, which wraps the ERM and automatically refreshes it when tablet migration does not affect the current token; see the token_metadata_guard comments for details. We use this guard only for Paxos verbs because regular reads and writes already hold raw erms in storage_proxy and on the coordinators. The erms must be held in all RPC handlers that support fencing — that is, those with a fencing_token parameter in storage_proxy.idl. Counter updates already hold erms in mutate_counter_on_leader_and_replicate. Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets barrier now waits for all nodes. As a result, the replica read is expected to finish, rather than fail due to the tablet having moved as it did previously. The test is renamed to test_tablets_barrier_waits_for_replica_erms to better reflect its purpose. Refs scylladb/scylladb#26864	2026-02-16 08:57:42 +01:00
Andrei Chekun	8c5c1096c2	test: ensure that that table used it cqlpy/test_tools have at least 3 pk One of the tests check that amount of the PK should be more than 2, but the method that creates it can return table with less keys. This leads to flakiness and to avoid it, this PR ensures that table will have at least 3 PK Closes scylladb/scylladb#28636	2026-02-16 09:50:58 +02:00
Andrei Chekun	e144d5b0bb	test.py: fix JUnit double test case records Move the hook for overwriting the XML reporter to be the first, to avoid double records. Closes scylladb/scylladb#28627	2026-02-15 19:02:24 +02:00
Avi Kivity	a365e2deaa	test: sstables::test_env: adjust file open limit The twcs compaction tests open more than 1024 files (not so good), and will fail in a user session with the default soft limit (1024). Attempt to raise the limit so the tests pass. On a modern systemd installation the hard limit is >500,000, so this will work. There's no problem in dbuild since it raises the file limit globally.	2026-02-15 14:27:37 +02:00
Avi Kivity	bab3afab88	test: extract cql_test_env's adjust_rlimit() for reuse The sstable-oriented sstable::test_env would also like to use it, so extract it into a neutral place.	2026-02-15 14:26:46 +02:00
Piotr Dulikowski	9c1e310b0d	Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused the test case to fail because the ANN request duration exceeded the test case timeout. The PR introduces two changes: * Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites simultaneously with ANN requests that utilize those certificates. * Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout. Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write operation, potentially bypassing connect timeout. Fixes: #28012 Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test. Closes scylladb/scylladb#28617 * github.com:scylladb/scylladb: vector_search: Fix missing timeout on TLS handshake vector_search: test: Fix flaky cert rewrite test	2026-02-13 19:03:50 +01:00
Patryk Jędrzejczak	aebc108b1b	test: run test_different_group0_ids in all modes CI currently fails in release and debug modes if the PR only changes a test run only in dev mode. There is no reason to wait for the CI fix, as there is no reason to run this test only in dev mode in the first place. The test is very fast.	2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak	59746ea035	test: make test_different_group0_ids work with the Raft-based topology The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway.	2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz	1b0a68d1de	test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call It's difficult to say if our download backend would always return transient error correctly so that the curl could retry. Instead it's more robust to always retry on error.	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	8ca834d4a4	test: pylib: increase curl's number of retries when downloading scylla By default curl does exponential backoff, and we want to keep that but there is time cap of 10 minutes, so with 40 retries we'd wait long time, instead we set the cap to 60 seconds. Total waiting time (excluding receiving request time): before - 17m after - 35m	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	70366168aa	test: pylib: improve error reporting in get_scylla_2025_1_executable Curl or other tools this function calls will now log error in the place they fail instead of doing plain assert.	2026-02-12 16:18:52 +01:00
Andrzej Jackowski	9ffa62a986	test: explicitly set compression algorithm in test_autoretrain_dict When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: scylladb/scylladb#28204	2026-02-12 14:58:39 +01:00
Andrzej Jackowski	e63cfc38b3	test: remove unneeded semicolons from python test	2026-02-12 14:49:17 +01:00
Ferenc Szili	d7cfaf3f84	test, simulator: compute load based on tablet size instead of count This patch changes the load balancing simulator so that it computes table load based on tablet sizes instead of tablet count. best_shard_overcommit measured minimal allowed overcommit in cases where the number of tablets can not be evenly distributed across all the available shards. This is still the case, but instead of computing it as an integer div_ceil() of the average shard load, it is now computed by allocating the tablet sizes using the largest-tablet-first method. From these, we can get the lowest overcommit for the given set of nodes, shards and tablet sizes.	2026-02-12 12:54:55 +01:00
Ferenc Szili	216443c050	test, simulator: generate tablet sizes and update load_stats This change adds a random tablet size generator. The tablet sizes are created in load_stats. Further changes to the load balance simulator: - apply_plan() updates the load_stats after a migration plan is issued by the load balancer, - adds the option to set a command line option which controls the tablet size deviation factor.	2026-02-12 12:54:55 +01:00
Ferenc Szili	e31870a02d	test, simulator: postpone creation of load_stats_ptr With size based load balancing, we will have to move the tablet size in load_stats after each internode migration issued by balance_tablets(). This will be done in a subsequent commit in apply_plan() which is called from rebalance_tablets(). Currently, rebalance_tablets() is passed a load_stats_ptr which is defined as: using load_stats_ptr = lw_shared_ptr<const load_stats>; Because this is a pointer to const, apply_plan() can't modify it. So, we pass a reference to load_stats to rebalance_tablets() and create a load_stats_ptr from it for each call to balance_tablets().	2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk	f955a90309	test: fix test_remove_node_violating_rf_rack_with_rack_list test_remove_node_violating_rf_rack_with_rack_list creates a cluster with four nodes. One of the nodes is excluded, then another one is stopped, excluded, and removed. If the two stopped nodes were both voters, the majority is lost and the cluster loses its raft leader. As a result, the node cannot be removed and the operation times out. Add the 5th node to the cluster. This way the majority is always up. Fixes: https://github.com/scylladb/scylladb/issues/28596. Closes scylladb/scylladb#28610	2026-02-12 12:58:48 +02:00
Ferenc Szili	4ca40929ef	test: add read barrier to test_balance_empty_tablets The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 Closes scylladb/scylladb#28598	2026-02-12 11:16:34 +02:00
Karol Nowacki	aef5ff7491	vector_search: test: Fix flaky cert rewrite test The test is flaky most likely because when TLS certificate rewrite happens simultaneously with an ANN request, the handshake can hang for a long time (~60s). This leads to a timeout in the test case. This change introduces a checkpoint in the test so that it will wait for the certificate rewrite to happen before sending an ANN request, which should prevent the handshake from hanging and make the test more reliable. Fixes: #28012	2026-02-12 09:58:54 +01:00
Dawid Pawlik	4e32502bb3	test/vector_search: add reproducer for rescoring with zero vectors Add reproducer for the SCYLLADB-456 issue following exception on ANN vector queries with rescoring with similarity cosine.	2026-02-11 13:41:09 +01:00
Dawid Pawlik	af0889d194	vector_search: return NaN for similarity_cosine with all-zero vectors The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456	2026-02-11 12:31:47 +01:00
Pavel Emelyanov	2a3a56850c	test: Fix the condition for streaming directions validation Commit `ea8a661119` tried to reduce the dataset for restoration tests. While doing it effectively disabled part of itself -- the checks for streaming directions were never ran after this change. The thing is that this check only runs if restored tablet count matches some hardcoded one of 512. This was the real dataset size of the test before the aforementioned commit, but after it it had changed to over values, and the comparison with 512 became always False. Fix it with a local variable to prevent such mistakes in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-11 12:55:27 +03:00
Pavel Emelyanov	f187dceb1a	test: Split test_backup.py::check_data_is_back() into two This method does two things -- checks that the data is indeed back, and validates streaming directions. The latter is not quite about "data is back", so better to have it as explicit dedicated method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-11 12:54:20 +03:00
Dawid Mędrek	f83f911bae	test: cluster: Reduce wait time in test_sync_point If everything is OK, the sync point will not resolve with node 3 dead. As a result, the waiting will use all of the time we allocate for it, i.e. 30 seconds. That's a lot of time. There's no easy way to verify that the sync point will NOT resolve, but let's at least reduce the waiting to 3 seconds. If there's a bug, it should be enough to trigger it at some point, while reducing the average time needed for CI.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	a256ba7de0	test: cluster: Fix test_sync_point The test had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203	2026-02-10 17:05:02 +01:00
Dawid Mędrek	c5239edf2a	test: cluster: Await sync points asynchronously There's a dedicated HTTP API for communicating with the cluster, so let's use it instead of yet another custom solution.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	ac4af5f461	test: cluster: Create sync points asynchronously There's a dedicated HTTP API for communicating with the nodes, so let's use it instead of yet another custom solution.	2026-02-10 17:05:01 +01:00

... 19 20 21 22 23 ...

11801 Commits