scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 10:41:12 +00:00

Author	SHA1	Message	Date
Michael Litvak	bf7bc5b410	logstor: code cleanup misc code cleanup and small changes	2026-03-31 18:40:56 +02:00
Piotr Smaron	2ce409dca0	test: clean up fuzzy_test_config and add comments Remove the unused timeout field from fuzzy_test_config. It was declared, initialized per build mode, and logged, but never actually enforced anywhere. Document the intentionally small max_size (1024 bytes) passed to read_partitions_with_paged_scan in run_fuzzy_test_scan: it forces many pages per scan to stress the paging and result-merging logic.	2026-03-31 17:13:26 +02:00
Piotr Smaron	df2924b2a3	test: fix fuzzy_test timeout in release mode The multishard_query_test/fuzzy_test was timing out (SIGKILL after 15 minutes) in release mode CI. In release mode the test generates up to 64 partitions with up to 1000 clustering rows and 1000 range tombstones each. With deeply nested randomly-generated types (e.g. frozen<map<varint, frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed the 15-minute CI timeout. Reduce the release-mode clustering-row and range-tombstone distributions from 0-1000 to 0-200. This caps the worst case at ~12,800 rows -- still 2x the devel-mode maximum (0-100) and sufficient to exercise multi-partition paged scanning with many pages. Fixes: SCYLLADB-1270	2026-03-31 17:13:06 +02:00
Piotr Szymaniak	6d8ec8a0c0	alternator: fix flaky test_update_condition_unused_entries_short_circuit The test was flaky because it stopped dc2_node immediately after an LWT write, before cross-DC replication could complete. The LWT commit uses LOCAL_QUORUM, which only guarantees persistence in the coordinator's DC. Replication to the remote DC is async background work, and CAS mutations don't store hints. Stopping dc2_node could drop in-flight RPCs, leaving DC1 without the mutation. Fix by polling both live DC1 nodes after the write to confirm cross-DC replication completed before stopping dc2_node. Both nodes must have the data so that the later ConsistentRead=True (LOCAL_QUORUM) read on restarted node1 is guaranteed to succeed. Fixes SCYLLADB-1267 Closes scylladb/scylladb#29287	2026-03-31 16:50:51 +03:00
Avi Kivity	216d39883a	Merge 'test: audit: fix audit test syslog race' from Dario Mirovic Fix two independent race conditions in the syslog audit test that cause intermittent `assert 2 <= 1` failures in `assert_entries_were_added`. Datagram ordering race: `UnixSockerListener` used `ThreadingUnixDatagramServer`, where each datagram spawns a new thread. The notification barrier in `get_lines()` assumes FIFO handling, but the notification thread can win the lock before an audit entry thread, so `clear_audit_logs()` misses entries that arrive moments later. Fix: switch to sequential `UnixDatagramServer`. Config reload race: The live-update path used `wait_for_config` (REST API poll on shard 0) which can return before `broadcast_to_all_shards()` completes. Fix: wait for `"completed re-reading configuration file"` in the server log after each SIGHUP, which guarantees all shards have the new config. Fixes SCYLLADB-1277 This is CI improvement for the latest code. No need for backport. Closes scylladb/scylladb#29282 * github.com:scylladb/scylladb: test: cluster: wait for full config reload in audit live-update path test: cluster: fix syslog listener datagram ordering race	2026-03-31 13:53:01 +03:00
Tomasz Grabiec	b355bb70c2	dtest/alternator: stop concurrent-requests test when workers hit limit `test_limit_concurrent_requests` could create far more tables than intended because worker threads looped indefinitely and only the probe path terminated the test. In practice, workers often hit `RequestLimitExceeded` first, but the test kept running and creating tables, increasing memory pressure and causing flakiness due to bad_alloc errors in logs. Fix by replacing the old probe-driven termination with worker-driven termination. Workers now run until any worker sees `RequestLimitExceeded`. Fixes SCYLLADB-1181 Closes scylladb/scylladb#29270	2026-03-31 13:35:50 +03:00
Patryk Jędrzejczak	b9f82f6f23	raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start A joining node hung forever if the topology coordinator added it to the group 0 configuration before the node reached `post_server_start`. In that case, `server->get_configuration().contains(my_id)` returned true and the node broke out of the join loop early, skipping `post_server_start`. `_join_node_group0_started` was therefore never set, so the node's `join_node_response` RPC handler blocked indefinitely. Meanwhile the topology coordinator's `respond_to_joining_node` call (which has no timeout) hung forever waiting for the reply that never came. Fix by only taking the early-break path when not starting as a follower (i.e. when the node is the discovery leader or is restarting). A joining node must always reach `post_server_start`. We also provide a regression test. It takes 6s in dev mode. Fixes SCYLLADB-959 Closes scylladb/scylladb#29266	2026-03-31 12:33:56 +02:00
Ferenc Szili	7b308f3aa0	test: verify hints are delivered during tablet RF reduction Add test_hint_to_leaving_when_reducing_rf which verifies that mutations stored as hints are delivered to the correct replicas when a tablet is removed due to RF reduction. The test sets up a 3-node cluster with RF=2, drops the hint for one replica via error injection, then reduces RF to 1 while hints are pending. It asserts that the mutation is readable after the topology change completes. Also adds a "drop_hint_for_host" error injection point in hint_endpoint_manager to selectively drop hints for a specific host.	2026-03-31 09:18:42 +02:00
Dario Mirovic	0cb63fb669	test: cluster: wait for full config reload in audit live-update path _apply_config_to_running_servers used wait_for_config (REST API poll) to confirm live config updates. The REST API reads from shard 0 only, so it can return before broadcast_to_all_shards() completes — other shards may still have stale audit config, generating unexpected entries. Additionally, server_remove_config_option for absent keys sent separate SIGHUPs before server_update_config, and the single wait_for_config at the end could match a completion from an earlier SIGHUP. Wait for "completed re-reading configuration file" in the server log after each SIGHUP-producing operation. This message is logged only after both read_config() and broadcast_to_all_shards() finish, guaranteeing all shards have the new config. Each operation gets its own mark+wait so no stale completion is matched. Fixes SCYLLADB-1277	2026-03-31 02:27:11 +02:00
Dario Mirovic	1d623196eb	test: cluster: fix syslog listener datagram ordering race UnixSockerListener used ThreadingUnixDatagramServer, which spawns a new thread per datagram. The notification barrier in get_lines() relies on all prior datagrams being handled before the notification. With threading, the notification handler can win the lock before an audit entry handler, so get_lines() returns before the entry is appended. clear_audit_logs() then clears an incomplete buffer, and the late entry leaks into the next test's before/after diff. Switch to sequential UnixDatagramServer. The server thread now handles datagrams in kernel FIFO order, so the notification is always processed after all preceding audit entries. Refs SCYLLADB-1277	2026-03-31 02:27:11 +02:00
Karol Nowacki	493a4433e7	index: fix DESC INDEX for vector index The `DESC INDEX` command returned incorrect results for local vector indexes and for vector indexes that included filtering columns. This patch corrects the implementation to ensure `DESCRIBE INDEX` accurately reflects the index configuration. This was a pre-existing issue, not a regression from recent serialization schema changes for vector index target options.	2026-03-30 16:46:48 +02:00
Karol Nowacki	a32e4bb9f4	vector_search: test: refactor boilerplate setup The test boilerplate setup for some vector store client tests has been extracted to a common function.	2026-03-30 16:46:48 +02:00
Karol Nowacki	6bc88e817f	vector_search: fix SELECT on local vector index Queries against local vector indexes were failing with the error: "ANN ordering by vector requires the column to be indexed using 'vector_index'" This was a regression introduced by `15788c3734`, which incorrectly assumed the first column in the targets list is always the vector column. For local vector indexes, the first column is the partition key, causing the failure. Previously, serialization logic for the target index option was shared between vector and secondary indexes. This is no longer viable due to the introduction of local vector indexes and vector indexes with filtering columns, which have different target format. This commit introduces a dedicated JSON-based serialization format for vector index targets, identifying the target column (tc), filtering columns (fc), and partition key columns (pk). This ensures unambiguous serialization and deserialization for all vector index types. This change is backward compatible for regular vector indexes. However, it breaks compatibility for local vector indexes and vector indexes with filtering columns created in version 2026.1.0. To mitigate this, usage of these specific index types will be blocked in the 2026.1.0 release by failing ANN queries against them in vector-store service. Fixes: SCYLLADB-895	2026-03-30 16:46:48 +02:00
Karol Nowacki	c0b78477a5	index: test: vector index target option serialization test This test ensures that the serialization format for vector index target options remains stable. Maintaining backward compatibility is critical because the index is restored from this property on startup. Any unintended changes to the serialization schema could break existing indexes after an upgrade. This option is also an interface for the vector-store service, which uses it to identify the indexed column.	2026-03-30 16:46:48 +02:00
Karol Nowacki	4dc28dfa52	index: test: secondary index target option serialization test Target option serialization must remain stable for backward compatibility. The index is restored from this property on startup, so unintentional changes to the serialization schema can break indexes after upgrade.	2026-03-30 16:46:47 +02:00
Andrzej Jackowski	ab43420d30	test: use exclusive driver connection in test_limited_concurrency_of_writes Use get_cql_exclusive(node1) so the driver only connects to node1 and never attempts to contact the stopped node2. The test was flaky because the driver received `Host has been marked down or removed` from node2. Fixes: SCYLLADB-1227 Closes scylladb/scylladb#29268	2026-03-30 11:50:44 +02:00
Botond Dénes	068a7894aa	test/cluster: fix flaky test_cleanup_stop by using asyncio.sleep The test was using time.sleep(1) (a blocking call) to wait after scheduling the stop_compaction task, intending to let it register on the server before releasing the sstable_cleanup_wait injection point. However, time.sleep() blocks the asyncio event loop entirely, so the asyncio.create_task(stop_compaction) task never gets to run during the sleep. After the sleep, the directly-awaited message_injection() runs first, releasing the injection point before stop_compaction is even sent. By the time stop_compaction reaches Scylla, the cleanup has already completed successfully -- no exception is raised and the test fails. Fix by replacing time.sleep(1) with await asyncio.sleep(1), which yields control to the event loop and allows the stop_compaction task to actually send its HTTP request before message_injection is called. Fixes: SCYLLADB-834 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29202	2026-03-30 11:40:47 +03:00
Ernest Zaslavsky	f3a91df0b4	test/cmake: add missing tests to boost test suite Add symmetric_key_test (standalone, links encryption library) and auth_cache_test to the combined_tests binary. These tests already exist in configure.py; this aligns the CMake build.	2026-03-29 16:17:45 +03:00
Ernest Zaslavsky	de606cc17a	test/cmake: remove per-test LTO disable The per-test -fno-lto link option is now redundant since -fno-lto was added globally in mode.common.cmake. LTO-enabled targets (the scylla binary in RelWithDebInfo) override it via enable_lto().	2026-03-29 16:17:45 +03:00
Ernest Zaslavsky	7e72898150	cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs Place add_compile_definitions(SEASTAR_TESTING_MAIN) after both add_subdirectory(seastar) and add_subdirectory(abseil) are processed. This matches configure.py's global define without leaking into seastar's subdirectory build (which would cause a duplicate main symbol in seastar_testing). Remove the now-redundant per-test SEASTAR_TESTING_MAIN compile definition from test/CMakeLists.txt.	2026-03-29 16:17:45 +03:00
Nadav Har'El	d32fe72252	Merge 'alternator: check concurrency limit before memory acquisition' from Łukasz Paszkowski Fix the ordering of the concurrency limit check in the Alternator HTTP server so it happens before memory acquisition, and reduce test pressure to avoid LSA exhaustion on the memory-constrained test node. The patch moves the concurrency check to right after the content-length early-out, before any memory acquisition or I/O. The check was originally placed before memory acquisition but was inadvertently moved after it during a refactoring. This allowed unlimited requests to pile up consuming memory, reading bodies, verifying signatures, and decompressing — all before being rejected. Restores the original ordering and mirrors the CQL transport (`transport/server.cc`). Lowers `concurrent_requests_limit` from 5 to 3 and the thread multiplier from 5 to 2 (6 threads instead of 25). This is still sufficient to reliably trigger RequestLimitExceeded, while keeping flush pressure within what 512MB per shard can sustain. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1248 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1181 The test started to fail quite recently. It affects master only. No backport is needed. We might want to consider backporting a commit moving the concurrency check earlier. Closes scylladb/scylladb#29272 * github.com:scylladb/scylladb: test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion alternator: check concurrency limit before memory acquisition	2026-03-29 11:08:28 +03:00
Łukasz Paszkowski	b8e3ef0c64	test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion The test_limit_concurrent_requests dtest uses concurrent CreateTable requests to verify Alternator's concurrency limiting. Each admitted CreateTable triggers Raft consensus, schema mutations, and memtable flushes—all of which consume LSA memory. On the 1 GB test node (2 SMP × 512 MB), the original settings (limit=5, 25 threads) created enough flush pressure to exhaust the LSA emergency reserve, producing logalloc::bad_alloc errors in the node log. The test was always marginal under these settings and became flaky as new system tables increased baseline LSA usage over time. Lower concurrent_requests_limit from 5 to 3 and the thread multiplier from 5 to 2 (6 threads total). This is still well above the limit and sufficient to reliably trigger RequestLimitExceeded, while keeping flush pressure within what 512 MB per shard can sustain.	2026-03-28 20:40:33 +01:00
Aleksandra Martyniuk	166b293d06	test: add test_failed_tablet_rebuild_is_retried_on_alter Test if alter keyspace statement with the current rf values will fix the state of replicas.	2026-03-27 17:29:31 +01:00
Aleksandra Martyniuk	9ec54a8207	test: add a test to ensure that failed rebuilds are retried	2026-03-27 17:29:31 +01:00
Botond Dénes	854c374ebf	test/encryption: wait for topology convergence after abrupt restart test_reboot uses a custom restart function that SIGKILLs and restarts nodes sequentially. After all nodes are back up, the test proceeded directly to reads after wait_for_cql_and_get_hosts(), which only confirms CQL reachability. While a node is restarted, other nodes might execute global token metadata barriers, which advance the topology fence version. The restarted node has to learn about the new version before it can send reads/writes to the other nodes. The test issues reads as soon as the CQL port is opened, which might happen before the last restarted node learns of the latest topology version. If this node acts as a coordinator for reads/write before this happens, these will fail as the other nodes will reject the ops with the outdated topology fence version. Fix this by replacing wait_for_cql_and_get_hosts() on the abrupt-restart path with the more robus get_ready_cql(), which makes sure servers see each other before refreshing the cql connection. This should ensure that nodes have exchanged gossip and converged on topology state before any reads are executed. The rolling_restart() path is unaffected as it handles this internally. Fixes: SCYLLADB-557 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29211	2026-03-27 09:52:27 +01:00
Avi Kivity	b708e5d7c9	Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin `test_crashed_node_substitution` intermittently failed: ```python assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256). backport: this issue may affect CI for all branches, so should be backported to all versions. [SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29254 * github.com:scylladb/scylladb: test: test_crashed_node_substitution: add docstring and fix whitespace test: fix race condition in test_crashed_node_substitution	2026-03-26 21:40:33 +02:00
Petr Gusev	c38e312321	test_lwt_fencing_upgrade: fix quorum failure due to gossip lag If lwt_workload() sends an update immediately after a rolling restart, the coordinator might still see a replica as down due to gossip lagging behind. Concurrently restarting another node leaves only one available replica, failing the LOCAL_QUORUM requirement for learn or eventually consistent sp::query() in sp::cas() and resulting in a mutation_write_failure_exception. We fix this problem by waiting for the restarted server to see 2 other peers. The server_change_version doesn't do that by default -- it passes wait_others=0 to server_start(). Fixes SCYLLADB-1136 Closes scylladb/scylladb#29234	2026-03-26 21:25:53 +02:00
bitpathfinder	627a8294ed	test: test_crashed_node_substitution: add docstring and fix whitespace Add a description of the test's intent and scenario; remove extra blanks.	2026-03-26 18:40:17 +01:00
bitpathfinder	5a086ae9b7	test: fix race condition in test_crashed_node_substitution `test_crashed_node_substitution` intermittently failed: ``` assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake ("finished do_send_ack2_msg"), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: SCYLLADB-1256.	2026-03-26 18:25:05 +01:00
Robert Bindar	c575bbf1e8	test_refresh_deletes_uploaded_sstables should wait for sstables to get deleted SSTable unlinking is async, so in some cases it may happen that the upload dir is not empty immediately after refresh is done. This patch adjusts test_refresh_deletes_uploaded_sstables so it waits with a timeout till the upload dir becomes empty instead of just assuming the API will sync on sstables being gone. Fixes SCYLLADB-1190 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#29215	2026-03-26 08:43:14 +03:00
Nikos Dragazis	8789c95a85	test: cluster: Add test for migration of multiple keyspaces Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:29 +02:00
Nikos Dragazis	25af8bdc24	test: cluster: Add test for error conditions Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:29 +02:00
Nikos Dragazis	01a51817c4	test: cluster: Add vnodes->tablets migration test (rollback) Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:29 +02:00
Nikos Dragazis	56ec33d3e0	test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes) Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:29 +02:00
Nikos Dragazis	58e930c490	test: cluster: Add vnodes->tablets migration test (1 table, 1 node) This test runs the vnodes-to-tablets migration for a single table on a single-node cluster. The node has multiple shards and multiple power-of-two aligned vnodes, so resharding is triggered. More details in the docstring. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:29 +02:00
Nikos Dragazis	2a5e6b832a	api: Add REST endpoint for vnode-to-tablet migration status If the keyspace is migrating, it reports the intended and actual storage mode for each node. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:24 +02:00
Marcin Maliszkiewicz	7fdd650009	Merge 'test: audit: clean up test helper class naming' from Dario Mirovic Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`. Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`. Logs before the fixes: ``` test/cluster/test_audit.py:514: 14 warnings /home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py) @pytest.mark.single_node ``` Fixes SCYLLADB-1237 This is an addition to the latest master code. No backport needed. Closes scylladb/scylladb#29237 * github.com:scylladb/scylladb: test: audit: rename TestCQLAudit to CQLAuditTester test: audit: remove unused pytest.mark.single_node	2026-03-25 15:30:16 +01:00
Radosław Cybulski	1dc20cc8f9	alternator/test: explain why 'always' write isolation mode is used in tests Improve test comments for test_streams_batchwrite_into_the_same_partition_deletes_existing_items and test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data to explain why 'always' write isolation mode is required: in always_use_lwt mode all items in a batch get the same CDC timestamp, which triggers the squashing bug. In other modes each item gets a separate timestamp so the bug doesn't manifest. Also fix the example in the second test comment to use cleaner key values and correct event type (INSERT, not MODIFY, since items are inserted into an empty table), and fix the issue reference from #28452 (the PR) to #28439 (the issue).	2026-03-25 15:15:20 +01:00
Dario Mirovic	552a2d0995	test: audit: rename TestCQLAudit to CQLAuditTester pytest tries to collect tests for execution in several ways. One is to pick all classes that start with 'Test'. Those classes must not have custom '__init__' constructor. TestCQLAudit does. TestCQLAudit after migration from test/cluster/dtest is not a test class anymore, but rather a helper class. There are two ways to fix this: 1. Add __init__ = False to the TestCQLAudit class 2. Rename it to not start with 'Test' Option 2 feels better because the new name itself does not convey the wrong message about its role. Fixes SCYLLADB-1237	2026-03-25 13:21:08 +01:00
Dario Mirovic	73de865ca3	test: audit: remove unused pytest.mark.single_node Remove unused pytest.mark.single_node in TestCQLAudit class. This is a leftover from audit tests migration from test/cluster/dtest to test/cluster. Refs SCYLLADB-1237	2026-03-25 13:18:37 +01:00
Radosław Cybulski	ded62b2c5e	alternator/test: add scylla_only to always write isolation fixture Add scylla_only fixture dependency to the test_table_ss_new_and_old_images_write_isolation_always fixture. This ensures all tests using the 'always' write isolation mode are skipped when running against DynamoDB (--aws), since the system:write_isolation tag is a Scylla-only feature.	2026-03-25 12:38:09 +01:00
Radosław Cybulski	7d404cdd51	alternator: fix BatchWriteItem squashed Streams entries BatchWriteItem with items for the same partition (and write isolation set to always) will trigger LWT and run different cdc code path, which will result in wrong Streams data being returned to the user - changes will be randomly squashed together. For example batch write: batch.put_item(Item={'p': 'p', 'c': 'c0'}) batch.put_item(Item={'p': 'p', 'c': 'c1'}) batch.put_item(Item={'p': 'p', 'c': 'c2'}) instead of producing 3 modify / insert events will produce one: type=INSERT, key={'c': {'S': 'c0'}, 'p': {'S': 'p'}}, old_image=None, new_image={'c': {'S': 'c2'}, 'p': {'S': 'p'}} with `new_image` having different `c` key from `key` field. This happens because BatchWriteItem (when using LWT) emits it's changes to cdc under the same timestamp. This results in in all log entries being put in single cdc "bucket" (under the same cdc$timestamp key). Previous parsing algorithm would interpret those changes as a change to a single item and squash them together. The patch rewrites algorithm to use `std::unordered_map` for records based on value of clustering key, that is added to every cdc log entry. This allows rebuilding all item modifications. Fixes #28439 Fixes: SCYLLADB-540	2026-03-25 11:40:53 +01:00
Radosław Cybulski	85da03c88d	alternator: add BatchWriteItem test (failing) Add additional BatchWriteItem tests (some failing): - `test_streams_batchwrite_no_clustering_deletes_non_existing_items` `test_streams_batchwrite_no_clustering_deletes_existing_items` - those tests pass, we add it here for completness, as non clustering tables trigger different paths. - `test_streams_batchwrite_into_the_same_partition_deletes_existing_items` - failing test, that checks combinations of puts and deletes in a single batch write (so for example 3 items, 2 puts and 1 delete). - `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data` - failing simple test. Tests fail, because current implementation, when writing cdc log entries will squash all changes done to the same partition together. The data is still there, but when GetRecords is called and we parse cdc log entries, we don't correctly recover it (see issue #28439 for more details).	2026-03-25 11:40:53 +01:00
Marcin Maliszkiewicz	f988ec18cb	test/lib: fix port in-use detection in start_docker_service Previously, the result of when_all was discarded. when_all stores exceptions in the returned futures rather than throwing, so the outer catch(in_use&) could never trigger. Now we capture the when_all result and inspect each future individually to properly detect in_use from either stream. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216 Closes scylladb/scylladb#29219	2026-03-25 11:45:53 +02:00
Artsiom Mishuta	cd1679934c	test/pylib: use exponential backoff in wait_for() Change wait_for() defaults from period=1s/no backoff to period=0.1s with 1.5x backoff capped at 1.0s. This catches fast conditions in 100ms instead of 1000ms, benefiting ~100 call sites automatically. Add completion logging with elapsed time and iteration count. Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode), log output: wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations) wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations) Fixes SCYLLADB-738 Closes scylladb/scylladb#29173	2026-03-24 23:49:49 +02:00
Botond Dénes	d52fbf7ada	Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek The test was flaky. The scenario looked like this: 1. Stop server 1. 2. Set its rf_rack_valid_keyspaces configuration option to true. 3. Create an RF-rack-invalid keyspace. 4. Start server 1 and expect a failure during start-up. It was wrong. We cannot predict when the Raft mutation corresponding to the newly created keyspace will arrive at the node or when it will be processed. If the check of the RF-rack-valid keyspaces we perform at start-up was done before that, it won't include the keyspace. This will lead to a test failure. Unfortunately, it's not feasible to perform a read barrier during start-up. What's more, although it would help the test, it wouldn't be useful otherwise. Because of that, we simply fix the test, at least for now. The new scenario looks like this: 1. Disable the rf_rack_valid_keyspaces configuration option on server 1. 2. Start the server. 3. Create an RF-rack-invalid keyspace. 4. Perform a read barrier on server 1. This will ensure that it has observed all Raft mutations, and we won't run into the same problem. 5. Stop the node. 6. Set its rf_rack_valid_keyspaces configuration option to true. 7. Try to start the node and observe a failure. This will make the test perform consistently. --- I ran the test (in dev mode, on my local machine) three times before these changes, and three times with them. I include the time results below. Before: ``` real 0m47.570s user 0m41.631s sys 0m8.634s real 0m50.495s user 0m42.499s sys 0m8.607s real 0m50.375s user 0m41.832s sys 0m8.789s ``` After: ``` real 0m50.509s user 0m43.535s sys 0m9.715s real 0m50.857s user 0m44.185s sys 0m9.811s real 0m50.873s user 0m44.289s sys 0m9.737s ``` Fixes SCYLLADB-1137 Backport: The test is present on all supported branches, and so we should backport these changes to them. Closes scylladb/scylladb#29218 * github.com:scylladb/scylladb: test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py	2026-03-24 21:09:19 +02:00
Patryk Jędrzejczak	141aa2d696	Merge 'test/cluster/test_incremental_repair.py: fix typo + enable compaction DEBUG logs' from Botond Dénes This PR contains two small improvements to `test_incremental_repair.py` motivated by the sporadic failure of `test_tablet_incremental_repair_and_scrubsstables_abort`. The test fails with `assert 3 == 2` on `len(sst_add)` in the second repair round. The extra SSTable has `repaired_at=0`, meaning scrub unexpectedly produced more unrepaired SSTables than anticipated. Since scrub (and compaction in general) logs at DEBUG level and the test did not enable debug logging, the existing logs do not contain enough information to determine the root cause. Commit 1 fixes a long-standing typo in the helper function name (`preapre` -> `prepare`). Commit 2 enables `compaction=debug` for the Scylla nodes started by `do_tablet_incremental_repair_and_ops`, which covers all `test_tablet_incremental_repair_and_` variants. This will capture full compaction/scrub activity on the next reproduction, making the failure diagnosable. Refs: SCYLLADB-1086 Backport: test improvement, no backport Closes scylladb/scylladb#29175 https://github.com/scylladb/scylladb: test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops test/cluster/test_incremental_repair.py: fix typo preapre -> prepare	2026-03-24 16:27:01 +01:00
Pavel Emelyanov	2d8540f1ee	transport: fix process_startup cert-auth path missing connection-ready setup When authenticate() returns a user directly (certificate-based auth, introduced in `20e9619bb1`), process_startup was missing the same post-authentication bookkeeping that the no-auth and SASL paths perform: - update_scheduling_group(): without it, the connection runs under the default scheduling group instead of the one mapped to the user's service level. - _authenticating = false / _ready = true: without them, system.clients reports connection_stage = AUTHENTICATING forever instead of READY. - on_connection_ready(): without it, the connection never releases its slot in the uninitialized-connections concurrency semaphore (acquired at connection creation), leaking one unit per cert-authenticated connection for the lifetime of the connection. The omission was introduced when on_connection_ready() was added to the else and SASL branches in `474e84199c` but the cert-auth branch was missed. Fixes: `20e9619bb1` ("auth: support certificate-based authentication") Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-24 18:02:46 +03:00
Pavel Emelyanov	da6fe14035	transport: test that connection_stage is READY after auth via all process_startup paths The cert-auth path in process_startup (introduced in `20e9619bb1`) was missing _ready = true, _authenticating = false, update_scheduling_group() and on_connection_ready(). The result is that connections authenticated via certificate show connection_stage = AUTHENTICATING in system.clients forever, run under the wrong service-level scheduling group, and hold the uninitialized-connections semaphore slot for the lifetime of the connection. Add a parametrized cluster test that verifies all three process_startup branches result in connection_stage = READY: - allow_all: AllowAllAuthenticator (no-auth path) - password: PasswordAuthenticator (SASL/process_auth_response path) - cert_bypass: CertificateAuthenticator with transport_early_auth_bypass error injection (cert-auth path -- the buggy one) The injection is added to certificate_authenticator::authenticate() so tests can bypass actual TLS certificate parsing while still exercising the cert-auth code path in process_startup. The cert_bypass case is marked xfail until the bug is fixed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-24 18:01:28 +03:00
Benny Halevy	1a7b013377	test: add test_sstable_clone_preserves_staging_state	2026-03-24 16:48:01 +02:00

... 9 10 11 12 13 ...

11801 Commits