scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 10:30:38 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	bf369326d6	Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki The default 100ms timeout for client readiness in tests is too aggressive. In some test environments, this is not enough time for client creation, which involves address resolution and TLS certificate reading, leading to flaky tests. This commit increases the default client creation timeout to 10 seconds. This makes the tests more robust, especially in slower execution environments, and prevents similar flakiness in other test cases. Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826 Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well. Closes scylladb/scylladb#28846 * github.com:scylladb/scylladb: vector_search: test: include ANN error in assertion vector_search: test: fix HTTPS client test flakiness (cherry picked from commit `2fb981413a`) Closes scylladb/scylladb#28879	2026-03-04 18:09:04 +01:00
Anna Stuchlik	864774fb00	doc: fix the upgrage guide to 2026.1 The upgrade guide on branch-2026.1 has bugs caused by incorrectly resolved conflicts in the backport PR: https://github.com/scylladb/scylladb/pull/28835#issuecomment-3992474167 This commit fixes the issue. The fix only applies to branch-2026.1. Fixes https://github.com/scylladb/scylladb/issues/28871 Closes scylladb/scylladb#28872	2026-03-04 16:53:49 +02:00
Botond Dénes	49ed97cec8	Merge '[Backport 2026.1] Fix regression in Alternator TTL with tablets and node going down' from Scylladb[bot] Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used. Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done. It turns out that recently, in commits `817fdad` and `d88036d`, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica. Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function. So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix): 1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica() 2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason). Fixes SCYLLADB-777. - (cherry picked from commit `9ab3d5b946`) - (cherry picked from commit `0c7f499750`) - (cherry picked from commit `e463d528fe`) Parent PR: #28771 Closes scylladb/scylladb#28803 * github.com:scylladb/scylladb: test: add unit test for tablet_map::get_secondary_replica() test, alternator: add test for TTL expiration with a node down locator: fix get_secondary_replica() to match get_primary_replica()	2026-03-04 14:21:44 +02:00
Marcin Maliszkiewicz	81685b0d06	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature (cherry picked from commit `a83ee6cf66`) Closes scylladb/scylladb#28853	2026-03-04 08:28:39 +02:00
Anna Stuchlik	06013b2377	doc: add the upgrade guide from 2025.x to 2026.1 This commit adds the upgrade guide for version 2026.1. According to the new upgrade policy, the user can now upgrade to the major version (2026.1) from any previous minor version. So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1. In addition, this commit: - Updates the upgrade policy for reflect the above change. - Removes the upgrade guides for the previous version. Fixes https://github.com/scylladb/scylladb/issues/28533 Fixes https://github.com/scylladb/scylladb/issues/28532 Closes scylladb/scylladb#28789 (cherry picked from commit `dfd46ad3fb`) Closes scylladb/scylladb#28835	2026-03-03 13:04:16 +02:00
Grzegorz Burzyński	4cc5c2605f	packaging: add systemctl command to dependencies scylladb/scylla container image doesn't include systemctl binary, while it is used by perftune.py script shipped within the same image. Scylla Operator runs this script to tune Scylla nodes/containers, expecting its all dependencies to be available in the container's PATH. Without systemctl, the script fails on systems that run irqbalance (e.g., on EKS nodes) as the script tries to reconfigure irqbalance and restart it via systemctl afterwards. Fixes: scylladb/scylla-operator#3080 Closes scylladb/scylladb#28567 (cherry picked from commit `b4f0eb666f`) Closes scylladb/scylladb#28845	2026-03-03 13:03:46 +02:00
Anna Stuchlik	021851c5c5	doc: remove reduntant Java-related information This commit removes: - Instructions to install scylla-jmx (and all references) - The Java 11 requirement for Ubuntu. Fixes https://github.com/scylladb/scylladb/issues/28249 Fixes https://github.com/scylladb/scylladb/issues/28252 Closes scylladb/scylladb#28254 (cherry picked from commit `64b1798513`) Closes scylladb/scylladb#28818	2026-03-03 10:39:46 +01:00
Patryk Jędrzejczak	c4aa14c1a7	test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip The test is currently flaky with `reuse_ip = True`. The issue is that the test retries replace before the first replace is rolled back and the first replacing node is removed from gossip. The second replacing node can see the entry of the first replacing node in gossip. This entry has a newer generation than the entry of the node being replaced, and both replacing nodes have the same IP as the node being replaced. Therefore, the second replacing node incorrectly considers this entry as the entry of the node being replaced. This entry is missing rack and DC, so the second replace fails with ``` ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed: std::runtime_error (Cannot replace node 8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on a different data center or rack. Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2) ``` Fixes SCYLLADB-805 Closes scylladb/scylladb#28829 (cherry picked from commit `ba7f314cdc`) Closes scylladb/scylladb#28850	2026-03-03 10:21:11 +01:00
Jenkins Promoter	0fdb0961a2	Update ScyllaDB version to: 2026.1.0-rc4	2026-03-02 20:36:38 +02:00
Roy Dahan	2100ae2d0a	install.sh: fix REST API paths for nonroot installations In nonroot installations, the install.sh script was hardcoding the api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml, even though the actual files were installed to a different location (typically ~/scylladb). This caused REST API endpoints like /api-doc/failure_detector/ to fail with "transfer closed with outstanding read data remaining" error because Scylla couldn't find the API documentation files at the configured paths. Fix this by using the $prefix variable instead of hardcoded /opt/scylladb/ paths. This ensures that: - In regular installations: $prefix = /opt/scylladb (no change) - In nonroot installations: $prefix = ~/scylladb (paths now correct) Fixes: SCYLLADB-721 Backport: The hardcoded paths in install.sh have been present since the nonroot installation feature was introduced, making REST API endpoints non-functional in all nonroot installations across all live versions of Scylla. Closes scylladb/scylladb#28805 (cherry picked from commit `822c1597c9`) Closes scylladb/scylladb#28836	2026-03-01 23:20:12 +02:00
Jenkins Promoter	51fc498314	Update pgo profiles - aarch64 scylla-2026.1.0-rc3 scylla-2026.1.0-rc3-candidate-20260301034102	2026-03-01 05:01:46 +02:00
Jenkins Promoter	f4b938df09	Update pgo profiles - x86_64	2026-02-28 21:23:33 -05:00
Botond Dénes	0dfefc3f12	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: SCYLLADB-105 Closes scylladb/scylladb#28080 (cherry picked from commit `b637e17b19`) Closes scylladb/scylladb#28725	2026-02-27 06:32:15 +02:00
Łukasz Paszkowski	883e3e014a	compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever The futurization refactoring in `9d3755f276` ("replica: Futurize retrieval of sstable sets in compaction_group_view") changed maybe_wait_for_sstable_count_reduction() from a single predicated wait: ``` co_await cstate.compaction_done.wait([..] { return num_runs_for_compaction() <= threshold \|\| !can_perform_regular_compaction(t); }); ``` to a while loop with a predicated wait: ``` while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) { co_await cstate.compaction_done.wait([this, &t] { return !can_perform_regular_compaction(t); }); } ``` This was necessary because num_runs_for_compaction() became a coroutine (returns future<size_t>) and can no longer be called inside a condition_variable predicate (which must be synchronous). However, the inner wait's predicate — !can_perform_regular_compaction(t) — only returns true when compaction is disabled or the table is being removed. During normal operation, every signal() from compaction_done wakes the waiter, the predicate returns false, and the waiter immediately goes back to sleep without ever re-checking the outer while loop's num_runs_for_compaction() condition. This causes memtable flushes to hang forever in maybe_wait_for_sstable_count_reduction() whenever the sstable run count exceeds the threshold, because completed compactions signal compaction_done but the signal is swallowed by the predicate. Fix by replacing the predicated wait with a bare wait(), so that any signal (including from completed compactions) causes the outer while loop to re-evaluate num_runs_for_compaction(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610 Closes scylladb/scylladb#28801 (cherry picked from commit `bb57b0f3b7`)	2026-02-27 01:38:13 +02:00
Yaron Kaikov	4ccb795beb	.github/workflows: enable automatic backport PR creation with Jira sub-issue integration This workflow calls the reusable backport-with-jira workflow from scylladb/github-automation to enable automatic backport PR creation with Jira sub-issue integration. The workflow triggers on: - Push to master/next-/branch- branches (for promotion events) - PR labeled with backport/X.X pattern (for manual backport requests) - PR closed/merged on version branches (for chain backport processing) Features enabled by calling the shared workflow: - Creates Jira sub-issues under the main issue for each backport version - Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2) - Cherry-picks from previous version branch to avoid repeated conflicts - On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR Closes scylladb/scylladb#28804 (cherry picked from commit `b211590bc0`) Closes scylladb/scylladb#28812	2026-02-26 09:29:19 +02:00
Yaron Kaikov	9e02b0f45f	ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv - Remove -v and -i flags from curl to prevent credentials from being logged in workflow output - Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting to prevent shell injection via crafted PR metadata - Add org membership verification step for pull_request_target events so that only PRs from scylladb org members can trigger Jenkins CI Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796 Closes scylladb/scylladb#28785 (cherry picked from commit `98494e08eb`) Closes scylladb/scylladb#28811	2026-02-26 09:28:00 +02:00
Anna Stuchlik	eb9b8dbf62	doc: remove the tablets limitation for Alternator This commit removes the information that Alternator doesn't support tablets. The limitation is no longer valid. Fixes SCYLLADB-778 Closes scylladb/scylladb#28781 (cherry picked from commit `e2333a57ad`) Closes scylladb/scylladb#28795	2026-02-26 09:26:57 +02:00
Andrzej Jackowski	995df5dec6	test: fix configuration of test_autoretrain_dict `test_autoretrain_dict` sporadically fails because the default compression algorithm was changed after the test was written. `9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by changing the compression configuration during node startup. However, the configuration change had an incorrect YAML format and was ignored by ScyllaDB. This commit fixes it. Fixes: scylladb/scylladb#28204 Closes scylladb/scylladb#28746 (cherry picked from commit `cd4caed3d3`) Closes scylladb/scylladb#28794	2026-02-26 09:26:23 +02:00
Calle Wilund	beb781b829	gcp: Add handling of 429 (too many requests) to exponential backoff Fixes: SCYLLADB-611 Adds http error code 429 to codes handled by exponential backoff. Closes scylladb/scylladb#28588 (cherry picked from commit `8e71a6f52a`) Closes scylladb/scylladb#28724	2026-02-26 09:24:43 +02:00
Marcin Maliszkiewicz	502b7f296d	Merge '[Backport 2026.1] vector_search: return NaN for similarity_cosine with all-zero vectors' from Scylladb[bot] The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456 Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there. - (cherry picked from commit `af0889d194`) - (cherry picked from commit `4e32502bb3`) Parent PR: #28609 Closes scylladb/scylladb#28775 * github.com:scylladb/scylladb: test/vector_search: add reproducer for rescoring with zero vectors vector_search: return NaN for similarity_cosine with all-zero vectors	2026-02-25 14:34:58 +01:00
Nadav Har'El	b251ee02a4	test: add unit test for tablet_map::get_secondary_replica() This patch adds a unit test for tablet_map::get_secondary_replica(). It was never officially defined how the "primary" and "secondary" replicas were chosen, and their implementation changed over time, but the one invariant that this test verifies is that the secondary replica and the primary replica must be a different node. This test reproduces issue SCYLLADB-777, where we discovered that the get_primary_replica() changed without a corresponding change to get_primary_replica(). So before the previous patch, this test failed, and after the previous patch - it passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `e463d528fe`)	2026-02-25 12:59:26 +00:00
Nadav Har'El	f26d08dde2	test, alternator: add test for TTL expiration with a node down We have many single-node functional tests for Alternator TTL in test/alternator/test_ttl.py. This patch adds a multi-node test in test/cluster/test_alternator.py. The new test verifies that: 1. Even though Alternator TTL splits the work of scanning and expiring items between nodes, all the items get correctly expired. 2. When one node is down, all the items still expire because the "secondary" owner of each token range takes over expiring the items in this range while the "primary" owner is down. This new test is actually a port of a test we already had in dtest (alternator_ttl_tests.py::test_multinode_expiration). This port is faster and smaller then the original (fewer nodes, fewer rows), but it still found a regression (SCYLLADB-777) that dtest missed - the new test failed when running with tablets and in release build mode. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `0c7f499750`)	2026-02-25 12:59:26 +00:00
Nadav Har'El	9cd1038c7a	locator: fix get_secondary_replica() to match get_primary_replica() The function tablet_map::get_secondary_replica() is used by Alternator TTL to choose a node different from get_primary_replica(). Unfortunately, recently (commits `817fdad` and d88037d) the implementation of the latter function changed, without changing the former. So this patch changes the former to match. The next two patches will have two tests that fail before this patch, and pass with it: 1. A unit test that checks that get_secondary_replica() returns a different node than get_primary_replica(). 2. An Alternator TTL test that checks that when a node is down, expirations still happen because the secondary replica takes over the primary replica's work. Fixes SCYLLADB-777 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `9ab3d5b946`)	2026-02-25 12:59:25 +00:00
Avi Kivity	fdae3e4f3a	Merge '[Backport 2026.1] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot] - Correct `calc_part_size` function since it could return more than 10k parts - Add tests - Add more checks in `calc_part_size` to comply with S3 limits Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640 Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters - (cherry picked from commit `289e910cec`) - (cherry picked from commit `6280cb91ca`) - (cherry picked from commit `960adbb439`) Parent PR: #28592 Closes scylladb/scylladb#28697 * github.com:scylladb/scylladb: s3_client: add more constrains to the calc_part_size s3_client: add tests for calc_part_size s3_client: correct multipart part-size logic to respect 10k limit	2026-02-24 14:23:33 +02:00
Avi Kivity	d47e4898ea	Merge '[Backport 2026.1] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot] Describe a procedure to convert tablet keyspace replication factor to rack list. Update the procedures of adding and removing a node to consider tablet keyspaces. Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398) Fixes: https://github.com/scylladb/scylladb/issues/28306. Fixes: https://github.com/scylladb/scylladb/issues/28307. Fixes: https://github.com/scylladb/scylladb/issues/28270. Needs backport to all live branches as they all include tablets. [SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ - (cherry picked from commit `eefe66b2b2`) - (cherry picked from commit `e08ac60161`) - (cherry picked from commit `1c764cf6ea`) - (cherry picked from commit `e4c42acd8f`) - (cherry picked from commit `9ccc95808f`) Parent PR: #28521 Closes scylladb/scylladb#28780 * github.com:scylladb/scylladb: docs: update nodetool rebuild docs docs: update a procedure of decommissioning a DC docs: update a procedure of adding a DC docs: describe upgrade to enforce_rack_list option docs: describe conversion to rack-list RF	2026-02-24 14:21:50 +02:00
Aleksandra Martyniuk	7bc87de838	docs: update nodetool rebuild docs Update nodetool rebuild docs to mention that the command does not work for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28270. (cherry picked from commit `9ccc95808f`)	2026-02-24 09:26:27 +01:00
Aleksandra Martyniuk	2141b9b824	docs: update a procedure of decommissioning a DC Update a procedure of decommissioning a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28307. (cherry picked from commit `e4c42acd8f`)	2026-02-24 09:26:13 +01:00
Aleksandra Martyniuk	aa50edbf17	docs: update a procedure of adding a DC Update a procedure of adding a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28306. (cherry picked from commit `1c764cf6ea`)	2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk	7f836aa3ec	docs: describe upgrade to enforce_rack_list option (cherry picked from commit `e08ac60161`)	2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk	bd26803c1a	docs: describe conversion to rack-list RF Fixes: SCYLLADB-398 (cherry picked from commit `eefe66b2b2`)	2026-02-23 20:58:23 +00:00
Dawid Pawlik	2feed49285	test/vector_search: add reproducer for rescoring with zero vectors Add reproducer for the SCYLLADB-456 issue following exception on ANN vector queries with rescoring with similarity cosine. (cherry picked from commit `4e32502bb3`)	2026-02-23 17:09:51 +00:00
Dawid Pawlik	3007cb6f37	vector_search: return NaN for similarity_cosine with all-zero vectors The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456 (cherry picked from commit `af0889d194`)	2026-02-23 17:09:50 +00:00
Ferenc Szili	1e2d1c7e85	load_stats: fix race condition when computing sum_tablet_sizes In storage_service::load_stats_for_tablet_based_tables(), we are passing a reference to sum_tablet_sizes to the lambda which increments this value on each shard via map_reduce0(). This means we could have a race condition because this is executed on separate threads/CPUs. This patch fixed the problem by collecting the sums by shard into a vector, then summing those up. Refs: SCYLLADB-678 Closes scylladb/scylladb#28703 (cherry picked from commit `f1bc17bd4c`) Closes scylladb/scylladb#28729	2026-02-23 15:02:48 +01:00
Jenkins Promoter	55ad575c8f	Update ScyllaDB version to: 2026.1.0-rc3	2026-02-22 14:48:35 +02:00
Tomasz Grabiec	8982140cd9	Merge '[Backport 2026.1] test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Scylladb[bot] Currently, the test assumes that when 'topology_coordinator_pause_before_processing_backlog: waiting' is logged, the task for decommission must be there. This was based on the assumption that topology coordinator is idle and decommission request wakes it up. But if the server is slow enough, it may still be running the load balancer in reaction to table creation, and block on that injection point before decommission request was added. Fix by waiting for the task to appear rather than the injection. Fixes SCYLLADB-715 Only 2026.1 vulnerable. - (cherry picked from commit `e14eca46af`) - (cherry picked from commit `2454de4f8f`) - (cherry picked from commit `d33d38139f`) Parent PR: #28688 Closes scylladb/scylladb#28750 * github.com:scylladb/scylladb: test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance test: cluster: task_manager_client: Introduce wait_task_appears() tests: pylib: util: Add exponential backoff to wait_for	2026-02-21 01:45:52 +01:00
Tomasz Grabiec	e90449f770	test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance Currently, the test assumes that when 'topology_coordinator_pause_before_processing_backlog: waiting' is logged, the task for decommission must be there. This was based on the assumption that topology coordinator is idle and decommission request wakes it up. But if the server is slow enough, it may still be running the load balancer in reaction to table creation, and block on that injection point before decommission request was added. Fix by waiting for the task to appear rather than the injection. Fixes SCYLLADB-715 (cherry picked from commit `d33d38139f`)	2026-02-20 16:35:39 +00:00
Tomasz Grabiec	98fd5c5e45	test: cluster: task_manager_client: Introduce wait_task_appears() (cherry picked from commit `2454de4f8f`)	2026-02-20 16:35:39 +00:00
Tomasz Grabiec	cca6a1c3dd	tests: pylib: util: Add exponential backoff to wait_for Allows balancing the trade-off between fast execution in case the condition is satisfied quickly and not adding load when it's not. (cherry picked from commit `e14eca46af`)	2026-02-20 16:35:39 +00:00
Szymon Malewski	86554e6192	vector: Improve similarity functions performance Improves performance of deserialization of vector data for calculating similarity functions. Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float> and then pass it to similarity functions as a std::span<const float>. This avoids overhead of data_value allocations and conversions. Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`: client concurrency 1: before: ~135 QPS, after: ~1005 QPS client concurrency 20: before: ~280 QPS, after: ~2097 QPS Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471 Closes scylladb/scylladb#28615 (cherry picked from commit `668d6fe019`) Closes scylladb/scylladb#28690	2026-02-19 14:14:39 +02:00
Asias He	637618560b	repair: Skip auto repair for tables using RF one There is no point running repair for tables using RF one. Row level repair will skip it but the auto repair scheduler will keep scheduling such repairs since repair_time could not be updated. Skip such repairs at the scheduler level for auto repair. If the request is issued by user, we will have to schedule such repair otherwise the user request will never be finished. Fixes SCYLLADB-561 Closes scylladb/scylladb#28640 (cherry picked from commit `1be80c9e86`) Closes scylladb/scylladb#28714	2026-02-19 13:07:37 +02:00
Avi Kivity	8c3c5777da	Merge '[Backport 2026.1] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot] The connection's `cpu_concurrency_t` struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 Backport: all supported affected versions, bug introduced with initial feature implementation in: `ed3e4f33fd` - (cherry picked from commit `0376d16ad3`) - (cherry picked from commit `3b98451776`) Parent PR: #28530 Closes scylladb/scylladb#28716 * github.com:scylladb/scylladb: test: auth_cluster: add test for hanged AUTHENTICATING connections transport: fix connection code to consume only initially taken semaphore units	2026-02-19 12:47:38 +02:00
Tomasz Grabiec	bb9a5261ec	Merge '[Backport 2026.1] test: fix flaky test_balance_empty_tablets' from Scylladb[bot] The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 The test is present in master and 2026.1, so we need to backport this. - (cherry picked from commit `4ca40929ef`) Parent PR: #28598 Closes scylladb/scylladb#28638 * github.com:scylladb/scylladb: test/cluster: Remove short_tablet_stats_refresh_interval injection test: add read barrier to test_balance_empty_tablets scylla-2026.1.0-rc2 scylla-2026.1.0-rc2-candidate-20260219021348	2026-02-18 23:39:03 +01:00
Marcin Maliszkiewicz	d5d81cc066	test: auth_cluster: add test for hanged AUTHENTICATING connections Test runtime: Release - 2s Debug - 5s (cherry picked from commit `3b98451776`)	2026-02-18 19:43:02 +00:00
Marcin Maliszkiewicz	6a438543c2	transport: fix connection code to consume only initially taken semaphore units The connection's cpu_concurrency_t struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 (cherry picked from commit `0376d16ad3`)	2026-02-18 19:43:02 +00:00
Botond Dénes	99a67484bf	Merge '[Backport 2026.1] cql3/statements/describe_statement: hide paxos state tables ' from Scylladb[bot] Paxos state tables are internal tables fully managed by Scylla and they shouldn't be exposed to the user nor they shouldn't be backed up. This commit hides those kind of tables from all listings and if such table is directly described with `DESC ks."tbl$paxos"`, the description is generated withing a comment and a note for the user is added. Fixes https://github.com/scylladb/scylladb/issues/28183 LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version. - (cherry picked from commit `f89a8c4ec4`) - (cherry picked from commit `9baaddb613`) Parent PR: #28230 Closes scylladb/scylladb#28508 * github.com:scylladb/scylladb: test/cqlpy: add reproducer for hidden Paxos table being shown by DESC cql3/statements/describe_statement: hide paxos state tables	2026-02-18 12:41:08 +02:00
Anna Stuchlik	cabf2845d9	doc: fix the links on the repair-related pages This is a follow-up to https://github.com/scylladb/scylladb/pull/28199. This commit fixes the syntax of the internal links. Fixes https://github.com/scylladb/scylladb/issues/28486 Closes scylladb/scylladb#28487 (cherry picked from commit `77480c9d8f`) Closes scylladb/scylladb#28512	2026-02-18 12:39:07 +02:00
Yaron Kaikov	ff4a0fc87e	ci: fix PR number extraction for unlabeled events When the workflow is triggered by removing the 'conflicts' label (pull_request_target unlabeled event), github.event.issue.number is not available. Use github.event.pull_request.number as fallback. Fixes: https://scylladb.atlassian.net/browse/RELENG-245 Closes scylladb/scylladb#28543 (cherry picked from commit `b30ecb72d5`) Closes scylladb/scylladb#28553	2026-02-18 12:38:10 +02:00
Botond Dénes	0a89dbb4d4	Merge '[Backport 2026.1] raft topology: generate notification about released nodes only once' from Scylladb[bot] Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it. The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in `44de563` verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed. Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed. Fixes: scylladb/scylladb#28301 Refs: scylladb/scylladb#25031 The log messages added in `44de563` cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1). - (cherry picked from commit `10e9672852`) - (cherry picked from commit `d28c841fa9`) - (cherry picked from commit `29da20744a`) Parent PR: #28367 Closes scylladb/scylladb#28612 * github.com:scylladb/scylladb: storage_service: fix indentation after previous patch raft topology: generate notification about released nodes only once raft topology: extract "released" nodes calculation to external function	2026-02-18 12:37:33 +02:00
Aleksandra Martyniuk	19cbaa1be2	test: fix test_remove_node_violating_rf_rack_with_rack_list test_remove_node_violating_rf_rack_with_rack_list creates a cluster with four nodes. One of the nodes is excluded, then another one is stopped, excluded, and removed. If the two stopped nodes were both voters, the majority is lost and the cluster loses its raft leader. As a result, the node cannot be removed and the operation times out. Add the 5th node to the cluster. This way the majority is always up. Fixes: https://github.com/scylladb/scylladb/issues/28596. Closes scylladb/scylladb#28610 (cherry picked from commit `f955a90309`) Closes scylladb/scylladb#28639	2026-02-18 12:36:52 +02:00
Anna Mikhlin	9cf0f0998d	.github/workflows: ignore quoted comments for trigger CI prevent CI from being triggered when trigger-ci command appears inside quoted (>) comment text Fixes: https://scylladb.atlassian.net/browse/RELENG-271 Closes scylladb/scylladb#28604 (cherry picked from commit `33cf97d688`) Closes scylladb/scylladb#28652	2026-02-18 12:36:28 +02:00

1 2 3 4 5 ...

51814 Commits