scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-13 03:12:13 +00:00

Author	SHA1	Message	Date
Gleb Natapov	9e205cc3a6	raft topology: Notify that a node was removed only once Raft topology goes over all nodes in a 'left' state and triggers 'remove node' notification in case id/ip mapping is available (meaning the node left recently), but the problem is that, since the mapping is not removed immediately, when multiple nodes are removed in succession a notification for the same node can be sent several times. Fix that by sending notification only if the node still exists in the peers table. It will be removed by the first notification and following notification will not be sent. Closes scylladb/scylladb#27743 (cherry picked from commit `4a5292e815`) Closes scylladb/scylladb#27913	2025-12-30 11:17:41 +01:00
Dario Mirovic	fa3146e76f	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378 (cherry picked from commit `f545ed37bc`) Closes scylladb/scylladb#27582	2025-12-29 18:12:45 +02:00
Nadav Har'El	bc87366b32	Merge '[Backport 2025.4] test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Scylladb[bot] …eshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again. Fixes #27247 Fixes #27325 Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested. - (cherry picked from commit `807fc68dc5`) - (cherry picked from commit `c30b326033`) Parent PR: #27412 Closes scylladb/scylladb#27555 * github.com:scylladb/scylladb: test: cqlpy: test_protocol_exceptions.py: enable debug exception logging test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold	2025-12-29 11:27:16 +02:00
Gleb Natapov	1f8c2744a4	topology coordinator: set session id for streaming at the correct time Commit `d3efb3ab6f` added streaming session for rebuild, but it set the session and request submission time. The session should be set when request starts the execution, so this patch moved it to the correct place. Closes scylladb/scylladb#27757 (cherry picked from commit `04976875cc`) Closes scylladb/scylladb#27867	2025-12-28 13:32:44 +02:00
Ferenc Szili	c08b2290dc	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245 (cherry picked from commit `d883ff2317`) Closes scylladb/scylladb#27507	2025-12-23 17:06:48 +02:00
Anna Stuchlik	bfff9ebe15	doc: document support for i8g and i8ge instances Fixes https://github.com/scylladb/scylladb/issues/27703 Closes scylladb/scylladb#27754 (cherry picked from commit `4c247a5d08`) Closes scylladb/scylladb#27827	2025-12-23 10:47:34 +02:00
Anna Stuchlik	0ed82c1877	doc: add a Vector Search page under Features This commit adds a page with an overview of Vector Search under the Features section. It includes a link to the VS documentation in ScyllaDB Cloud, as the feature is only available in ScyllaDB Cloud. The purpose of the page is to raise awareness of the feature. Fixes https://scylladb.atlassian.net/browse/VECTOR-215 Closes scylladb/scylladb#27787 (cherry picked from commit `9793a45288`) Closes scylladb/scylladb#27826	2025-12-23 10:15:23 +02:00
Karol Nowacki	1c0891d577	vector_search: test: Fix flaky DNS resolution test The `vector_store_client_test_dns_resolving_repeated` test had race conditions causing it to be flaky. Two main issues were identified: 1. Race between initial refresh and manual trigger: The test assumes a specific resolution sequence, but timing variations between the initial DNS refresh (on client creation) and the first manual trigger (in the test loop) can cause unexpected delayed scheduling. 2. Extra triggers from resolve_hostname fiber: During the client refresh phase, the background DNS fiber clears the client list. If resolve_hostname executes in the window after clearing but before the update completes, pending triggers are processed, incrementing the resolution count unexpectedly. At count 6, the mock resolver returns a valid address (count % 3 == 0), causing the test to fail. The fix relaxes test assertions to verify retry behavior and client clearing on DNS address loss, rather than enforcing exact resolution counts. Fixes: #27074 Closes scylladb/scylladb#27685 (cherry picked from commit `addac8b3f7`) Closes scylladb/scylladb#27799	2025-12-23 09:13:22 +02:00
Michał Hudobski	af14df5459	vector_search: throw an error when we restrict primary in vector search We currently allow restrictions on single column primary key, but we ignore the restriction and return all results. This can confuse the users. We change it so such a restriction will throw an error and add a test to validate it. Fixes: VECTOR-331 Closes scylladb/scylladb#27668	2025-12-21 19:29:03 +02:00
Emil Maskovsky	49306c76f0	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737 (cherry picked from commit `2a75b1374e`) Closes scylladb/scylladb#27784	2025-12-21 19:26:20 +02:00
Anna Stuchlik	4888f5b008	doc: remove the links to the Download Center This commit removes the remaining links to the Download Center on the website. We no longer use it for installation, and we don't want users to infer that something like that still exists. Fixes https://github.com/scylladb/scylladb/issues/27753 Closes scylladb/scylladb#27756 (cherry picked from commit `f65db4e8eb`) Closes scylladb/scylladb#27785	2025-12-21 19:22:51 +02:00
Benny Halevy	d59beb52ce	sstable: add _mutate_sem to serialize link/move with components rewrite We currently have races, like between moving an sstable from staging using change_state, or when taking a snapshot, to e.g. rewrite_statistics that replaces one of the sstable component files when called, for example, from update_repaired_at by incremental repair. Use a semaphore as a mutex to serialize those functions. Note that there is no need for rwlock since the operations are rare and read-only operations like snapshot don't need to run in parallel. Fixes #25919 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `9e18cfbe17`) Closes scylladb/scylladb#27751	2025-12-21 19:22:26 +02:00
Ernest Zaslavsky	404965808c	streaming:: add more logging Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags Fixes: https://github.com/scylladb/scylladb/issues/27299 Closes scylladb/scylladb#27311 (cherry picked from commit `1d5f60baac`) Closes scylladb/scylladb#27344	2025-12-21 17:49:10 +02:00
Łukasz Paszkowski	566dbd0b19	test_user_writes_rejection: Disable speculative retries This test starts a 3-node cluster and creates a large blob file so that one node reaches critical disk utilization, triggering write rejections on that node. The test then writes data with CL=QUORUM and validates that the data: - did not reach the critically utilized node - did reach the remaining two nodes By default, tables use speculative retries to determine when coordinators may query additional replicas. Since the validation uses CL=ONE, it is possible that an additional request is sent to satisfy the consistency level. As a result: - the first check may fail if the additional request is sent to a node that already contains data, making it appear as if data reached the critically utilized node - the second check may fail if the additional request is sent to the critically utilized node, making it appear as if data did not reach the healthy node The patch fixes the flakiness by disabling the speculative retries. Fixes https://github.com/scylladb/scylladb/issues/27212 Closes scylladb/scylladb#27488 (cherry picked from commit `2cb9bb8f3a`) Closes scylladb/scylladb#27773	2025-12-21 17:48:32 +02:00
Avi Kivity	d507568eca	Merge '[Backport 2025.4] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot] Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions - (cherry picked from commit `502b03dbc6`) - (cherry picked from commit `904183734f`) - (cherry picked from commit `7f20b66eff`) - (cherry picked from commit `e1b2180092`) - (cherry picked from commit `d436233209`) - (cherry picked from commit `1935268a87`) - (cherry picked from commit `6fc43f27d0`) Parent PR: #26319 Closes scylladb/scylladb#26766 * github.com:scylladb/scylladb: repair: throw if flush failed in get_flush_time db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-12-21 14:14:13 +02:00
Radosław Cybulski	b104c80c8e	Fix use-after-free in encode_paging_state in Alternator Fix unlikely use-after-free in `encode_paging_state`. The function incorrectly assumes that current position to encode will always have data for all clustering columns the schema defines. It's possible to encounter current position having less than all columns specified, for eample in case of range tombstone. Those don't happen in Alternator tables as DynamoDB doesn't allow range deletions and clustering key might be of size at most 1. Alternator api can be used to read scylla system tables and those do have range tombstones with more than single clustering column. The fix is to stop trying to encode columns, that don't have the value - they are not needed anyway, as there's no possible position with those values (range tombstone made sure of that). Fixes #27001 Fixes #27125 Closes scylladb/scylladb#26960 (cherry picked from commit `b54a9f4613`) Closes scylladb/scylladb#27347	2025-12-21 14:12:23 +02:00
Michael Litvak	14886c56fa	test: fix test flakiness in test_colocated_tables_gc_mode The test executes a LWT query in order to create a paxos state table and verify the table properties. However, after executing the LWT query, the table may not exist on all nodes but only on a quorum of nodes, thus checking the properties of the table may fail if the table doesn't exist on the queried node. To fix that, execute a group0 read barrier to ensure the table is created on all nodes. Fixes scylladb/scylladb#27398 Closes scylladb/scylladb#27401 (cherry picked from commit `9213a163cb`) Closes scylladb/scylladb#27411	2025-12-19 17:32:12 +01:00
Michael Litvak	fbca8a7644	docs: document restrictions of colocated tables Currently some things are not supported for colocated tables: it's not possible to repair a colocated table, and due to this it's also not possible to use the tombstone_gc=repair mode on a colocated table. Extend the documentation to explain what colocated tables are and document these restrictions. Fixes scylladb/scylladb#27261 Closes scylladb/scylladb#27516 (cherry picked from commit `33f7bc28da`) Closes scylladb/scylladb#27772	2025-12-19 12:26:44 +01:00
Emil Maskovsky	6d81dc8ba8	topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted In several exception handlers, only raft::request_aborted was being caught and rethrown, while seastar::abort_requested_exception was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds seastar::abort_requested_exception handling alongside raft::request_aborted in all places where it was missing. When rethrown, these exceptions propagate up to the main run() loop where handle_topology_coordinator_error() recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 (cherry picked from commit `37e3dacf33`) Closes scylladb/scylladb#27663	2025-12-19 11:50:02 +01:00
Patryk Jędrzejczak	1cadf057ce	Merge '[Backport 2025.4] Make direct failure detector verb handler more efficient' from Scylladb[bot] We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group. Fixes https://github.com/scylladb/scylladb/issues/27142 Backport to all version where `60f1053087` was backported to since it should improve performance in large clusters. - (cherry picked from commit `82f80478b8`) - (cherry picked from commit `6a6bbbf1a6`) - (cherry picked from commit `86dde50c0d`) Parent PR: #27387 Closes scylladb/scylladb#27483 * https://github.com/scylladb/scylladb: direct_failure_detector: run direct failure detector in the gossiper scheduling group raft: drop invoke_on from the pinger verb handler direct_failure_detector: pass timeout to direct_fd_ping verb	2025-12-19 11:13:11 +01:00
Amnon Heiman	cbf6250021	scylla-node-exporter: Add ethtool to node exporter AWS suggests following multiple network performance metrics: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics This patch enables the ethtool collector with the specific list of metrics Ater this patch the relevant metris looks like: $ curl http://localhost:9100/metrics \|& grep ethtool node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0 node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0 node_ethtool_conntrack_allowance_available{device="ens5"} 51303 node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0 node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1 node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0 node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436 node_scrape_collector_success{collector="ethtool"} 1 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27358 (cherry picked from commit `a213e41250`) Closes scylladb/scylladb#27508	2025-12-19 09:15:51 +02:00
Pawel Pery	0d7a32cece	unittest: fix vector_store_client_test_dns_refresh_aborted hangs The root cause for the hanging test is a concurrency deadlock. `vector_store_client` runs dns refresh time and it is waiting for the condition variable.After aborting dns request the test signals the condition variable. Stopping the vector_store_client takes time enough to trigger the next dns refresh - and this time the condition variable won't be signalled - so vector_store_client will wait forever for finish dns refresh fiber. The commit fixes the problem by waiting for the condition variable only once. Fixes: #27237 Fixes: VECTOR-370 Closes scylladb/scylladb#27239 (cherry picked from commit `b5c85d08bb`) Closes scylladb/scylladb#27393	2025-12-19 09:14:49 +02:00
Ernest Zaslavsky	4b81530e8a	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350 (cherry picked from commit `605f71d074`) Closes scylladb/scylladb#27392	2025-12-19 09:14:00 +02:00
Michael Litvak	34ede10db9	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`) Closes scylladb/scylladb#27331	2025-12-19 09:13:29 +02:00
Amnon Heiman	2e0c41b32b	vector_index: require tablets for vector indexes This patch enforces that vector indexes can only be created on keyspaces that use tablets. During index validation, `check_uses_tablets()` verifies the base keyspace configuration and rejects creation otherwise. To support this, the `custom_index::validate()` API now receives a `const data_dictionary::database&` parameter, allowing index implementations to access keyspace-level settings during DDL validation. Fixes https://scylladb.atlassian.net/browse/VECTOR-322 Closes scylladb/scylladb#26786 (cherry picked from commit `68c7236acb`) Closes scylladb/scylladb#27272	2025-12-19 09:12:18 +02:00
Amnon Heiman	9ad7bd8070	index/vector_index.cc: Don't allow zero as an index option This patch forces vector_index option value to be real-positive numbers as zero would make no senese. Fixes https://scylladb.atlassian.net/browse/VECTOR-249 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27191 (cherry picked from commit `b2c2a99741`) Closes scylladb/scylladb#27234	2025-12-19 09:11:34 +02:00
Jenkins Promoter	7a365e0973	Update ScyllaDB version to: 2025.4.1	2025-12-18 18:36:06 +02:00
Jenkins Promoter	fe8b2f1092	Update ScyllaDB version to: 2025.4.0 scylla-2025.4.0-candidate-20251217011910 scylla-2025.4.0	2025-12-17 09:46:48 +02:00
Aleksandra Martyniuk	03ccf3915e	repair: throw if flush failed in get_flush_time Currently, _flush_time was stored as a std::optional<gc_clock::time_point> and std::nullopt indicates that the flush was needed but failed. It's confusing for the caller and does not work as expected since the _flush_time is initialized with value (not optional). Change _flush_time type to gc_clock::time_point. If a flush is needed but failed, get_flush_time() throws an exception. This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319 but it was mistakenly overwritten during rebases. Refs: https://github.com/scylladb/scylladb/issues/24415. Closes scylladb/scylladb#26794 (cherry picked from commit `e3e81a9a7a`)	2025-12-16 15:18:32 +01:00
Aleksandra Martyniuk	71cec8bff3	db: fix indentation (cherry picked from commit `6fc43f27d0`)	2025-12-16 15:18:20 +01:00
Aleksandra Martyniuk	a973209b32	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection. (cherry picked from commit `1935268a87`)	2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk	9ad108acbf	repair: fail tablet repair if any batch wasn't sent successfully If any batch replay failed, we cannot update repair_time as we risk the data resurrection. If replay of any batch needs to be retried, run the whole repair but fail at the very end, so that the repair_time for it won't be updated. (cherry picked from commit `d436233209`)	2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk	be05e1b8e1	db/batchlog_manager: fix making decision to skip batch replay Currently, we skip batch replay if less than batch_log_timeout passed from the moment the batch was written. batch_log_timeout value can be configured. If it is large, it won't be replayed for a long time. If the tombstone will be GC'd before the batch is replayed, then we risk the data resurrection. To ensure safety we can skip only the batches that won't be GC'd. In this patch we skip replay of the batches for which: now() < written_at + min(timeout + propagation_delay) repair_time is set as a start of batchlog replay, so at the moment of the check we will have: repair_time <= now() So we know that: repair_time < written_at + propagation_delay With this condition we are sure that GC won't happen. (cherry picked from commit `e1b2180092`)	2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk	767d8793b6	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated. (cherry picked from commit `7f20b66eff`)	2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk	34de8387a4	db/batchlog_manager: delete batch with incorrect or unknown version batchlog_manager::replay_all_failed_batches skips batches that have unknown or incorrect version. Next round will process these batches again. Such batches will probably be skipped everytime, so there is no point in keeping them. Even if at some point the version becomes correct, we should not replay the batch - it might be old and this may lead to data resurrection. (cherry picked from commit `904183734f`)	2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk	3dd028f881	db/batchlog_manager: coroutinize replay_all_failed_batches (cherry picked from commit `502b03dbc6`)	2025-12-16 15:11:04 +01:00
Nadav Har'El	55b78e56a9	test/cqlpy: fix flaky test test_view_in_system_tables The cqlpy test test_materialized_view.py::test_view_in_system_tables checks that the system table "system.built_views" can inform us that a view has been built. This test was flaky, starting to fail quite often recently, and this patch fixes the problem in the test. For historic reasons this test began by calling a utility function wait_for_view_built() - which uses a different system table, system_distributed.view_build_status, to wait until the view was built. The test then immediately tries to verify that also system.built_views lists this view. But there is no real reason why we could assume - or want to assume - that these two tables are updated in this order, or how much time passed between the two tables being changed. The authors of this test already acknowledged there is a problem - they included a hack purporting to be a "read barrier" that claimed to solve this exact problem - but it seems it doesn't, or at least no longer does after recent changes to the view builder's implementation. The solution is simple - just remove the call to wait_for_view_built() and the "hack" after it. We should just wait in a loop (until a timeout) for the system table that we really wanted to check - system.built_views. It's as simple as that. No need for any other assumptions or hacks. Fixes #27296 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27626 (cherry picked from commit `ccacea621f`) Closes scylladb/scylladb#27670	2025-12-16 15:45:59 +02:00
Michael Litvak	4b26a86cb0	alternator: require rf_rack_valid_keyspaces when creating index When creating an alternator table with tablets, if it has an index, LSI or GSI, require the config option rf_rack_valid_keyspaces to be enabled. The option is required for materialized views in tablets keyspaces to function properly and avoid consistency issues that could happen due to cross-rack migrations and pairing switches when RF-rack validity is not enforced. Currently the option is validated when creating a materialized view via the CQL interface, but it's missing from the alternator interface. Since alternator indexes are based on materialized views, the same check should be added there as well. Fixes scylladb/scylladb#27612 Closes scylladb/scylladb#27622 (cherry picked from commit `b9ec1180f5`) Closes scylladb/scylladb#27671	2025-12-16 10:13:31 +02:00
Jenkins Promoter	344f648703	Update pgo profiles - aarch64 scylla-2025.4.0-rc7-candidate-20251216080722 scylla-2025.4.0-rc7	2025-12-15 10:31:11 +02:00
Yaron Kaikov	b70794e6ed	auto-backport.py: modify instruction for making PR ready for review Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review Fixes: https://scylladb.atlassian.net/browse/RELENG-152 Closes scylladb/scylladb#27547 (cherry picked from commit `d3e199984e`) Closes scylladb/scylladb#27565	2025-12-15 09:59:44 +02:00
Yaron Kaikov	3c7ff856c3	workflows: trigger CI automatically when conflicts label is removed Add pull_request_target event with unlabeled type to trigger-scylla-ci workflow. This allows automatic CI triggering when the 'conflicts' label is removed from a PR, in addition to the existing manual trigger via comment. The workflow now runs when: - A user posts a comment with '@scylladbbot trigger-ci' (existing) - The 'conflicts' label is removed from a PR (new) Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84 Closes scylladb/scylladb#27521 (cherry picked from commit `f7ffa395a8`) Closes scylladb/scylladb#27602	2025-12-15 09:58:52 +02:00
Jenkins Promoter	99e9f4b07c	Update pgo profiles - x86_64	2025-12-15 09:42:11 +02:00
Botond Dénes	ffc6953850	Merge '[Backport 2025.4] api: storage_service/tablets/repair: disable incremental repair by default' from Scylladb[bot] Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Backport to 2025.4 where `611918056a` was introduced - (cherry picked from commit `5fae4cdf80`) - (cherry picked from commit `c8cff94a5a`) Parent PR: #27530 Closes scylladb/scylladb#27595 * github.com:scylladb/scylladb: api: storage_service/tablets/repair: disable incremental repair by default docs: nodetool-commands: cluster: repair: fix incremental-mode example	2025-12-15 08:47:49 +02:00
Yaron Kaikov	434956af0f	Add JIRA issue validation to backport PR fixes check Extend the Fixes validation pattern to also accept JIRA issue references (format: [A-Z]+-\d+) in addition to GitHub issue references. This allows backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'. Fixes: https://github.com/scylladb/scylladb/issues/27571 Closes scylladb/scylladb#27572 (cherry picked from commit `3dfa5ebd7f`) Closes scylladb/scylladb#27601	2025-12-12 09:34:45 +02:00
Benny Halevy	2f4f3ff980	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `c8cff94a5a`)	2025-12-11 23:49:07 +00:00
Benny Halevy	20134b9ade	docs: nodetool-commands: cluster: repair: fix incremental-mode example There is no 'regular' incremental mode anymore. The example seems have meant 'disabled'. Fixes #27587 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `5fae4cdf80`)	2025-12-11 23:49:07 +00:00
Dario Mirovic	57b90034c6	test: cqlpy: test_protocol_exceptions.py: enable debug exception logging Enable debug logging for "exception" logger inside protocol exception tests. The exceptions will be logged, and it will be possible to see which ones occured if a protocol exceptions test fails. Refs #27272 Refs #27325 (cherry picked from commit `c30b326033`)	2025-12-10 14:21:24 +00:00
Dario Mirovic	d8cc029a5e	test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the exception burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Fixes #27247 Fixes #27325 (cherry picked from commit `807fc68dc5`)	2025-12-10 14:21:23 +00:00
Anna Stuchlik	cc885a3f35	replace the Driver pages with a link to the new Drivers pages This commit removes the now redundant driver pages from the Scylla DB documentation. Instead, the link to the pages where we moved the diver information is added. Also, the links are updated across the ScyllaDB manual. Redirections are added for all the removed pages. Fixes https://github.com/scylladb/scylladb/issues/26871 Closes scylladb/scylladb#27277 (cherry picked from commit `c5580399a8`) Closes scylladb/scylladb#27442	2025-12-10 09:18:24 +01:00
Jenkins Promoter	582b9f83db	Update ScyllaDB version to: 2025.4.0-rc7	2025-12-09 17:21:23 +02:00

1 2 3 4 5 ...

50182 Commits