Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.
Closesscylladb/scylladb#27743
(cherry picked from commit 4a5292e815)
Closesscylladb/scylladb#27913
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.
The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.
The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.
Fixes#27322Closesscylladb/scylladb#27378
(cherry picked from commit f545ed37bc)
Closesscylladb/scylladb#27582
…eshold
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.
Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.
Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.
Fixes#27247Fixes#27325
Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.
- (cherry picked from commit 807fc68dc5)
- (cherry picked from commit c30b326033)
Parent PR: #27412Closesscylladb/scylladb#27555
* github.com:scylladb/scylladb:
test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.
Closesscylladb/scylladb#27757
(cherry picked from commit 04976875cc)
Closesscylladb/scylladb#27867
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.
When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.
In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.
Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
fails the test
This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.
Fixes: #26347Closesscylladb/scylladb#27245
(cherry picked from commit d883ff2317)
Closesscylladb/scylladb#27507
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:
1. Race between initial refresh and manual trigger: The test assumes
a specific resolution sequence, but timing variations between the
initial DNS refresh (on client creation) and the first manual
trigger (in the test loop) can cause unexpected delayed scheduling.
2. Extra triggers from resolve_hostname fiber: During the client
refresh phase, the background DNS fiber clears the client list.
If resolve_hostname executes in the window after clearing but
before the update completes, pending triggers are processed,
incrementing the resolution count unexpectedly. At count 6, the
mock resolver returns a valid address (count % 3 == 0), causing
the test to fail.
The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.
Fixes: #27074Closesscylladb/scylladb#27685
(cherry picked from commit addac8b3f7)
Closesscylladb/scylladb#27799
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.
Fixes: VECTOR-331
Closesscylladb/scylladb#27668
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.
Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.
Fixes: scylladb/scylladb#27136
Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.
Closesscylladb/scylladb#27737
(cherry picked from commit 2a75b1374e)
Closesscylladb/scylladb#27784
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9e18cfbe17)
Closesscylladb/scylladb#27751
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes
By default, tables use speculative retries to determine when coordinators may
query additional replicas.
Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
already contains data, making it appear as if data reached the critically
utilized node
- the second check may fail if the additional request is sent to the critically
utilized node, making it appear as if data did not reach the healthy node
The patch fixes the flakiness by disabling the speculative retries.
Fixes https://github.com/scylladb/scylladb/issues/27212Closesscylladb/scylladb#27488
(cherry picked from commit 2cb9bb8f3a)
Closesscylladb/scylladb#27773
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
- (cherry picked from commit 502b03dbc6)
- (cherry picked from commit 904183734f)
- (cherry picked from commit 7f20b66eff)
- (cherry picked from commit e1b2180092)
- (cherry picked from commit d436233209)
- (cherry picked from commit 1935268a87)
- (cherry picked from commit 6fc43f27d0)
Parent PR: #26319Closesscylladb/scylladb#26766
* github.com:scylladb/scylladb:
repair: throw if flush failed in get_flush_time
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Fix unlikely use-after-free in `encode_paging_state`. The function
incorrectly assumes that current position to encode will always have
data for all clustering columns the schema defines. It's possible to
encounter current position having less than all columns specified, for
eample in case of range tombstone. Those don't happen in Alternator
tables as DynamoDB doesn't allow range deletions and clustering key
might be of size at most 1. Alternator api can be used to read
scylla system tables and those do have range tombstones with more
than single clustering column.
The fix is to stop trying to encode columns, that don't have the value -
they are not needed anyway, as there's no possible position with those
values (range tombstone made sure of that).
Fixes#27001Fixes#27125Closesscylladb/scylladb#26960
(cherry picked from commit b54a9f4613)
Closesscylladb/scylladb#27347
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.
To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.
Fixesscylladb/scylladb#27398Closesscylladb/scylladb#27401
(cherry picked from commit 9213a163cb)
Closesscylladb/scylladb#27411
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.
Extend the documentation to explain what colocated tables are and
document these restrictions.
Fixesscylladb/scylladb#27261Closesscylladb/scylladb#27516
(cherry picked from commit 33f7bc28da)
Closesscylladb/scylladb#27772
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.
For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"
This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.
Fixes: scylladb/scylladb#27255
(cherry picked from commit 37e3dacf33)
Closesscylladb/scylladb#27663
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
- (cherry picked from commit 82f80478b8)
- (cherry picked from commit 6a6bbbf1a6)
- (cherry picked from commit 86dde50c0d)
Parent PR: #27387Closesscylladb/scylladb#27483
* https://github.com/scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.
The commit fixes the problem by waiting for the condition variable only once.
Fixes: #27237
Fixes: VECTOR-370
Closesscylladb/scylladb#27239
(cherry picked from commit b5c85d08bb)
Closesscylladb/scylladb#27393
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.
Fixesscylladb/scylladb#27304Closesscylladb/scylladb#27312
(cherry picked from commit 97b7c03709)
Closesscylladb/scylladb#27331
This patch enforces that vector indexes can only be created on keyspaces
that use tablets. During index validation, `check_uses_tablets()` verifies
the base keyspace configuration and rejects creation otherwise.
To support this, the `custom_index::validate()` API now receives a
`const data_dictionary::database&` parameter, allowing index
implementations to access keyspace-level settings during DDL validation.
Fixes https://scylladb.atlassian.net/browse/VECTOR-322Closesscylladb/scylladb#26786
(cherry picked from commit 68c7236acb)
Closesscylladb/scylladb#27272
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).
Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.
This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.
Refs: https://github.com/scylladb/scylladb/issues/24415.
Closesscylladb/scylladb#26794
(cherry picked from commit e3e81a9a7a)
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.
If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
(cherry picked from commit 1935268a87)
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.
If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
(cherry picked from commit d436233209)
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.
To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
now() < written_at + min(timeout + propagation_delay)
repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
repair_time <= now()
So we know that:
repair_time < written_at + propagation_delay
With this condition we are sure that GC won't happen.
(cherry picked from commit e1b2180092)
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
(cherry picked from commit 7f20b66eff)
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.
Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
(cherry picked from commit 904183734f)
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.
For historic reasons this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.
But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.
The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.
Fixes#27296
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27626
(cherry picked from commit ccacea621f)
Closesscylladb/scylladb#27670
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.
The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.
Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.
Fixesscylladb/scylladb#27612Closesscylladb/scylladb#27622
(cherry picked from commit b9ec1180f5)
Closesscylladb/scylladb#27671
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.
The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84Closesscylladb/scylladb#27521
(cherry picked from commit f7ffa395a8)
Closesscylladb/scylladb#27602
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.
Fixes#27587
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5fae4cdf80)
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.
Refs #27272
Refs #27325
(cherry picked from commit c30b326033)
The initial problem:
Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.
Test logic:
These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.
The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.
Note that the exceptions are counted per shard, not per code path.
The new problem:
The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.
There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.
Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.
For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).
Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.
Fixes#27247Fixes#27325
(cherry picked from commit 807fc68dc5)
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.
Redirections are added for all the removed pages.
Fixes https://github.com/scylladb/scylladb/issues/26871Closesscylladb/scylladb#27277
(cherry picked from commit c5580399a8)
Closesscylladb/scylladb#27442