- Shows complete review workflow from analysis to corrected code
- Demonstrates P0/P1/P2 prioritization with real examples
- Includes before/after comparisons and specific feedback templates
- Provides practical demonstration of key reviewer mantras
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.
The PR introduces two changes:
* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.
Fixes: #28012
Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.
Closesscylladb/scylladb#28617
* github.com:scylladb/scylladb:
vector_search: Fix missing timeout on TLS handshake
vector_search: test: Fix flaky cert rewrite test
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.
Add the 5th node to the cluster. This way the majority is always up.
Fixes: https://github.com/scylladb/scylladb/issues/28596.
Closesscylladb/scylladb#28610
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.
After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.
This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.
Fixes: SCYLLADB-603
Closesscylladb/scylladb#28598
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.
This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.
This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.
Fixes: #28012
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
---
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.
Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.
---
The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):
Before:
CPU utilization: 0.9%
real 2m7.811s
user 0m25.446s
sys 0m16.733s
After:
CPU utilization: 1.1%
real 1m40.288s
user 0m25.218s
sys 0m16.566s
---
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
Backport: This improves the stability of our CI, so let's
backport it to all supported versions.
Closesscylladb/scylladb#28602
* github.com:scylladb/scylladb:
test: cluster: Reduce wait time in test_sync_point
test: cluster: Fix test_sync_point
test: cluster: Await sync points asynchronously
test: cluster: Create sync points asynchronously
test: cluster: Fetch hint metrics asynchronously
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.
There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.
It needs a backport to 2026.1 to remove remaining validator tests from there.
Fixes: VECTOR-497
Closesscylladb/scylladb#28568
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.
Allow for that and only match 'Error' that isn't 'error::Error'.
Fixes#28516.
Closesscylladb/scylladb#28574
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.
This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).
Use the config option. This reduces the execution time to ~8 seconds.
Fixes SCYLLADB-556.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28536
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28539
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.
Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).
Backport not needed, xpass doesn't bother anyone.
Closesscylladb/scylladb#28441
* github.com:scylladb/scylladb:
test/cqlpy: remove xfail from tests for fixed issue 7972
test/cqlpy: remove xfail from tests for fixed issue 10358
test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.
* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes#28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.
Closesscylladb/scylladb#28528
* github.com:scylladb/scylladb:
test/alternator: add test for BatchWriteItem with different write isolations
test/alternator: reproducer for Alternator Streams bug
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.
But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.
In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.
Fixes#28171
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:
1. A BatchWriteItem operation writing multiple items to the *same*
partition.
2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
occur in other write isolation modes).
We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).
As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.
The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.
Refs #28439
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.
Fixes#28455Closesscylladb/scylladb#28476
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.
Closesscylladb/scylladb#28464
* github.com:scylladb/scylladb:
test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
test: test_tablets_parallel_decommission: prevent group0 majority loss
test: delete test_service_levels_work_during_recovery
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.
The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.
By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28397
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().
Refining inter-components API, not backporting
Closesscylladb/scylladb#28412
* github.com:scylladb/scylladb:
gossiper: Export its scheduling group for those who need it
migration_manager: Reorder members
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.
We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28479
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.
Closesscylladb/scylladb#28462
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.
Closesscylladb/scylladb#28429
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
to anyone reading the lof message if the expiration time is in the
local timezone or in GMT/UTC.
Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
2428 | endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
| ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
| ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
207 | # define FMT_DEPRECATED [[deprecated]]
| ^
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#28434
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.
This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.
Refs #28410
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28427
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.
The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.
Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.
Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>