This is about ungraceful stop, where the node is killed.
Test cases typically need to wait for other nodes to notice that the
node is down before proceeding. By default, that takes about 20s. Can
be reduced via config by reducing failure detector threshold, but it's
not the best solution:
- cannot set the threshold too low, or we'll introduce falkiness due to false
positives
- so it's still slow (a couple of seconds)
- developers forget about it and the test still works
This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter.
At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps:
- the series starts with defaulting to convict=False
- each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable
- finally, the "convict" parameter is made mandatory
There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost.
Tested on dev-mode
cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain:
Wall clock time reduced from 41s to 16s
No backport: enhancement
Closesscylladb/scylladb#28495
* https://github.com/scylladb/scylladb:
test: gossiper: Add test for natural failure detection
test: pylib: Make convict a required parameter in server_stop()
test: Annotate server_stop() calls where conviction is harmful
test: Annotate server_stop() calls where conviction is beneficial
test: Annotate server_stop() calls where conviction is useless
test: pylib: Add convict option to server_stop()
api: failure_detector: Introduce convict-node API
gms: gossiper: Make convict() public and safe to call from any scheduling group
api: Extract validate functions to common header