This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
No backport: enhancement
Closesscylladb/scylladb#29422
* github.com:scylladb/scylladb:
cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
test: cluster: dtest: Fix double-counting of metrics
get_node_metrics() in test/cluster/dtest/tools/metrics.py used
re.search(metric_name, metric) to match Prometheus metric lines. The
metric name select_partition_range_scan is a substring of
select_partition_range_scan_no_bypass_cache. So when querying for
select_partition_range_scan, the regex matched both Prometheus lines:
scylla_cql_select_partition_range_scan{shard="0",...} 1
scylla_cql_select_partition_range_scan_no_bypass_cache{shard="0",...} 1
And because the code does metrics_res[metric_name] += val, it summed
both values, making it look like the counter was incremented
by 2 when it was actually incremented by 1. The fix appends r"[\s{]"
to the regex so the metric name must be followed by { (labels) or
whitespace (value), preventing substring matches.
The fix for SCYLLADB-1373 (b4f652b7c1) changed get_session() to use
the default timeout=30 for the retry loop in patient_*_cql_connection
(previously timeout=0.1). This correctly allowed retrying transient
NoHostAvailable errors during node startup, but introduced a new
flakiness in test_login and other auth tests.
The failure chain:
1. test_login connects with bad credentials (e.g. user="doesntexist")
2. get_session() calls patient_exclusive_cql_connection(), which calls
retry_till_success() with bypassed_exception=NoHostAvailable
3. The first attempt correctly fails: the server rejects the credentials
with AuthenticationFailed, wrapped in NoHostAvailable
4. retry_till_success() catches NoHostAvailable indiscriminately and
retries, not distinguishing between transient errors (node not ready)
and permanent errors (bad credentials)
5. A subsequent retry attempt times out (connect_timeout=5), producing
OperationTimedOut wrapped in NoHostAvailable
6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping
OperationTimedOut instead of the original AuthenticationFailed
7. The assertion `isinstance(..., AuthenticationFailed)` fails
With the old timeout=0.1, the deadline was already exceeded after the
first attempt, so the original AuthenticationFailed propagated.
Fix: Add a `should_retry` predicate parameter to retry_till_success()
and use it in patient_cql_connection() and
patient_exclusive_cql_connection() to immediately re-raise
NoHostAvailable when it wraps AuthenticationFailed. Retrying
authentication failures is never useful since the credentials won't
change between attempts.
Fixes: SCYLLADB-1382
Closesscylladb/scylladb#29348
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
The commitlog in the tests with big mutations were corrupted by
overwriting 10 chunks of 1KB with random data, which could not be enough
due to randomness and the big size of the commitlog (~65MB).
Change `corrupt_file` to overwrite based on a percentage
of the file's size instead of fixed number of chunks.
refs: #25627
Make `wait_for_any_log()` function to work closer to the original
dtest's version: use `ScyllaLogFile.grep()` method instead of
the usage of `ScyllaNode.wait_log_for()` with a small timeout to
have at least one try to find.
Also, add `max_count` argument to `.grep()` method for the
optimization purpose.
Technically, `new_node()`'s `bootstrap` parameter used to mark a node
as a seed if it's False. In test.py, seeds parameter passed on start of
a node, so, save it as `ScyllaNode.bootstrap` attribute to use in
`ScyllNode.start()` method.
Copy wait_for_any_log() function from dtest tools/log_utils.py
with few modifications:
- Add type hints;
- Change timeout for node.watch_log_for() calls from 0 to 0.1
because dtest shim's implementation uses asyncio.timeout()
and 0 means not "one time" but "never run";
- Use set() instead of list() for `ret` variable;
- Remove redundant `found` variable.
- Remove `remaining` variable and use shallow copies to make
the code more correct. As a side effect this makes the
TimeoutError message more correct too;
- Use f-string formatting for TimeoutError message;
As a part of the porting process, copy missed utility functions from scylla-dtest,
remove unused imports and markers, and add single_node marker description to pytest.ini
Enable the test in suite.yaml (run in dev mode only)
Use universalasync library to make test.py async code compatible
with synchronous code of dtest/ccm
Also, copied unmodified error_example_test.py from dtest as an example.
Run the test in `dev` mode only.