test: fix replica_read_timeout_no_exception flakiness on slow systems

The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions.  As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.

The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.

On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables.  When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site.  This matches the
CI failure seen in SCYLLADB-1774:

  ERROR ... replica_read_timeout_no_exception: cql env callback failed,
  error: exceptions::read_timeout_exception (Operation timed out for
  replica_read_timeout_no_exception.tbl - received only 0 responses
  from 1 CL=ONE.)

The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.

The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.

No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
  timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
  data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
  there is the happy path, not a failure.

The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested.  The problem was only that the pre-test warm-up was not
guarded against the same timeout.

Fixes: SCYLLADB-1774

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29731
This commit is contained in:
Nadav Har'El
2026-05-03 21:24:15 +03:00
committed by Botond Dénes
parent afd9a55891
commit 1f15e05946

View File

@@ -30,6 +30,7 @@
#include "test/lib/simple_schema.hh"
#include "test/lib/test_utils.hh"
#include "test/lib/key_utils.hh"
#include "test/lib/eventually.hh"
#include "replica/database.hh"
#include "utils/assert.hh"
@@ -2327,8 +2328,17 @@ SEASTAR_TEST_CASE(replica_read_timeout_no_exception) {
e.execute_cql(format("INSERT INTO {}.{} (pk, ck, v) VALUES (1, 2, 2)", ks_name, tbl_name)).get();
// One successful read, so the auth cache is populated.
e.execute_cql(format("SELECT * FROM {}.{} WHERE pk = 1", ks_name, tbl_name)).get();
// Before starting the test's read loop, we perform a warmup read once.
// This first read might perform some one-time initialization that uses
// exceptions internally. We need all these exceptions to happen once,
// here, so that the real test below can validate that no exceptions at
// all are thrown during the test itself.
// We configured read_timeout to the 10ms, and on very loaded systems
// the read sometimes doesn't finish in 10ms (see SCYLLADB-1774),
// so we retry this warmup read until it succeeds.
eventually([&] {
e.execute_cql(format("SELECT * FROM {}.{} WHERE pk = 1", ks_name, tbl_name)).get();
});
auto get_cxx_exceptions = [&e] {
return e.db().map_reduce0([] (replica::database&) { return engine().cxx_exceptions(); }, 0, std::plus<size_t>()).get();