mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-12 19:02:12 +00:00
test: fix replica_read_timeout_no_exception flakiness on slow systems
The test uses a 10ms read timeout to exercise code paths that handle timed-out reads without throwing C++ exceptions. As part of setup, it inserts rows and flushes them to two SSTables, then runs a warm-up SELECT to populate internal caches (e.g. the auth cache) before the real test begins. The reason for this warm-up read was the possibility that the first read does additional operations (such as reading and caching authentication) that might throw exceptions internally. I couldn't verify that such exceptions actually happen in today's code, but they might (re)appear in the future, so we should keep the warm-up SELECT. On slow CI machines (aarch64, debug build), that warm-up SELECT can take longer than 10ms to read from the two SSTables. When it does, the read times out: the coordinator receives 0 responses from the local replica within the deadline and propagates a read_timeout_exception. Since the exception is not caught, it escapes the test lambda, is logged as "cql env callback failed", and causes Boost.Test to report a C++ failure at the do_with_cql_env_thread call site. This matches the CI failure seen in SCYLLADB-1774: ERROR ... replica_read_timeout_no_exception: cql env callback failed, error: exceptions::read_timeout_exception (Operation timed out for replica_read_timeout_no_exception.tbl - received only 0 responses from 1 CL=ONE.) The CI log also shows that only 12 reads were admitted (the warm-up read plus the 11 reads from the two prepare() calls and CREATE/INSERT statements made earlier), and the current permit was stuck in need_cpu state -- the reactor hadn't had a chance to schedule the read before the 10ms window elapsed. The fix catches read_timeout_exception from the warm-up SELECT and retries until the read succeeds. The warm-up is required for correctness: some lazy-init code paths (e.g. auth cache population) use C++ exceptions for control flow internally. Those exceptions must be absorbed before the cxx_exceptions baseline is sampled inside execute_test(); otherwise they would appear in the delta and cause a false test failure. Simply ignoring a timed-out warm-up is not safe, because the lazy-init exceptions would then fire during the 1000 test reads, inflating cxx_exceptions_after relative to cxx_exceptions_before. No other calls in setup are susceptible to the 10ms read timeout: - CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write timeout (10s) and are not reads. - e.prepare() goes through the query processor without reading table data, so it is not subject to the read timeout. - The semaphore manipulation in Test 2 is internal and has no timeout. - All 1000 reads in execute_test() are expected to fail, so a timeout there is the happy path, not a failure. The 10ms timeout itself is fine for the test's purpose: it is deliberately aggressive so that reads reliably time out on the hot path being tested. The problem was only that the pre-test warm-up was not guarded against the same timeout. Fixes: SCYLLADB-1774 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29731
This commit is contained in:
committed by
Botond Dénes
parent
afd9a55891
commit
1f15e05946
@@ -30,6 +30,7 @@
|
||||
#include "test/lib/simple_schema.hh"
|
||||
#include "test/lib/test_utils.hh"
|
||||
#include "test/lib/key_utils.hh"
|
||||
#include "test/lib/eventually.hh"
|
||||
|
||||
#include "replica/database.hh"
|
||||
#include "utils/assert.hh"
|
||||
@@ -2327,8 +2328,17 @@ SEASTAR_TEST_CASE(replica_read_timeout_no_exception) {
|
||||
|
||||
e.execute_cql(format("INSERT INTO {}.{} (pk, ck, v) VALUES (1, 2, 2)", ks_name, tbl_name)).get();
|
||||
|
||||
// One successful read, so the auth cache is populated.
|
||||
e.execute_cql(format("SELECT * FROM {}.{} WHERE pk = 1", ks_name, tbl_name)).get();
|
||||
// Before starting the test's read loop, we perform a warmup read once.
|
||||
// This first read might perform some one-time initialization that uses
|
||||
// exceptions internally. We need all these exceptions to happen once,
|
||||
// here, so that the real test below can validate that no exceptions at
|
||||
// all are thrown during the test itself.
|
||||
// We configured read_timeout to the 10ms, and on very loaded systems
|
||||
// the read sometimes doesn't finish in 10ms (see SCYLLADB-1774),
|
||||
// so we retry this warmup read until it succeeds.
|
||||
eventually([&] {
|
||||
e.execute_cql(format("SELECT * FROM {}.{} WHERE pk = 1", ks_name, tbl_name)).get();
|
||||
});
|
||||
|
||||
auto get_cxx_exceptions = [&e] {
|
||||
return e.db().map_reduce0([] (replica::database&) { return engine().cxx_exceptions(); }, 0, std::plus<size_t>()).get();
|
||||
|
||||
Reference in New Issue
Block a user