mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-29 20:57:00 +00:00
In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute().
Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable.
The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with
checked_ptr_is_null_exception.
The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as:
WriteFailure: Failed to prepare ballot ... Replica errors:
host_id ... -> seastar::rpc::remote_verb_error (std::exception)
Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle.
Fixes SCYLLADB-1173
backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected.
Closes scylladb/scylladb#29675
* https://github.com/scylladb/scylladb:
table_helper: retry insert prepare on concurrent cache invalidation
paxos_state: keep prepared message alive across statement execution