db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc

Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998
This commit is contained in:
Calle Wilund
2026-01-05 20:19:00 +01:00
committed by Botond Dénes
parent 7bf26ece4d
commit a7cdb602e1
3 changed files with 132 additions and 4 deletions

View File

@@ -502,6 +502,9 @@ public:
void flush_segments(uint64_t size_to_remove);
void check_no_data_older_than_allowed();
// whitebox testing
std::function<future<>()> _oversized_pre_wait_memory_func;
private:
class shutdown_marker{};
@@ -1597,8 +1600,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
scope_increment_counter allocating(totals.active_allocations);
// #27992 - whitebox testing. signal we are trying to lock out
// all allocators
if (_oversized_pre_wait_memory_func) {
co_await _oversized_pre_wait_memory_func();
}
auto permit = co_await std::move(fut);
SCYLLA_ASSERT(_request_controller.available_units() == 0);
// #27992 - task reordering _can_ force the available units to negative. this is ok.
SCYLLA_ASSERT(_request_controller.available_units() <= 0);
decltype(permit) fake_permit; // can't have allocate+sync release semaphore.
bool failed = false;
@@ -1859,13 +1869,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
}
}
}
SCYLLA_ASSERT(_request_controller.available_units() == 0);
auto avail = _request_controller.available_units();
SCYLLA_ASSERT(avail <= 0);
SCYLLA_ASSERT(permit.count() == max_request_controller_units());
auto nw = _request_controller.waiters();
permit.return_all();
// #20633 cannot guarantee controller avail is now full, since we could have had waiters when doing
// return all -> now will be less avail
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == ssize_t(max_request_controller_units()));
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == (avail + ssize_t(max_request_controller_units())));
if (!failed) {
clogger.trace("Oversized allocation succeeded.");
@@ -3949,6 +3961,9 @@ void db::commitlog::update_max_data_lifetime(std::optional<uint64_t> commitlog_d
_segment_manager->cfg.commitlog_data_max_lifetime_in_seconds = commitlog_data_max_lifetime_in_seconds;
}
void db::commitlog::set_oversized_pre_wait_memory_func(std::function<future<>()> f) {
_segment_manager->_oversized_pre_wait_memory_func = std::move(f);
}
future<std::vector<sstring>> db::commitlog::get_segments_to_replay() const {
return _segment_manager->get_segments_to_replay();