Files
seaweedfs/weed/mount/weedfs_write_buffer_evict.go
Chris Lu bdebd63f7c fix(mount): evict writable chunks when writeBufferSizeMB cap is reached (#9091)
* fix(mount): evict writable chunks when writeBufferSizeMB cap is reached

Reported on #8777: fio 4k randwrite with --nrfiles=1000 on a weed mount
hangs indefinitely after ~1 second of activity, volume server QPS
drops to zero, and the test never makes progress beyond a few hundred
writes. A scaled-down local repro (4 jobs x 250 nrfiles x 40 MiB with
-chunkSizeLimitMB=32 -writeBufferSizeMB=10240) reproduced the hang
exactly: fio stuck in fio_state=SETTING_UP, mount idle, all writer
goroutines parked in page_writer.(*WriteBufferAccountant).Reserve
with n=0x2000000 (32 MiB).

Root cause: UploadPipeline.SaveDataAt reserves a full chunkSize slot
against the global WriteBufferAccountant the first time it allocates
a writable chunk for a given file. The reservation is released by
SealedChunk.FreeReference only after the chunk's async upload
completes, and chunks only move from writable to sealed when they
fill or when the file is flushed/closed. For 4k random writes with
small per-file data and iodepth holding files open, writable chunks
never fill and never close, so their slots are pinned forever. Once
openFiles * chunkSize exceeds writeBufferSizeMB, every new writer
blocks in Reserve's cond.Wait with no possible waker, a hard deadlock.
The user-visible symptoms (volume QPS=0, waited 30min, still running)
are exactly that state. In the reported config, 8 jobs * 1000 nrfiles
* 32 MiB = 256 GB of reservations against a 10 GB cap.

Fix: add a SetEvictor hook on WriteBufferAccountant. When Reserve
would otherwise block on the cond, it single-flights an evictor
callback that walks wfs.fhMap, picks the first open file handle with
a writable chunk, and force-seals its fullest writable chunk via a
new UploadPipeline.EvictOneWritableChunk. Force-sealing submits the
chunk for async upload on the existing uploader pool; the upload's
SealedChunk.FreeReference releases the global slot, broadcasts on
the accountant cond, and unblocks the waiter. The fullest-first
heuristic matches the existing over-limit path in SaveDataAt and
keeps us from thrashing on repeatedly re-sealing half-empty chunks.

The original #9066 motivation (cap global write-pipeline growth so
swap can't be filled while uploads stall) is preserved: Reserve
still blocks when the cap is actually exhausted by in-flight
uploads, and a failed evictor (no writable chunks to seal) falls
through to cond.Wait exactly as before.

Verified end-to-end in a privileged debian container running a
full master + volume + filer + weed mount stack. Against the
reporter's config shrunk to the same memory footprint (4 jobs x
250 nrfiles x 40 MiB, -chunkSizeLimitMB=32 -writeBufferSizeMB=10240):
  - baseline binary (master): fio stalls after ~320 writes, killed
    by timeout at 400s with io=1280 KiB and bw=2383 B/s.
  - patched binary: fio completes in 1.9 s at 82.8 MiB/s, all
    40,000 writes issued, 156 MiB written, rc=0.
page_writer unit tests still pass.

Touches: WriteBufferAccountant (SetEvictor + single-flight in
Reserve), UploadPipeline.EvictOneWritableChunk, DirtyPages
interface + the two existing implementers (ChunkedDirtyPages,
PageWriter), and a new weedfs_write_buffer_evict.go holding the
WFS.evictOneWritableChunk callback that walks fhMap.

* fix(mount): re-check write-budget cap after evict and make eviction panic-safe

Addresses two review concerns on #9091:

1. CodeRabbit flagged that the original Reserve loop only considered the
   break condition when the evictor reported success:

     if evicted {
         if a.used+n <= a.cap || a.used == 0 {
             break
         }
     }
     a.cond.Wait()

   A concurrent Release firing during the evictor's execution window (we
   drop a.mu around the callback so uploader goroutines can drain) would
   land its Broadcast on an empty waiter set and then be missed, because
   we unconditionally call cond.Wait() on the same iteration. In practice
   an in-flight async upload eventually fires another broadcast so this
   would delay rather than permanently hang — but the logic is wrong
   either way and should re-check the cap after every evictor round.

2. Gemini flagged that setting `a.evicting = true` and dropping `a.mu`
   without a deferred unwind leaves the accountant in a broken state if
   the evictor panics: the flag stays set forever and no Reserve caller
   can ever run another eviction, even though the mutex gets unlocked by
   the normal stack unwind.

Pull the evictor call into a tiny helper runEvictorLocked whose defer
re-acquires a.mu and clears a.evicting regardless of panic status, then
broadcasts so other blocked Reservers re-evaluate. Reserve checks the
cap condition immediately after the helper returns; only falls through
to cond.Wait() when the budget is still exhausted. No behavior change on
the happy path; the fix is in the error and race corners.

* doc(mount): clarify eviction-scan cost on the blocked-Reserve path

Gemini review on #9091 flagged that the fhMap scan inside
evictOneWritableChunk is O(open files). Document why the cost is
bounded to the Reserve-blocked path and paid at most once per
chunkSize drained, rather than on the write hot path, and record the
reason we do not preempt with an LRU.
2026-04-15 14:06:06 -07:00

48 lines
2.2 KiB
Go

package mount
// evictOneWritableChunk is the callback the WriteBufferAccountant invokes
// when Reserve would otherwise block. It walks open file handles and asks
// the first one that has a writable chunk to force-seal it. Sealing moves
// the chunk into the async upload path, whose completion Releases a slot
// on the accountant and unblocks the waiter.
//
// Without this hook, workloads that open many files for write but never
// fill or close any single chunk (e.g. fio 4k randwrite across nrfiles
// where nrfiles * chunkSizeLimit > writeBufferSizeMB) deadlock
// permanently: every writable chunk reserves a full chunkSize slot on
// creation, and the pipeline only releases the slot when the chunk
// finishes uploading, which only happens after the chunk fills or the
// file closes.
//
// Performance note: this does a linear scan of fhMap.inode2fh looking
// for the first handle with a writable chunk. The scan is bounded by
// the number of open-for-write file handles at the moment Reserve
// blocks, not by nrfiles in the workload: handles that have no dirty
// pages fall through EvictOneWritableChunk in O(1) under their own
// chunksLock, so the hot-path cost is just the map walk. The scan is
// already serialized by the caller's single-flight `evicting` flag
// (one eviction across the entire accountant at a time), and happens
// only when Reserve is about to block — i.e., when the cap is actually
// exhausted and we would otherwise park a writer — so the O(N) walk is
// paid at most once per chunkSize drained, not per write. Replacing
// the map with an LRU / dirty-set would trade this for per-AddPage
// bookkeeping overhead on a hotter path; we keep the scan until a
// profile shows it mattering at the scale users actually hit.
//
// Called from WriteBufferAccountant.Reserve with accountant.mu dropped.
// Must not call anything that takes accountant.mu back; see the caller's
// single-flight `evicting` flag for safety against concurrent invocation.
func (wfs *WFS) evictOneWritableChunk(needBytes int64) bool {
wfs.fhMap.RLock()
defer wfs.fhMap.RUnlock()
for _, fh := range wfs.fhMap.inode2fh {
if fh == nil || fh.dirtyPages == nil {
continue
}
if fh.dirtyPages.EvictOneWritableChunk() {
return true
}
}
return false
}