mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-22 17:51:30 +00:00
* fix(mount): evict writable chunks when writeBufferSizeMB cap is reached Reported on #8777: fio 4k randwrite with --nrfiles=1000 on a weed mount hangs indefinitely after ~1 second of activity, volume server QPS drops to zero, and the test never makes progress beyond a few hundred writes. A scaled-down local repro (4 jobs x 250 nrfiles x 40 MiB with -chunkSizeLimitMB=32 -writeBufferSizeMB=10240) reproduced the hang exactly: fio stuck in fio_state=SETTING_UP, mount idle, all writer goroutines parked in page_writer.(*WriteBufferAccountant).Reserve with n=0x2000000 (32 MiB). Root cause: UploadPipeline.SaveDataAt reserves a full chunkSize slot against the global WriteBufferAccountant the first time it allocates a writable chunk for a given file. The reservation is released by SealedChunk.FreeReference only after the chunk's async upload completes, and chunks only move from writable to sealed when they fill or when the file is flushed/closed. For 4k random writes with small per-file data and iodepth holding files open, writable chunks never fill and never close, so their slots are pinned forever. Once openFiles * chunkSize exceeds writeBufferSizeMB, every new writer blocks in Reserve's cond.Wait with no possible waker, a hard deadlock. The user-visible symptoms (volume QPS=0, waited 30min, still running) are exactly that state. In the reported config, 8 jobs * 1000 nrfiles * 32 MiB = 256 GB of reservations against a 10 GB cap. Fix: add a SetEvictor hook on WriteBufferAccountant. When Reserve would otherwise block on the cond, it single-flights an evictor callback that walks wfs.fhMap, picks the first open file handle with a writable chunk, and force-seals its fullest writable chunk via a new UploadPipeline.EvictOneWritableChunk. Force-sealing submits the chunk for async upload on the existing uploader pool; the upload's SealedChunk.FreeReference releases the global slot, broadcasts on the accountant cond, and unblocks the waiter. The fullest-first heuristic matches the existing over-limit path in SaveDataAt and keeps us from thrashing on repeatedly re-sealing half-empty chunks. The original #9066 motivation (cap global write-pipeline growth so swap can't be filled while uploads stall) is preserved: Reserve still blocks when the cap is actually exhausted by in-flight uploads, and a failed evictor (no writable chunks to seal) falls through to cond.Wait exactly as before. Verified end-to-end in a privileged debian container running a full master + volume + filer + weed mount stack. Against the reporter's config shrunk to the same memory footprint (4 jobs x 250 nrfiles x 40 MiB, -chunkSizeLimitMB=32 -writeBufferSizeMB=10240): - baseline binary (master): fio stalls after ~320 writes, killed by timeout at 400s with io=1280 KiB and bw=2383 B/s. - patched binary: fio completes in 1.9 s at 82.8 MiB/s, all 40,000 writes issued, 156 MiB written, rc=0. page_writer unit tests still pass. Touches: WriteBufferAccountant (SetEvictor + single-flight in Reserve), UploadPipeline.EvictOneWritableChunk, DirtyPages interface + the two existing implementers (ChunkedDirtyPages, PageWriter), and a new weedfs_write_buffer_evict.go holding the WFS.evictOneWritableChunk callback that walks fhMap. * fix(mount): re-check write-budget cap after evict and make eviction panic-safe Addresses two review concerns on #9091: 1. CodeRabbit flagged that the original Reserve loop only considered the break condition when the evictor reported success: if evicted { if a.used+n <= a.cap || a.used == 0 { break } } a.cond.Wait() A concurrent Release firing during the evictor's execution window (we drop a.mu around the callback so uploader goroutines can drain) would land its Broadcast on an empty waiter set and then be missed, because we unconditionally call cond.Wait() on the same iteration. In practice an in-flight async upload eventually fires another broadcast so this would delay rather than permanently hang — but the logic is wrong either way and should re-check the cap after every evictor round. 2. Gemini flagged that setting `a.evicting = true` and dropping `a.mu` without a deferred unwind leaves the accountant in a broken state if the evictor panics: the flag stays set forever and no Reserve caller can ever run another eviction, even though the mutex gets unlocked by the normal stack unwind. Pull the evictor call into a tiny helper runEvictorLocked whose defer re-acquires a.mu and clears a.evicting regardless of panic status, then broadcasts so other blocked Reservers re-evaluate. Reserve checks the cap condition immediately after the helper returns; only falls through to cond.Wait() when the budget is still exhausted. No behavior change on the happy path; the fix is in the error and race corners. * doc(mount): clarify eviction-scan cost on the blocked-Reserve path Gemini review on #9091 flagged that the fhMap scan inside evictOneWritableChunk is O(open files). Document why the cost is bounded to the Reserve-blocked path and paid at most once per chunkSize drained, rather than on the write hot path, and record the reason we do not preempt with an LRU.
48 lines
2.2 KiB
Go
48 lines
2.2 KiB
Go
package mount
|
|
|
|
// evictOneWritableChunk is the callback the WriteBufferAccountant invokes
|
|
// when Reserve would otherwise block. It walks open file handles and asks
|
|
// the first one that has a writable chunk to force-seal it. Sealing moves
|
|
// the chunk into the async upload path, whose completion Releases a slot
|
|
// on the accountant and unblocks the waiter.
|
|
//
|
|
// Without this hook, workloads that open many files for write but never
|
|
// fill or close any single chunk (e.g. fio 4k randwrite across nrfiles
|
|
// where nrfiles * chunkSizeLimit > writeBufferSizeMB) deadlock
|
|
// permanently: every writable chunk reserves a full chunkSize slot on
|
|
// creation, and the pipeline only releases the slot when the chunk
|
|
// finishes uploading, which only happens after the chunk fills or the
|
|
// file closes.
|
|
//
|
|
// Performance note: this does a linear scan of fhMap.inode2fh looking
|
|
// for the first handle with a writable chunk. The scan is bounded by
|
|
// the number of open-for-write file handles at the moment Reserve
|
|
// blocks, not by nrfiles in the workload: handles that have no dirty
|
|
// pages fall through EvictOneWritableChunk in O(1) under their own
|
|
// chunksLock, so the hot-path cost is just the map walk. The scan is
|
|
// already serialized by the caller's single-flight `evicting` flag
|
|
// (one eviction across the entire accountant at a time), and happens
|
|
// only when Reserve is about to block — i.e., when the cap is actually
|
|
// exhausted and we would otherwise park a writer — so the O(N) walk is
|
|
// paid at most once per chunkSize drained, not per write. Replacing
|
|
// the map with an LRU / dirty-set would trade this for per-AddPage
|
|
// bookkeeping overhead on a hotter path; we keep the scan until a
|
|
// profile shows it mattering at the scale users actually hit.
|
|
//
|
|
// Called from WriteBufferAccountant.Reserve with accountant.mu dropped.
|
|
// Must not call anything that takes accountant.mu back; see the caller's
|
|
// single-flight `evicting` flag for safety against concurrent invocation.
|
|
func (wfs *WFS) evictOneWritableChunk(needBytes int64) bool {
|
|
wfs.fhMap.RLock()
|
|
defer wfs.fhMap.RUnlock()
|
|
for _, fh := range wfs.fhMap.inode2fh {
|
|
if fh == nil || fh.dirtyPages == nil {
|
|
continue
|
|
}
|
|
if fh.dirtyPages.EvictOneWritableChunk() {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|