Files
seaweedfs/weed/command/mount.go
Chris Lu e8a8449553 feat(mount): pre-allocate file IDs in pool for writeback cache mode (#9038)
* feat(mount): pre-allocate file IDs in pool for writeback cache mode

When writeback caching is enabled, chunk uploads no longer block on a
per-chunk AssignVolume RPC. Instead, a FileIdPool pre-allocates file IDs
in batches using a single AssignVolume(Count=N, ExpectedDataSize=ChunkSize)
call and hands them out instantly to upload workers.

Pool size is 2x ConcurrentWriters, refilled in background when it drops
below ConcurrentWriters. Entries expire after 25s to respect JWT TTL.
Sequential needle keys are generated from the base file ID returned by
the master, so one Assign RPC produces N usable IDs.

This cuts per-chunk upload latency from 2 RTTs (assign + upload) to
1 RTT (upload only), with the assign cost amortized across the batch.

* test: add benchmarks for file ID pool vs direct assign

Benchmarks measure:
- Pool Get vs Direct AssignVolume at various simulated latencies
- Batch assign scaling (Count=1 through Count=32)
- Concurrent pool access with 1-64 workers

Results on Apple M4:
- Pool Get: constant ~3ns regardless of assign latency
- Batch=16: 15.7x more IDs/sec than individual assigns
- 64 concurrent workers: 19M IDs/sec throughput

* fix(mount): address review feedback on file ID pool

1. Fix race condition in Get(): use sync.Cond so callers wait for an
   in-flight refill instead of returning an error when the pool is empty.

2. Match default pool size to async flush worker count (128, not 16)
   when ConcurrentWriters is unset.

3. Add logging to UploadWithAssignFunc for consistency with UploadWithRetry.

4. Document that pooled assigns omit the Path field, bypassing path-based
   storage rules (filer.conf). This is an intentional tradeoff for
   writeback cache performance.

5. Fix flaky expiry test: widen time margin from 50ms to 1s.

6. Add TestFileIdPoolGetWaitsForRefill to verify concurrent waiters.

* fix(mount): use individual Count=1 assigns to get per-fid JWTs

The master generates one JWT per AssignResponse, bound to the base file
ID (master_grpc_server_assign.go:158). The volume server validates that
the JWT's Fid matches the upload exactly (volume_server_handlers.go:367).
Using Count=N and deriving sequential IDs would fail this check.

Switch to individual Count=1 RPCs over a single gRPC connection. This
still amortizes connection overhead while getting a correct per-fid JWT
for each entry. Partial batches are accepted if some requests fail.

Remove unused needle import now that sequential ID generation is gone.

* fix(mount): separate pprof from FUSE protocol debug logging

The -debug flag was enabling both the pprof HTTP server and the noisy
go-fuse protocol logging (rx/tx lines for every FUSE operation). This
makes profiling impractical as the log output dominates.

Split into two flags:
- -debug: enables pprof HTTP server only (for profiling)
- -debug.fuse: enables raw FUSE protocol request/response logging

* perf(mount): replace LevelDB read+write with in-memory overlay for dir mtime

Profile showed TouchDirMtimeCtime at 0.22s — every create/rename/unlink
in a directory did a LevelDB FindEntry (read) + UpdateEntry (write) just
to bump the parent dir's mtime/ctime.

Replace with an in-memory map (same pattern as existing atime overlay):
- touchDirMtimeCtimeLocal now stores inode→timestamp in dirMtimeMap
- applyInMemoryDirMtime overlays onto GetAttr/Lookup output
- No LevelDB I/O on the mutation hot path

The overlay only advances timestamps forward (max of stored vs overlay),
so stale entries are harmless. Map is bounded at 8192 entries.

* perf(mount): skip self-originated metadata subscription events in writeback mode

With writeback caching, this mount is the single writer. All local
mutations are already applied to the local meta cache (via
applyLocalMetadataEvent or direct InsertEntry). The filer subscription
then delivers the same event back, causing redundant work:
proto.Clone, enqueue to apply loop, dedup ring check, and sometimes
redundant LevelDB writes when the dedup ring misses (deferred creates).

Check EventNotification.Signatures against selfSignature and skip
events that originated from this mount. This eliminates the redundant
processing for every self-originated mutation.

* perf(mount): increase kernel FUSE cache TTL in writeback cache mode

With writeback caching, this mount is the single writer — the local
meta cache is authoritative. Increase EntryValid and AttrValid from 1s
to 10s so the kernel doesn't re-issue Lookup/GetAttr for every path
component and stat call.

This reduces FUSE /dev/fuse round-trips which dominate the profile at
38% of CPU (syscall.rawsyscalln). Each saved round-trip eliminates a
kernel→userspace→kernel transition.

Normal (non-writeback) mode retains the 1s TTL for multi-mount
consistency.
2026-04-11 20:02:42 -07:00

181 lines
9.3 KiB
Go

package command
import (
"os"
"time"
)
type MountOptions struct {
filer *string
filerMountRootPath *string
dir *string
dirAutoCreate *bool
collection *string
collectionQuota *int
replication *string
diskType *string
ttlSec *int
chunkSizeLimitMB *int
concurrentWriters *int
concurrentReaders *int
cacheMetaTtlSec *int
cacheDirForRead *string
cacheDirForWrite *string
cacheSizeMBForRead *int64
dataCenter *string
allowOthers *bool
defaultPermissions *bool
umaskString *string
nonempty *bool
volumeServerAccess *string
uidMap *string
gidMap *string
readOnly *bool
includeSystemEntries *bool
debug *bool
debugPort *int
debugFuse *bool
localSocket *string
disableXAttr *bool
extraOptions []string
fuseCommandPid int
// Periodic metadata flush to protect against orphan chunk cleanup
metadataFlushSeconds *int
// RDMA acceleration options
rdmaEnabled *bool
rdmaSidecarAddr *string
rdmaFallback *bool
rdmaReadOnly *bool
rdmaMaxConcurrent *int
rdmaTimeoutMs *int
dirIdleEvictSec *int
// Distributed lock for cross-mount write coordination
distributedLock *bool
// POSIX compliance options
posixDirNlink *bool
// FUSE performance options
writebackCache *bool
asyncDio *bool
cacheSymlink *bool
// macOS-specific FUSE options
novncache *bool
// if true, we assume autofs exists over current mount point. Autofs (the kernel one, used by systemd automount)
// is expected to be mounted as a shim between auto-mounted fs and original mount point to provide auto mount.
// with this option, we ignore autofs mounted on the same point.
hasAutofs *bool
}
var (
mountOptions MountOptions
mountCpuProfile *string
mountMemProfile *string
mountReadRetryTime *time.Duration
)
func init() {
cmdMount.Run = runMount // break init cycle
mountOptions.filer = cmdMount.Flag.String("filer", "localhost:8888", "comma-separated weed filer location")
mountOptions.filerMountRootPath = cmdMount.Flag.String("filer.path", "/", "mount this remote path from filer server")
mountOptions.dir = cmdMount.Flag.String("dir", ".", "mount weed filer to this directory")
mountOptions.dirAutoCreate = cmdMount.Flag.Bool("dirAutoCreate", false, "auto create the directory to mount to")
mountOptions.collection = cmdMount.Flag.String("collection", "", "collection to create the files")
mountOptions.collectionQuota = cmdMount.Flag.Int("collectionQuotaMB", 0, "quota for the collection")
mountOptions.replication = cmdMount.Flag.String("replication", "", "replication(e.g. 000, 001) to create to files. If empty, let filer decide.")
mountOptions.diskType = cmdMount.Flag.String("disk", "", "[hdd|ssd|<tag>] hard drive or solid state drive or any tag")
mountOptions.ttlSec = cmdMount.Flag.Int("ttl", 0, "file ttl in seconds")
mountOptions.chunkSizeLimitMB = cmdMount.Flag.Int("chunkSizeLimitMB", 2, "local write buffer size, also chunk large files")
mountOptions.concurrentWriters = cmdMount.Flag.Int("concurrentWriters", 128, "limit concurrent goroutine writers")
mountOptions.concurrentReaders = cmdMount.Flag.Int("concurrentReaders", 128, "limit concurrent chunk fetches for read operations")
mountOptions.cacheDirForRead = cmdMount.Flag.String("cacheDir", os.TempDir(), "local cache directory for file chunks and meta data")
mountOptions.cacheSizeMBForRead = cmdMount.Flag.Int64("cacheCapacityMB", 128, "file chunk read cache capacity in MB")
mountOptions.cacheDirForWrite = cmdMount.Flag.String("cacheDirWrite", "", "buffer writes mostly for large files")
mountOptions.cacheMetaTtlSec = cmdMount.Flag.Int("cacheMetaTtlSec", 60, "metadata cache validity seconds")
mountOptions.dataCenter = cmdMount.Flag.String("dataCenter", "", "prefer to write to the data center")
mountOptions.allowOthers = cmdMount.Flag.Bool("allowOthers", true, "allows other users to access the file system")
mountOptions.defaultPermissions = cmdMount.Flag.Bool("defaultPermissions", true, "enforce permissions by the operating system")
mountOptions.umaskString = cmdMount.Flag.String("umask", "022", "octal umask, e.g., 022, 0111")
mountOptions.nonempty = cmdMount.Flag.Bool("nonempty", false, "allows the mounting over a non-empty directory")
mountOptions.volumeServerAccess = cmdMount.Flag.String("volumeServerAccess", "direct", "access volume servers by [direct|publicUrl|filerProxy]")
mountOptions.uidMap = cmdMount.Flag.String("map.uid", "", "map local uid to uid on filer, comma-separated <local_uid>:<filer_uid>")
mountOptions.gidMap = cmdMount.Flag.String("map.gid", "", "map local gid to gid on filer, comma-separated <local_gid>:<filer_gid>")
mountOptions.readOnly = cmdMount.Flag.Bool("readOnly", false, "read only")
mountOptions.includeSystemEntries = cmdMount.Flag.Bool("includeSystemEntries", false, "show filer system entries (e.g. /topics, /etc) in directory listings")
mountOptions.debug = cmdMount.Flag.Bool("debug", false, "serves runtime profiling data, e.g., http://localhost:<debug.port>/debug/pprof/goroutine?debug=2")
mountOptions.debugPort = cmdMount.Flag.Int("debug.port", 6061, "http port for debugging")
mountOptions.debugFuse = cmdMount.Flag.Bool("debug.fuse", false, "log raw FUSE protocol requests and responses")
mountOptions.localSocket = cmdMount.Flag.String("localSocket", "", "default to /tmp/seaweedfs-mount-<mount_dir_hash>.sock")
mountOptions.disableXAttr = cmdMount.Flag.Bool("disableXAttr", false, "disable xattr")
mountOptions.hasAutofs = cmdMount.Flag.Bool("autofs", false, "ignore autofs mounted on the same mountpoint (useful when systemd.automount and autofs is used)")
mountOptions.fuseCommandPid = 0
// Periodic metadata flush to protect against orphan chunk cleanup
mountOptions.metadataFlushSeconds = cmdMount.Flag.Int("metadataFlushSeconds", 120, "periodically flush file metadata to filer in seconds (0 to disable). This protects chunks from being purged by volume.fsck for long-running writes")
// RDMA acceleration flags
mountOptions.rdmaEnabled = cmdMount.Flag.Bool("rdma.enabled", false, "enable RDMA acceleration for reads")
mountOptions.rdmaSidecarAddr = cmdMount.Flag.String("rdma.sidecar", "", "RDMA sidecar address (e.g., localhost:8081)")
mountOptions.rdmaFallback = cmdMount.Flag.Bool("rdma.fallback", true, "fallback to HTTP when RDMA fails")
mountOptions.rdmaReadOnly = cmdMount.Flag.Bool("rdma.readOnly", false, "use RDMA for reads only (writes use HTTP)")
mountOptions.rdmaMaxConcurrent = cmdMount.Flag.Int("rdma.maxConcurrent", 64, "max concurrent RDMA operations")
mountOptions.rdmaTimeoutMs = cmdMount.Flag.Int("rdma.timeoutMs", 5000, "RDMA operation timeout in milliseconds")
mountOptions.dirIdleEvictSec = cmdMount.Flag.Int("dirIdleEvictSec", 600, "seconds to evict idle cached directories (0 to disable)")
mountCpuProfile = cmdMount.Flag.String("cpuprofile", "", "cpu profile output file")
mountMemProfile = cmdMount.Flag.String("memprofile", "", "memory profile output file")
mountReadRetryTime = cmdMount.Flag.Duration("readRetryTime", 6*time.Second, "maximum read retry wait time")
// Distributed lock for cross-mount write coordination
mountOptions.distributedLock = cmdMount.Flag.Bool("dlm", false, "enable distributed lock for cross-mount write coordination (only one mount can write a file at a time)")
// POSIX compliance options
mountOptions.posixDirNlink = cmdMount.Flag.Bool("posix.dirNLink", false, "report POSIX-compliant directory nlink (2 + subdirectory count); costs one directory listing per stat")
// FUSE performance options
mountOptions.writebackCache = cmdMount.Flag.Bool("writebackCache", false, "enable FUSE writeback cache for improved write performance (at risk of data loss on crash)")
mountOptions.asyncDio = cmdMount.Flag.Bool("asyncDio", false, "enable async direct I/O for better concurrency")
mountOptions.cacheSymlink = cmdMount.Flag.Bool("cacheSymlink", false, "enable symlink caching to reduce metadata lookups")
// macOS-specific FUSE options
mountOptions.novncache = cmdMount.Flag.Bool("sys.novncache", false, "(macOS only) disable vnode name caching to avoid stale data")
}
var cmdMount = &Command{
UsageLine: "mount -filer=localhost:8888 -dir=/some/dir",
Short: "mount weed filer to a directory as file system in userspace(FUSE)",
Long: `mount weed filer to userspace.
Pre-requisites:
1) have SeaweedFS master and volume servers running
2) have a "weed filer" running
These 2 requirements can be achieved with one command "weed server -filer=true"
This uses github.com/seaweedfs/fuse, which enables writing FUSE file systems on
Linux, and OS X.
On OS X, it requires OSXFUSE (https://osxfuse.github.io/).
RDMA Acceleration:
For ultra-fast reads, enable RDMA acceleration with an RDMA sidecar:
weed mount -filer=localhost:8888 -dir=/mnt/seaweedfs \
-rdma.enabled=true -rdma.sidecar=localhost:8081
RDMA Options:
-rdma.enabled=false Enable RDMA acceleration for reads
-rdma.sidecar="" RDMA sidecar address (required if enabled)
-rdma.fallback=true Fallback to HTTP when RDMA fails
-rdma.readOnly=false Use RDMA for reads only (writes use HTTP)
-rdma.maxConcurrent=64 Max concurrent RDMA operations
-rdma.timeoutMs=5000 RDMA operation timeout in milliseconds
`,
}