mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-25 01:22:39 +00:00

T

Chris LuandGitHub c40db5a52d perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171 )

* perf(filer): parallelize StreamMutateEntry with path-keyed scheduler

The server handler processed one mutation at a time per stream, capping a
mount's aggregate throughput at ~1/filer_store_service_time regardless of
client concurrency (see issue #9138). With 12 rclone processes this showed
as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via
unary CreateEntry.

Replace the serial for-loop with a per-request goroutine admitted by a
path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor
(weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind
taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier
and descendant-barrier rules. Cross-path mutations run in parallel; same-
path mutations serialize on arrival order; recursive delete and directory
rename act as subtree barriers; directory attribute bumps stay non-barrier
so they do not serialize file writes under them.

Correctness and safety:
- Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource
  use from a single noisy mount.
- syncStream wrapper serializes stream.Send across worker goroutines (gRPC
  Send is not concurrent-safe).
- Handler waits on in-flight workers before returning on recv EOF/error so
  no worker writes to a torn-down stream.
- First fatal Send error from any worker propagates as the handler's
  return, causing the stream to tear down.

Benchmark (2 ms simulated filer-store service delay, 12 client workers):
  serial    : 440 QPS
  sem only  : 4902 QPS (unsafe — reorders same-path ops)
  scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct)

The sem-only number shows the upper bound of raw parallelism; the
scheduler matches it on distinct paths (the realistic 12-rclone case) and
correctly falls back to serial when the workload demands ordering. Peak
concurrent mutations at the handler equals client worker count on the
distinct-path workload and pins to 1 on the same-path workload, as the
scheduler intends.

* perf(filer): decouple StreamMutateEntry admission from receive loop

The previous StreamMutateEntry handler called sched.Admit directly in the
Recv loop. A single request conflicting on path /hot would head-of-line
block stream.Recv, so later requests targeting unrelated paths could not
be received or admitted until /hot drained — cross-path parallelism then
depended on request ordering instead of being a property of the scheduler.

Spawn the worker goroutine immediately on Recv and move sched.Admit into
that goroutine. A new streamMutatePendingLimit (1024) caps total per-
stream outstanding goroutines (pending + active) so a client flooding a
conflicted path cannot explode goroutine count without bound.

Addresses #9171 review comment (coderabbitai, Major).

* fix(filer): reply with EINVAL on unknown StreamMutateEntry request type

Returning nil when req.Request is a future oneof variant or a malformed
request left the client's per-RequestId waiter blocked forever, because
no response was ever sent for that id. Reply with IsLast=true and EINVAL
so the waiter completes with a well-formed error.

Addresses #9171 review comments (gemini-code-assist, coderabbitai).

* fix(filer): make classifyMutation crash-free and correct for deletes

Two issues addressed together because they share one function:

1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without
   a nil guard; an empty create_request / update_request / rename_request
   from a misbehaving client crashed the scheduler. Guard each oneof
   variant and fall back to a "/" barrier; the handler then sends EINVAL
   via the unknown-request path.

2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry-
   Request does not carry IsDirectory, so the previous kindMutateFile
   classification for non-recursive deletes did not conflict with an in-
   flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same
   path — a race in scheduler terms. Classify every delete as
   kindMutateBarrierDir regardless of IsRecursive. The incremental cost
   of a descendant-wait for a non-recursive delete of a non-empty dir is
   negligible since that call fails at the store anyway.

Adds classifyMutation tests for malformed create/update, empty oneof,
and updates the delete-non-recursive case to the new expected kind.

Addresses #9171 review comments (coderabbitai Critical, Major).

* fix(filer): route renameStreamProxy.SendMsg through the wrapping Send

The default pass-through SendMsg on renameStreamProxy bypassed the
syncStream mutex and the StreamMutateEntryResponse wrapping: anything
the rename helpers happened to push via SendMsg would have been emitted
on the wire as the wrong protobuf type and could interleave with other
workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry
Recv loop and could steal unrelated mutation requests.

Route SendMsg through the wrapping Send (rejecting other payload types)
and fail RecvMsg explicitly — the rename logic is a strictly server-push
stream and never calls RecvMsg, so loud failure is safer than silent
stealing.

Addresses #9171 review comment (coderabbitai, Major).

* test(filer): run exactly ops in stream-mutate workloads

perGoroutine := ops / concurrency silently truncated the total when the
values were not divisible — e.g. 2400 ops with 64 workers actually ran
2368 and with 256 workers ran 2304, making the logged "ops per run"
inaccurate and introducing measurement noise that varied across the
concurrency sweep.

Introduce opsForWorker(g, concurrency, ops) which distributes the
remainder to the first (ops % concurrency) workers so the three
workloads (unary, stream sync, stream async) each dispatch exactly
`ops` operations. No changes to the timing methodology.

Addresses #9171 review comment (coderabbitai, Minor).

* fix(filer): enforce per-path FIFO admission in mutateScheduler

sync.Cond.Broadcast wakes every waiter; the first to re-acquire the
mutex wins, so two conflicting same-path admissions could be reordered
by the Go runtime even though they arrived serially on the stream. A
single stream is supposed to carry ordered mutations — the PR's original
#8770 claim — so admission must be FIFO per path.

Replace the single cond with a per-path FIFO queue. Each Admit enqueues
a waiter on every path it touches (primary, and on rename the secondary
too) and blocks on a ready channel. tryPromoteLocked admits any waiter
that is at the head of every queue it joined, passes pathConflictsLocked
against the active-state indexes, and is under concurrencyLimit. Done
removes the heads and re-runs tryPromoteLocked so waiters freed by the
completion move in arrival order.

Side effect: two non-barrier directory updates on the same path now
serialize instead of overlapping. filer.sync's MetadataProcessor
intentionally allows them to overlap because its events come from a
committed log where last-writer-wins coalescing is safe; streamed
mutations carry client operations whose order matters, so we drop that
optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier
release) and TestAdmitSamePathNonBarrierSerializes to cover both.

Also refreshed the kindMutateFile doc comment that still referenced the
pre-#1ecf805f5 "non-recursive delete" classification.

Addresses #9171 review comments (coderabbitai Critical, Minor).

* test(filer): make TestAdmitSamePathFIFO deterministic without sleeps

The previous arrival-ordering sync (send to `started` before calling
Admit, plus a 1 ms sleep) relied on the goroutine actually entering
Admit and reaching the per-path queue during that sleep. Under -race on
a loaded CI that is a real flake source, which is ironic for a test
whose job is catching non-deterministic wake-ups.

Observe the scheduler's own pathQueue length between spawns instead —
waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected
number of waiters (1 barrier holder + i+1 file waiters) is enqueued.
That's the exact event the test wants to synchronise on, so there is no
fudge factor. Verified by `go test -race -count=5`.

Addresses #9171 review comment (coderabbitai, Minor).

2026-04-21 11:25:09 -07:00

.github

fix(s3api): route STS GetFederationToken to STS handler (#9157 ) (#9167 )

2026-04-20 19:33:22 -07:00

.superset

chore: remove ~50k lines of unreachable dead code (#8913 )

2026-04-03 16:04:27 -07:00

cmd

Move SQL engine and PostgreSQL server to their own binaries (#8417 )

2026-02-23 16:27:08 -08:00

docker

ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106 )

2026-04-16 12:50:19 -07:00

k8s/charts

4.21

2026-04-19 14:38:29 -07:00

note

Update Wiki images (#8069 )

2026-01-20 14:12:14 -08:00

other

peer chunk sharing 1/8: proto definitions (#9130 )

2026-04-18 20:02:55 -07:00

postgres-examples

Message Queue: Add sql querying (#7185 )

2025-09-09 01:01:03 -07:00

seaweed-volume

fix(volume): keep vacuum running past dangling .idx entries (#9115 )

2026-04-16 22:01:34 -07:00

seaweedfs-rdma-sidecar

build(deps): bump rand from 0.9.2 to 0.9.4 in /seaweedfs-rdma-sidecar/rdma-engine (#9065 )

2026-04-13 22:51:00 -07:00

snap

move to https://github.com/seaweedfs/seaweedfs

2022-07-29 00:17:28 -07:00

sw-block/design

doc: P14 S8 final bounded close — evidence matrix + P15 handoff (#9142 )

2026-04-20 02:24:44 -07:00

telemetry

fix(telemetry): use correct TopologyId field in integration test (#8714 )

2026-03-20 22:15:05 -07:00

test

fix(s3api): route STS GetFederationToken to STS handler (#9157 ) (#9167 )

2026-04-20 19:33:22 -07:00

unmaintained

go fix

2026-02-20 18:42:00 -08:00

util

util: added gostd script

2019-04-30 03:23:20 +00:00

weed

perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171 )

2026-04-21 11:25:09 -07:00

.gitignore

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

backers.md

chore: add nimbus web services to backers.md (#4769 )

2023-08-20 15:31:23 -07:00

CODE_OF_CONDUCT.md

add code of conduct (#4109 )

2023-01-05 11:01:22 -08:00

go.mod

build(deps): bump modernc.org/sqlite from 1.46.1 to 1.49.1 (#9155 )

2026-04-20 12:20:55 -07:00

go.sum

build(deps): bump modernc.org/sqlite from 1.46.1 to 1.49.1 (#9155 )

2026-04-20 12:20:55 -07:00

install.sh

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

LICENSE

Update LICENSE, fix copyright license year (#6405 )

2025-01-01 01:55:42 -08:00

Makefile

(fix): Add templ install step in admin-generate (#8997 )

2026-04-08 19:23:18 -07:00

README.md

Remove AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY exports

2026-01-14 13:09:11 -08:00

SECURITY.md

Add security policy for vulnerability reporting

2026-04-17 09:51:21 -07:00

VOLUME_SERVER_RUST_PLAN.md

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

README.md

SeaweedFS

SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.

Your support will be really appreciated by me and other supporters!

Quick Start

Quick Start with weed mini

The easiest way to get started with SeaweedFS for development and testing:

Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file weed or weed.exe.

Example:

# remove quarantine on macOS
# xattr -d com.apple.quarantine  ./weed

./weed mini -dir=/data

This single command starts a complete SeaweedFS setup with:

Master UI: http://localhost:9333
Volume Server: http://localhost:9340
Filer UI: http://localhost:8888
S3 Endpoint: http://localhost:8333
WebDAV: http://localhost:7333
Admin UI: http://localhost:23646

Perfect for development, testing, learning SeaweedFS, and single node deployments!

Quick Start for S3 API on Docker

docker run -p 8333:8333 chrislusf/seaweedfs server -s3

Quick Start with Single Binary

Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file weed or weed.exe. Or run go install github.com/seaweedfs/seaweedfs/weed@latest.
export AWS_ACCESS_KEY_ID=admin ; export AWS_SECRET_ACCESS_KEY=key as the admin credentials to access the object store.
Run weed server -dir=/some/data/dir -s3 to start one master, one volume server, one filer, and one S3 gateway. The difference with weed mini is that weed mini can auto configure based on the single host environment, while weed server requires manual configuration and are designed for production use.

Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -master="<master_host>:9333" -port=8081 locally, or on a different machine, or on thousands of machines. That is it!

Introduction

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

to store billions of files!
to serve the files fast!

SeaweedFS started as a blob store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).

There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.

SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem and Google's Colossus File System

On top of the blob store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.

SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and cheaper than direct cloud storage!

System	File Metadata	File Content Read	POSIX	REST API	Optimized for large number of small files
SeaweedFS	lookup volume id, cacheable	O(1) disk seek		Yes	Yes
SeaweedFS Filer	Linearly Scalable, Customizable	O(1) disk seek	FUSE	Yes	Yes
GlusterFS	hashing		FUSE, NFS
Ceph	hashing + rules		FUSE	Yes
MooseFS	in memory		FUSE		No
MinIO	separate meta file for each file			Yes	No

SeaweedFS	comparable to Ceph	advantage
Master	MDS	simpler
Volume	OSD	optimized for small files
Filer	Ceph FS	linearly scalable, Customizable, O(1) or O(logN)

README.md Unescape Escape

SeaweedFS

Sponsor SeaweedFS via Patreon

Gold Sponsors

Table of Contents

Quick Start

Quick Start with weed mini

Quick Start for S3 API on Docker

Quick Start with Single Binary

Introduction

Features

Additional Blob Store Features

Filer Features

Kubernetes

Example: Using Seaweed Blob Store

Start Master Server

Start Volume Servers

Write A Blob

Save Blob Id

Read a Blob

Rack-Aware and Data Center-Aware Replication

Allocate Blob Key on Specific Data Center

Other Features

Blob Store Architecture

Master Server and Volume Server

Write and Read files

Saving memory

Tiered Storage to the cloud

SeaweedFS Filer

Compared to Other File Systems

Compared to HDFS

Compared to GlusterFS, Ceph

Compared to GlusterFS

Compared to MooseFS

Compared to Ceph

Compared to MinIO

Dev Plan

Installation Guide

Disk Related Topics

Hard Drive Performance

Solid State Disk

Benchmark

Run WARP and launch a mixed benchmark.

Enterprise

License

Stargazers over time

README.md