Files
Chris Lu cfc08fbf6c fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563) (#9565)
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity

verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into
Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0
so the on-disk needle header carries Size=0; TombstoneFileSize is only
the .idx sentinel for "this entry is deleted" and is never written into
a needle header.

ReadBytes' size check therefore mismatched on every tombstone
(-1 != 0), returned ErrorSizeMismatch, and triggered the
4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any
volume large enough that offset+32 GB exceeds dat fileSize the retry
read EOF, CheckVolumeDataIntegrity reported corruption, and the loader
set noWriteOrDelete = true. Every volume whose last 10 .idx entries
included a deletion went read-only on startup — i.e. any healthy
volume where the most recent operations included a delete.

Pass Size(0) so the size check matches the on-disk tombstone header.

Add a regression test that writes three needles, deletes one, and
asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx
tail. Without this fix the test reproduces the exact log shape from
the bug report:

  read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF
  verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF

The Rust port guards its integrity-check size comparison with
!size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never
hits this path, so no Rust mirror change is needed.

* test(seaweed-volume): mirror Go regression for deletion-tombstone integrity

The Rust integrity check already guards its size-mismatch comparison
with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs
with body_size=0, so the Go regression fixed in the previous commit
does not apply. Lock that guarantee in with a parallel reload test:
write three needles, delete one, sync, reopen via Volume::new, assert
the volume is not flipped read-only.

Catches any future change that removes the deleted-entry guard or
re-introduces a size-strict path in check_volume_data_integrity for
tombstones.

* fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity

CheckVolumeDataIntegrity relies on identity comparison against io.EOF
and ErrorSizeMismatch to walk back through the last ten .idx entries
and tolerate a partial truncation at the tail (the "fix and continue"
loop). The live-needle branch in doCheckAndFixVolumeData already
returns those sentinels unwrapped; the deletion branch wrapped them
in fmt.Errorf, so a genuine .dat truncation past a tombstone offset
broke the recovery and flipped the volume read-only.

Mirror the live-needle handling: both verifyDeletedNeedleIntegrity
and doCheckAndFixVolumeData now short-circuit on io.EOF /
ErrorSizeMismatch and pass them through unwrapped. Other errors keep
their existing context wrapping.

Also tighten the regression test to capture lastAppendAtNs and assert
it's non-zero, so a future regression that skips the tombstone body
(and therefore never populates AppendAtNs) is caught even when the
err check still passes.
2026-05-19 13:11:19 -07:00
..

SeaweedFS Volume Server (Rust)

A drop-in replacement for the SeaweedFS Go volume server, rewritten in Rust. It uses binary-compatible storage formats (.dat, .idx, .vif) and speaks the same HTTP and gRPC protocols, so it works with an unmodified Go master server.

Building

Requires Rust 1.75+ (2021 edition).

cd seaweed-volume
cargo build --release

The binary is produced at target/release/seaweed-volume.

Running

Start a Go master server first, then point the Rust volume server at it:

# Minimal
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7

# Multiple data directories
seaweed-volume --port 8080 --master localhost:9333 \
  --dir /mnt/ssd1,/mnt/ssd2 --max 100,100 --disk ssd

# With datacenter/rack topology
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --dataCenter dc1 --rack rack1

# With JWT authentication
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --securityFile /etc/seaweedfs/security.toml

# With TLS (configured in security.toml via [https.volume] and [grpc.volume] sections)
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
  --securityFile /etc/seaweedfs/security.toml

Common flags

Flag Default Description
--port 8080 HTTP listen port
--port.grpc port+10000 gRPC listen port
--master localhost:9333 Comma-separated master server addresses
--dir /tmp Comma-separated data directories
--max 8 Max volumes per directory (comma-separated)
--ip auto-detect Server IP / identifier
--ip.bind same as --ip Bind address
--dataCenter Datacenter name
--rack Rack name
--disk Disk type tag: hdd, ssd, or custom
--index memory Needle map type: memory, leveldb, leveldbMedium, leveldbLarge
--readMode proxy Non-local read mode: local, proxy, redirect
--fileSizeLimitMB 256 Max upload file size
--minFreeSpace 1 (percent) Min free disk space before marking volumes read-only
--securityFile Path to security.toml for JWT keys and TLS certs
--metricsPort 0 (disabled) Prometheus metrics endpoint port
--whiteList Comma-separated IPs with write permission
--preStopSeconds 10 Graceful drain period before shutdown
--compactionMBps 0 (unlimited) Compaction I/O rate limit
--pprof false Enable pprof HTTP handlers

Set RUST_LOG=debug (or trace, info, warn) for log level control. Set SEAWEED_WRITE_QUEUE=1 to enable batched async write processing.

Features

  • Binary compatible -- reads and writes the same .dat/.idx/.vif files as the Go server; seamless migration with no data conversion.
  • HTTP + gRPC -- full implementation of the volume server HTTP API and all gRPC RPCs including streaming operations (copy, tail, incremental copy, vacuum).
  • Master heartbeat -- bidirectional streaming heartbeat with the Go master server; volume and EC shard registration, leader failover, graceful shutdown deregistration.
  • JWT authentication -- signing key configuration via security.toml with token source precedence (query > header > cookie), file_id claims validation, and separate read/write keys.
  • TLS -- HTTPS for the HTTP API and mTLS for gRPC, configured through security.toml.
  • Erasure coding -- Reed-Solomon EC shard management: mount/unmount, read, rebuild, copy, delete, and shard-to-volume reconstruction.
  • S3 remote storage -- FetchAndWriteNeedle reads from any S3-compatible backend (AWS, MinIO, Wasabi, Backblaze, etc.) and writes locally. Supports VolumeTierMoveDatToRemote/FromRemote for tiered storage.
  • Needle map backends -- in-memory HashMap, LevelDB (via rusty-leveldb), or redb (pure Rust disk-backed) needle maps.
  • Image processing -- on-the-fly resize/crop, JPEG EXIF orientation auto-fix, WebP support.
  • Streaming reads -- large files (>1MB) are streamed via spawn_blocking to avoid blocking the async runtime.
  • Auto-compression -- compressible file types (text, JSON, CSS, JS, SVG, etc.) are gzip-compressed on upload.
  • Prometheus metrics -- counters, histograms, and gauges exported at a dedicated metrics port; optional push gateway support.
  • Graceful shutdown -- SIGINT/SIGTERM handling with configurable preStopSeconds drain period.

Testing

Rust unit tests

cd seaweed-volume
cargo test

Go integration tests

The Go test suite can target either the Go or Rust volume server via the VOLUME_SERVER_IMPL environment variable:

# Run all HTTP + gRPC integration tests against the Rust server
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 1200s \
  ./test/volume_server/grpc/... ./test/volume_server/http/...

# Run a single test
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 60s \
  -run "TestName" ./test/volume_server/http/...

# Run S3 remote storage tests
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 180s \
  -run "TestFetchAndWriteNeedle" ./test/volume_server/grpc/...

Load testing

A load test harness is available at test/volume_server/loadtest/. See that directory for usage instructions and scenarios.

Architecture

The server runs three listeners concurrently:

  • HTTP (Axum 0.7) -- admin and public routers for file upload/download, status, and stats endpoints.
  • gRPC (Tonic 0.12) -- all VolumeServer RPCs from the SeaweedFS protobuf definition.
  • Metrics (optional) -- Prometheus scrape endpoint on a separate port.

Key source modules:

Path Description
src/main.rs Entry point, server startup, signal handling
src/config.rs CLI parsing and configuration resolution
src/server/volume_server.rs HTTP router setup and middleware
src/server/handlers.rs HTTP request handlers (read, write, delete, status)
src/server/grpc_server.rs gRPC service implementation
src/server/heartbeat.rs Master heartbeat loop
src/storage/volume.rs Volume read/write/delete logic
src/storage/needle.rs Needle (file entry) serialization
src/storage/store.rs Multi-volume store management
src/security.rs JWT validation and IP whitelist guard
src/remote_storage/ S3 remote storage backend

See DEV_PLAN.md for the full development history and feature checklist.