mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-22 09:41:28 +00:00

Go to file

Chris Lu 122ca7c020 feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446 )

* docs(s3lifecycle): design for daily-replay worker

Captures the algorithm and dev plan iterated on in PR #9431 and the
discussion leading up to it: per-shard daily meta-log replay, walker
as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent
plus a recovery branch over engine.RecoveryView(snap), explicit
retention-window input to RulesForShard, two cursor hashes
(ReplayContentHash + PromotedHash) that together detect every
invalidation case. Implementation phases are sequenced so each can
ship independently — Phase 1 (noncurrent_since stamp) just landed.

* feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2)

New weed/s3api/s3lifecycle/dailyrun package implementing the bounded
daily meta-log scan from the design doc. One pass per Execute per
shard: load cursor, scan events forward, route each through router.Route,
dispatch any due Match, advance the cursor on success. Halt-on-failure
keeps the cursor at the last fully-processed event so tomorrow resumes
from the same point — head-of-line blocking is the deliberate failure
signal.

Replay-only in this phase. Phase 4 wires the walker for ExpirationDate,
ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules.
Until then a typed UnsupportedRuleError refuses runs on those buckets:
operators see the rejection in the activity log rather than silently
losing rules.

Behavior:
- Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted
under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in
Phase 2; Phase 4 turns it on.
- Rule-change branch rewinds cursor to now - max_ttl when the
replay-content hash mismatches. Cold start uses the same floor.
- Transport errors retry 3x with exponential backoff capped at 5s;
server outcomes (RETRY_LATER / BLOCKED) halt the run without retry.
- Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules
exist, only the hash gates a future addition.

Worker shape:
- New admin config field "algorithm" with enum streaming|daily_replay,
default streaming. Existing deployments are unaffected.
- handler.Execute branches on the flag: streaming routes through the
current scheduler.Scheduler, daily_replay routes through
dailyrun.Run.
- dispatcher.NewFilerSiblingLister exported so both paths share the
same .versions/ + null-bare lookup.

Engine integration:
- Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase
4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace
them with one-line redirects; the local versions hash the same
fields so the cursor stays valid across the swap.

Tests cover cursor persistence, unsupported-rule rejection,
hash stability under rule reordering, hash sensitivity to TTL edits,
max-TTL aggregation, dispatch retry budget, and request shape
including the identity-CAS witness.

Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers
and future phases share the same spec.

* feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob

The streaming dispatcher hasn't shipped to users yet, so there's no
backward-compat surface to preserve. Flip the algorithm default from
streaming to daily_replay so the new path is the standard from day
one. Streaming stays as an explicit opt-in escape hatch during the
Phase 4 walker rollout; Phase 5 deletes both the flag and the
streaming code.

Buckets whose lifecycle rules require walker-bound dispatch
(ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only)
will fail the daily_replay run with the existing
UnsupportedRuleError until Phase 4 walker integration ships. Operators
hitting that case can set algorithm=streaming until the follow-up
lands.

Updates the test for the default value and renames the
unknown-value-fallback case to reflect the new default.

* fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches

The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}.
That's only safe when each event produces at most one match per
ActionKey with a single deterministic due-time formula —
ExpirationDays and AbortMPU fit that shape because due_time
= ev.TsNs + r.days is monotonic in event TsNs.

But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed
in Phase 2 since it compiles to ActionKindNoncurrentDays) routes
through routePointerTransitionExpand, which emits matches for every
noncurrent sibling — each with its own SuccessorModTime taken from
the demoting event for that specific sibling. A single event can
therefore produce two matches for the same ActionKey on different
objects with wildly different DueTimes.

With the old code, a not-yet-due sibling encountered first would set
done[ActionKey] = true and then the next sibling — even though its
DueTime had already passed — would be skipped. Future events for the
same rule would also be suppressed for the rest of the run. Objects
that should have been deleted weren't.

Fix: drop the early-stop optimization. Process every match
independently. A future-DueTime match is now silently skipped without
affecting any later match. The performance hit is small (Phase 2 is a
single bounded daily pass, and the rate limiter is the real
throughput governor); the correctness gain is non-negotiable.

Also fixes the inverted comment in processMatches that described the
old check as "due_time is past now" when it actually checked
DueTime.After(now) (i.e., NOT yet due).

Adds four targeted tests:
- not-yet-due match first in slice does not suppress two later
due matches for the same rule;
- reversed slice ordering produces identical dispatch;
- BLOCKED outcome halts the loop before later due matches are sent;
- empty match slice is a no-op.

Phase 4's walker-and-recovery integration can revisit a
per-(rule, object) memoization if profiling argues for it.

* fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency

Addresses PR #9446 review feedback. Eight distinct fixes:

1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the
persisted cursor to lastOK = TsNs of the last event processed,
including events whose matches were skipped as not-yet-due. Those
skipped matches would never be re-scanned, so objects under
long-TTL rules would never expire.

Track a "stuck" flag in drainShardEvents: the first event with a
skipped (future-DueTime) match stops cursorAdvanceTo from rising,
but the loop keeps processing later events to dispatch any that ARE
due. The persisted cursor sits at the last fully-processed event so
tomorrow's run re-scans from the skipped event onward and the
future-due matches get re-evaluated when they age in.

processMatches now returns (skippedAny, halted, err) so the drain
loop can tell apart "event fully drained" from "event had pending
future-due matches."

2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the
ActionKind. A replay-eligible kind with Mode != ModeEventDriven
(e.g. ModeScanOnly via retention promotion) passed the check but
then got silently ignored by router.Route, which gates dispatch
on Mode == ModeEventDriven. Reject loudly with the typed error
so admin sees the rejection in the activity log.

3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard
goroutines regardless of cfg.Workers. Add a Workers field to
dailyrun.Config and gate the goroutine fan-out on a semaphore of
that size; the handler now passes cfg.Workers through.

4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one
snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per
shard. Mid-run Compile would let shards process different rule
sets. Capture snap at the top of Run, pass it down to every shard.

5. FROZEN runNow (coderabbit). drainShardEvents and processMatches
accepted a `now func() time.Time` and called it multiple times.
DueTime comparisons would slip as the run wore on. Capture runNow
once at the top of Run and thread it through as a time.Time value.

6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case
broke out of the loop and returned nil, marking interrupted runs as
successful. Return ctx.Err() instead so the caller propagates the
interrupt; cursorAdvanceTo carries whatever progress was made.

7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently
accepted empty files, mismatched shard_ids, and hash slices shorter
than 32 bytes (copy() would zero-pad). Each now returns a typed
error so the run halts and an operator investigates rather than
silently re-scanning from time zero or persisting a zero-padded
hash that masks corruption forever.

8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted"
guard in runShard was unreachable because drainShardEvents
initialized lastOK := startTsNs and only ever raised it. Removed
along with the new cursor-advancement semantics that handle the
"no events processed" case implicitly.

Plus markdown lint: DESIGN.md fenced code blocks now carry a `text`
language identifier to satisfy MD040.

Skipped from the review:
- gemini's "maxTTL == 0 incorrectly skips immediate expirations":
actions with Days <= 0 don't compile to a CompiledAction (see
weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new
empty-replay sentinel uses `rsh == [32]byte{}` for clarity per
gemini's suggested form, but the behavior is equivalent.

Tests added/updated:
- TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when
all matches are past their DueTime.
- TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins
the new Mode check.
- TestFilerCursorPersister_EmptyFileReturnsError,
_ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin
the new validation rules.
- Existing process-matches tests reshaped for the
(skippedAny, halted, err) return tuple.

Full build clean. Dailyrun + worker test packages green.

2026-05-11 18:07:17 -07:00

.github

fix(s3tests): wire lifecycle worker for expiration suite (#9374 )

2026-05-08 17:29:47 -07:00

.superset

chore: remove ~50k lines of unreachable dead code (#8913 )

2026-04-03 16:04:27 -07:00

cmd

Move SQL engine and PostgreSQL server to their own binaries (#8417 )

2026-02-23 16:27:08 -08:00

docker

ci(e2e): switch FUSE Mount build to Azure Ubuntu mirror, persist buildx cache

2026-05-05 00:22:59 -07:00

k8s/charts

4.23

2026-05-03 23:15:34 -07:00

note

docs(note): add production-setup slide deck

2026-04-23 02:36:58 -07:00

other

peer chunk sharing 1/8: proto definitions (#9130 )

2026-04-18 20:02:55 -07:00

postgres-examples

Message Queue: Add sql querying (#7185 )

2025-09-09 01:01:03 -07:00

seaweed-volume

fix(volume): don't panic on read when needle map is nil (#9342 )

2026-05-06 18:23:06 -07:00

seaweedfs-rdma-sidecar

build(deps): bump rand from 0.9.2 to 0.9.4 in /seaweedfs-rdma-sidecar/rdma-engine (#9065 )

2026-04-13 22:51:00 -07:00

snap

move to https://github.com/seaweedfs/seaweedfs

2022-07-29 00:17:28 -07:00

sw-block/design

doc: P14 S8 final bounded close — evidence matrix + P15 handoff (#9142 )

2026-04-20 02:24:44 -07:00

telemetry

fix(telemetry): use correct TopologyId field in integration test (#8714 )

2026-03-20 22:15:05 -07:00

test

feat(s3): stamp noncurrent_since on versioned demotions (#9431 )

2026-05-11 13:41:33 -07:00

unmaintained

go fix

2026-02-20 18:42:00 -08:00

util

util: added gostd script

2019-04-30 03:23:20 +00:00

weed

feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446 )

2026-05-11 18:07:17 -07:00

.gitignore

test(s3tables): add Dremio Iceberg catalog integration tests (#9299 )

2026-05-02 11:31:27 -07:00

backers.md

chore: add nimbus web services to backers.md (#4769 )

2023-08-20 15:31:23 -07:00

CODE_OF_CONDUCT.md

add code of conduct (#4109 )

2023-01-05 11:01:22 -08:00

go.mod

build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364 )

2026-05-08 05:59:26 -07:00

go.sum

build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364 )

2026-05-08 05:59:26 -07:00

install.sh

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

LICENSE

Update LICENSE, fix copyright license year (#6405 )

2025-01-01 01:55:42 -08:00

Makefile

(fix): Add templ install step in admin-generate (#8997 )

2026-04-08 19:23:18 -07:00

README.md

docs(readme): align Docker quick start with weed mini defaults

2026-05-04 00:09:13 -07:00

S3_LIFECYCLE_REDESIGN.md

docs(s3/lifecycle): reflect shipped reader, obsolete Phase 6 (#9419 )

2026-05-10 10:40:33 -07:00

SECURITY.md

Add security policy for vulnerability reporting

2026-04-17 09:51:21 -07:00

VOLUME_SERVER_RUST_PLAN.md

Rust volume server implementation with CI (#8539 )

2026-03-26 17:24:35 -07:00

README.md

SeaweedFS

Sponsor SeaweedFS via Patreon

SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.

Your support will be really appreciated by me and other supporters!

Gold Sponsors

Quick Start
- Quick Start with weed mini
- Quick Start for S3 API on Docker
Introduction
Features
- Additional Features
- Filer Features
Example: Using Seaweed Object Store
Architecture
Compared to Other File Systems
Dev Plan
Installation Guide
Disk Related Topics
Benchmark
Enterprise
License

Quick Start

Quick Start with weed mini

Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip the single weed (or weed.exe) file, or run go install github.com/seaweedfs/seaweedfs/weed@latest. Then start a ready-to-use S3 object store with credentials and a pre-created bucket in one command:

AWS_ACCESS_KEY_ID=admin \
AWS_SECRET_ACCESS_KEY=secret \
S3_BUCKET=my-bucket \
./weed mini -dir=/data

That's it — the S3 endpoint is at http://localhost:8333, my-bucket already exists, and admin/secret are valid credentials. S3_BUCKET accepts a comma-separated list (e.g. raw,processed); use S3_TABLE_BUCKET for S3 Tables (Iceberg) buckets. Drop any of the env vars to skip that piece (no AWS keys → S3 runs in unauthenticated "Allow All" mode for development).

The same command starts everything else too:

S3 Endpoint: http://localhost:8333
Master UI: http://localhost:9333
Volume Server: http://localhost:9340
Filer UI: http://localhost:8888
WebDAV: http://localhost:7333
Admin UI: http://localhost:23646

macOS: if the binary is quarantined, run xattr -d com.apple.quarantine ./weed first.

Perfect for development, testing, learning SeaweedFS, and single-node deployments. To scale out, add more volume servers by running weed volume -dir="/some/data/dir2" -master="<master_host>:9333" -port=8081 locally, on another machine, or on thousands of machines.

Quick Start for S3 API on Docker

docker run -p 8333:8333 \
  -e AWS_ACCESS_KEY_ID=admin \
  -e AWS_SECRET_ACCESS_KEY=secret \
  -e S3_BUCKET=my-bucket \
  chrislusf/seaweedfs

Same behavior as the weed mini command above — the S3 endpoint is at http://localhost:8333 with my-bucket pre-created. Drop the env vars to run anonymously for development.

Introduction

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

to store billions of files!
to serve the files fast!

SeaweedFS started as a blob store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).

There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.

SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem and Google's Colossus File System

On top of the blob store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.

SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and cheaper than direct cloud storage!

SeaweedFS also ships a built-in Iceberg REST Catalog, turning the same cluster into a self-contained lakehouse. Spark, Trino, Dremio, DuckDB, and RisingWave can query Iceberg tables directly — no Hive Metastore, Glue, or external catalog service required. Storage and table metadata live in one system, simplifying on-prem and small-team analytics stacks.

System	File Metadata	File Content Read	POSIX	REST API	Optimized for large number of small files
SeaweedFS	lookup volume id, cacheable	O(1) disk seek		Yes	Yes
SeaweedFS Filer	Linearly Scalable, Customizable	O(1) disk seek	FUSE	Yes	Yes
GlusterFS	hashing		FUSE, NFS
Ceph	hashing + rules		FUSE	Yes
MooseFS	in memory		FUSE		No
MinIO	separate meta file for each file			Yes	No

SeaweedFS	comparable to Ceph	advantage
Master	MDS	simpler
Volume	OSD	optimized for small files
Filer	Ceph FS	linearly scalable, Customizable, O(1) or O(logN)

README.md Unescape Escape

SeaweedFS

Sponsor SeaweedFS via Patreon

Gold Sponsors

Table of Contents

Quick Start

Quick Start with weed mini

Quick Start for S3 API on Docker

Introduction

Features

Additional Blob Store Features

Filer Features

Data Lakehouse Features

Kubernetes

Example: Using Seaweed Blob Store

Start Master Server

Start Volume Servers

Write A Blob

Save Blob Id

Read a Blob

Rack-Aware and Data Center-Aware Replication

Allocate Blob Key on Specific Data Center

Other Features

Blob Store Architecture

Master Server and Volume Server

Write and Read files

Saving memory

Tiered Storage to the cloud

SeaweedFS Filer

Compared to Other File Systems

Compared to HDFS

Compared to GlusterFS, Ceph

Compared to GlusterFS

Compared to MooseFS

Compared to Ceph

Compared to MinIO

Dev Plan

Installation Guide

Disk Related Topics

Hard Drive Performance

Solid State Disk

Benchmark

Run WARP and launch a mixed benchmark.

Enterprise

License

Stargazers over time

README.md