Files
Chris Lu 85abf3ca88 feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch

Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash)
shard, routes events through the compiled lifecycle engine, and dispatches
due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard
cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume.

Operator-runnable harness for end-to-end Phase 3 validation while the
plugin-worker auto-scheduler is still pending. EventBudget bounds a single
invocation; flags expose dispatch + checkpoint cadence.

Discovers buckets by walking the configured DirBuckets path and reading
each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml]
through lifecycle_xml.ParseCanonical. All compiled actions are seeded
BootstrapComplete=true so the run dispatches whatever fires immediately;
production bootstrap walks set this incrementally per bucket.

* test(s3/lifecycle): integration test driving the run-shard shell command

Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix,
PUTs the target object, then rewrites the entry's Mtime via filer
UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every
shard via 'weed shell' subprocess and asserts the backdated object is
deleted within 30s, and the in-prefix-but-recent object remains.

The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable.
Backdating via the filer's gRPC sidesteps that constraint while still
exercising the real Reader -> Router -> Schedule -> Dispatcher ->
LifecycleDelete RPC path end-to-end.

Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs
all 16 shards because ShardID(bucket, key) is hash-based and the test
shouldn't couple to that detail; running every shard keeps the test
independent of the hash function.

* fix(shell/s3.lifecycle.run-shard): address review findings

- Reject negative -events explicitly. Help text already defines 0 as
  unbounded; negative budgets created ambiguous behavior in pipeline.Run.
- Bound the gRPC dial with a 30s timeout instead of context.Background()
  so an unreachable S3 endpoint doesn't hang the shell.
- Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList
  takes a single-RPC limit; the prior 4096 silently dropped buckets
  past that page on large clusters. Loop with startFrom until a page
  comes back short.
- Surface parse errors instead of swallowing them. Buckets with
  malformed lifecycle XML now print the first three errors verbatim
  and a count for the rest, so an operator running this command for
  diagnostics can find what's wrong.

* feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription

Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads
the same model through Reader and Pipeline.

- reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs;
  ShardID stays for the single-shard short form. Event carries the
  computed ShardID so consumers can route per-shard without rehashing.
- dispatcher.Pipeline gains Shards []int. When set, Run holds one
  Cursor + Schedule + Dispatcher per shard, opens one filer
  SubscribeMetadata stream with a predicate covering the whole set,
  and routes events into the matching shard's schedule from a single
  dispatch goroutine — no per-shard goroutine fan-out.
- shell command parses -shard or -shards (mutually exclusive),
  formats progress messages with a contiguous-range label when
  applicable, and validates against ShardCount.

Integration test now uses -shards 0-15 (one subprocess invocation)
instead of a 16-iteration loop.

* fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil

The reader rejected the legitimate 'fresh subscription from epoch'
state when called from a fresh Pipeline.Run on a multi-shard worker
(no cursor file yet, all shards' MinTsNs=0). The downstream
SubscribeMetadata call handles SinceNs=0 fine; the up-front check
was over-defensive and broke the auto-scheduler completely (CI
showed 5-second-cadence retries with this exact error).

* fix(s3/lifecycle): schedule from ModTime not eventTime

A backdated or out-of-band entry update has eventTime ≈ now while
ModTime is far in the past; eventTime+Delay would push the dispatch
into the future even though the rule already fires. ModTime+Delay
is the correct fire moment. The dispatcher's identity-CAS still
catches drift between schedule and dispatch.

* fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards

The CI integration test sets -events 200 expecting the subprocess to
return after 200 in-shard events. But -events counts only events that
pass the shard filter; the test produces ~5 such events (bucket
create, lifecycle PUT, two object PUTs, mtime backdate), so the
reader stays in stream.Recv forever and runShellCommand hangs the
test deadline.

- weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag.
  When > 0, Pipeline.Run runs under context.WithTimeout(D); on
  expiry the reader/dispatcher drain cleanly and the cursor saves.
- weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat
  context.DeadlineExceeded the same as context.Canceled at exit
  (both are graceful shutdown signals).

* test(s3/lifecycle): pass -runtime 10s to run-shard

Pair with the new -runtime flag so the subprocess exits cleanly
after 10s instead of waiting for an event budget that never lands
on quiet shards.

* refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg

The worker's router needs the same length-prefixed sha256 of the entry's
Extended map; pulling it out of the s3api private file lets both sides
import it.

* fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS

Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil
while the live entry on the server has a non-nil hash, so every dispatch
returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted.

* fix(s3/lifecycle): identity HeadFid via GetFileIdString

Meta-log events go through BeforeEntrySerialization, which clears
FileChunk.FileId and writes the Fid struct instead. Reading .FileId
directly returns "" on the worker side while the server's freshly
fetched entry still has a populated string, so the identity-CAS would
mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY.

* fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit

errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx
errors, so a subscription that ends because runCtx was canceled was
being logged as a fatal reader error. Check status.Code as well so the
shell's -runtime cap exits cleanly.

* fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard

run-shard's -s3 flag dials the LifecycleDelete gRPC service, which
listens on s3.port + 10000. The integration test was passing the HTTP
port instead, so the dispatcher's RPC just timed out and the shell
command exited under -runtime with no work done.

* chore(test/s3/lifecycle): drop emoji from Makefile output

* docs(test/s3/lifecycle): correct '-shards 0-15' wording

* fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run

The shell's parseShardsSpec already validates, but a programmatic caller
(scheduler, future worker config) shouldn't be able to silently produce
no-op states by passing -1 or 99.

* fix(s3/lifecycle): bound drain + final-save with their own timeouts

Shutdown was using context.Background, so a stuck dispatcher RPC or
filer save could keep Pipeline.Run from ever returning.

* fix(test/s3/lifecycle): drop self-killing pkill in stop-server

The pkill pattern \"weed mini -dir=...\" is also in the running shell's
argv (it's the recipe body), so pkill -f matches its own bash and the
recipe exits with Terminated. CI test job passed but the cleanup step
failed with exit 2. The PID file is sufficient on its own.

* docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var
2026-05-08 09:59:10 -07:00
..
2026-03-09 23:10:27 -07:00
2026-04-10 17:31:14 -07:00
2026-04-10 17:31:14 -07:00
2026-03-09 11:12:05 -07:00
2023-11-13 08:23:53 -08:00