Files
seaweedfs/.github/workflows
Chris Lu 05d31a04b6 fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite

The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.

Three changes that work together:

- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
  the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
  the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
  tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
  Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
  land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
  -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
  shell command runs the full pipeline (reader + scheduler + dispatcher) for the
  duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
  out for now — versioned-bucket expiration via the worker still needs its own
  pass.

Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (8b87ceb0d updated `mini_plugin_test.go` for the s3_lifecycle
plugin but missed this twin test).

Two retention-gate engine tests `t.Skip` under the s3tests build because they
rely on absolute lookback-vs-retention math the day-rescale collapses; the prod
build still covers them.

* review: harden lifecycle worker spawn + assert handler identity

- Workflow: aliveness check on the backgrounded `weed shell` (a bad command
  exits in <1s and the suite would otherwise just opaque-timeout); move
  worker/server teardown into a `trap cleanup EXIT` so failure paths still
  print the worker log and reap the data dir.
- worker_test: check the actual job-type set by name, not just the count.

* fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet

The s3-tests CI runs the worker BEFORE any test creates a bucket, so
LoadCompileInputs returns empty and the shell command was bailing out
with "no buckets with enabled lifecycle rules found" within ~1s. The
aliveness check then fired exit 1 before tox ever started.

Two changes:

- Don't early-exit on empty inputs. Compile against the empty set, log a
  one-liner, and let the pipeline run normally — the meta-log subscription
  is already up, so events for buckets created later DO arrive; they just
  need the engine to know about them when they do.
- Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that
  periodically re-runs LoadCompileInputs + engine.Compile so rules added
  after startup land in the snapshot the dispatcher reads on its next
  tick. Production deployments keep the 5m default; only the CI workflow
  drops to 2s.

Workflow passes `-refresh 2s` in both basic and SQL blocks.

* fix(shell): backfill pre-rule entries via bootstrap walker

The reader-driven path only sees meta-log events created AFTER its
engine snapshot knows the rule. The s3-tests CI scenario PUTs objects
first, then PUTs the lifecycle config, so by the time the engine
refresh picks up the new bucket the object events have already been
seen-and-dropped (BucketActionKeys returned empty for the bucket).

Wire bootstrap.Walk into the shell command:

- bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one
  loop goroutine per fresh bucket.
- Each goroutine re-walks the bucket every walkInterval (defaults to
  the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and
  feeds each entry through bootstrap.Walk; due actions dispatch via a
  direct LifecycleDelete RPC. Not-yet-due entries are silently skipped
  and picked up on a later iteration once they age past their (rescaled
  or real) threshold.
- LifecycleDelete is called with no expected_identity; the server-side
  identityMatches treats nil as "skip CAS", which is the right call
  for bootstrap (the bootstrap entry doesn't carry chunk fid /
  extended hash anyway).

The dispatcher's pkg-private toProtoActionKind is duplicated in the
shell file rather than exported, since the shape is six lines and the
reverse import would pull a proto dep into the s3lifecycle root.

* refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg

The shell command got the backfill in the previous commit but the worker
plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run
directly and missed it — same root cause: the reader-driven path only
sees events created after the rule lands, so a daily cron picking up a
freshly-PUT rule wouldn't expire any pre-rule object.

Move the looping bucket walker into scheduler.BucketBootstrapper:

- Scheduler.Run now constructs one and calls KickOffNew on every engine
  refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval
  (defaults to RefreshInterval — 5m in prod, 2s under s3tests).
- The shell command consumes the same struct instead of its own copy
  so the two paths can't drift in semantics.

* refactor(s3/lifecycle): walk-once + schedule via event injection

Previous per-bucket walker re-listed every WalkInterval forever. For a
bucket with N objects under a long rule, the worker did O(N * runtime /
walkInterval) listings even when nothing was newly due — way too much
for production-scale buckets.

New approach: walk each bucket exactly once on first sight, synthesize
one *reader.Event per existing entry, push it onto Pipeline.events.
Router.Route builds a Match with DueTime=mtime+delay; future-due matches
sit in the per-shard Schedule and fire when their DueTime arrives.
Currently-due matches fire on the very next dispatch tick.

Wiring:

- dispatcher.Pipeline lifts its events channel into a struct field
  with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no
  longer closes the channel — the dispatch goroutine exits on runCtx
  cancellation, which works the same as channel-close did.
- scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew
  spawns one walker goroutine per fresh bucket; the goroutine lists,
  synthesizes events, then exits.
- scheduler.Scheduler builds its pipelines up front and exposes a
  pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi-
  worker scheduler routes each synthesized event to the pipeline that
  owns its shard.
- Shell command's single-pipeline path passes pipeline.InjectEvent
  directly.

Synthesized events carry TsNs=0; dispatcher.advance treats that as a
no-op so the reader's persisted cursor isn't ratcheted past unprocessed
meta-log events. Identity (HeadFid + ExtendedHash) is still computed
from the real filer entry, so the server's identity-CAS catches an
overwrite between bootstrap and dispatch.

* debug(s3tests): make lifecycle worker progress visible in CI logs

The previous CI failure dumped an empty $LC_LOG even though the worker
was running. Two reasons:

1. weed shell suppresses glog by default (logtostderr / alsologtostderr
   set to false). Pass `-debug` so the bootstrapper's V(0) lines reach
   stderr instead of disappearing into /tmp/weed.*.log.
2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first
   with a 1s grace, then SIGKILL the holdout, then read the log.

While here: bump the bootstrap walker's two informational logs to V(0)
so the diagnosis from CI doesn't require -v=1 on the worker.

* fix(s3/lifecycle/dispatcher): refresh snap on every event

Pipeline.Run captured snap at startup and only refreshed it on the
dispatch tick. With bootstrap event injection, the walker pushes events
seconds after engine.Compile sees the bucket — typically WITHIN the
same dispatch interval. Routing against the cached (empty) snap then
silently dropped every match because BucketActionKeys returned nil for
the bucket-not-yet-in-snapshot case.

Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so
the cost is negligible. The dispatch-tick branch keeps using a fresh
local read for its own loop, so its semantics are unchanged.
2026-05-08 17:29:47 -07:00
..
2026-03-09 23:10:27 -07:00
2026-03-09 14:05:39 -07:00
2026-03-09 23:10:27 -07:00
2026-03-09 23:10:27 -07:00
2026-03-09 23:10:27 -07:00
2026-03-09 23:10:27 -07:00