mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-13 21:31:32 +00:00
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":
- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
`now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
can confirm WalkerInterval is actually being honored. A stuck value
means the scheduler isn't invoking the worker, the throttle is too
long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
gauges and the persisted state stay aligned (gauges only advance on
successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
`cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
Existing tokens stay positional-stable; new ones append at the end.
Marker `cold` distinguishes "not started" from "0s caught up."
Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).
Stacked on #9485.