Commit Graph

13340 Commits

Author SHA1 Message Date
pingqiu
e34952f648 docs(p15): update G9C component ready evidence 2026-05-02 19:56:49 -07:00
pingqiu
df7f3b183d docs(p15): update G9C post-close durable ack evidence 2026-05-02 19:54:28 -07:00
pingqiu
7643356f2d docs(p15): start G9C replica ready feed continuity plan 2026-05-02 19:49:20 -07:00
pingqiu
3367071d97 docs(p15): close G9B replica join lifecycle 2026-05-02 19:33:03 -07:00
pingqiu
0b840d8933 docs(p15): record G9B L2 join lifecycle smoke 2026-05-02 19:24:07 -07:00
pingqiu
08945de1a9 docs(p15): record G9B adapter lifecycle proof 2026-05-02 19:13:48 -07:00
pingqiu
db3160565c docs(p15): update G9B replica ready lifecycle progress 2026-05-02 19:06:48 -07:00
pingqiu
52addf2b35 docs(p15): start G9B replica join lifecycle plan 2026-05-02 18:58:01 -07:00
pingqiu
fc71db68ef docs(p15): record G9A RF3 ACK semantics 2026-05-02 18:46:53 -07:00
pingqiu
bf6775b856 docs(p15): close G9A ACK reintegration policy 2026-05-02 18:39:11 -07:00
pingqiu
e72ce0d530 docs(p15): mark G9A close-ready 2026-05-02 18:38:39 -07:00
pingqiu
f5670b5dc4 docs(p15): record G9A recovering status role 2026-05-02 18:35:40 -07:00
pingqiu
4b5a7d62d7 docs(p15): record G9A returned-replica readiness oracle 2026-05-02 18:27:57 -07:00
pingqiu
72f22ec46e docs(p15): record G9A best-effort ACK oracle 2026-05-02 18:21:47 -07:00
pingqiu
fa1ed676c2 docs(p15): update G9A ACK reintegration progress 2026-05-02 18:18:01 -07:00
pingqiu
5d411294e4 docs(p15): start G9A ACK reintegration plan 2026-05-02 16:43:30 -07:00
pingqiu
1a613afba4 docs(p15): close G8 failover data continuity 2026-05-02 16:22:57 -07:00
pingqiu
0483a035b0 docs(recovery): separate pin breach from primary flow control 2026-05-02 12:12:44 -07:00
pingqiu
baaa1e7e45 docs(recovery): record coordinator progress fact gaps 2026-05-02 11:09:25 -07:00
pingqiu
6905fb8278 docs(recovery): plan progress fact recover governance 2026-05-02 11:00:28 -07:00
pingqiu
eda40923a8 docs(recovery): add single-egress grill checklist 2026-05-01 23:24:59 -07:00
pingqiu
3910cbd1cb docs(architecture): V3 egress single-decision-core principle
Pins the architecture principle that V3 egress components (per-peer
shipper / session pump / flusher / barrier driver) must be modeled
as a single decision core with single-queue / single serializable
worker shape. Monotonic pointers (cursor, applied LSN, emit profile,
bound conn) advance only via the core's internal transitions; external
callers deliver commands/events; direct external mutation is a
design-debt side door.

Grounds the principle in three hardware-validated incidents on m01/M02
during 2026-05-02, all of which turn out to be the same shape:

  §3.1 executor.Ship overwrites the WalShipper's emit context mid-
       session (g7 #5: 498 LBA mismatches at concurrent-write range)
  §3.2 PrimaryBridge onStart/onClose dropped ReplicaID; engine and
       runtime peer state diverged (g7 #5/#6 dispatch never fires)
  §3.3 A-class sender→coord RecordBarrierWalLegOk side-write
       (reverted 2026-05-02 working tree per this principle)

Companion to v3-rebuild-from-lsn-pin-clarification.md. Anchors in
consensus: §I P1, §I P7, §6.8, INV-SINGLE.

No new wire field, predicate, or invariant; clarifies a rule that
earlier docs imply but don't articulate. Provides judgment criterion
for upcoming Ship/PushLiveWrite collapse and peer-state ownership
work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:57:31 -07:00
pingqiu
02ed4db104 docs(recovery): rebuild fromLSN pin sentinel translation clarification
Pins the rebuild path's `fromLSN=0` sentinel semantic for the current
`StartRebuild` signature. Closes the gap §I P7 calls out ("transport
silently overwriting `fromLSN := 0` violates parity") in the absence of
an engine-published `fromLSN` for rebuild.

Hardware-validated by seaweed_block@bc4286e g7-dual-lane on m01/M02:
G7-#2 PASS (dispatch=1s, complete=1s, total=2s, 1000 LBAs byte-equal).

Sentinel rule:
  - Caller passes 0 ⇒ "rebuild — primary picks the pin"
  - Transport translates: sessionFromLSN := targetLSN
  - Receiver-visible fromLSN = targetLSN
  - Future catch-up (engine surfaces fromLSN := replicaLSN): passthrough

Three transport mechanics that satisfy the rule without violating any
consensus invariant: sentinel translation in startRebuildDualLane,
cursor-caught-up shortcut in WalShipper.DrainBacklog (preserves the
recycle gate's <= strictness verbatim), SeedWalApplied at SessionStart
so base-only rebuilds satisfy the A-class TryComplete conjunct.

Anti-discipline: no new wire fields, predicates, or invariants. Memo
clarifies the meaning of an existing transport-layer constant.

Anchors: v3-recovery-algorithm-consensus.md §I P7 / §6.9 / §6.10;
recover-semantics-adjustment-plan.md §1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:08:18 -07:00
pingqiu
d6636e3497 docs(sw-block): mini-plan §11.9 slice-2A drive+L SHAs + Legacy default receipt
Document 28fb142 (drive collapse) and d647db4 (LegacyOutOfOrderEmit false),
deprecation stance, Slice-2B Path B remainder; §8 v0.15; §12.1 bridge pointer.

Made-with: Cursor
2026-04-30 22:06:55 -07:00
pingqiu
f098e0e33c docs(sw-block): §11.8 WriteExtentDirect receipt; §6.3 CASE C ∅ noop v3.17
- Mini-plan §11.8 documents g7-redo/wal-shipper-impl 6fccc62 and 9f7d918
- Pillar 2 supersession note (faulty-store vs wire-abort); §12.4 #7-10 backlog
- Consensus §6.3 Drive pseudocode CASE C comment (architect ∅ nit)
- v3.17 revision; mini-plan revision v0.14

Made-with: Cursor
2026-04-30 20:16:08 -07:00
pingqiu
8110555978 docs(sw-block): dual-lane recv appendWAL vs writeExtentDirect; backlog iterator cost
Consensus §6.10: WAL path uses appendWAL + bitmap; base uses
writeExtentDirect (no WAL/LSN); substrate must expose both. §6.3:
ring/sequential backlog model, stateful vs stateless ReadAtLSN table,
O(N) vs O(N log N). §7 routing + INV rows aligned. Mini-plan §12.4
checklist expanded (v0.12). Wal-shipper spec §7.4 note. v3.15.

Made-with: Cursor
2026-04-30 13:26:33 -07:00
pingqiu
c48804479c docs(sw-block): normative Drive/Apply pseudocode and cross-links
Add §6.3 Drive(input) and §6.10 ApplyWAL/ApplyBASE blocks with state,
atomic envelope, and INV wiring; align cursor/head with exclusive-slot
semantics plus §6.1/CHK-REWIND note; revise §13 StrictRealtimeOrdering
for cursor convention. Wal-shipper spec §7.3 references consensus §6.3;
§7.4 points at §6.10 pseudocode. Mini-plan §12 defers Drive to consensus
§6.3 only (v0.11). Consensus v3.14.
2026-04-30 13:06:40 -07:00
pingqiu
7dfb0172d1 sw-block/design: mini-plan §11.7 — pillar 2/3 landed-SHA + test-name table
Annotates pillar 2 (dual-line execution) and pillar 3 (receiver
convergence) rows with the landed commits + concrete test names so
the design narrative tracks the code:

  Pillar 2 — assembled-stack fault-injection
    7d051e2 — C3 fault-injection (recovery-package layer, spy sink):
              TestC3FaultInjection_BaseError_WalExitsViaCtx
              TestC3FaultInjection_OuterCancel_BothLanesWindDown
              TestC3FaultInjection_WalError_RunTerminates_NarrowVariantA
    9f62ebe — Pillar 2 fault-injection on the assembled stack
              (BlockExecutor + RecoverySink + Sender + WalShipper):
              TestPillar2A_BaseError_AssembledStack_FailReason
              TestPillar2B_LiveWrites_HighPressure_BarrierIntegrity
              TestPillar2C_WireAbortMidSession_AssembledStack_RestoresEmitContext

  Pillar 3 — receiver convergence under same-LBA conflict
    291e652 — slice-1: same-LBA arbitration on transport stack
    3495a12 — slice-1 polish (AchievedLSN assertion, replica==primary
              comparison, comment fixes, companion-SHA list)
              TestPillar3Slice1_ReceiverConvergence_LiveOverwritesBacklog_SameLBAs
              (transport-stack only; slice-2 lifts to engine path
               via dual_lane_engine_test.go)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:51:20 -07:00
pingqiu
822c588d08 sw-block/design: mini-plan §11.6 Phase 0 SHA receipt + §11.7 heavy integration matrix
§11.6 startRebuildDualLane row updated with landed-2026-04-30 commit
receipts on g7-redo/wal-shipper-impl:
  - e354813 fix(transport): unify replicaID in startRebuildDualLane
  - e5a8763 fix(test): align component cluster harness to replica-%d
Module-wide go test green (25 packages). Phase 0 closed.

§11.7 added — heavy integration proof matrix on the manager-assembled
session (BlockExecutor + PrimaryBridge + RecoverySink + recovery.Sender +
resident WalShipper + PeerShipCoordinator under unified replicaID).
Three pillars: (1) WalShipper backlog/realtime modes, (2) dual-line
execution + writeMu interleave, (3) receiver convergence. Order:
Phase 0 + SHA receipt → C3 fault-injection → widen matrix → T2/T7
hardware soak. Stop rule: failure at assembled-session layer fixes
algorithm/wiring before Phase 1 attempt binder.

Revision rows v0.7 + v0.8 record the §11.6 + §11.7 deltas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:21:14 -07:00
pingqiu
91a313e1e1 sw-block/design: §13 E-WALSHIPPER-DUAL-MODE + mini-plan §11.2a + spec sync
Architect-approved exception (consensus v3.9, 2026-04-30): WalShipper
dual-mode contract carves Realtime out of §6.3(B) timer + §6.8(3)(9)
send(incoming, debt) normative scope. Backlog mode is fully normative;
Realtime is per-append (T4a no-replay), with StrictRealtimeOrdering
as the production safety switch.

Files:

  v3-recovery-algorithm-consensus.md
    - NEW §13: Architect-approved exceptions, with E-WALSHIPPER-
      DUAL-MODE entry. Old §13 Revision → §14; old §14 Document
      map → §15. Cross-refs in §I, §6.3(A)(B), §6.8 checklist,
      §V §12 (CHK-WALSHIPPER-TIMER-DRAIN explicitly limited to
      ModeBacklog), §15 (index gains §13).
    - §14 revision row v3.9.

  v3-recovery-wal-shipper-mini-plan.md
    - §11.1 G-TIMER row updated to "C2 + dual-mode/§13".
    - §11.2 replaced (was: pre-merge "Realtime drains on cursor<head"
      bullets) with C2 dual-mode contract + §11.2a normative table:
        * Backlog: §6.3 / §6.8(3)(4)(9) fully apply (timer, scan,
          oldest-first, send(·, debt)).
        * Realtime: NotifyAppend per-LSN-in-order; lsn==cursor+1
          invariant; no substrate replay (§13 carve-out).
        * Production safety switch: StrictRealtimeOrdering.
    - §11.3 compliance receipt rebuilt as 1–9 ordered table with
      commit SHAs (P2d cb8ff1c, C1 294d4bf, C2 53c292f, C3 40f2935,
      review-fix 377bcb0).
    - §11.4 test anchor table gains StrictRealtimeOrdering opt-in.
    - §11.5 invariants gain (4) §13 dual-mode + (5) replicaID drift.
    - §11.6 reshaped as production-migration table:
        * Engine drives rebuild-on-gap before flipping
          StrictRealtimeOrdering=true.
        * Fix replicaID drift in startRebuildDualLane (sink uses
          engine replicaID; bridge.coord uses dl.ReplicaID).
        * Hardware-run carries: §IV T2 (barrier vs targetLSN),
          §IV T7 (R2 saturation under sustained pressure).
        * PR template line: ban `lsn > cursor + 1` debt detection
          (banned regression anchor; use `cursor < head` if it
          ever needs to come back, but only inside Backlog).
    - §8 revision: v0.6.

  v3-recovery-wal-shipper-spec.md
    - Goal/checklist front-matter cross-refs §13 + mini-plan §11.2a.
    - §10 revision row aligning with consensus v3.9.

Companion implementation on seaweed_block g7-redo/wal-shipper-impl:
  - 377bcb0 review-fix: T4a Realtime sequence guard + drainOpportunity TOCTOU
  - cb17338 docs: drainOpportunity comment matches §13 dual-mode contract
2026-04-30 09:48:10 -07:00
pingqiu
aeeb71f0e7 sw-block/design: mini-plan §11 — commit SHAs + review-derived invariants
§11 hardening sequence is fully landed on
seaweed_block/g7-redo/wal-shipper-impl:

  C1 294d4bf  shared writeMu + post-emit RecordShipped hook
  C2 53c292f  WalShipper self-driving Backlog drain + DisableTimerDrain
  C3 40f2935  Sender.Run errgroup BASE ∥ WAL parallel
  348cb35     §6.8 review-derived doc invariants (DisableTimerDrain +
              lock order shipMu → writeMu)

§11.3 — compliance receipt expanded into a 9-row table mapping each
  §6.8 MUST to its commit SHA and implementation locus.

§11.4 — test anchor table with §6.8 # / repo path. Includes
  TestC2_NoGapDenseLSNEdge as the regression anchor banning
  `lsn > cursor + 1` debt detection (the user's correction).

§11.5 — three review-derived invariants now documented in code:
  - DisableTimerDrain test-only contract
  - shipMu → writeMu lock hierarchy
  - RecordShipped four-path coverage audit

§11.6 — open items carried forward (§IV T2 barrier vs targetLSN;
  §IV T7 R2 saturation under sustained pressure; PR template line
  banning `lsn > cursor + 1`).
2026-04-30 08:57:10 -07:00
pingqiu
058305225b sw-block/design: mini-plan §11 C1-C3 hardening sequence
P2d ratified + implemented in seaweed_block (cb8ff1c). Architect
review against §6.8 / consensus v3.8 nine MUSTs identified three
correctness gaps to close, sequenced as C1..C3:

  G-WRITE-RACE   — Sender.writeFrame vs EmitFunc.conn.Write on
                   same dual-lane conn (no shared mutex).
  G-RECORDSHIPPED — WalShipper-routed emits never advance
                   coord.shipCursor.
  G-TIMER        — no self-driving periodic emit-from-cursor;
                   primary-idle starvation possible. NotifyAppend
                   ignores debt (cursor < head) and emits new tail
                   directly — violates send(incoming, debt).
  G-PARALLEL     — Sender.Run runs streamBase fully before
                   sink.DrainBacklog. P6 / G3 requires overlap.

C1 — shared writeMu + post-emit RecordShipped hook (duck-typed
     sink sub-interfaces WriteMu / SetPostEmitHook).
C2 — WalShipper internal timer + NotifyAppend debt-aware dispatch
     using cursor < head (NOT lsn > cursor + 1; that edge-case
     fails at dense single-LSN debt). data arg canonical only on
     no-debt path; debt path drain reads substrate.
C3 — Sender.Run errgroup BASE ∥ WAL parallel; both write through
     shared writeMu (mutex-bounded interleaving, not zero-blocking).

§11.3 records the §6.8 nine-MUST compliance receipt expected
post-C3.

Companion implementation lands on
seaweed_block g7-redo/wal-shipper-impl as three commits.
2026-04-30 02:33:04 -07:00
pingqiu
361f5140a4 sw-block/design: v3-recovery-wal-shipper-mini-plan + §10 P2d decision request
Adds the WalShipper implementation mini-plan that bridges
v3-recovery-wal-shipper-spec.md to the seaweed_block layout (phased
PR rollout P0..P4, INV ↔ test mapping, reviewer checklist).

§10 P2d decision request — the architect-gated handoff:

P2c is closed (slice A / B-1 / B-2 merged on g7-redo/wal-shipper-impl).
The bridging senderBacklogSink owns the live-write buffer + flushAndSeal
under sinkMu; Sender.Run barriers as soon as sink.DrainBacklog returns;
Close/closeCh/liveQueue/drainAndSeal are deleted from Sender. Atomic-seal
contract migrates intact (capture-vs-reject from queueMu → sinkMu).

P2d is gated on a three-axis decision the architect must make before a
real transport.WalShipper sink can replace the bridging path:

  1. Body format on the dual-lane port:
     (A) MsgShipEntry payload (unify on legacy steady encoding), OR
     (B) frameWALEntry payloads (teach WalShipper.Emit to encode), OR
     (C) documented third (e.g. envelope byte).

  2. Single applier owner:
     recovery.Receiver vs transport replica handler.

  3. Replay source of truth:
     which encoding the on-disk WAL playback decoder reads.

§10 also lists pre-decision deliverables that can land in parallel:
adapter scaffolding (transport-side struct satisfying recovery.WalShipperSink
by duck typing) + integration tests for architect rules 1+2 (emit context
before StartSession; restore steady lineage after EndSession).

V2 wire-compat is gated separately per feedback_porting_discipline.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:38:29 -07:00
pingqiu
759d50ba5d sw-block/design: v3-recovery-inv-test-map — INV-WAL-CURSOR-MONOTONIC-FROM-PINLSN
Adds the new invariant pin row for §3.2 #3 unified WAL stream / cursor-
rewind. Companion to the seaweed_block g7-redo/unified-wal-impl branch
(checkpoint 3/N at commit 0550e44) which adds the 7 new tests cited.

Row content:
  Definition: sender pump rewinds cursor to fromLSN once at session
  start; cursor advances monotonically per ScanLBAs callback; never
  decreases. Receiver enforces matching wire-level monotonic
  discipline (4-case per kickoff v0.3 §5.1: ==applied+1 apply /
  >applied+1 gap → FailureContract / ==applied exact-duplicate →
  FailureProtocol / <applied backward → FailureProtocol).
  Tests pinned (7):
    TestSender_PumpHappyPath_OnMemoryWAL
    TestSender_LiveWritesDuringSession_OnMemoryWAL
    TestSender_KindByte_FlipsOnceAtCatchUp
    TestSender_StreamUntilHead_CtxCancel
    TestReceiver_RejectsBackwardLSN_InSession
    TestReceiver_RejectsGap_InSession
    TestReceiver_RejectsExactDuplicate_InSession
  Status:  pinned.
2026-04-29 17:48:18 -07:00
pingqiu
04b9d1ea53 sw-block/design: relocate V3 recovery dev docs from seaweed_block
Per repository policy: dev/design docs live in
seaweedfs/sw-block/design/, not in seaweed_block/docs/. Formal
product docs come later. This commit relocates the 7 recovery
design markdown docs (4 trunk-merged in seaweed_block phase-15;
3 in-flight on g7-redo branches) plus the 1 hardware canonical
YAML to sw-block/design/ with v3-recovery-* prefix to match the
existing naming pattern (v3-recovery-live-line-backlog-spec.md).

Companion cleanup: a follow-on PR on seaweed_block removes the
docs from docs/ (and the YAML from testrunner/scenarios/) — that
PR is the seaweed_block side of the relocation.

Files added:

  v3-recovery-pin-floor-wire.md            — was docs/recovery-pin-floor-wire.md
                                             on seaweed_block phase-15 (PR #11+#16)
  v3-recovery-wiring-plan.md               — was docs/recovery-wiring-plan.md
                                             (PR #13)
  v3-recovery-execution-institution.md     — was docs/recovery-execution-institution.md
  v3-recovery-inv-test-map.md              — was docs/recovery-inv-test-map.md
                                             (PR #11/#14/#15)
  v3-recovery-unified-wal-stream-kickoff.md   — was docs/recovery-unified-wal-stream-kickoff.md
                                                 g7-redo/unified-wal-kickoff (v0.3)
  v3-recovery-unified-wal-stream-mini-plan.md — was docs/recovery-unified-wal-stream-mini-plan.md
                                                 g7-redo/unified-wal-mini-plan (v0.2)
  v3-recovery-dual-lane-canonical-runbook.md  — was docs/recovery-dual-lane-canonical-runbook.md
                                                 g7-redo/hardware-canonical-paper
  v3-recovery-dual-lane-canonical.yaml        — was testrunner/scenarios/recovery-dual-lane-canonical.yaml
                                                 g7-redo/hardware-canonical-paper

Internal cross-references updated in-place via sed:
  - docs/recovery-inv-test-map.md → v3-recovery-inv-test-map.md
  - docs/recovery-pin-floor-wire.md → v3-recovery-pin-floor-wire.md
  - docs/recovery-wiring-plan.md → v3-recovery-wiring-plan.md
  - testrunner/scenarios/recovery-dual-lane-canonical.yaml →
    v3-recovery-dual-lane-canonical.yaml

Hand-edits:
  - runbook §1 companion-YAML link: was
    "../v3-recovery-dual-lane-canonical.yaml" (parent dir from
    seaweed_block/docs); now same-directory link in design/.
  - runbook §8 §3.2 #3 reference: was relative to seaweed_block
    memory file (../../.claude/...); rewritten to point to
    v3-recovery-unified-wal-stream-kickoff.md §4 directly.
  - mini-plan Q15: docs/archive/ wording updated to
    sw-block/design/archive/.

Stages-of-evidence still readable from the docs themselves
(kickoff §11, mini-plan §10 resolution logs, inv-test-map row
versions). Original seaweed_block branches preserve git
history for the in-flight content; the cleanup PR closes them
once this lands.

NOTE: this commit does NOT include the user's unrelated
ongoing edits in feature/sw-block (M v3-batch-process.md,
M v3-dev-roadmap.md, M v3-phase-15-g6-mini-plan.md, etc.).
Those stay uncommitted for the user to handle separately.
2026-04-29 17:15:38 -07:00
pingqiu
3904730c5a G7 mini-plan v0.1 — §harness-notes correction per QA pre-work survey
QA's G7 pre-work surfaced a discrepancy: the v0.1 §harness-notes
pointed at `exec_rebuild_started` / `exec_rebuild_completed` as
the harness markers. Those are RecoveryLog event names (internal
Orchestrator.Log ring buffer, process-local) — NOT visible in
primary.log on hardware. Hardware harnesses can't scrape them
without a /recovery-log HTTP surface (G5-3 forward-carry).

Hardware-visible markers (corrected):
- START:    `executor: rebuild start replica=<id> sessionID=<n>
             epoch=<n> EV=<n> targetLSN=<n>`
            from core/transport/rebuild_sender.go:41
            (added at G6 #1, seaweed_block@85475cd)
- COMPLETE: `executor: rebuild complete, sent <n> blocks
             (targetLSN=<n>)`
            from core/transport/rebuild_sender.go:120
            (pre-existing T4d-4 part B / earlier)

Both produced via log.Printf in rebuild_sender.go and routed to
the daemon's stdout/stderr stream (which iterate harness captures
to ${REMOTE_RUN_DIR}/logs/primary.log). Both are sessionID-
correlatable for chained-scenario filtering. The G6 hardware run
already proved the START marker pattern; COMPLETE follows the
same shape.

Files corrected:
- §2 #7 acceptance row (harness helper text)
- §2 entry-marker table row
- §3 risks "Ambiguous rebuild done vs peer healthy" row
- §harness-notes (full rewrite with v0.1 correction note +
  marker table + RecoveryLog clarification + recommended helper
  shape with sessionID filter)

Negative-references to RecoveryLog event names retained in
explanatory context (so future readers don't re-introduce the
mistake by reading the engine code in isolation).

QA pre-work artifact V:\share\g5-test\scenarios\g7-helpers.sh is
already written against the corrected literals; this commit
brings the §harness-notes source-of-truth into alignment.

Standing by for architect §1.A ratification (Q1 topology / Q2
fold-G6 / Q3 deadline / etc.) before §1.H code-start audit.
2026-04-28 13:01:55 -07:00
pingqiu
4a876a9cd2 G6 §close + 5 INVs inscribed in ledger + roadmap closure
m01 single-run GREEN at 71 s on seaweed_block@96c51b4 — both §2 #4
(retention-OK catch-up) AND §2 #5 (sustained-write recycle →
rebuild dispatch + 5000-LBA byte-equal) in one closed-loop run per
architect §2 #6 binding.

Logs: V:\share\g5-test\logs\g6-20260428T100217Z.log
Scenarios: V:\share\g5-test\scenarios-g6.sh + scenarios\g6-d.sh

Mini-plan §close:
- §close.summary: 8-row table of bindings + commits + hardware
  + regression status, all GREEN.
- §close.evidence: software-pin (3 commits, 10 tests / 14 cases
  PASS); hardware-pin (5 acceptance rows, all GREEN; single 71 s run).
- §close.deltas: 2 entries documenting (a) physical-recycle NOT
  required for §2 #5 (engine recovery decision branch is
  load-bearing) and (b) harness discipline finding from QA.
- §close.findings: 2 findings — (1) data-vs-state convergence
  harness discipline → new INV; (2) §1.H audit verdict was
  correct + resolved in-batch.
- §close.forward-carries: G5-2/G5-6 (durability mode), G5-3
  (peer-state surface), future replica-aware retention (β/γ),
  G7 (rebuild path semantics).

5 INVs inscribed in v3-invariant-ledger.md:
- INV-G6-WALRECYCLE-DISPATCHES-REBUILD
- INV-G6-CATCHUP-CONVERGES-WITHIN-RETENTION
- INV-G6-RETENTION-POLICY-OPERATOR-VISIBLE
- INV-G6-ENGINE-NO-REBUILD-PINNED-ON-OTHER-FAILURES
- INV-G6-HARNESS-DATA-AND-STATE-CONVERGENCE (NEW from §close.findings #1)

INV-G6-RETENTION-POLICY-REPLICA-AWARE NOT inscribed — reserved for
future β/γ replica-aware retention batch (architect §1.A α
ratification 2026-04-29).

Roadmap §3 G6 line:  next →  closed 2026-04-28 (retention-aware
recovery; α config knob + escalation pin).
Roadmap §7: G6 row added to recently-closed table.
Roadmap §8 backlog: G6-T-WALRECYCLE-ESCALATE → "Closed backlog
tickets" section with verdict (a) + resolution narrative.

Awaiting architect single-sign on this §close.
2026-04-28 11:03:26 -07:00
pingqiu
050c3ff875 G6 mini-plan v0.1 (kickoff draft for architect ratification)
Per architect ruling 2026-04-28 on G6 scope (post-G5-5C close):

Bindings absorbed:
- G6-T-WALRECYCLE-ESCALATE folded into G6 main acceptance, not
  a separate sub-batch (architect ruling #1).
- §1 AC = single closed-loop covering retention-OK catch-up +
  recycle-triggered escalation in ONE hardware scenario
  (architect ruling #1, "现象上是一件事, 不重复跑 sustained").
- §1.A = WAL retention policy options (α config knob / β
  pin-window / γ replica-watermark-driven). sw recommends α for
  smallest diff + fastest ratification; β/γ are richer, naturally
  G6-followup territory if escalation path proves clean first.
- §1.H = audit-then-decide on code surface; do NOT pre-declare
  "zero code". Three possible verdicts: verify-only /
  minor-patch / engine-evolution-batch (halt condition).

§2 acceptance criteria (8 items):
- #1 §1.H audit published as commit note before any production
  code.
- #2 §1.A bound + landed.
- #3 engine-layer dispatch test pinning RecoveryFailureWALRecycled
  → RebuildPinned=true → next decide() emits StartRebuild.
- #4 hardware retention-OK catch-up GREEN.
- #5 hardware recycle-escalation GREEN (rebuild dispatch within
  deadline OR documented operator-failure-mode if §1 binds
  rebuild-as-NON-GOAL — architect's product-句 escape clause).
- #6 #4 + #5 pass in SAME hardware run (one closed-loop AC).
- #7 no regression on G5-5C 6-step suite.
- #8 zero diff under master/authority/proto (carries
  INV-G5-5C-NO-MASTER-PROTOCOL-CHANGE discipline).

§3 INVs to inscribe at close (4 always + 1 conditional):
- INV-G6-WALRECYCLE-DISPATCHES-REBUILD
- INV-G6-CATCHUP-CONVERGES-WITHIN-RETENTION
- INV-G6-RETENTION-POLICY-OPERATOR-VISIBLE (only if α)
- INV-G6-RETENTION-POLICY-REPLICA-AWARE (only if β/γ)
- INV-G6-ENGINE-NO-REBUILD-PINNED-ON-OTHER-FAILURES

Forward-carry from G5-5C consumed (§5):
- G6-T-WALRECYCLE-ESCALATE — primary scope of this batch.
- Evidence: V:\share\g5-test\logs\bcd-20260428T072539Z.log D-section.
- QA wait_until_rebuild_dispatched helper held until §1-§6
  ratified.

§7 sign table awaits architect §1-§6 ratification (especially
§1.A α/β/γ pick) before sw runs §1.H audit.

Standing by for architect ratification.
2026-04-28 01:31:54 -07:00
pingqiu
1207fc5444 Roadmap §8: queue G6-T-WALRECYCLE-ESCALATE backlog ticket from G5-5C QA scenario D
Per architect ruling 2026-04-28 + sw §close.appendix: D's WALRecycled
boundary finding is G6 territory, not a G5-5C reopener. Adding the
backlog ticket here so it doesn't get lost between G5-5C close and
G6 kickoff.

Ticket text + evidence pointer + cross-references all preserved
from the §close.appendix; this is the dev-roadmap-side mirror so
the ticket surfaces when planning G6 scope.

Standing by for architect final §close single-sign on G5-5C.
2026-04-28 00:40:40 -07:00
pingqiu
5069e74445 G5-5C §close.appendix: QA scenario expansion (B/C confidence + D → G6 carry)
Per architect ruling 2026-04-28 on QA's expanded scenario report:
- A (capacity): 🐛 already-fixed at seaweed_block@a250b52, INV inscribed.
- B (500 random LBAs over 65536-LBA volume):  GREEN. Confidence
  bump on dirty-map skew + ship order under random write pattern.
- C (kill replica mid-write-storm + restart + 200 LBAs converge):
   GREEN. Highest-signal recovery scenario in the expansion;
  validates G5-5C peer-recovery trigger under load.
- D (5000-LBA sustained write → WALRecycled past replica LSN):
  🐛 boundary finding. Architect: G6 territory, not G5-5C reopener.
  Catch-up requires WAL retention; rebuild path is for gap-beyond-
  WAL. Engine has dispatch-branch tests (Batch 4); runtime
  escalation path under sustained pressure is G6 acceptance scope.

Doc updates:
- New §close.appendix table with all 4 scenario rows + dispositions.
- Semantic clarification on D — catch-up vs WAL recycle vs rebuild.
- §close.forward-carries gets a NEW G6 entry with backlog ticket
  text, evidence pointer, cross-reference to INV-G5-5C-PROBE-BEFORE-
  CATCHUP, and explicit non-reopener rationale.
- Logs + scenario script paths recorded for QA continuity.

§close substance unchanged: G5-5C gate (verify_restart_catchup
GREEN within 30 s) was met on the canonical case at
seaweed_block@712cbc47 + capacity addendum at a250b52. B/C are
strengthening, not gating; D is forward-carry.

Awaiting architect final §close single-sign on this tree.
2026-04-28 00:39:54 -07:00
pingqiu
a5c39fde34 G5-5C addendum: inscribe INV-G5-FRONTEND-CAPACITY-FROM-DURABLE-CONFIG
Per architect ruling 2026-04-28 + sw addendum landing at
seaweed_block@a250b52: inscribe new INV in the ledger.

Statement: iSCSI/NVMe externally-visible volume capacity and block
size MUST derive from --durable-blocks × --durable-blocksize when
--durable-root is set, not silently fall back to frontend defaults
(DefaultVolumeBlocks=2048 × DefaultBlockSize=512 = 1 MiB). Without
this plumb-through, a daemon configured for N MiB durable storage
advertises a 1 MiB iSCSI/NVMe LUN and any workload above LBA 256
fails.

Test pointers: cmd/blockvolume/frontend_capacity_test.go (6 tests:
ProductOfBlocksAndBlockSize, RejectsZero, OverflowGuard,
IscsiHandlerCapacity, NvmeHandlerCapacity, FrontendDefaults_
StillReturn1MiB). Source-side: cmd/blockvolume/main.go::
computeFrontendVolumeSize flows into both iscsi.TargetConfig and
nvme.TargetConfig handler.

First introduced: P15 G5-5C addendum (P0 product fix).
Owner layer: host (binary, frontend wiring).
Last verified: 2026-04-28 (G5-5C addendum P0; m01 hardware re-
verification pending QA).
Status: ACTIVE.

Awaiting m01 hardware re-run for full §close ledger update.
2026-04-27 23:19:32 -07:00
pingqiu
9a2c939b9a G5-5C §close: ALL 6 m01 hardware verify steps GREEN — L4 reached
m01 hardware run 4 at seaweed_block@712cbc47 (with Batch #7
per-peer adapter wiring) — full results:

  #1 verify_cluster_ready        GREEN
  #2 verify_byte_equal           GREEN
  #3 verify_network_catchup      GREEN (9s)
  #4 verify_restart_catchup      GREEN (9s)  ← Batch #7 unblocked
  #5 verify_race_stress (×10)    GREEN
  #6 verify_full_suite           GREEN

§close updated:
- Header: closes at L4 Replicated IO with peer-restart resilience.
- §close.evidence hardware-pin row table: run 4 results.
- Earlier-runs row table preserved for artifact retention (run 2
  port-release race; run 3 per-peer adapter gap; both root-caused
  and fixed).
- §close.findings 'per-peer adapter gap' marked RESOLVED by Batch #7.
- §close.deltas: forward-carry to G5-5D dropped (absorbed in-batch).
- §close.forward-carries: G5-5D removed; only G5-5 deferred ledger
  pointers + G5-2/G5-3/future master observability remain.
- architect-review-checklist: scope truth, engine impact, product
  level all updated to reflect L4 reached on hardware.

INV-G5-5C-PER-PEER-ADAPTER-PER-PEER-ENGINE inscribed at this close
(no longer deferred).

Awaiting QA evidence verification + architect single-sign per
v3-batch-process.md §5 + §8C.2.
2026-04-27 18:13:35 -07:00
pingqiu
88cff6145c G5-5C: §1.I plan extension for Batch #7 (per-peer adapter wiring)
Architect approved Option B 2026-04-27: absorb the hardware-revealed
gap into G5-5C as Batch #7 instead of carrying to G5-5D.

§1.I scope:
- core/host/volume/peer_command_executor.go (NEW, ~120 LOC)
- core/host/volume/peer_adapter_registry.go (NEW, ~100 LOC)
- core/replication/volume.go ConfigurePeerLifecycleHook (~30 LOC)
- core/host/volume/probe_loop_wiring.go router signature (~20 net)
- cmd/blockvolume/main.go registry wire-up (~20 net)
- ~10 new tests, ~250 LOC test code

INV INV-G5-5C-PER-PEER-ADAPTER-PER-PEER-ENGINE absorbed back
in-batch (was previously deferred to G5-5D in pre-architect-ruling
draft).

Pass criterion unchanged: m01 verify_restart_catchup GREEN within
30s deadline; #1-#3 regression GREEN in the same run.

§close updated: ceremony waits for Batch #7 land + hardware re-run;
G5-5C closes at full L4 in one shot.
2026-04-27 17:46:10 -07:00
pingqiu
389896b5e4 G5-5C §close: m01 #1-#3 GREEN, #4 RED — hardware-revealed gap, carries to G5-5D
m01 hardware run 3 at seaweed_block@ac9392d:
- #1 verify_cluster_ready   GREEN
- #2 verify_byte_equal      GREEN
- #3 verify_network_catchup  GREEN (9s)
- #4 verify_restart_catchup  RED (30s timeout)

Root cause (verified in code + log):
Primary log shows probe loop fired correctly post-restart and the
wire probe SUCCEEDED twice (R=2 S=1 H=3), but no StartCatchUp ever
dispatched. Engine apply.go:117-128 checkReplicaID drops events
whose ReplicaID doesn't match the adapter's tracked Identity —
cmd/blockvolume's host adapter tracks the PRIMARY'S OWN slot
(ReplicaID=r1), not peer r2. Probe results for r2 are correctly
dropped as wrong_replica.

Component test (Batch #6) passed because cluster.go's
WithEngineDrivenRecovery constructs c.primary.adapters[] — one
per peer. cmd/blockvolume only constructs ONE adapter for the
host's own slot. The component test exercised a different
(architecturally-correct) wiring than production has.

§1.H audit verdict was correct on engine SEMANTICS; it did not
extend to whether the production binary CONSTRUCTS per-peer engine
state. That layer was assumed; hardware revealed the assumption.

§close decision:
- G5-5C software pieces all sound, stay landed (50 unit + integ
  tests PASS; full ./... regression PASS).
- Hardware finding carries to G5-5D — Per-peer adapter wiring for
  primary-side recovery dispatch.
- G5-5D pass criterion = exact verify_restart_catchup case from
  this run; seed evidence = sw-block/design/g5-artifacts/primary-fail.log.
- New INV to inscribe at G5-5D close:
  INV-G5-5D-PER-PEER-ADAPTER-PER-PEER-ENGINE.

Doc updates:
- §close.evidence: hardware-pin row table filled with run 3 results.
- §close.deltas: 3 implicit assumptions surfaced.
- §close.findings: 2 findings (#1 per-peer adapter gap; #2 script
  port-release race already fixed).
- §close.forward-carries: G5-5D added as named carry.
- architect-review-checklist: scope/audit/engine-impact/product
  level all updated to reflect actual reached state (L3+, not L4).

Awaiting architect ratification of G5-5D scope at single-sign or
earlier; sw drafts G5-5D mini-plan once architect rules.
2026-04-27 17:40:41 -07:00
pingqiu
a15d13a02c G5-5C §close skeleton: software pin + hardware-pin TBD rows
Per v3-batch-process.md §2: §close drafted as soon as software is
ready. Hardware row table left as TBD; sw fills evidence pointers
once iterate-m01-replicated-write.sh completes. Forward-carries +
deferred ledger pointers + architect-review-checklist all populated
based on G5-5C scope already in-batch.

Awaiting:
1. m01 hardware run completion → fill #1-#4 evidence rows
2. QA evidence verification → §close.deltas / findings if needed
3. architect single-sign per v3-batch-process.md §5 + §8C.2
2026-04-27 17:33:51 -07:00
pingqiu
9245446b59 G5-5C §1.H code-start audit: PROCEED — all halt-conditions clear
Per v0.5 §1.H step 3, sw publishes audit findings as a commit note
before any G5-5C production code change.

AUDIT METHOD: greped seaweed_block/core/{engine,replication,adapter}
for the structural backing of each in-scope INV; cited apply.go +
state.go + replication/volume.go + adapter/adapter.go line numbers
as evidence.

PER-INV FINDINGS:

[1] INV-G5-5C-PRIMARY-RECOVERY-AUTHORITY-BOUNDED
    Owner: core/replication/volume.go (ReplicationVolume.peers map)
    Status:  PASS. peers map is sole probe target collection;
    UpdateReplicaSet is sole mutator and is master-fact-driven only.
    Halt-cond cleared.

[2] INV-G5-5C-GENERATION-FENCE
    Owner: core/engine/apply.go:132-166 (stale event rejection) +
    state.go:24-32 (IdentityTruth.{Epoch, EndpointVersion} carrier)
    Status:  PASS. Engine rejects events with epoch < Identity.Epoch
    or (epoch == AND ev < Identity.EndpointVersion). identityChanged
    triggers wholesale Recovery reset (line 166-169). Fence is
    carried on engine state, not re-derived per call site.
    Halt-cond cleared.

[3] INV-G5-5C-SINGLE-INFLIGHT-PER-PEER
    Owner: core/engine/state.go:144-151 (SessionTruth single-slot) +
    apply.go phase-guards at 183/236/364/417/442/455/472/507/536
    Status:  PASS. ReplicaState.Session is one slot per peer.
    Engine FSM handlers explicitly skip / reject when Phase is
    PhaseStarting or PhaseRunning. apply.go:536 "Skip if a rebuild
    session already exists" pinned. In-flight is engine-explicit,
    not implicit. Halt-cond cleared.

[4] INV-G5-5C-PROBE-BEFORE-CATCHUP
    Owner: core/engine/state.go:84-121 (RecoveryTruth) +
    decide() probe-driven decision path
    Status:  PASS. RecoveryTruth.Decision is derived from R/S/H
    (boundaries from probe), NOT from transport reachability.
    Engine's RebuildPinned guard prevents stale auto-probe from
    downgrading Rebuild back to CatchUp mid-flight (line 105-120).
    Halt-cond cleared.

[5] INV-G5-5C-RECOVERY-BACKOFF
    Owner: engine retry budget (state.go:91-103
    RecoveryTruth.Attempts + RuntimePolicy.MaxRetries from T4c-3) +
    NEW G5-5C runtime cooldown (5s base → 10s → 20s → 40s → 60s cap;
    reset on success)
    Status: ⚠ PARTIAL — engine has retry budget but no exponential
    cooldown. G5-5C adds the cooldown as a primary-runtime policy on
    top of engine retry budget. NOT an engine FSM change. Acceptable
    under §1.H "minimum evolution" criterion. Halt-cond cleared.

[6] INV-G5-5C-STALE-ACK-NO-HEALTH-PROMOTION
    Owner: core/engine/apply.go:766-789 (Healthy gate)
    Status:  PASS. Healthy = true requires three conjuncts:
    (a) Recovery.Decision == DecisionNone, (b) Reachability.Status
    == ProbeReachable, (c) Identity.Epoch <= Reachability.FencedEpoch.
    A barrier ack with AchievedLSN < TargetLSN does not transition
    SessionTruth, decide() does not flip Decision to None on
    insufficient achieved LSN — Healthy stays false. Halt-cond
    cleared.

OVERALL VERDICT: PROCEED.

All six in-scope INVs have their backing infrastructure in engine
(state.go + apply.go) or replication (volume.go). G5-5C is a runtime
wiring batch + small policy extension (backoff). No engine FSM
rewrite needed. No halt-condition fires; no engine-evolution
mini-plan required.

NEXT STEP: implement primary-side probe loop +
ReplicaPeer.ProbeIfDegraded() + lifecycle/cooldown/dispatch tests +
component test, all under core/replication/. Probe loop owned by
ReplicationVolume lifecycle per architect binding. Test method
names to be concretized as code-start commit-note addendum to §2.

This audit commit fulfills §1.H step 3 (audit findings published) +
§2 #15 (audit commit note before production code).
2026-04-27 15:21:14 -07:00
pingqiu
74e92b974d G5-5C mini-plan v0.4.5 → v0.5: single-sign recorded + §1 scope-rule one-liner
Architect single-signed §1-§6 at seaweedfs@ba7bd0ba4 2026-04-27 with:
- Option B trigger source (primary-side degraded-peer probe loop)
- Probe loop placement = core/replication/ owned by ReplicationVolume
- Master protocol unchanged
- §1.H code-start audit gate before code

This commit:
1. Records the single-sign in the doc header.
2. Adds a §1 scope-rule one-liner near the top so future readers find
   the architect-bound boundary without re-reading the v0.1→v0.5 trail:
   "master owns identity/topology; primary+engine own data recovery;
   the protocol aligns the two via (PeerSetGeneration, epoch,
   EndpointVersion) fences."

§1.A already bound Option B in v0.4; no flip needed there. No design
change. §1.H audit is the next sw step before any production code.
2026-04-27 15:19:00 -07:00
pingqiu
ba7bd0ba48 G5-5C mini-plan v0.4.4 → v0.4.5: doc-hygiene cleanup + probe loop placement bound
Architect approves v0.4.4 substance (Option B; master unchanged;
no PeerSetGeneration change) but requires five doc-hygiene fixes
before single-sign:

1. §1 #3 V2 path "weed/server/" → V3 "core/replication/" + reword
   from "shipper re-arms" V2 vocabulary to "probe loop detects
   degraded peer reconnection".
2. §1 Architecture truth-domain check: dropped v0.3 / A1 /
   "publication / re-emission" residue. Now points cleanly to §1.C.
3. §2 "#3a/#3b/#4" v0.2 naming residue: rewritten to reference
   acceptance criteria #2-#15 with package-level verifier files
   (peer_test.go, probe_loop_test.go, volume_test.go, component/).
   Test method names concretized at code-start as commit-note
   addendum (no re-ratification needed).
4. Architect review checklist "Engine / adapter impact" reworded:
   "No new engine recovery primitive by default; engine-owned
   fences/state audited at §1.H code-start; if found insufficient,
   sw halts G5-5C and starts engine-evolution mini-plan rather
   than layering ifs in core/replication/."
5. §1.A loop owner row bound: probe loop placement = core/replication/
   owned by ReplicationVolume lifecycle (NOT host layer). Reasoning:
   admitted peers + peer state + close/teardown + in-flight guard
   all in core/replication/; host only forwards flags/config. §1.H
   halt rule preserved: audit may still escalate to engine-evolution.

§7 sign table records substance approval 2026-04-27 + probe loop
placement binding + awaits single-sign of v0.4.5.

Standing by for architect single-sign.
2026-04-27 15:15:40 -07:00
pingqiu
9b6e103dde G5-5C mini-plan v0.4.3 → v0.4.4: engine/runtime/master split + 6 boundary rules in scope, 3 forward-carry, audit gate
Architect framing 2026-04-27: enumerate ten protocol boundary rules
and address engine-evolution question.

Engine vs primary runtime vs master split:
- Engine owns: recovery FSM, single in-flight per peer,
  generation/epoch fence, probe→decision, backoff/cooldown policy,
  stale-ack-cannot-promote-health rule, recovery reason / projection
- Primary runtime/adapter owns: timer / degraded-peer loop, transport
  probe execution, feeding probe result into engine, executing
  engine-emitted commands, ReplicationVolume / ReplicaPeer connection
  lifecycle
- Master owns: identity / topology / assignment / health observation
  ONLY. No runtime recovery. No epoch bumps for short up/down.

Six in-scope boundary rules (#1, #2, #3, #4, #7, #8):
- #1 Admitted Peer Rule — already INV-G5-5C-PRIMARY-RECOVERY-AUTHORITY-BOUNDED
- #2 Generation Fence — NEW INV-G5-5C-GENERATION-FENCE
- #3 Single In-Flight Per Peer — NEW INV-G5-5C-SINGLE-INFLIGHT-PER-PEER
- #4 Probe Before Catch-Up — NEW INV-G5-5C-PROBE-BEFORE-CATCHUP
- #7 Backoff/Cooldown — NEW INV-G5-5C-RECOVERY-BACKOFF (extends v0.4
  fixed-5s into 5s→10s→20s→40s→60s cap, reset on success)
- #8 Stale Ack Guard — NEW INV-G5-5C-STALE-ACK-NO-HEALTH-PROMOTION
  (cross-refs G5-5 round-14 gate-degraded artifact)

Three forward-carries OUT of G5-5C (per §5):
- #5 Durability Mode Explicit → G5-2 / G5-6
- #6 RF Health Reporting Separate From Recovery → future master
  observability batch
- #10 Status Surface (recovery reason, effective RF, last probe) →
  G5-3 metrics/backpressure

One citation (#9 Replica-side lineage check): already enforced by T4
acceptMutationLineage gate; G5-5C cites, no new code.

§1.H code-start audit gate: sw audits per-INV current owner location
BEFORE writing any code. Halt-condition: if recovery FSM is embedded
in ReplicationVolume, fence is re-derived per call site, in-flight is
implicit, or stale-ack guard is missing — sw stops and re-scopes as
engine-evolution batch instead of layering ifs in core/replication/.
Audit findings published as commit note pre-code; PR includes
audit-summary.

§2 acceptance criteria: add #13 (stale-ack guard), #14 (backoff
progression), #15 (code-start audit). Acceptance count now 15
covering 7 INVs (6 new + reconnect orthogonality from v0.4.3).

Standing by for architect single-sign of v0.4.4.
2026-04-27 15:11:45 -07:00
pingqiu
13eb8181d3 G5-5C mini-plan v0.4.2 → v0.4.3: add §1.F reconnect orthogonal axes
Architect framing 2026-04-27 (sharpening v0.4.2): reconnect splits
along two orthogonal dimensions — connection recovery vs identity /
lineage change. Each axis has different protocol semantics; G5-5C
must handle both correctly.

Architect's protocol judgment points:
1. PeerSetGeneration only changes for identity / address / lineage
   change. Brief disconnects / restarts / freshness flapping do NOT
   bump generation.
2. Primary's degraded-peer loop only acts on currently-admitted peers
   (§1.E reaffirmed).
3. After reconnect, primary still probes R/S/H — reconnect alone is
   not assumed sufficient.
4. If a higher PeerSetGeneration arrives during reconnect / probe,
   the in-flight recovery must stop or invalidate.

Changes:
- New §1.F with two cases:
  Case 1 (identity unchanged): primary retries existing peer
    descriptor; new sessionID minted (sessions are session-scoped, not
    peer-scoped); probe R/S/H; catch-up / rebuild as needed; no master
    re-emit needed. This is G5-5C's core path.
  Case 2 (identity changed): existing UpdateReplicaSet T4a-5 path
    (volume.go:229-246) tears down + recreates; in-flight aborts via
    Close(); new peer with new lineage takes over.
- Misread guards documented: "primary keeps retrying old address
  forever" rejected by Case 2 + §1.E (c); "master must bump on every
  blip" rejected by Case 1 + §1.D.
- New INV-G5-5C-RECONNECT-ORTHOGONAL-AXES in §3.
- New §2 #11 (reconnect Case 1 — identity unchanged, no re-emit) and
  §2 #12 (reconnect Case 2 — lineage bump mid-flight).

This is structural reaffirmation: the V3 code already does Case 2
correctly (T4a-5 teardown). Case 1 is what the probe loop adds. The
new tests pin both axes against future drift.

Standing by for architect single-sign of v0.4.3.
2026-04-27 15:07:17 -07:00