Commit Graph

13089 Commits

Author SHA1 Message Date
pingqiu
512bb5bcf6 fix: orchestrator owns full catch-up contract (budget + truncation)
CompleteCatchUp now integrates:
- BeginCatchUp with start tick (freezes target)
- RecordCatchUpProgress (skips if already converged, e.g., truncation-only)
- CheckBudget at completion tick (escalates to NeedsRebuild + logs)
- RecordTruncation before completion (logs truncation_recorded)
- Logs causal reason for every rejection/escalation

CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN.

3 new orchestrator-level tests:
- ReplicaAhead_TruncateViaOrchestrator: truncation through entry path
- ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected
- BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs

Observability tests relabeled as sender-level (not entry-path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:04:34 -07:00
pingqiu
adaff8ddb3 fix: only log endpoint_changed when endpoint actually changed
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 08:10:35 -07:00
pingqiu
5cdee4a011 fix: orchestrator owns zero-gap completion and per-replica invalidation logging
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap

Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)

Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason

All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.

1 new test: EndpointChange_LogsInvalidation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:01:53 -07:00
pingqiu
47238df0d7 fix: add RecoveryOrchestrator as real integrated entry path
New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle
from assignment through execution to completion/escalation:
- ProcessAssignment: reconcile + session creation + auto-log
- ExecuteRecovery: connect → handshake from RetainedHistory → outcome
- CompleteCatchUp: begin catch-up → progress → complete + auto-log
- CompleteRebuild: connect → handshake → history-driven source →
  transfer → tail replay → complete + auto-log
- InvalidateEpoch: invalidate stale sessions + auto-log

All integration tests rewritten to use orchestrator as entry path.
No direct sender API calls in recovery lifecycle.

SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded,
RebuildSource, RebuildPhase.

RecoveryLog is auto-populated by orchestrator at every transition.

7 integration tests via orchestrator:
- ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica
- Observability: session snapshot, rebuild snapshot, auto-populated log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:58 -07:00
pingqiu
7436b3b79c feat: add integration closure and observability (Phase 05 Slice 4)
New files:
- observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging
- integration_test.go: V2-boundary integration tests through real
  engine entry path

Observability:
- Registry.Status() returns full snapshot: per-sender state, session
  snapshots, counts by category (InSync, Recovering, Rebuilding)
- RecoveryLog: append-only event log for recovery lifecycle debugging

Integration tests (6):
- ChangedAddress_FullFlow: initial recovery → address change →
  sender preserved → new session → recovery with proof
- NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild
  → rebuild assignment → history-driven source → InSync
- EpochBump_DuringRecovery: mid-recovery epoch bump → old session
  rejected → new assignment at new epoch → InSync
- MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via
  RetainedHistory proofs, registry status verified
- RegistryStatus_Snapshot: observability snapshot structure
- RecoveryLog: event recording and filtering

Engine module at 54 tests (12 + 18 + 18 + 6).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:15:46 -07:00
pingqiu
4d06622c01 fix: add nil check for RetainedHistory in sender APIs
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:57:19 -07:00
pingqiu
cc8c529962 fix: connect recovery decisions to RetainedHistory, fix rebuild source
RetainedHistory as engine input:
- RecordHandshakeFromHistory: sender-level API consuming RetainedHistory
  directly, returns RecoverabilityProof alongside outcome
- SelectRebuildFromHistory: sender-level API consuming RetainedHistory
  for rebuild-source decision

RebuildSourceDecision soundness:
- Now requires BOTH trusted checkpoint AND replayable tail
  (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN)
- Trusted checkpoint with unreplayable tail falls back to full_base

4 new tests:
- TrustedCheckpoint_UnreplayableTail (the regression case)
- SenderDriven_CatchUp (history → proof → outcome → complete)
- SenderDriven_Rebuild_SnapshotTail (history → source → rebuild)
- SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:55:31 -07:00
pingqiu
ff7ea41099 feat: add engine data/recoverability core (Phase 05 Slice 3)
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
  checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
  recovery is or is not allowed

14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed

Engine module at 44 tests (12 + 18 + 14).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:04:51 -07:00
pingqiu
368a956aee fix: correct catch-up entry counting and rebuild transfer gate
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
  (recoveredTo - startLSN), not the replica's pre-existing prefix

Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer

3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:35:03 -07:00
pingqiu
930de4ba78 feat: add Slice 2 recovery execution tests (Phase 05)
15 new engine-level recovery execution tests:
- Zero-gap / catch-up / needs-rebuild branching (3 tests)
- Stale execution rejection during active recovery (2 tests)
- Bounded catch-up: frozen target, duration, entries, stall (5 tests)
- Completion before convergence rejected
- Rebuild exclusivity: catch-up APIs excluded (1 test)
- Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests)
- Assignment-driven recovery flow

Engine module now at 27 tests (12 Slice 1 + 15 Slice 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:14:18 -07:00
pingqiu
61e9408261 fix: separate stable ReplicaID from Endpoint in registry
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.

Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses

New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix

doc.go: clarified Slice 1 core vs carried-forward files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:06:11 -07:00
pingqiu
bb24b4b039 fix: encapsulate engine sender/session authority state
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session

Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code

New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:58:28 -07:00
pingqiu
20d70f9fb6 feat: add V2 engine replication core (Phase 05 Slice 1)
Creates sw-block/engine/replication/ — the real V2 engine ownership core,
promoted from sw-block/prototype/enginev2/ with all accepted invariants.

Files:
- types.go: Endpoint, ReplicaState, SessionKind, SessionPhase, FSM transitions
- sender.go: per-replica Sender with full execution + rebuild APIs
- session.go: Session with identity, phases, frozen target, truncation, budget
- registry.go: Registry with reconcile + assignment intent + epoch invalidation
- budget.go: CatchUpBudget (duration, entries, stall detection)
- rebuild.go: RebuildState FSM (snapshot+tail vs full base)
- outcome.go: HandshakeResult + ClassifyRecoveryOutcome

Tests (ownership_test.go, 13 tests):
- Changed-address invalidation (A10)
- Stale session ID rejected at all APIs (A3)
- Stale completion after supersede (A3)
- Epoch bump invalidates all sessions (A3)
- Stale assignment epoch rejected
- Rebuild exclusivity (catch-up APIs rejected)
- Rebuild full lifecycle
- Frozen target rejects chase (A5)
- Budget violation escalates (A5)
- E2E: 3 replicas, 3 outcomes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:51:01 -07:00
pingqiu
26a1b33c2e feat: add A5-A8 acceptance traceability and rebuild-source evidence
Cleanup: removed redundant TargetLSNAtStart from CatchUpBudget.
FrozenTargetLSN on RecoverySession is the single source of truth.

Acceptance traceability (acceptance_test.go):
- A5: 3 evidence tests (unrecoverable gap, budget escalation, frozen target)
- A6: 2 evidence tests (exact boundary, contiguity required)
- A7: 3 evidence tests (snapshot history, catch-up replay, truncation)
- A8: 2 evidence tests (convergence required, truncation required)

Rebuild-source decision evidence:
- snapshot_tail when trusted base exists
- full_base when no snapshot or untrusted
- 3 explicit tests

13 new tests total.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:42:48 -07:00
pingqiu
8f5070679c fix: make frozen target intrinsic and rebuild completion exclusive
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract

Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:30:17 -07:00
pingqiu
8e4028758f fix: make rebuild path exclusive, enforce phase discipline, require tick for stall budget
Rebuild exclusivity:
- BeginCatchUp rejects SessionRebuild ("must use rebuild APIs")
- RecordCatchUpProgress rejects SessionRebuild
- Rebuild sessions can only be completed via CompleteRebuild
- All legacy rebuild-through-catch-up paths in tests converted

Phase discipline:
- SelectRebuildSource requires session.Phase == PhaseHandshake
- Cannot skip BeginConnect + RecordHandshake

Stall budget:
- RecordCatchUpProgress requires tick parameter when
  ProgressDeadlineTicks > 0 (no silent stall budget bypass)

3 new tests: rebuild exclusivity (catch-up APIs rejected),
rebuild source requires handshake phase, stall budget requires tick.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:21:39 -07:00
pingqiu
5b66a85f92 fix: wire rebuild FSM into sender, enforce frozen target, fix entry counting
Rebuild execution path:
- newRecoverySession auto-initializes RebuildState for SessionRebuild
- Sender rebuild APIs: SelectRebuildSource, BeginRebuildTransfer,
  RecordRebuildTransferProgress, BeginRebuildTailReplay,
  RecordRebuildTailProgress, CompleteRebuild
- All rebuild APIs are sender-authority-gated by sessionID
- E2E rebuild test now drives through rebuild FSM, not catch-up APIs

Bounded CatchUp enforcement:
- BeginCatchUp freezes TargetLSNAtStart from session.TargetLSN
- RecordCatchUpProgress rejects progress beyond frozen target
- Entry counting uses LSN delta (recoveredTo - previous), not call count
- Merged RecordCatchUpProgressAt into RecordCatchUpProgress (tick param)

5 new tests: target-frozen enforcement, sender-level rebuild via
rebuild APIs, reject non-rebuild, reject stale ID on rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:16:56 -07:00
pingqiu
3f0048cbd9 feat: add bounded CatchUp budget and Rebuild mode state machine (Phase 4.5 P0)
Bounded CatchUp:
- CatchUpBudget: MaxDurationTicks, MaxEntries, ProgressDeadlineTicks
- BudgetCheck: runtime consumption tracker (StartTick, EntriesReplayed, LastProgressTick)
- Sender.CheckBudget: evaluates budget, escalates to NeedsRebuild on violation
- RecordCatchUpProgressAt: tracks progress tick for stall detection
- BeginCatchUp accepts optional startTick for budget tracking

Rebuild state machine:
- RebuildSource: snapshot_tail (preferred) vs full_base (fallback)
- RebuildPhase: init → source_select → transfer → tail_replay → completed|aborted
- SelectSource: chooses based on snapshot availability
- Phase ordering enforced, transfer regression rejected
- ReadyToComplete validates target reached

13 new tests: budget enforcement (duration, entries, stall, no-budget),
sender budget integration, rebuild lifecycle (snapshot+tail, full base,
abort, phase order, regression), E2E bounded catch-up → rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 14:33:06 -07:00
pingqiu
90c39b549d feat: add prototype scenario closure (Phase 04 P4)
Maps V2 acceptance criteria A1-A7, A10 to enginev2 prototype evidence.
Adds 4 V2-boundary scenarios against the prototype.

Scenario tests:
- A1: committed data survives promotion (WAL truncation boundary)
- A2: uncommitted data truncated, not revived
- A3: stale epoch fenced at sender + session + assignment layers
- A4: short-gap catch-up with WAL-backed proof + data verification
- A5: unrecoverable gap escalates to NeedsRebuild with proof
- A6: recoverability boundary exact (tail +/- 1 LSN)
- A7: historical data correct after tail advancement (snapshot)
- A10: changed-address → invalidation → new assignment → recovery

V2-boundary scenarios:
- NeedsRebuild persists across topology update
- catch-up does not overwrite safe data
- 5 disconnect/reconnect cycles preserve sender identity
- full V2 harness: 3 replicas, 3 outcomes (zero-gap, catch-up, rebuild)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:31:56 -07:00
pingqiu
942a0b7da7 fix: strengthen IsRecoverable contiguity check and StateAt snapshot correctness
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)

StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot

5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:52:11 -07:00
pingqiu
c89709e47e feat: add WAL history model and recoverability proof (Phase 04 P3)
Adds minimal historical-data prototype to enginev2:

- WALHistory: retained-prefix model with Append, Commit, AdvanceTail,
  Truncate, EntriesInRange, IsRecoverable, StateAt
- MakeHandshakeResult connects WAL state to outcome classification
- RecordTruncation execution API for divergent tail cleanup
- CompleteSessionByID gates on truncation when required
- Zero-gap requires exact equality (FlushedLSN == CommittedLSN)
- Replica-ahead classified as CatchUp with mandatory truncation

15 new tests: WAL basics, provable recoverability, unprovable gap,
exact boundary, truncation enforcement, WAL-backed end-to-end
recovery with data verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:29:27 -07:00
pingqiu
edec7098e8 feat: add V2 protocol simulator and enginev2 sender/session prototype
Adds sw-block/ directory with:

- distsim: protocol correctness simulator (96 tests)
  - cluster model with epoch fencing, barrier semantics, commit modes
  - endpoint identity, control-plane flow, candidate eligibility
  - timeout events, timer races, same-tick ordering
  - session ownership tracking with ID-based stale fencing

- enginev2: standalone V2 sender/session implementation (63 tests)
  - per-replica Sender with identity-preserving reconciliation
  - RecoverySession with FSM phase transitions and session ID
  - execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
    RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
  - recovery outcome branching: zero-gap, catch-up, needs-rebuild
  - assignment-intent orchestration with epoch fencing

- design docs: acceptance criteria, open questions, first-slice spec,
  protocol development process

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 10:38:27 -07:00
pingqiu
abbc8bff2b fix: canonicalize host in AllocateBlockVolumeResponse (CP13-2 follow-up)
AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica
addresses. When the VS binds to ":port" (no explicit IP), host
resolved to empty string, producing ":dataPort" as the replica
address. This ":port" propagated through master assignments to both
primary and replica sides.

Now canonicalizes empty/wildcard host using PreferredOutboundIP()
before constructing replication addresses. Also exported
PreferredOutboundIP for use by the server package.

This is the source fix — all downstream paths (heartbeat, API
response, assignment) inherit the canonical address.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:16:45 -07:00
pingqiu
ae87a31d22 fix: store canonical replica addresses in heartbeat state
setupReplicaReceiver now reads back canonical addresses from
the ReplicaReceiver (which applies CP13-2 canonicalization)
instead of storing raw assignment addresses in replStates.

This fixes the API-level leak where replica_data_addr showed
":port" instead of "ip:port" in /block/volumes responses,
even though the engine-level CP13-2 fix was working.

New BlockVol.ReplicaReceiverAddr() returns canonical addresses
from the running receiver. Falls back to assignment addresses
if receiver didn't report.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:08:48 -07:00
pingqiu
aa4688d5d5 fix: sync flusher checkpointLSN after rebuild (CP13-7)
rebuildFullExtent updated superblock.WALCheckpointLSN but not the
flusher's internal checkpointLSN. NewReplicaReceiver then read
stale 0 from flusher.CheckpointLSN(), causing post-rebuild
flushedLSN to be wrong.

Added Flusher.SetCheckpointLSN() and call it after rebuild
superblock persist. TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint
flips FAIL→PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 17:22:55 -07:00
pingqiu
4ed54d04ba fix: close leaked replica in TestShip_DegradedDoesNotSilently
The test used createSyncAllPair(t) but discarded the replica
return value, leaving the volume file open. On Windows this
caused TempDir cleanup failure. All 7 CP13-1 baseline FAILs
now PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 16:54:05 -07:00
pingqiu
3e9358f2be feat: rebuild fallback with per-replica heartbeat state (CP13-7)
Adds per-replica state reporting in heartbeat so master can identify
which specific replica needs rebuild, not just a volume-level boolean.

New ReplicaShipperStatus{DataAddr, State, FlushedLSN} type reported
via ReplicaShipperStates field on BlockVolumeInfoMessage. Populated
from ShipperGroup.ShipperStates() on each heartbeat. Scales to RF=3+.

V1 constraints (explicit):
- NeedsRebuild cleared only by control-plane reassignment (no local exit)
- Post-rebuild replica re-enters as Disconnected/bootstrap, not InSync
- flushedLSN = checkpointLSN after rebuild (durable baseline only)

4 new tests: heartbeat per-replica state, NeedsRebuild reporting,
rebuild-complete-reenters-InSync (full cycle), epoch mismatch abort.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 16:46:31 -07:00
Ping Qiu
47f0111cae feat: replica-aware WAL retention (CP13-6)
Flusher now holds WAL entries needed by recoverable replicas.
Both AdvanceTail (physical space) and checkpointLSN (scan gate)
are gated by the minimum flushed LSN across catch-up-eligible
replicas.

New methods on ShipperGroup:
- MinRecoverableFlushedLSN() (uint64, bool): pure read, returns
  min flushed LSN across InSync/Degraded/Disconnected/CatchingUp
  replicas with known progress. Excludes NeedsRebuild.
- EvaluateRetentionBudgets(timeout): separate mutation step,
  escalates replicas that exceed walRetentionTimeout (5m default)
  to NeedsRebuild, releasing their WAL hold.

Flusher integration: evaluates budgets then queries floor on each
flush cycle. If floor < maxLSN, holds both checkpoint and tail.
Extent writes proceed normally (reads work), only WAL reclaim
is deferred.

LastContactTime on WALShipper: updated on barrier success,
handshake success, and catch-up completion. Not on Ship (TCP
write only). Avoids misclassifying idle-but-healthy replicas.

CP13-6 ships with timeout budget only. walRetentionMaxBytes
is deferred (documented as partial slice).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:04:23 -07:00
Ping Qiu
9e481a83e9 fix: serialize LSN allocation + shipping with shipMu
Concurrent WriteLBA/Trim calls could deliver WAL entries to replicas
out of LSN order: two goroutines allocate LSN 4 and 5 concurrently,
but LSN 5 could reach the replica first via ShipAll, causing the
replica to reject it as an LSN gap.

shipMu now wraps nextLSN.Add + wal.Append + ShipAll in both
WriteLBA and Trim, guaranteeing LSN-ordered delivery to replicas
under concurrent writers.

The dirty map update and WAL pressure check happen after shipMu
is released — they don't need ordering guarantees.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 16:33:42 -07:00
Ping Qiu
4429f2b8d2 fix: use handshake-reported flushedLSN for catch-up, fix receiver init
doReconnectAndCatchUp() now uses the replicaFlushedLSN returned by
the reconnect handshake as the catch-up start point, not the
shipper's stale cached value. The replica may have less durable
progress than the shipper last knew.

ReplicaReceiver initialization: flushedLSN now set from the
volume's checkpoint LSN (durable by definition), not nextLSN
(which includes unflushed entries). receivedLSN still uses
nextLSN-1 since those entries are in the WAL buffer even if
not yet synced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:54:23 -07:00
Ping Qiu
24de2cea2a fix: refactor reconnect tests to preserve shipper identity (CP13-5)
Updated 3 reconnect tests to stop/restart the ReplicaReceiver on
the same addresses WITHOUT calling SetReplicaAddr. This preserves
the shipper object, its ReplicaFlushedLSN, HasFlushedProgress flag,
and catch-up state across the disconnect/reconnect cycle.

All 3 tests now PASS:
- TestReconnect_CatchupFromRetainedWal
- CatchupReplay_DataIntegrity_AllBlocksMatch
- CatchupReplay_DuplicateEntry_Idempotent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:46:02 -07:00
Ping Qiu
548e47e482 feat: reconnect handshake + WAL catch-up protocol (CP13-5)
Adds the sync_all reconnect protocol: when a degraded shipper
reconnects, it performs a handshake (ResumeShipReq/Resp) to
determine the replica's durable progress, then streams missed
WAL entries to close the gap before resuming live shipping.

New wire messages:
- MsgResumeShipReq (0x03): primary sends epoch, headLSN, retainStart
- MsgResumeShipResp (0x04): replica returns status + flushedLSN
- MsgCatchupDone (0x05): marks end of catch-up stream

Decision matrix after handshake:
- R == H: already caught up → InSync
- S <= R+1 <= H: recoverable gap → CatchingUp → stream → InSync
- R+1 < S: gap exceeds retained WAL → NeedsRebuild
- R > H: impossible progress → NeedsRebuild

WALAccess interface: narrow abstraction (RetainedRange + StreamEntries)
avoids coupling shipper to raw WAL internals.

Bootstrap vs reconnect split: fresh shippers (HasFlushedProgress=false)
use CP13-4 bootstrap path. Previously-synced shippers use handshake.

Catch-up retry budget: maxCatchupRetries=3 before NeedsRebuild.

ReplicaReceiver now initializes receivedLSN/flushedLSN from volume's
nextLSN on construction (handles receiver restart on existing volume).

TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers flips FAIL→PASS.
All previously-passing baseline tests remain green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:38:06 -07:00
Ping Qiu
8d6379f841 feat: replica state machine + barrier eligibility gating (CP13-4)
Replaces binary degraded flag with ReplicaState type:
Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild.

Ship() allowed from Disconnected (bootstrap: data must flow before
first barrier) and InSync (steady state). Ship does NOT change state.

Barrier() gating:
- InSync: proceed normally
- Disconnected: bootstrap path (connect + barrier)
- Degraded: reconnect both data+ctrl connections, then barrier
- Connecting/CatchingUp/NeedsRebuild: rejected immediately

Only barrier success grants InSync. Reconnect alone does not.

IsDegraded() now means "not sync-eligible" (any non-InSync state).
InSyncCount() added to ShipperGroup.

dist_group_commit.go: removed AllDegraded short-circuit that
prevented bootstrap. Barrier attempts always run — individual
shippers handle their own state-based gating.

8 CP13-4 tests + TestBarrier_RejectsReplicaNotInSync flips FAIL→PASS.
All previously-passing baseline tests remain green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 02:39:32 -07:00
Ping Qiu
499e244b8e feat: durable progress truth — replicaFlushedLSN in barrier (CP13-3)
Barrier response extended from 1-byte status to 9-byte payload
carrying the replica's durable WAL progress (FlushedLSN). Updated
only after successful fd.Sync(), never on receive/append/send.

Replica side: new flushedLSN field on ReplicaReceiver, advanced
only in handleBarrier after proven contiguous receipt + sync.
max() guard prevents regression.

Shipper side: new replicaFlushedLSN (authoritative) replacing
ShippedLSN (diagnostic only). Monotonic CAS update from barrier
response. hasFlushedProgress flag tracks whether replica supports
the extended protocol.

ShipperGroup: MinReplicaFlushedLSN() returns (uint64, bool) —
minimum across shippers with known progress. (0, false) for empty
groups or legacy replicas.

Backward compat: 1-byte legacy responses decoded as FlushedLSN=0.
Legacy replicas explicitly excluded from sync_all correctness.

7 new tests: roundtrip, backward compat, flush-only-after-sync,
not-on-receive, shipper update, monotonicity, group minimum.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 01:52:35 -07:00
Ping Qiu
4f3edffb0a fix: canonical replica address resolution (CP13-2)
ReplicaReceiver.DataAddr()/CtrlAddr() now return canonical ip:port
instead of raw listener addresses that may be wildcard (:port,
0.0.0.0:port, [::]:port).

New canonicalizeListenerAddr() resolves wildcard IPs using the
provided advertised host (from VS listen address). Falls back to
outbound-IP detection when no advertised host is available.

NewReplicaReceiver accepts optional advertisedHost parameter for
multi-NIC correctness. In production, the assignment path already
provides canonical addresses; this fix ensures test patterns with
:0 bind also produce routable addresses.

7 new tests. TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind flips
from FAIL to PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 01:38:55 -07:00
Ping Qiu
c263d082b5 fix: restart reconciliation — trust roles, upsert replicas
Same-epoch reconciliation now trusts reported roles first:
- one claims primary, other replica → trust roles
- both claim primary → WALHeadLSN heuristic tiebreak
- both claim replica → keep existing, log ambiguity

Replaced addServerAsReplica with upsertServerAsReplica: checks
for existing replica entry by server name before appending.
Prevents duplicate ReplicaInfo rows during restart/replay windows.

2 new tests: role-trusted same-epoch, duplicate replica prevention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:24:53 -07:00
Ping Qiu
9137fa6486 fix: epoch-based reconciliation on master restart reconstruction
When a second server reports the same volume during master restart,
UpdateFullHeartbeat now uses epoch-based tie-breaking instead of
first-heartbeat-wins:

1. Higher epoch wins as primary — old entry demoted to replica
2. Same epoch — higher WALHeadLSN wins (heuristic, warning logged)
3. Lower epoch — added as replica

Applied in both code paths: the auto-register branch (no entry
exists yet for this name) and the unlinked-server branch (entry
exists but this server is not in it).

This is a deterministic reconstruction improvement, not ground
truth. The long-term fix is persisting authoritative volume state.

5 new tests covering all reconciliation scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:17:51 -07:00
Ping Qiu
a9a5e455c6 fix: Lookup/ListAll return copies, add UpdateEntry for safe mutation
Lookup() and ListAll() now return value copies (not pointers to
internal registry state). Callers can no longer mutate registry
entries without holding a lock.

Added clone() on BlockVolumeEntry with deep-copied Replicas slice.
Added UpdateEntry(name, func(*BlockVolumeEntry)) for locked mutation.
ListByServer() also returns copies.

Migrated 1 production mutation (ReplicaPlacement + Preset in create
handler) and ~20 test mutations to use UpdateEntry.

5 new copy-correctness tests: Lookup returns copy, Replicas slice
isolated, ListAll returns copies, UpdateEntry mutates, UpdateEntry
not-found error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:00:27 -07:00
Ping Qiu
e8c921d9e8 fix: remove nil-optional superMu pattern, require in all FlusherConfigs
superMu is mandatory for correctness — all superblock mutation+persist
must be serialized. Remove the nil guard in updateSuperblockCheckpoint
and add SuperMu to all 7 test FlusherConfig sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 00:19:25 -07:00
Ping Qiu
3ddb87adc9 fix: superblock write coordination (superMu) + remove debug logs
Adds sync.Mutex (superMu) to BlockVol, shared between group commit's
syncWithWALProgress() and flusher's updateSuperblockCheckpoint().
Both paths now serialize superblock mutation + persist, preventing
WALTail/WALCheckpointLSN regression when flusher and group commit
write the full superblock concurrently.

persistSuperblock() also guarded for consistency.

Removes temporary log.Printf lines in the open/recovery path that
were added during BUG-RESTART-ZEROS investigation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 00:09:14 -07:00
Ping Qiu
e92263b4f4 fix: ioMu data-plane exclusion for restore/import/expand
Adds sync.RWMutex (ioMu) to BlockVol enforcing mutual exclusion
between normal I/O and destructive state operations.

Shared (RLock): WriteLBA, ReadLBA, Trim, SyncCache, replica
applyEntry, rebuild applyRebuildEntry — concurrent I/O safe.

Exclusive (Lock): RestoreSnapshot, ImportSnapshot, Expand,
PrepareExpand, CommitExpand, CancelExpand — drains all in-flight
I/O before modifying extent/WAL/dirtyMap.

Scope rule: RLock covers local data-structure mutation only.
Replication shipping is asynchronous and outside the lock, so
exclusive holders block only behind local I/O, not network stalls.

Lock ordering: ioMu > snapMu > assignMu > mu.

Closes the critical ER item: restore/import vs concurrent WriteLBA
silent data corruption gap.

3 new tests: concurrent writes allowed, real restore-vs-write
contention with data integrity check, close coordination.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 20:40:41 -07:00
Ping Qiu
bb691a5458 feat: CP11B-4 observability pack — health state, alerts, dashboard
Health-state derivation: deriveHealthStateWithLiveness() computes
per-volume state (unsafe > rebuilding > degraded > healthy) using
role, replica count, durability mode, degraded flag, and primary
server liveness. Used consistently in both volume responses and
cluster summary.

Extended GET /block/status with health counts (healthy, degraded,
rebuilding, unsafe) and NVMe-capable server count. Response is now
typed BlockStatusResponse instead of untyped map.

Default alert pack: 7 Prometheus rules covering WAL pressure,
flusher errors, replica degradation, rebuilding, scrub errors.
Alert rules reference real seaweedfs_blockvol_* metric names.

Default dashboard: Grafana JSON with 17 panels — cluster health,
IOPS, latency P99, WAL pressure, flusher throughput, replication,
scrub, dirty map, epoch.

17 tests: 9 health derivation, 1 cluster summary, 2 handler/API,
2 alert validation, 2 dashboard validation, 1 liveness parity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 02:12:42 -07:00
Ping Qiu
f501c63009 feat: CP11B-2 explainable placement / plan API
New POST /block/volume/plan endpoint returns full placement preview:
resolved policy, ordered candidate list, selected primary/replicas,
and per-server rejection reasons with stable string constants.

Core design: evaluateBlockPlacement() is a pure function with no
registry/topology dependency. gatherPlacementCandidates() is the
single topology bridge point. Plan and create share the same planner —
parity contract is same ordered candidate list for same cluster state.

Create path refactored: uses evaluateBlockPlacement() instead of
PickServer(), iterates all candidates (no 3-retry cap), recomputes
replica order after primary fallback. rf_not_satisfiable severity
is durability-mode-aware (warning for best_effort, error for strict).

15 unit tests + 20 QA adversarial tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 02:12:25 -07:00
Ping Qiu
683969086c feat: CP11B-1 provisioning presets + review fixes
Preset system: ResolvePolicy resolves named presets (database, general,
throughput) with per-field overrides into concrete volume parameters.
Create path now uses resolved policy instead of ad-hoc validation.
New /block/volume/resolve diagnostic endpoint for dry-run resolution.

Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability
from server-level heartbeat attribute (block_nvme_addr proto field)
instead of scanning volume entries. Fixes false "no NVMe" warning on
fresh clusters with NVMe-capable servers but no volumes yet.

Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader —
read-only diagnostic endpoint can be served by any master.

Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL
entry is recycled between lookup and read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 14:44:24 -07:00
Ping Qiu
075ff52219 feat: CP11B-3 safe ops — promotion hardening, preflight, manual promote
Six-task checkpoint hardening the promotion and failover paths:

T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role,
    server liveness) with structured rejection reasons.
T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08).
T3: Deferred timer safety — epoch validation prevents stale timers
    from firing on recreated/changed volumes (B-07).
T4: Rebuild addr cleanup on promotion (B-11), NVMe publication
    refresh on heartbeat, and preflight endpoint wiring.
T5: Manual promote API — POST /block/volume/{name}/promote with
    force flag, target server selection, and structured rejection
    response. Shared applyPromotionLocked/finalizePromotion helpers
    eliminate duplication between auto and manual paths.
T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight)
    and blockapi client wrappers (Preflight, Promote).

BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared
    by both auto and manual paths) to prevent metrics divergence.

24 files changed, ~6500 lines added. 42 new QA adversarial tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 17:21:17 -07:00
Ping Qiu
ed11a09a61 fix: CP11A-4 snapshot export/import safety — 3 bugs from review
BUG-CP11A4-1 (HIGH): ImportSnapshot now rejects when active snapshots
exist. Import overwrites the extent region that non-CoW'd snapshot blocks
read from, which would silently return import data instead of snapshot-time
data. New ErrImportActiveSnapshots error and snapMu-guarded check.

BUG-CP11A4-2 (HIGH): Double import without AllowOverwrite now correctly
rejected. Import bypasses WAL so nextLSN stays at 1; added FlagImported
(Superblock.Flags bit 0) set after successful import and checked alongside
nextLSN in the non-empty gate.

BUG-CP11A4-3 (MED): Replaced fixed exportTempSnapID (0xFFFFFFFE) with
atomic sequence counter (exportTempSnapBase + exportTempSnapSeq). Each
auto-export gets a unique temp snapshot ID, preventing concurrent export
races and user snapshot ID collisions.

Also added beginOp()/endOp() lifecycle guards to both ExportSnapshot and
ImportSnapshot, and documented the non-atomic import failure semantics.

5 new regression tests + QA-EX-3 rewritten for rejection behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 10:56:18 -07:00
Ping Qiu
7cc6467d09 feat: CP11A-4 snapshot export/import to S3 — artifact format, engine, and transport
Add crash-consistent snapshot export/import for single-profile block volumes.
Export creates a temp snapshot, streams the full volume image with inline
SHA-256, and uploads to S3. Import validates manifest + checksum and writes
directly to extent region. Admin HTTP endpoints /export and /import added
to the standalone iscsi-target binary.

Engine: snapshot_export.go (manifest types, ExportSnapshot, ImportSnapshot)
S3: snapshot_s3.go (AWS SDK v1 transport, pipe-based streaming upload)
Tests: 14 engine + 9 QA adversarial = 23 new tests, all passing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 00:15:27 -07:00
Ping Qiu
1c5b658170 feat: CP11A-3 WAL hardening foundations — pressure visibility, sizing guidance, preflight
Add PressureState() and writer wait tracking to WALAdmission, WALStatus
snapshot API on BlockVol, WAL sizing guidance pure functions, Prometheus
histogram/gauge/counter exports, and admin /status WAL fields. 23 new
tests (7 admission, 10 guidance, 6 QA adversarial).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:30:59 -07:00
Ping Qiu
67f6e73ca7 fix: B-09 stale entry during expand, B-10 heartbeat deletes during expand
B-09: ExpandBlockVolume re-reads the registry entry after acquiring
the expand inflight lock. Previously it used the entry from the
initial Lookup, which could be stale if failover changed VolumeServer
or Replicas between Lookup and PREPARE.

B-10: UpdateFullHeartbeat stale-cleanup now skips entries with
ExpandInProgress=true. Previously a primary VS restart during
coordinated expand would delete the entry (path not in heartbeat),
orphaning the volume and stranding the expand coordinator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:12:40 -07:00
Ping Qiu
1b3edd7856 feat: CP11A-2 coordinated expand protocol for replicated block volumes
Two-phase prepare/commit/cancel protocol ensures all replicas expand
atomically. Standalone volumes use direct-commit (unchanged behavior).

Engine: PrepareExpand/CommitExpand/CancelExpand with on-disk
PreparedSize+ExpandEpoch in superblock, crash recovery clears stale
prepare state on open, v.mu serializes concurrent expand operations.

Proto: 3 new RPCs (PrepareExpand/CommitExpand/CancelExpandBlockVolume).

Coordinator: expandClean flag pattern — ReleaseExpandInflight only on
clean success or full cancel. Partial replica commit failure calls
MarkExpandFailed (keeps ExpandInProgress=true, suppresses heartbeat
size updates). ClearExpandFailed for manual reconciliation.

Registry: AcquireExpandInflight records PendingExpandSize+ExpandEpoch.
ExpandFailed state blocks new expands until cleared.

Tests: 15 engine + 4 VS + 10 coordinator + heartbeat suppression
regression + updated QA CP82/durability tests with prepare/commit mocks.

Also includes CP11A-1 remaining: QA storage profile tests, QA
io_backend config tests, testrunner perf-baseline scenarios and
coordinated-expand actions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:06:48 -07:00