fix: Phase 20 T3 — add V2 promotion observability to FailoverDiagnostic

FailoverDiagnostic now carries V2PromotionEnabled and V2PromotionReady
fields. MasterServer.FailoverDiagnosticSnapshot() enriches the failover
state diagnostic with rollout gate visibility so operators can confirm
whether the master is on V1, V2, or V2-fail-closed-placeholder mode.

Update phase-20.md: document default=false rollout policy (safe default
until proto regen enables evidence RPC, then flip to default true).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
pingqiu
2026-04-05 16:27:02 -07:00
parent 43016e6645
commit 2b97cd04b8
2 changed files with 15 additions and 2 deletions

View File

@@ -207,9 +207,11 @@ authorization.
6. Bump epoch, enqueue assignment to selected candidate
**Legacy fallback policy**:
- Add `--block.v2-promotion` flag (default `true`)
- Add `--block.v2Promotion` flag (default `false` — safe rollout default
until proto regen enables the evidence RPC; once RPC is live, flip to
default `true`)
- When `true`: `promoteReplicaV2()` with fail-closed on evidence failure
- When `false`: existing `promoteReplica()` (V1 path)
- When `false`: existing `promoteReplicaV1()` (V1 path)
- The flag is observable via `/vol/status` and metrics
- The flag is intended to be removed once V2 is validated, not permanent

View File

@@ -54,6 +54,8 @@ type FailoverVolumeState struct {
// Volume-oriented: each entry describes one volume's failover state.
// Aggregate counts are derived from the volume list.
type FailoverDiagnostic struct {
V2PromotionEnabled bool // T3: whether durability-first V2 promotion is active
V2PromotionReady bool // T3: whether V2 evidence querier is wired (false = fail-closed placeholder)
Volumes []FailoverVolumeState
PendingRebuildCount map[string]int // dead server → count of pending rebuilds
DeferredPromotionCount map[string]int // dead server → count of deferred promotion timers
@@ -93,6 +95,15 @@ func (fs *blockFailoverState) DiagnosticSnapshot() FailoverDiagnostic {
return diag
}
// FailoverDiagnosticSnapshot returns a FailoverDiagnostic enriched with
// V2 promotion rollout state so operators can observe the active mode.
func (ms *MasterServer) FailoverDiagnosticSnapshot() FailoverDiagnostic {
diag := ms.blockFailover.DiagnosticSnapshot()
diag.V2PromotionEnabled = ms.blockV2Promotion
diag.V2PromotionReady = ms.blockV2Promotion && ms.blockVSQueryEvidence != nil
return diag
}
// PublicationDiagnostic is a bounded read-only snapshot comparing the
// operator-visible publication (LookupBlockVolume response) against the
// registry authority for one volume. P3 diagnosability surface for S2.