Root cause: StatusSnapshot().CommittedLSN reports 0 in sync_all mode when
the replica shipper has no flushed progress (NeedsRebuild state). This is
correct for lineage-safe committed boundary, but PlanRebuild uses
CommittedLSN as RebuildTargetLSN. With target=0, shouldStartSessionCommand
rejects the StartRebuildCommand, and the rebuild IO never executes.
Fix: PlanRebuild falls back to HeadLSN when CommittedLSN is 0. The
primary's WAL head IS the data boundary the replica needs to reach.
The fact that no replica has confirmed durability is exactly why we're
rebuilding.
Also adds command type logging to coreApplyAndLog so tester can verify
which commands are actually emitted vs silently dropped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three correctness fixes for the remote rebuild path:
1. No double completion: for remote rebuilds, OnRebuildCompleted skips
RebuildCommitted since ObserveReplicaRebuildSessionAck already emitted
SessionCompleted on the accepted ack. One rebuild = one completion event.
2. SessionAckFailed with rejected observation: if OnAck rejects the failed
ack (stale session), don't use the sentinel errRebuildAckFailed. Return
a regular error so ExecutePendingRebuild emits the fallback SessionFailed.
No path leaves the engine session hanging.
3. Diagnostic logging in ExecutePendingRebuild: log the replicaID and
targetLSN on both nil-return (TakeRebuild mismatch) and successful take
paths. Also log the pending store in runRebuild with replicaID, targetLSN,
and IO type. This makes the TakeRebuild seam diagnosable on hardware
without rebuilding the engine package.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the broken primary-local rebuild executor with RemoteRebuildIO,
a server-side engine.RebuildIO implementation that coordinates remotely.
The primary sends SessionControlV2 (with RebuildAddr trailer) to the
replica's control channel; the replica starts a local rebuild session
and auto-connects to the primary's rebuild server for the base lane.
Single rebuild route: ALL core-present rebuilds use RemoteRebuildIO.
The entire command chain is preserved unchanged:
PlanRebuild → pending → RebuildStarted → StartRebuildCommand
→ ExecutePendingRebuild → RemoteRebuildIO.TransferFullBase
Key changes:
- SessionControlMsg v2: optional RebuildAddr trailer (len-based decode)
- ReplicaRebuilding shipper state: session-gated live WAL lane
- RemoteRebuildIO: dials replica ctrl, sends session control, reads acks
- Ack forwarding through ObserveReplicaRebuildSessionAck (pins/watchdog)
- Completion proof from replica's achievedLSN, not primary's local vol
- Transport failures emit SessionFailed (no double-emit on ack failures)
- Progress ack rejection fails closed (stale session = abort)
- Replica auto-starts base lane client on v2 session control
State transitions:
NeedsRebuild → [accepted ack] → Rebuilding → [completed] → InSync
Rebuilding → [failed/EOF] → NeedsRebuild → [next probe] → retry
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace three bypass mechanisms with one unified model. When the
probe returns ProbeRebuildRequired, the host now starts the rebuild
through the existing recovery manager (StartRecoveryTask), which
resolves the rebuild address, plans the rebuild, and executes via
the v2bridge executor — the same path as master-driven RoleRebuilding.
New per-replica probe API:
- WALShipper.ProbeReconnect() → ReplicaProbeResult with typed outcome
- ShipperGroup.ProbeReconnectAll() → []ReplicaProbeResult
- BlockVol.ProbeReplicaOnboarding() / IsClosed()
Host-side wiring:
- handleReplicaProbeResult routes outcomes:
KeepUp → ShipperConnectedObserved
CatchUp → ShipperConnectedObserved (recovery manager handles session)
Rebuild → NeedsRebuildObserved + StartRecoveryTask (executes rebuild)
TemporaryFailure → no-op
- lastAssignmentsForPath reconstructs assignment for recovery manager
- onPrimaryRosterChanged probes all replicas (defined, called from watchdog)
- observePrimaryShipperConnectivity uses probe API
Probe fires via syncProtocolExecutionState immediately after assignment
processing — same heartbeat cycle, no timer delay.
Deleted: startDirectRebuild, resolveCtrlAddrForShipper,
TryReconnect/TryReconnectAll/TryReconnectShippers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When proactive reconnect finds WAL gap exceeds retained range:
1. Emit per-replica NeedsRebuildObserved to engine (with ReplicaID)
2. Resolve replica ctrl address from shipper group
3. Start direct rebuild session: send sessionControl(start_rebuild)
to replica's ctrl channel, stream base blocks, emit RebuildStarted
The primary drives the rebuild directly without master round-trip.
The master sees the result via heartbeat projection (needs_rebuild →
rebuilding → healthy). This matches V2 authority: master owns identity,
primary owns data-control recovery.
Added WALShipper.CtrlAddr() getter for address resolution.
resolveCtrlAddrForShipper maps data address to ctrl address via
shipper group (works for RF=2 and RF=3+).
startDirectRebuild runs in a goroutine: dials replica ctrl, sends
start_rebuild, waits for accepted ack, serves base blocks, emits
RebuildStarted to engine on success.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert detectAndEnqueueRebuildFromHeartbeat (Bridge 2) — master
should not drive rebuild assignments from heartbeat. The primary
owns data-control recovery per the V2 authority split.
Fix Bridge 1: NeedsRebuildObserved now carries per-replica identity.
resolveReplicaIDForShipper maps shipper DataAddr to ReplicaID via
the shipper group (works for RF=2 and RF=3+). The engine receives
the specific replica that needs rebuild, not a volume-level broadcast.
Primary-direct rebuild: the primary detects which replica needs
rebuild and will drive the session directly. The master learns about
it via subsequent heartbeat projection (needs_rebuild → rebuilding →
healthy). No master round-trip needed for the rebuild decision.
Added WALShipper.DataAddr() getter for address resolution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After rejoin, the shipper is configured but no I/O triggers Ship(),
so the shipper stays Disconnected and the core stays at
awaiting_shipper_connected indefinitely.
Fix: observePrimaryShipperConnectivity now calls TryReconnectShippers
when ShipperConfigured=true but ShipperConnected=false. This triggers
the full reconnect protocol (dial + handshake + bounded catch-up)
proactively, bringing the replica current without waiting for I/O.
Option B approach: uses the same reconnect path as Barrier() — not a
fake write or bare dial probe. CatchUpTo(headLSN) replays any retained
WAL entries, bringing the replica fully current.
New methods:
- WALShipper.TryReconnect(): full reconnect without foreground I/O
- ShipperGroup.TryReconnectAll(): probes all disconnected shippers
- BlockVol.TryReconnectShippers(): volume-level entry point
Also fix pre-existing test expectation: engine now emits
start_recovery_task on primary assignment with replicas.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix recover path TOCTOU: re-Lookup after AddReplica so the primary
refresh assignment includes the freshly added replica addresses.
Previously, Lookup (copy) was called before AddReplica modified the
registry, so entry.Replicas was empty → primary got replicas=0 →
shipper never configured.
Add 2 WAL pressure edge case tests:
- ShipperCatchUpOrEscalate: 64KB WAL, 200 writes, aggressive flusher.
Proves no hang/deadlock/corruption. Shipper either keeps up or
correctly escalates to NeedsRebuild.
- RebuildWithPinWhilePrimaryWrites: rebuild session active while
primary writes 7600+ blocks in 2s. Proves primary never freezes
— rebuild pin is on replica only, primary WAL recycles freely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 43 actions pass on m01/m02 hardware. Auto-failover PASS.
dd_write: 30s → 123ms. Post-failover write: 33,621 IOPS.
1. WAL retention: remove keepup retention floor (MinShippedLSN).
WAL cannot be pinned during sustained async writes — any pin
strategy either fills WAL (blocking writes) or over-recycles
(breaking catch-up). Flusher recycles freely. Future LBA map
will provide catch-up without WAL retention.
MinShippedLSN on ShipperGroup retained as diagnostic surface.
2. Registry stale-cleanup race: add RegisteredAt grace period.
Race: master registers volume → next VS heartbeat arrives before
VS discovers the volume → stale cleanup deletes the entry →
failover finds 0 entries. Fix: skip stale cleanup for entries
registered within 30s (> 2 heartbeat intervals).
2 new tests: grace protects new entry, old entry still cleaned.
3. Shutdown heartbeat: VS disconnect heartbeat no longer claims
block inventory authority. Previously, the shutdown beat's
empty inventory triggered stale cleanup, deleting the entry
before failover could use it.
Scenario fix: recovery-baseline-failover.yaml now kills the
correct node (discovered primary, not hardcoded), connects to
the correct new primary for post-failover verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire protocol messages and transport handlers for the rebuild MVP:
Protocol messages (rebuild_transport.go):
- SessionControlMsg: epoch, sessionID, command, baseLSN, targetLSN,
snapshotID. Encode/Decode with fixed 37-byte wire format.
- SessionAckMsg: epoch, sessionID, phase, walAppliedLSN, baseComplete,
achievedLSN. Encode/Decode with fixed 34-byte wire format.
- MsgSessionControl (0x10) and MsgSessionAck (0x11) on control channel.
- SendSessionControl/SendSessionAck convenience functions.
Transport handlers:
- RebuildTransportServer: primary-side, streams all extent blocks as
MsgRebuildExtent frames (reusing existing rebuild message type),
ends with MsgRebuildDone.
- RebuildTransportClient: replica-side, receives base blocks and
routes through vol.ApplyRebuildSessionBaseBlock, marks base
complete on MsgRebuildDone.
4 transport tests:
- SessionControl wire round-trip
- SessionAck wire round-trip
- BaseBlockStreaming: full TCP loop, 1024 blocks streamed and verified
- SessionControlOverTCP: real TCP send/receive with accepted ack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BlockService replica-side rebuild routing API that bridges
transport/host layer to BlockVol session surface:
StartReplicaRebuildSession(path, config)
ApplyReplicaRebuildWALEntry(path, sessionID, entry)
ApplyReplicaRebuildBaseBlock(path, sessionID, lba, data)
MarkReplicaRebuildBaseComplete(path, sessionID, totalBlocks)
TryCompleteReplicaRebuildSession(path, sessionID)
CancelReplicaRebuildSession(path, sessionID, reason)
ReplicaRebuildSession(path) → snapshot
Each method does one thing: validate → WithVolume → delegate to BlockVol.
No wire decoding, no protocol decisions, no state invention. Transport
wiring (sessionControl/walData/sessionData handlers) is the next step.
2 focused tests: skeleton routes correctly, stale session ID rejected.
Updated v2-rebuild-mvp-session-protocol.md with server skeleton section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tighten acceptance matrix with explicit per-boundary rows, signoff
reading split into hard blockers vs product hardening, and clear
rule: architecture-complete ≠ product-complete.
6 hard blockers before T6/T7:
1. WriteLBA/SyncCache/sync_all contract closure
2. Fresh replica bounded catch-up before live tail
3. Timeout/retention-loss classification for catch-up
4. publish_healthy alignment with one protocol contract
5. RF=2 stable identity on all shipping paths
6. Test audit for incorrect WriteLBA==commit assumptions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7-area acceptance matrix mapping current state vs product requirements:
write/durability contract, fresh replica bootstrap, host observation
completeness, serving/publish alignment, snapshot/rebuild convergence,
adapter consistency, test contract alignment.
Each item marked with: current state, required for product, blocks
T6/T7, best test level. Priority ordered into must-close-before-Stage-1,
should-close-before-Stage-2, and can-close-after-T6/T7.
Key diagnosis: architecture-complete, execution-incomplete. The engine
thinks like a product; the data plane still behaves partly like a
prototype. The gap is end-to-end contract closure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add host-side protocol state seam that derives per-replica execution
state from V2 sender/session snapshots and blocks live-tail WAL
shipping while an active recovery session is in progress.
New file: weed/server/block_protocol_state.go
- replicaProtocolExecutionState derived from engine snapshots
- LiveEligible=false during active catch-up/rebuild sessions
- bindProtocolExecutionPolicy wires policy into BlockVol
- syncProtocolExecutionState called after assignments + core events
Data plane changes:
- WALShipper.Ship() checks liveShippingPolicy before dial/send
- BlockVol.SetLiveShippingPolicy persists across shipper group rebuilds
- ShipperGroup propagates policy to all shippers
Design contract: sw-block/design/v2-protocol-aware-execution.md
Scope: WAL-first rollout only. Prevents illegal live-tail delivery
during active recovery. Does not change snapshot/build behavior or
move backlog. Next wave: bounded WAL catch-up under same contract.
Tests: 4 unit/component tests for phase gate behavior, plus bootstrap
seam tests that confirmed the two pre-existing bugs locally.
13 files changed, 900 insertions, 69 deletions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update coverage reading to reflect 49 tests (6 new component tests).
Add full roster status table with per-item strong/bounded/missing
marking and mapped test function names.
Unit+component: 32 of 33 items strong (T4-C7 NVMe bounded).
Integration: 6 of 10 missing (Tier 2 next).
Hardware: 4 of 4 missing (T6/T7 staged plan).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add detailed coverage mapping of 43 existing tests against the test
roster. Identify 7 missing component tests and 3 missing integration
tests with concrete scenarios, file placement, and must-prove criteria.
Key finding: every tester-found bug during T1-T5 was a wiring bug caught
by reviewing the production path, not by unit tests on pure logic. This
confirms component tests are the highest-value gap for CI/CD protection.
Priority order: Tier 1 (7 component tests, do now), Tier 2 (3 integration
tests, do before hardware), Tier 3 (4 hardware scenarios, T6/T7).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ClusterReplicationMode and EngineProjectionMode to
FailoverVolumeState so each volume in the failover diagnostic
carries its cluster/engine mode at diagnosis time.
FailoverDiagnosticSnapshot() enriches volume entries by looking up
the registry entry for each volume. This covers both the block
volume API (GET /block/volume/{name}) and the failover diagnostic
snapshot surface.
Update phase doc to reflect actual exposure paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three tester findings on T5:
1. RF2 with missing replicas now reports "degraded" instead of
"no_replicas". Only RF=1 with no replicas returns "no_replicas".
Missing replica in an RF2 set is a degraded cluster state.
2. TransportDegraded signal now incorporated: if master-observed
transport is degraded, ClusterReplicationMode is at least
"degraded" regardless of individual replica health.
3. API surface exposure: EngineProjectionMode and
ClusterReplicationMode now appear on blockapi.VolumeInfo and are
populated in entryToVolumeInfo(). Operators can consume both
through GET /block/volume/{name} with distinct JSON field names.
12 tests: keepup, catching_up, stale degraded, LSN gap needs_rebuild,
rebuilding role, RF1 no_replicas, RF2 missing degraded, transport
degraded, distinctness, heartbeat update, worst dominates, API
surface distinct naming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ClusterReplicationMode as a distinct master-owned cluster-level
replication health judgment, computed from multi-replica facts:
replica LSN lag, heartbeat freshness, role state. Monotonic: worst
replica state dominates.
Modes: "no_replicas" (RF=1), "keepup" (all healthy), "catching_up"
(replica behind but recoverable), "degraded" (stale heartbeat or
barrier failure), "needs_rebuild" (unrecoverable gap or rebuilding
role).
Distinct from EngineProjectionMode (VS-local engine truth) and
VolumeMode (legacy). They answer different questions, live in
different fields, have different names. Tests explicitly prove the
two can differ without conflict.
Computed in recomputeReplicaState() alongside existing VolumeMode.
Updated on every heartbeat that touches the entry.
9 tests: keepup, catching_up, stale degraded, LSN gap needs_rebuild,
rebuilding role, no_replicas, distinctness from EngineProjectionMode,
heartbeat-driven update, worst-replica-dominates (RF3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix two tester findings:
1. Missing engine projection now fails closed: if v2Core is active but
CoreProjection(path) is missing, gate locally with reason
"missing_engine_projection". Mirrors T2's fail-closed posture.
Only skips enforcement when V2 core is entirely absent.
2. NVMe/TCP now gated alongside iSCSI: gateServing() calls both
targetServer.DisconnectVolume() and nvmeServer.RemoveVolume().
ungateServing() re-registers with both iSCSI and NVMe. A gated
volume is unreachable through all frontend paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three tester findings on T4 activation gate:
1. Real serving enforcement: evaluateActivationGate now calls
gateServing() → DisconnectVolume(iqn) on gate (terminates active
iSCSI sessions, removes volume from target). ungateServing() →
AddVolume(iqn, adapter) on clear (re-registers volume). This is
actual serving enforcement, not just bookkeeping.
2. Wire propagation: add activation_gated (field 25) and
activation_gate_reason (field 26) to proto BlockVolumeInfoMessage.
Add generated Go fields + getters. Add proto conversion in
InfoMessageToProto/InfoMessageFromProto. Gate state now rides the
real VS→master heartbeat wire.
3. Runtime ungate: evaluateActivationGate() now also runs in
applyCoreEvent() (the observation-driven path), not just
applyCoreAssignmentEvent(). Recovery/catch-up completion that
transitions the projection to publish_healthy/replica_ready now
clears the gate and re-registers the volume automatically.
ClearActivationGate() remains as an explicit override for edge cases
but is no longer the primary ungate path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After assignment executes through V2 core, evaluateActivationGate()
checks the resulting projection locally. If mode is degraded,
needs_rebuild, bootstrap_pending, or allocated_only, the volume is
gated from serving. Gate is enforced immediately after assignment,
before the next heartbeat round-trip.
Gate cleared only when projection reaches publish_healthy or
replica_ready. IsActivationGated() provides the query surface for
iSCSI/NVMe adapter enforcement. Heartbeat carries ActivationGated
and ActivationGateReason fields so master can observe the gated state
(report path, not enforcement path).
activationGated map on BlockService tracks per-volume gate state.
Initialized in constructor. Test helper updated to include it.
6 tests: degraded gates, needs_rebuild gates, healthy clears gate,
gate enforced before heartbeat, recovery re-enables, assignment with
degraded projection triggers gate.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace misleading V2PromotionEnabled/V2PromotionReady booleans with
single V2PromotionMode string: "disabled", "placeholder_fail_closed",
or "transport_ready".
Previous V2PromotionReady was true whenever any querier was installed,
including the placeholder that always returns error. Now the diagnostic
accurately distinguishes placeholder (fail-closed until proto regen)
from real gRPC transport.
blockV2EvidenceTransport bool on MasterServer tracks whether the real
transport querier is installed. Currently always false (placeholder).
Set to true only when real gRPC querier replaces the placeholder after
proto regen.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FailoverDiagnostic now carries V2PromotionEnabled and V2PromotionReady
fields. MasterServer.FailoverDiagnosticSnapshot() enriches the failover
state diagnostic with rollout gate visibility so operators can confirm
whether the master is on V1, V2, or V2-fail-closed-placeholder mode.
Update phase-20.md: document default=false rollout policy (safe default
until proto regen enables evidence RPC, then flip to default true).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire V2 promotion into production binary:
- Add --block.v2Promotion CLI flag on weed master (default false)
- MasterOption.BlockV2Promotion → NewMasterServer wires flag + querier
- defaultBlockVSQueryEvidence placeholder (returns explicit error until
proto regen on M01 enables gRPC evidence RPC)
Fix three fail-closed violations found by tester:
1. blockV2Promotion=true + nil querier now fails closed with explicit
log instead of silently falling back to V1
2. Partial evidence (any candidate query failed) now fails closed —
unreachable candidate may be the most durable, promoting from
incomplete evidence violates durability-first ordering
3. Clear EngineProjectionMode in applyPromotionLocked (already in
previous commit, verified in tests here)
2 new tests: NilQuerier_FailsClosed, PartialEvidenceFailure_FailsClosed.
Total T3 tests: 7, all pass. Existing V1 failover tests unaffected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire V2 promotion into the real master failover decision path:
promoteReplica() now dispatches to promoteReplicaV2() when
blockV2Promotion flag is true. V2 path queries each candidate for
fresh evidence via pluggable BlockPromotionEvidenceQuerier, selects
by CommittedLSN (durability-first), and fail-closes when no eligible
candidate exists. No silent fallback to V1.
Feature flag: blockV2Promotion bool on MasterServer. When false,
existing promoteReplicaV1() (health-score-first) is used unchanged.
Flag is explicit and observable, not a hidden rescue path.
Registry: add PromoteReplicaByServer() for V2 path where master
already knows the winner. Clear stale EngineProjectionMode in
applyPromotionLocked (complements T1 turnover fix).
T2 fix: fail-closed when V2 core projection is absent —
Eligible=false with reason "missing_engine_projection". CommittedLSN
from core used unconditionally (no WALHeadLSN overstatement).
5 T3 integration tests: higher CommittedLSN wins, all-ineligible
fail-closed, evidence-failure fail-closed, flag-off uses legacy,
epoch bump + assignment enqueue only after selection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VS-side evidence handler (QueryBlockPromotionEvidence) reads live
blockvol.Status() + V2 core projection at call time. Fail-closed:
no core projection → ineligible with reason "missing_engine_projection".
Engine CommittedLSN used unconditionally when core present (no WALHeadLSN
overstatement). Eligibility owned by local V2 engine, not master.
Master-side selection (selectDurabilityFirstCandidate): durability-first
ordering by CommittedLSN, tie-break WALHeadLSN then HealthScore. All
ineligible → fail-closed, no promotion. Pluggable querier
(BlockPromotionEvidenceQuerier) for T3 wiring.
Proto messages added to volume_server.proto. gRPC transport binding
pending proto regen on M01 — this commit delivers evidence semantics
and selection substrate, not full end-to-end RPC closure.
Phase 20 doc updated with T2-T5 reviewer packs and cross-task guardrails.
13 tests: live facts, core projection mode, fail-closed no-core, 4 gated
modes, missing volume, epoch mismatch, CommittedLSN ordering, WALHeadLSN
tie-break, HealthScore tie-break, all-ineligible, mixed collection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add engine_projection_mode as a distinct proto/wire/registry field
that carries pure V2 engine-derived local projection mode from VS
to master. Reads ONLY from CoreProjection — no ad-hoc fallback.
Separate from existing VolumeMode: EngineProjectionMode is VS-local
V2 engine truth, VolumeMode is the existing field that conflates V2
and V1 paths. Both exist during transition; only EngineProjectionMode
is V2-authoritative.
Clears stale value on primary turnover: when a newly promoted primary
heartbeats without the field, the old primary's projection is not
preserved (prevents synthetic master-side truth).
5 focused tests: propagation, distinctness (hard assertion), backward
compat preservation, turnover-clears, turnover-with-field.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Live HTTP evidence transport, continuous Loop2 service, bounded auto
failover trigger, runtime-managed frontend export, bounded replica
repair, end-to-end RF2 handoff with continued I/O on new primary,
bounded operator HTTP surface, and CSI V2 runtime backend adapter.
11 new proof tests covering the full M6-M10 chain plus CSI create/
lookup/publish through the V2 runtime path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Freeze the first bounded pilot/preflight/stop/rollout-review artifact set and sync the global product ledgers so productionization can start from an explicit chosen-envelope discipline instead of ad hoc rollout judgment.
Made-with: Cursor
Freeze the first Phase 17 branch/contract/policy/envelope package, add review and supported-matrix artifacts, and sync the product-completion and claim-evidence ledgers to the new bounded post-Phase-16 checkpoint.
Made-with: Cursor
Bind non-authoritative inventory, restart primary-truth rebasing, and sparse replica readiness retention into the heartbeat/master seam, and package the bounded finish-line checkpoint with explicit claims, non-claims, and proof commands.
Made-with: Cursor
Carry explicit volume_mode_reason across the heartbeat/master/API seam so outward surfaces retain the bounded core-owned explanation behind mode transitions.
Made-with: Cursor
Use ReplicaEligible instead of PublishHealthy in the heartbeat collector test now that publish health is rebound to publication truth rather than receiver readiness.
Made-with: Cursor
Make the heartbeat/master boundary preserve explicit volume_mode truth so master consume no longer reconstructs outward mode only from secondary heartbeat signals. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field.
Made-with: Cursor
Make the heartbeat/master boundary preserve explicit publish_healthy truth so master consume no longer reconstructs healthy publication only from secondary readiness and degraded heuristics. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field.
Made-with: Cursor
Make the heartbeat/master boundary preserve explicit needs_rebuild truth so primary heartbeat consume no longer collapses that stronger mode into a generic degraded signal. Keep backward compatibility by falling back to the previous heuristic when older heartbeats do not send the field.
Made-with: Cursor
Make the heartbeat/master boundary carry explicit replica readiness truth so the registry no longer depends only on replica transport-address presence as a readiness proxy. Keep backward compatibility by falling back to the old address heuristic when older heartbeats do not send the field.
Made-with: Cursor