# Cluster operations

## Drain a single node (hardware replacement, etc.)

```bash
anchorage admin drain nod_01h7rfxv...
# verify
curl -sf http://node-under-drain:8080/v1/ready  # now 503
```

The drained node:
- Returns 503 from `/v1/ready` so the LB stops routing new traffic.
- Cancels its `pin.jobs.<nodeID>` consumer after `cluster.drainGracePeriod` (default 2m) once in-flight jobs finish.
- Keeps publishing `node.heartbeat.<nodeID>` so peers know it is intentionally out of rotation.
- Gets its placements moved onto healthy nodes by the rebalancer with `reason=drain` in audit.

When work is done, restore:

```bash
anchorage admin uncordon nod_01h7rfxv...
```

Existing pins already migrated stay put; the uncordoned node re-acquires load gradually as new pins land.

## Cluster-wide maintenance (rolling upgrade)

Before bouncing every anchorage in turn:

```bash
anchorage admin maintenance on --reason "upgrade to v0.3.1" --ttl 30m
```

While this is set in the `ANCHORAGE_CLUSTER` NATS KV bucket:
- Rebalancer no-ops (so nodes briefly going `down` during restart don't trigger replacement placements).
- Requeue sweeper no-ops (same reason).
- API continues serving on live nodes; new pins are accepted and placed on the live set.
- `/v1/ready` keeps returning 200 on live nodes.

After the fleet has finished bouncing:

```bash
anchorage admin maintenance off
```

The safety rail `cluster.maintenance.maxDuration` (default 1h) means a forgotten flag logs loud warnings, so incidents aren't silently masked.

## Inspecting state

```bash
anchorage admin maintenance status
anchorage migrate status
```