# Cluster operations ## Drain a single node (hardware replacement, etc.) ```bash anchorage admin drain nod_01h7rfxv... # verify curl -sf http://node-under-drain:8080/v1/ready # now 503 ``` The drained node: - Returns 503 from `/v1/ready` so the LB stops routing new traffic. - Cancels its `pin.jobs.` consumer after `cluster.drainGracePeriod` (default 2m) once in-flight jobs finish. - Keeps publishing `node.heartbeat.` so peers know it is intentionally out of rotation. - Gets its placements moved onto healthy nodes by the rebalancer with `reason=drain` in audit. When work is done, restore: ```bash anchorage admin uncordon nod_01h7rfxv... ``` Existing pins already migrated stay put; the uncordoned node re-acquires load gradually as new pins land. ## Cluster-wide maintenance (rolling upgrade) Before bouncing every anchorage in turn: ```bash anchorage admin maintenance on --reason "upgrade to v0.3.1" --ttl 30m ``` While this is set in the `ANCHORAGE_CLUSTER` NATS KV bucket: - Rebalancer no-ops (so nodes briefly going `down` during restart don't trigger replacement placements). - Requeue sweeper no-ops (same reason). - API continues serving on live nodes; new pins are accepted and placed on the live set. - `/v1/ready` keeps returning 200 on live nodes. After the fleet has finished bouncing: ```bash anchorage admin maintenance off ``` The safety rail `cluster.maintenance.maxDuration` (default 1h) means a forgotten flag logs loud warnings, so incidents aren't silently masked. ## Inspecting state ```bash anchorage admin maintenance status anchorage migrate status ```