Files

William Gill 12bf35caf8 anchorage v1.0 initial tree

Greenfield Go multi-tenant IPFS Pinning Service wire-compatible with the
IPFS Pinning Services API spec. Paired 1:1 with Kubo over localhost RPC,
clustered via embedded NATS JetStream, Postgres source-of-truth with
RLS-enforced tenancy, Fiber + huma v2 for the HTTP surface, Authentik
OIDC for session login with kid-rotated HS256 JWT API tokens.

Feature-complete against the 22-milestone build plan, including the
ship-it v1.0 gap items:

  * admin CLIs: drain/uncordon, maintenance, mint-token, rotate-key,
    prune-denylist, rebalance --dry-run, cache-stats, cluster-presences
  * TTL leader election via NATS KV, fence tokens, JetStream dedup
  * rebalancer (plan/apply split), reconciler, requeue sweeper
  * ristretto caches with NATS-backed cross-node invalidation
    (placements live-nodes + token denylist)
  * maintenance watchdog for stuck cluster-pause flag
  * Prometheus /metrics with CIDR ACL, HTTP/pin/scheduler/cache gauges
  * rate limiting: session (10/min) + anonymous global (120/min)
  * integration tests: rebalance, refcount multi-org, RLS belt
  * goreleaser (tar + deb/rpm/apk + Alpine Docker) targeting Gitea

Stack: Cobra/Viper, Fiber v2 + huma v2, embedded NATS JetStream,
pgx/sqlc/golang-migrate, ristretto, TypeID, prometheus/client_golang,
testcontainers-go.

2026-04-16 18:13:36 -05:00

1.6 KiB

Raw Permalink Blame History

Cluster operations

Drain a single node (hardware replacement, etc.)

anchorage admin drain nod_01h7rfxv...
# verify
curl -sf http://node-under-drain:8080/v1/ready  # now 503

The drained node:

Returns 503 from /v1/ready so the LB stops routing new traffic.
Cancels its pin.jobs.<nodeID> consumer after cluster.drainGracePeriod (default 2m) once in-flight jobs finish.
Keeps publishing node.heartbeat.<nodeID> so peers know it is intentionally out of rotation.
Gets its placements moved onto healthy nodes by the rebalancer with reason=drain in audit.

When work is done, restore:

anchorage admin uncordon nod_01h7rfxv...

Existing pins already migrated stay put; the uncordoned node re-acquires load gradually as new pins land.

Cluster-wide maintenance (rolling upgrade)

Before bouncing every anchorage in turn:

anchorage admin maintenance on --reason "upgrade to v0.3.1" --ttl 30m

While this is set in the ANCHORAGE_CLUSTER NATS KV bucket:

Rebalancer no-ops (so nodes briefly going down during restart don't trigger replacement placements).
Requeue sweeper no-ops (same reason).
API continues serving on live nodes; new pins are accepted and placed on the live set.
/v1/ready keeps returning 200 on live nodes.

After the fleet has finished bouncing:

anchorage admin maintenance off

The safety rail cluster.maintenance.maxDuration (default 1h) means a forgotten flag logs loud warnings, so incidents aren't silently masked.

Inspecting state

anchorage admin maintenance status
anchorage migrate status

1.6 KiB Raw Permalink Blame History

Cluster operations

Drain a single node (hardware replacement, etc.)

Cluster-wide maintenance (rolling upgrade)

Inspecting state

1.6 KiB

Raw Permalink Blame History