Files

William Gill 12bf35caf8 anchorage v1.0 initial tree

Greenfield Go multi-tenant IPFS Pinning Service wire-compatible with the
IPFS Pinning Services API spec. Paired 1:1 with Kubo over localhost RPC,
clustered via embedded NATS JetStream, Postgres source-of-truth with
RLS-enforced tenancy, Fiber + huma v2 for the HTTP surface, Authentik
OIDC for session login with kid-rotated HS256 JWT API tokens.

Feature-complete against the 22-milestone build plan, including the
ship-it v1.0 gap items:

  * admin CLIs: drain/uncordon, maintenance, mint-token, rotate-key,
    prune-denylist, rebalance --dry-run, cache-stats, cluster-presences
  * TTL leader election via NATS KV, fence tokens, JetStream dedup
  * rebalancer (plan/apply split), reconciler, requeue sweeper
  * ristretto caches with NATS-backed cross-node invalidation
    (placements live-nodes + token denylist)
  * maintenance watchdog for stuck cluster-pause flag
  * Prometheus /metrics with CIDR ACL, HTTP/pin/scheduler/cache gauges
  * rate limiting: session (10/min) + anonymous global (120/min)
  * integration tests: rebalance, refcount multi-org, RLS belt
  * goreleaser (tar + deb/rpm/apk + Alpine Docker) targeting Gitea

Stack: Cobra/Viper, Fiber v2 + huma v2, embedded NATS JetStream,
pgx/sqlc/golang-migrate, ristretto, TypeID, prometheus/client_golang,
testcontainers-go.

2026-04-16 18:13:36 -05:00

10 KiB

Raw Permalink Blame History

Deploying anchorage

Two supported shapes — both use the same container image + binaries produced by GoReleaser.

Before either path, configure your OIDC provider: see ../docs/authentik-setup.md for the Authentik walkthrough. Without a valid auth.authentik.issuer / clientID / audience in anchorage.yaml the web UI login won't work — though anchorage admin mint-token and the API still do.

Option 1 — Linux packages (deb / rpm)

GoReleaser emits .deb and .rpm artifacts with a bundled systemd unit, lifecycle hooks, and directory structure under /etc/anchorage, /var/lib/anchorage, /var/log/anchorage.

# Debian / Ubuntu
apt install ./anchorage_${VERSION}_linux_amd64.deb

# RHEL / Fedora / Alma
dnf install ./anchorage_${VERSION}_linux_amd64.rpm

Post-install flow:

cp /etc/anchorage/anchorage.yaml.example /etc/anchorage/anchorage.yaml
# edit anchorage.yaml — postgres DSN, authentik issuer, ipfs.rpc, …

openssl rand -base64 48 > /etc/anchorage/jwt.key
chmod 0400 /etc/anchorage/jwt.key
chown anchorage:anchorage /etc/anchorage/jwt.key

# apply schema (advisory-lock-safe on a cluster)
/usr/bin/anchorage migrate up --config /etc/anchorage/anchorage.yaml

systemctl start anchorage
systemctl status anchorage
journalctl -u anchorage -f

Option 2 — Docker Swarm (three-node stack)

The stack in docker-compose.yml runs three anchorage instances, each paired 1:1 with its own Kubo daemon, against a single Postgres. An nginx LB fronts HTTP and upgrades /v1/events to WebSocket.

Prerequisites

Three Docker Swarm nodes (or one node if you don't care about HA — just drop the placement.constraints lines).
Each anchorage-hosting node needs anchorage.anchor-id as a label and anchorage.anchor=true:

docker swarm init
docker node update --label-add anchorage.db=true         node-1
docker node update --label-add anchorage.anchor=true     node-1
docker node update --label-add anchorage.anchor=true     node-2
docker node update --label-add anchorage.anchor=true     node-3
docker node update --label-add anchorage.anchor-id=anchor-1 node-1
docker node update --label-add anchorage.anchor-id=anchor-2 node-2
docker node update --label-add anchorage.anchor-id=anchor-3 node-3

Secrets

openssl rand -base64 32 | docker secret create anchorage_postgres_password -
openssl rand -base64 48 | docker secret create anchorage_jwt_key -

Env file

cat > .env <<'EOF'
ANCHORAGE_IMAGE=git.anomalous.dev/alphacentri/anchorage:latest
ANCHORAGE_DOMAIN=anchor.example.com
ANCHORAGE_AUTHENTIK_URL=https://auth.example.com/application/o/anchorage/
POSTGRES_REPLICAS=0
EOF

Deploy

docker stack deploy -c docker-compose.yml anchorage

docker stack services anchorage
docker service logs anchorage_anchorage-1

Verify

curl -fsS https://anchor.example.com/v1/health
curl -fsS https://anchor.example.com/v1/ready

Upgrade

# Bump the image tag in .env, then:
docker stack deploy -c docker-compose.yml anchorage

# Before a disruptive rolling restart, pause the cluster rebalancer
# so brief node absences don't trigger placement thrash:
anchorage admin maintenance on --reason "upgrade to v1.2" --ttl 30m

# …wait for the stack to converge, then:
anchorage admin maintenance off

Drain a single node for hardware work:

anchorage admin drain nod_anchor_2       # also visible in audit log
anchorage admin uncordon nod_anchor_2

Minting a JWT for IPFS clients (`ipfs pin remote`)

Before any OIDC user exists — or when handing a long-lived token to a CI pipeline or a headless service — use anchorage admin mint-token. It reads the signing key directly off disk and emits a signed JWT to stdout; no live anchorage process is required.

# Sysadmin break-glass token, default 395-day TTL (1 year + 30-day grace)
TOKEN=$(anchorage admin mint-token \
    --signing-key /etc/anchorage/jwt.key \
    --issuer   https://auth.example.com/application/o/anchorage/ \
    --audience anchorage)

# Hand it to the IPFS CLI:
ipfs pin remote service add anchor https://anchor.example.com/v1 "$TOKEN"
ipfs pin remote add --service=anchor --name "my-dataset" bafybeig...

--issuer and --audience must match the running anchorage's auth.authentik.* config — when mint-token is run from the same host as the server it reads these from anchorage.yaml automatically.

Shorter-lived tokens (e.g., a developer session):

anchorage admin mint-token --role member --org org_... --ttl 8h

Minted tokens are standalone — they don't appear in GET /v1/tokens and can't be revoked individually. To revoke one, either write its jti to the denylist via the /v1/tokens/{jti} DELETE endpoint (if registered) or rotate the signing key to invalidate every outstanding token at once.

Rotating the JWT signing key

anchorage supports overlap-style rotation — load the new key alongside the old, flip which one mints new tokens, then drop the retired key once outstanding tokens have expired or been re-minted. No mass re-auth event.

Every token carries a kid header naming the key that signed it. The verifier picks the matching key from the currently-loaded set, so "verify against either A or B" works unambiguously.

Config shape

auth.apiToken.signingKeys is a list. Exactly one entry has primary: true — the minting key; any additional entries are verify-only.

Steady state:

auth:
  apiToken:
    signingKeys:
      - id: "2026-04"
        path: /etc/anchorage/jwt.key
        primary: true

During a rotation overlap:

auth:
  apiToken:
    signingKeys:
      - id: "2026-04"
        path: /etc/anchorage/jwt.key
        primary: true        # still the minting key
      - id: "2026-10"
        path: /etc/anchorage/jwt.key.2026-10
        # verify-only until we flip `primary` below

Procedure

Step 1 — generate the new key and stage it.

anchorage admin rotate-signing-key --id 2026-10 --out /etc/anchorage/jwt.key.2026-10
# prints a YAML snippet to stdout — append it to auth.apiToken.signingKeys

Distribute the new key file to every anchorage node (Swarm secret, k8s Secret, Ansible, whatever you already use). The file must have identical bytes on every node.

Apply the config change adding the new entry (no primary: true) and roll-restart the fleet. Every anchorage now verifies against both keys but continues minting with the old primary.

Step 2 — flip primary. Edit the config so primary: true moves from the old entry to the new one:

signingKeys:
  - id: "2026-04"
    path: /etc/anchorage/jwt.key
  - id: "2026-10"
    path: /etc/anchorage/jwt.key.2026-10
    primary: true

Roll-restart. New mints now use kid=2026-10. Tokens already in the wild with kid=2026-04 continue to verify.

Step 3 — drop the retired key. Wait until outstanding old-key tokens have expired or been re-minted. auth.apiToken.maxTTL is the upper bound:

24h default TTL + sessions only: wait 25h and you're done.
395-day IPFS client tokens: either wait the full window, or mass-revoke via the denylist and ask users to re-mint. Most shops pick the second path for security-driven rotations and the first for scheduled ones.

Remove the old entry:

signingKeys:
  - id: "2026-10"
    path: /etc/anchorage/jwt.key.2026-10
    primary: true

Roll-restart. Any straggler token still signed with the old key is now rejected with token: unknown kid "2026-04". Delete /etc/anchorage/jwt.key from every node once the restart is complete.

When to rotate

Scheduled (annual / per-security-policy) — follow the full three-step procedure. Invisible to users whose tokens renew inside the overlap window.
Suspected compromise — do steps 1+2 immediately (seconds apart), then mass-denylist every outstanding old-key token or skip directly to step 3 and accept the breakage.
Algorithm migration (HS256 → ed25519 / RS256) — not yet supported; the token package is HS256-only today. When it lands, the same three-step rotation pattern will apply.

Observability: Prometheus /metrics

anchorage serves a Prometheus scrape endpoint at /metrics at the root (not under /v1) so standard service-discovery selectors work.

Gated by a CIDR allowlist on the direct TCP peer IP. Defaults to loopback + RFC1918, which matches the typical compose / swarm / k8s intra-cluster scrape path without leaking /metrics through a public LB. Tighten or disable via server.metrics.allowCIDRs in anchorage.yaml.

Series exposed:

anchorage_http_requests_total{method,status_class}
anchorage_pin_ops_total{op,result}
anchorage_scheduler_fetch_total{node,result}
anchorage_scheduler_acks_total{node,status}
anchorage_cache_hits_total{name}
anchorage_cache_misses_total{name}
anchorage_leader_is_elected
anchorage_cluster_nodes_live
anchorage_placements_by_status{status}

Scrape with the standard Prometheus job config (scrape each anchorage pod / container directly — the LB is bypassed). Alerting rules are left to the operator; a reasonable starter set watches for anchorage_leader_is_elected == 0 across every node (nobody is the leader), rate(anchorage_pin_ops_total{result="err"}[5m]) spikes, and anchorage_cluster_nodes_live falling below minReplicas.

Rate limiting

Two layers:

POST /v1/auth/session — capped per IP per minute (server.rateLimit.sessionPerMinute, default 10). Brute-force guard.
All anonymous requests — capped per IP per minute (server.rateLimit.anonymousPerMinute, default 120). Authenticated traffic (valid Bearer or session cookie) is exempt. Probe paths (/v1/health, /v1/ready, /metrics) are exempt.

Storage is per-process in-memory. Sticky sessions at the LB make this effectively global; without sticky sessions an attacker can burst across N anchorage nodes for N× the throughput. If that matters in your deployment, deploy behind a proxy that enforces its own global limits (e.g., nginx limit_req_zone, envoy local_ratelimit).

Backing up Postgres

docker exec -it $(docker ps -q -f name=anchorage_postgres) \
    pg_dump -U anchorage -Fc anchorage > anchorage_$(date +%F).pgdump

Backing up NATS state

NATS state under /var/lib/anchorage/nats is non-authoritative — it holds in-flight jobs and the leader / cluster-maintenance KV. Losing it trips the requeue sweeper once and comes back; Postgres is the source of truth.

Still, if you want it captured:

docker run --rm -v anchorage_anchorage_1_data:/data \
    busybox tar czf - /data/nats > nats_1_$(date +%F).tar.gz

10 KiB Raw Permalink Blame History Unescape Escape