at-container-registry/docs/RELAY.md

# Running an ATProto Relay for ATCR Hold Discovery

This document explains what it takes to run an ATProto relay for indexing ATCR hold records, including infrastructure requirements, configuration, and trade-offs.

## Overview

### What is an ATProto Relay?

An ATProto relay is a service that:
- **Subscribes to multiple PDS hosts** and aggregates their data streams
- **Outputs a combined "firehose"** event stream for real-time network updates
- **Validates data integrity** and identity signatures
- **Provides discovery endpoints** like `com.atproto.sync.listReposByCollection`

The relay acts as a network-wide indexer, making it possible to discover which DIDs have records of specific types (collections).

### Why ATCR Needs a Relay

ATCR uses hold captain records (`io.atcr.hold.captain`) stored in hold PDSs to enable hold discovery. The `listReposByCollection` endpoint allows AppViews to efficiently discover all holds in the network without crawling every PDS individually.

**The problem**: Standard Bluesky relays appear to only index collections from `did:plc` DIDs, not `did:web` DIDs. Since ATCR holds use `did:web` (e.g., `did:web:hold01.atcr.io`), they aren't discoverable via Bluesky's public relays.

## Recommended Approach: Phased Implementation

ATCR's discovery needs evolve as the network grows. Start simple, scale as needed.

## MVP: Minimal Discovery Service

For initial deployment with a small number of holds (dozens, not thousands), build a **lightweight custom discovery service** focused solely on `io.atcr.*` collections.

### Why Minimal Service for MVP?

- **Scope**: Only index `io.atcr.*` collections (manifests, tags, captain/crew, sailor profiles)
- **Opt-in**: Only crawls PDSs that explicitly call `requestCrawl`
- **Small scale**: Dozens of holds, not millions of users
- **Simple storage**: SQLite sufficient for current scale
- **Cost-effective**: $5-10/month VPS

### Architecture

**Inbound endpoints:**
```
POST /xrpc/com.atproto.sync.requestCrawl
  → Hold registers itself for crawling

GET /xrpc/com.atproto.sync.listReposByCollection?collection=io.atcr.hold.captain
  → AppView discovers holds
```

**Outbound (client to PDS):**
```
1. com.atproto.repo.describeRepo → verify PDS exists
2. com.atproto.sync.getRepo → fetch full CAR file (initial backfill)
3. com.atproto.sync.subscribeRepos → WebSocket for real-time updates
4. Parse events → extract io.atcr.* records → index in SQLite
```

**Data flow:**

**Initial crawl (on requestCrawl):**
```
1. Hold POSTs requestCrawl → service queues crawl job
2. Service fetches getRepo (CAR file) from hold's PDS for backfill
3. Service parses CAR using indigo libraries
4. Service extracts io.atcr.* records (captain, crew, manifests, etc.)
5. Service stores: (did, collection, rkey, record_data) in SQLite
6. Service opens WebSocket to subscribeRepos for this DID
7. Service stores cursor for reconnection handling
```

**Ongoing updates (WebSocket):**
```
1. Receive commit events via subscribeRepos WebSocket
2. Parse event, filter to io.atcr.* collections only
3. Update indexed_records incrementally (insert/update/delete)
4. Update cursor after processing each event
5. On disconnect: reconnect with stored cursor to resume
```

**Discovery (AppView query):**
```
1. AppView GETs listReposByCollection?collection=io.atcr.hold.captain
2. Service queries SQLite WHERE collection='io.atcr.hold.captain'
3. Service returns list of DIDs with that collection
```

### Implementation Requirements

**Technologies:**
- Go (reuse indigo libraries for CAR parsing and WebSocket)
- SQLite (sufficient for dozens/hundreds of holds)
- Standard HTTP server + WebSocket client

**Core components:**

1. **HTTP handlers** (`cmd/atcr-discovery/handlers/`):
   - `requestCrawl` - queue crawl jobs
   - `listReposByCollection` - query indexed collections

2. **Crawler** (`pkg/discovery/crawler.go`):
   - Fetch CAR files from PDSs for initial backfill
   - Parse with `github.com/bluesky-social/indigo/repo`
   - Extract records, filter to `io.atcr.*` only

3. **WebSocket subscriber** (`pkg/discovery/subscriber.go`):
   - WebSocket client for `com.atproto.sync.subscribeRepos`
   - Event parsing and filtering
   - Cursor management and persistence
   - Automatic reconnection with resume

4. **Storage** (`pkg/discovery/storage.go`):
   - SQLite schema for indexed records
   - Indexes on (collection, did) for fast queries
   - Cursor storage for reconnection

5. **Worker** (`pkg/discovery/worker.go`):
   - Background crawl job processor
   - WebSocket connection manager
   - Health monitoring for subscriptions

**Database schema:**
```sql
CREATE TABLE indexed_records (
    did TEXT NOT NULL,
    collection TEXT NOT NULL,
    rkey TEXT NOT NULL,
    record_data TEXT NOT NULL, -- JSON
    indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (did, collection, rkey)
);

CREATE INDEX idx_collection ON indexed_records(collection);
CREATE INDEX idx_did ON indexed_records(did);

CREATE TABLE crawl_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    hostname TEXT NOT NULL UNIQUE,
    did TEXT,
    status TEXT DEFAULT 'pending', -- pending, in_progress, subscribed, failed
    last_crawled_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE subscriptions (
    did TEXT PRIMARY KEY,
    hostname TEXT NOT NULL,
    cursor INTEGER, -- Last processed sequence number
    status TEXT DEFAULT 'active', -- active, disconnected, failed
    last_event_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

**Leveraging indigo libraries:**

```go
import (
    "github.com/bluesky-social/indigo/repo"
    "github.com/bluesky-social/indigo/atproto/syntax"
    "github.com/bluesky-social/indigo/events"
    "github.com/gorilla/websocket"
    "github.com/ipfs/go-cid"
)

// Initial backfill: Parse CAR file
r, err := repo.ReadRepoFromCar(ctx, bytes.NewReader(carData))
if err != nil {
    return err
}

// Iterate records
err = r.ForEach(ctx, "", func(path string, nodeCid cid.Cid) error {
    // Parse collection from path (e.g., "io.atcr.hold.captain/self")
    parts := strings.Split(path, "/")
    if len(parts) != 2 {
        return nil // skip invalid paths
    }

    collection := parts[0]
    rkey := parts[1]

    // Filter to io.atcr.* only
    if !strings.HasPrefix(collection, "io.atcr.") {
        return nil
    }

    // Get record data
    recordBytes, err := r.GetRecord(ctx, path)
    if err != nil {
        return err
    }

    // Store in database
    return store.IndexRecord(did, collection, rkey, recordBytes)
})

// WebSocket subscription: Listen for updates
wsURL := fmt.Sprintf("wss://%s/xrpc/com.atproto.sync.subscribeRepos", hostname)
conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil)
if err != nil {
    return err
}

// Read events
rsc := &events.RepoStreamCallbacks{
    RepoCommit: func(evt *events.RepoCommit) error {
        // Filter to io.atcr.* collections only
        for _, op := range evt.Ops {
            if !strings.HasPrefix(op.Collection, "io.atcr.") {
                continue
            }

            // Process create/update/delete operations
            switch op.Action {
            case "create", "update":
                store.IndexRecord(evt.Repo, op.Collection, op.Rkey, op.Record)
            case "delete":
                store.DeleteRecord(evt.Repo, op.Collection, op.Rkey)
            }
        }

        // Update cursor
        return store.UpdateCursor(evt.Repo, evt.Seq)
    },
}

// Process stream
scheduler := events.NewScheduler("discovery-worker", conn.RemoteAddr().String(), rsc)
return events.HandleRepoStream(ctx, conn, scheduler)
```

### Infrastructure Requirements

**Minimum specs:**
- 1 vCPU
- 1-2GB RAM
- 20GB SSD
- Minimal bandwidth (<1GB/day for dozens of holds)

**Estimated cost:**
- Hetzner CX11: €4.15/month (~$5/month)
- DigitalOcean Basic: $6/month
- Fly.io: ~$5-10/month

**Deployment:**
```bash
# Build
go build -o atcr-discovery ./cmd/atcr-discovery

# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
./atcr-discovery
```

### Limitations

**What it does NOT do:**
- ❌ Serve outbound `subscribeRepos` firehose (AppViews query via listReposByCollection)
- ❌ Full MST validation (trust PDS validation)
- ❌ Scale to millions of accounts (SQLite limits)
- ❌ Multi-instance deployment (single process with SQLite)

**When to migrate to full relay:** When you have 1000+ holds, need PostgreSQL, or multi-instance deployment.

## Future Scale: Full Relay (Sync v1.1)

When ATCR grows beyond dozens of holds and needs real-time indexing, migrate to Bluesky's relay v1.1 implementation.

### When to Upgrade

**Indicators:**
- 100+ holds requesting frequent crawls
- Need real-time updates (re-crawl latency too high)
- Multiple AppView instances need coordinated discovery
- SQLite performance becomes bottleneck

### Relay v1.1 Characteristics

Released May 2025, this is Bluesky's current reference implementation.

**Key features:**
- **Non-archival**: Doesn't mirror full repository data, only processes firehose
- **WebSocket subscriptions**: Real-time updates from PDSs
- **Scalable**: 2 vCPU, 12GB RAM handles ~100M accounts
- **PostgreSQL**: Required for production scale
- **Admin UI**: Web dashboard for management

**Source**: `github.com/bluesky-social/indigo/cmd/relay`

### Migration Path

**Step 1: Deploy relay v1.1**
```bash
git clone https://github.com/bluesky-social/indigo.git
cd indigo
go build -o relay ./cmd/relay

export DATABASE_URL="postgres://relay:password@localhost:5432/atcr_relay"
./relay --admin-password="secure-password"
```

**Step 2: Migrate data**
- Export indexed records from SQLite
- Trigger crawls in relay for all known holds
- Verify relay indexes correctly

**Step 3: Update AppView configuration**
```bash
# Point to new relay
export ATCR_RELAY_ENDPOINT="https://relay.atcr.io"
```

**Step 4: Decommission minimal service**
- Monitor relay for stability
- Shut down old discovery service

### Infrastructure Requirements (Full Relay)

**Minimum specs:**
- 2 vCPU cores
- 12GB RAM
- 100GB SSD
- 30 Mbps bandwidth

**Estimated cost:**
- Hetzner: ~$30-40/month
- DigitalOcean: ~$50/month (with managed PostgreSQL)
- Fly.io: ~$35-50/month

## Collection Indexing: The `collectiondir` Microservice

The `com.atproto.sync.listReposByCollection` endpoint is **not part of the relay core**. It's provided by a separate microservice called **`collectiondir`**.

### What is collectiondir?

- **Separate service** that indexes collections for efficient discovery
- **Optional**: Not required by the ATProto spec, but very useful for AppViews
- **Deployed alongside relay** by Bluesky's public instances

### Current Limitation: did:plc Only?

Based on testing, Bluesky's public relays (with collectiondir) appear to:
- ✅ Index `io.atcr.*` collections from `did:plc` DIDs
- ❌ NOT index `io.atcr.*` collections from `did:web` DIDs

This means:
- ATCR manifests from users (did:plc) are discoverable
- ATCR hold captain records (did:web) are NOT discoverable
- The relay still **stores** all data (CAR file includes did:web records)
- The issue is specifically with **indexing** for `listReposByCollection`

### Configuring collectiondir

Documentation on configuring collectiondir is sparse. Possible approaches:

1. **Fork and modify**: Clone indigo repo, modify collectiondir to index all DIDs
2. **Configuration file**: Check if collectiondir accepts whitelist/configuration for indexed collections
3. **No filtering**: Default behavior might be to index everything, but Bluesky's deployment filters

**Action item**: Review `indigo/cmd/collectiondir` source code to understand configuration options.

## Multi-Relay Strategy

Holds can request crawls from **multiple relays** simultaneously. This enables:

### Scenario: Bluesky + ATCR Relays

**Setup:**
1. Hold deploys with embedded PDS at `did:web:hold01.atcr.io`
2. Hold creates captain record (`io.atcr.hold.captain/self`)
3. Hold requests crawl from **both**:
   - Bluesky relay: `https://bsky.network/xrpc/com.atproto.sync.requestCrawl`
   - ATCR relay: `https://relay.atcr.io/xrpc/com.atproto.sync.requestCrawl`

**Result:**
- ✅ Bluesky relay indexes social posts (if hold owner posts)
- ✅ ATCR relay indexes hold captain records
- ✅ AppViews query ATCR relay for hold discovery
- ✅ Independent networks - Bluesky posts work regardless of ATCR relay

### Request Crawl Script

The existing script can be modified to support multiple relays:

```bash
#!/bin/bash
# deploy/request-crawl.sh

HOSTNAME=$1
BLUESKY_RELAY=${2:-"https://bsky.network"}
ATCR_RELAY=${3:-"https://relay.atcr.io"}

echo "Requesting crawl for $HOSTNAME from Bluesky relay..."
curl -X POST "$BLUESKY_RELAY/xrpc/com.atproto.sync.requestCrawl" \
  -H "Content-Type: application/json" \
  -d "{\"hostname\": \"$HOSTNAME\"}"

echo "Requesting crawl for $HOSTNAME from ATCR relay..."
curl -X POST "$ATCR_RELAY/xrpc/com.atproto.sync.requestCrawl" \
  -H "Content-Type: application/json" \
  -d "{\"hostname\": \"$HOSTNAME\"}"
```

Usage:
```bash
./deploy/request-crawl.sh hold01.atcr.io
```

## Deployment: Minimal Discovery Service

### 1. Infrastructure Setup

**Provision VPS:**
- Hetzner CX11, DigitalOcean Basic, or Fly.io
- Public domain (e.g., `discovery.atcr.io`)
- TLS certificate (Let's Encrypt)

**Configure reverse proxy (optional - nginx):**
```nginx
upstream discovery {
    server 127.0.0.1:8080;
}

server {
    listen 443 ssl http2;
    server_name discovery.atcr.io;

    ssl_certificate /etc/letsencrypt/live/discovery.atcr.io/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/discovery.atcr.io/privkey.pem;

    location / {
        proxy_pass http://discovery;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
```

### 2. Build and Deploy

```bash
# Clone ATCR repo
git clone https://github.com/atcr-io/atcr.git
cd atcr

# Build discovery service
go build -o atcr-discovery ./cmd/atcr-discovery

# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
export CRAWL_INTERVAL="12h"
./atcr-discovery
```

### 3. Update Hold Startup

Each hold should request crawl on startup:

```bash
# In hold startup script or environment
export ATCR_DISCOVERY_URL="https://discovery.atcr.io"

# Request crawl from both Bluesky and ATCR
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
  -H "Content-Type: application/json" \
  -d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"

curl -X POST "$ATCR_DISCOVERY_URL/xrpc/com.atproto.sync.requestCrawl" \
  -H "Content-Type: application/json" \
  -d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"
```

### 4. Update AppView Configuration

Point AppView discovery worker to the discovery service:

```bash
# In .env.appview or environment
export ATCR_RELAY_ENDPOINT="https://discovery.atcr.io"
export ATCR_HOLD_DISCOVERY_ENABLED="true"
export ATCR_HOLD_DISCOVERY_INTERVAL="6h"
```

### 5. Monitor and Maintain

**Monitoring:**
- Check crawl queue status
- Monitor SQLite database size
- Track failed crawls

**Maintenance:**
- Re-crawl on schedule (every 6-24 hours)
- Prune stale records (>7 days old)
- Backup SQLite database regularly

## Trade-Offs and Considerations

### Running Your Own Relay

**Pros:**
- ✅ Full control over indexing (can index `did:web` holds)
- ✅ No dependency on third-party relay policies
- ✅ Can customize collection filters for ATCR-specific needs
- ✅ Relatively lightweight with modern relay implementation

**Cons:**
- ❌ Infrastructure cost (~$30-50/month minimum)
- ❌ Operational overhead (monitoring, updates, backups)
- ❌ Need to maintain as network grows
- ❌ Single point of failure for discovery (unless multi-relay)

### Alternatives to Running a Relay

#### 1. Direct Registration API

Holds POST to AppView on startup to register themselves:

**Pros:**
- ✅ Simplest implementation
- ✅ No relay infrastructure needed
- ✅ Immediate registration (no crawl delay)

**Cons:**
- ❌ Ties holds to specific AppView instances
- ❌ Breaks decentralized discovery model
- ❌ Each AppView has different hold registry

#### 2. Static Discovery File

Maintain `https://atcr.io/.well-known/holds.json`:

**Pros:**
- ✅ No infrastructure beyond static hosting
- ✅ All AppViews share same registry
- ✅ Simple to implement

**Cons:**
- ❌ Manual process (PRs/issues to add holds)
- ❌ Not real-time discovery
- ❌ Centralized control point

#### 3. Hybrid Approach

Combine multiple discovery mechanisms:

```go
func (w *HoldDiscoveryWorker) DiscoverHolds(ctx context.Context) error {
    // 1. Fetch static registry
    staticHolds := w.fetchStaticRegistry()

    // 2. Query relay (if available)
    relayHolds := w.queryRelay(ctx)

    // 3. Accept direct registrations
    registeredHolds := w.getDirectRegistrations()

    // Merge and deduplicate
    allHolds := mergeHolds(staticHolds, relayHolds, registeredHolds)

    // Cache in database
    for _, hold := range allHolds {
        w.cacheHold(hold)
    }
}
```

**Pros:**
- ✅ Multiple discovery paths (resilient)
- ✅ Gradual migration to relay-based discovery
- ✅ Supports both centralized bootstrap and decentralized growth

**Cons:**
- ❌ More complex implementation
- ❌ Potential for stale data if sources conflict

## Recommendations for ATCR

### Phase 1: MVP (Now - 1000 holds)

**Build minimal discovery service with WebSocket** (~$5-10/month):
1. Implement `requestCrawl` + `listReposByCollection` endpoints
2. Initial backfill via `getRepo` (CAR file parsing)
3. Real-time updates via WebSocket `subscribeRepos`
4. SQLite storage with cursor management
5. Filter to `io.atcr.*` collections only

**Deliverables:**
- `cmd/atcr-discovery` service
- SQLite schema with cursor storage
- CAR file parser (indigo libraries)
- WebSocket subscriber with reconnection
- Deployment scripts

**Cost**: ~$5-10/month VPS

**Why**: Minimal infrastructure, real-time updates, full control over indexing, sufficient for hundreds of holds.

### Phase 2: Migrate to Full Relay (1000+ holds)

**Deploy Bluesky relay v1.1** when scaling needed (~$30-50/month):
1. Set up PostgreSQL database
2. Deploy indigo relay with admin UI
3. Migrate indexed data from SQLite
4. Configure for `io.atcr.*` collection filtering (if possible)
5. Handle thousands of concurrent WebSocket connections

**Cost**: ~$30-50/month

**Why**: Proven scalability to 100M+ accounts, standardized protocol, community support, production-ready infrastructure.

### Phase 3: Multi-Relay Federation (Future)

**Decentralized relay network:**
1. Multiple ATCR relays operated independently
2. AppViews query multiple relays (fallback/redundancy)
3. Holds request crawls from all known ATCR relays
4. Cross-relay synchronization (optional)

**Why**: No single point of failure, fully decentralized discovery, geographic distribution.

## Next Steps

### For MVP Implementation

1. **Create `cmd/atcr-discovery` package structure**
   - HTTP handlers for XRPC endpoints (`requestCrawl`, `listReposByCollection`)
   - Crawler with indigo CAR parsing for initial backfill
   - WebSocket subscriber for real-time updates
   - SQLite storage layer with cursor management
   - Background worker for managing subscriptions

2. **Database schema**
   - `indexed_records` table for collection data
   - `crawl_queue` table for crawl job management
   - `subscriptions` table for WebSocket cursor tracking
   - Indexes for efficient queries

3. **WebSocket implementation**
   - Use `github.com/bluesky-social/indigo/events` for event handling
   - Implement reconnection logic with cursor resume
   - Filter events to `io.atcr.*` collections only
   - Health monitoring for active subscriptions

4. **Testing strategy**
   - Unit tests for CAR parsing
   - Unit tests for event filtering
   - Integration tests with mock PDSs and WebSocket
   - Connection failure and reconnection testing
   - Load testing with SQLite

5. **Deployment**
   - Dockerfile for discovery service
   - Deployment scripts (systemd, docker-compose)
   - Monitoring setup (logs, metrics, WebSocket health)
   - Alert on subscription failures

6. **Documentation**
   - API documentation for XRPC endpoints
   - Deployment guide
   - Troubleshooting guide (WebSocket connection issues)

### Open Questions

1. **CAR parsing edge cases**: How to handle malformed CAR files or invalid records?
2. **WebSocket reconnection**: What's the optimal backoff strategy for reconnection attempts?
3. **Subscription management**: How many concurrent WebSocket connections can SQLite handle?
4. **Rate limiting**: Should discovery service rate-limit requestCrawl to prevent abuse?
5. **Authentication**: Should requestCrawl require authentication, or remain open?
6. **Cursor storage**: Should cursors be persisted immediately or batched for performance?
7. **Monitoring**: What metrics are most important for operational visibility (active subs, event rate, lag)?
8. **Error handling**: When a WebSocket dies, should we re-backfill via getRepo or trust cursor resume?

## References

### ATProto Specifications
- [ATProto Sync Specification](https://atproto.com/specs/sync)
- [Repository Specification](https://atproto.com/specs/repository)
- [CAR File Format](https://ipld.io/specs/transport/car/)

### Indigo Libraries
- [Indigo Repository](https://github.com/bluesky-social/indigo)
- [Indigo Repo Package](https://pkg.go.dev/github.com/bluesky-social/indigo/repo)
- [Indigo ATProto Package](https://pkg.go.dev/github.com/bluesky-social/indigo/atproto)

### Relay Reference (Future)
- [Relay v1.1 Updates](https://docs.bsky.app/blog/relay-sync-updates)
- [Indigo Relay Implementation](https://github.com/bluesky-social/indigo/tree/main/cmd/relay)
- [Running a Full-Network Relay](https://whtwnd.com/bnewbold.net/3kwzl7tye6u2y)