Files
at-container-registry/docs/RELAY.md
2025-10-23 12:24:04 -05:00

692 lines
21 KiB
Markdown

# Running an ATProto Relay for ATCR Hold Discovery
This document explains what it takes to run an ATProto relay for indexing ATCR hold records, including infrastructure requirements, configuration, and trade-offs.
## Overview
### What is an ATProto Relay?
An ATProto relay is a service that:
- **Subscribes to multiple PDS hosts** and aggregates their data streams
- **Outputs a combined "firehose"** event stream for real-time network updates
- **Validates data integrity** and identity signatures
- **Provides discovery endpoints** like `com.atproto.sync.listReposByCollection`
The relay acts as a network-wide indexer, making it possible to discover which DIDs have records of specific types (collections).
### Why ATCR Needs a Relay
ATCR uses hold captain records (`io.atcr.hold.captain`) stored in hold PDSs to enable hold discovery. The `listReposByCollection` endpoint allows AppViews to efficiently discover all holds in the network without crawling every PDS individually.
**The problem**: Standard Bluesky relays appear to only index collections from `did:plc` DIDs, not `did:web` DIDs. Since ATCR holds use `did:web` (e.g., `did:web:hold01.atcr.io`), they aren't discoverable via Bluesky's public relays.
## Recommended Approach: Phased Implementation
ATCR's discovery needs evolve as the network grows. Start simple, scale as needed.
## MVP: Minimal Discovery Service
For initial deployment with a small number of holds (dozens, not thousands), build a **lightweight custom discovery service** focused solely on `io.atcr.*` collections.
### Why Minimal Service for MVP?
- **Scope**: Only index `io.atcr.*` collections (manifests, tags, captain/crew, sailor profiles)
- **Opt-in**: Only crawls PDSs that explicitly call `requestCrawl`
- **Small scale**: Dozens of holds, not millions of users
- **Simple storage**: SQLite sufficient for current scale
- **Cost-effective**: $5-10/month VPS
### Architecture
**Inbound endpoints:**
```
POST /xrpc/com.atproto.sync.requestCrawl
→ Hold registers itself for crawling
GET /xrpc/com.atproto.sync.listReposByCollection?collection=io.atcr.hold.captain
→ AppView discovers holds
```
**Outbound (client to PDS):**
```
1. com.atproto.repo.describeRepo → verify PDS exists
2. com.atproto.sync.getRepo → fetch full CAR file (initial backfill)
3. com.atproto.sync.subscribeRepos → WebSocket for real-time updates
4. Parse events → extract io.atcr.* records → index in SQLite
```
**Data flow:**
**Initial crawl (on requestCrawl):**
```
1. Hold POSTs requestCrawl → service queues crawl job
2. Service fetches getRepo (CAR file) from hold's PDS for backfill
3. Service parses CAR using indigo libraries
4. Service extracts io.atcr.* records (captain, crew, manifests, etc.)
5. Service stores: (did, collection, rkey, record_data) in SQLite
6. Service opens WebSocket to subscribeRepos for this DID
7. Service stores cursor for reconnection handling
```
**Ongoing updates (WebSocket):**
```
1. Receive commit events via subscribeRepos WebSocket
2. Parse event, filter to io.atcr.* collections only
3. Update indexed_records incrementally (insert/update/delete)
4. Update cursor after processing each event
5. On disconnect: reconnect with stored cursor to resume
```
**Discovery (AppView query):**
```
1. AppView GETs listReposByCollection?collection=io.atcr.hold.captain
2. Service queries SQLite WHERE collection='io.atcr.hold.captain'
3. Service returns list of DIDs with that collection
```
### Implementation Requirements
**Technologies:**
- Go (reuse indigo libraries for CAR parsing and WebSocket)
- SQLite (sufficient for dozens/hundreds of holds)
- Standard HTTP server + WebSocket client
**Core components:**
1. **HTTP handlers** (`cmd/atcr-discovery/handlers/`):
- `requestCrawl` - queue crawl jobs
- `listReposByCollection` - query indexed collections
2. **Crawler** (`pkg/discovery/crawler.go`):
- Fetch CAR files from PDSs for initial backfill
- Parse with `github.com/bluesky-social/indigo/repo`
- Extract records, filter to `io.atcr.*` only
3. **WebSocket subscriber** (`pkg/discovery/subscriber.go`):
- WebSocket client for `com.atproto.sync.subscribeRepos`
- Event parsing and filtering
- Cursor management and persistence
- Automatic reconnection with resume
4. **Storage** (`pkg/discovery/storage.go`):
- SQLite schema for indexed records
- Indexes on (collection, did) for fast queries
- Cursor storage for reconnection
5. **Worker** (`pkg/discovery/worker.go`):
- Background crawl job processor
- WebSocket connection manager
- Health monitoring for subscriptions
**Database schema:**
```sql
CREATE TABLE indexed_records (
did TEXT NOT NULL,
collection TEXT NOT NULL,
rkey TEXT NOT NULL,
record_data TEXT NOT NULL, -- JSON
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (did, collection, rkey)
);
CREATE INDEX idx_collection ON indexed_records(collection);
CREATE INDEX idx_did ON indexed_records(did);
CREATE TABLE crawl_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
hostname TEXT NOT NULL UNIQUE,
did TEXT,
status TEXT DEFAULT 'pending', -- pending, in_progress, subscribed, failed
last_crawled_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE subscriptions (
did TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
cursor INTEGER, -- Last processed sequence number
status TEXT DEFAULT 'active', -- active, disconnected, failed
last_event_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
**Leveraging indigo libraries:**
```go
import (
"github.com/bluesky-social/indigo/repo"
"github.com/bluesky-social/indigo/atproto/syntax"
"github.com/bluesky-social/indigo/events"
"github.com/gorilla/websocket"
"github.com/ipfs/go-cid"
)
// Initial backfill: Parse CAR file
r, err := repo.ReadRepoFromCar(ctx, bytes.NewReader(carData))
if err != nil {
return err
}
// Iterate records
err = r.ForEach(ctx, "", func(path string, nodeCid cid.Cid) error {
// Parse collection from path (e.g., "io.atcr.hold.captain/self")
parts := strings.Split(path, "/")
if len(parts) != 2 {
return nil // skip invalid paths
}
collection := parts[0]
rkey := parts[1]
// Filter to io.atcr.* only
if !strings.HasPrefix(collection, "io.atcr.") {
return nil
}
// Get record data
recordBytes, err := r.GetRecord(ctx, path)
if err != nil {
return err
}
// Store in database
return store.IndexRecord(did, collection, rkey, recordBytes)
})
// WebSocket subscription: Listen for updates
wsURL := fmt.Sprintf("wss://%s/xrpc/com.atproto.sync.subscribeRepos", hostname)
conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil)
if err != nil {
return err
}
// Read events
rsc := &events.RepoStreamCallbacks{
RepoCommit: func(evt *events.RepoCommit) error {
// Filter to io.atcr.* collections only
for _, op := range evt.Ops {
if !strings.HasPrefix(op.Collection, "io.atcr.") {
continue
}
// Process create/update/delete operations
switch op.Action {
case "create", "update":
store.IndexRecord(evt.Repo, op.Collection, op.Rkey, op.Record)
case "delete":
store.DeleteRecord(evt.Repo, op.Collection, op.Rkey)
}
}
// Update cursor
return store.UpdateCursor(evt.Repo, evt.Seq)
},
}
// Process stream
scheduler := events.NewScheduler("discovery-worker", conn.RemoteAddr().String(), rsc)
return events.HandleRepoStream(ctx, conn, scheduler)
```
### Infrastructure Requirements
**Minimum specs:**
- 1 vCPU
- 1-2GB RAM
- 20GB SSD
- Minimal bandwidth (<1GB/day for dozens of holds)
**Estimated cost:**
- Hetzner CX11: €4.15/month (~$5/month)
- DigitalOcean Basic: $6/month
- Fly.io: ~$5-10/month
**Deployment:**
```bash
# Build
go build -o atcr-discovery ./cmd/atcr-discovery
# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
./atcr-discovery
```
### Limitations
**What it does NOT do:**
- ❌ Serve outbound `subscribeRepos` firehose (AppViews query via listReposByCollection)
- ❌ Full MST validation (trust PDS validation)
- ❌ Scale to millions of accounts (SQLite limits)
- ❌ Multi-instance deployment (single process with SQLite)
**When to migrate to full relay:** When you have 1000+ holds, need PostgreSQL, or multi-instance deployment.
## Future Scale: Full Relay (Sync v1.1)
When ATCR grows beyond dozens of holds and needs real-time indexing, migrate to Bluesky's relay v1.1 implementation.
### When to Upgrade
**Indicators:**
- 100+ holds requesting frequent crawls
- Need real-time updates (re-crawl latency too high)
- Multiple AppView instances need coordinated discovery
- SQLite performance becomes bottleneck
### Relay v1.1 Characteristics
Released May 2025, this is Bluesky's current reference implementation.
**Key features:**
- **Non-archival**: Doesn't mirror full repository data, only processes firehose
- **WebSocket subscriptions**: Real-time updates from PDSs
- **Scalable**: 2 vCPU, 12GB RAM handles ~100M accounts
- **PostgreSQL**: Required for production scale
- **Admin UI**: Web dashboard for management
**Source**: `github.com/bluesky-social/indigo/cmd/relay`
### Migration Path
**Step 1: Deploy relay v1.1**
```bash
git clone https://github.com/bluesky-social/indigo.git
cd indigo
go build -o relay ./cmd/relay
export DATABASE_URL="postgres://relay:password@localhost:5432/atcr_relay"
./relay --admin-password="secure-password"
```
**Step 2: Migrate data**
- Export indexed records from SQLite
- Trigger crawls in relay for all known holds
- Verify relay indexes correctly
**Step 3: Update AppView configuration**
```bash
# Point to new relay
export ATCR_RELAY_ENDPOINT="https://relay.atcr.io"
```
**Step 4: Decommission minimal service**
- Monitor relay for stability
- Shut down old discovery service
### Infrastructure Requirements (Full Relay)
**Minimum specs:**
- 2 vCPU cores
- 12GB RAM
- 100GB SSD
- 30 Mbps bandwidth
**Estimated cost:**
- Hetzner: ~$30-40/month
- DigitalOcean: ~$50/month (with managed PostgreSQL)
- Fly.io: ~$35-50/month
## Collection Indexing: The `collectiondir` Microservice
The `com.atproto.sync.listReposByCollection` endpoint is **not part of the relay core**. It's provided by a separate microservice called **`collectiondir`**.
### What is collectiondir?
- **Separate service** that indexes collections for efficient discovery
- **Optional**: Not required by the ATProto spec, but very useful for AppViews
- **Deployed alongside relay** by Bluesky's public instances
### Current Limitation: did:plc Only?
Based on testing, Bluesky's public relays (with collectiondir) appear to:
- ✅ Index `io.atcr.*` collections from `did:plc` DIDs
- ❌ NOT index `io.atcr.*` collections from `did:web` DIDs
This means:
- ATCR manifests from users (did:plc) are discoverable
- ATCR hold captain records (did:web) are NOT discoverable
- The relay still **stores** all data (CAR file includes did:web records)
- The issue is specifically with **indexing** for `listReposByCollection`
### Configuring collectiondir
Documentation on configuring collectiondir is sparse. Possible approaches:
1. **Fork and modify**: Clone indigo repo, modify collectiondir to index all DIDs
2. **Configuration file**: Check if collectiondir accepts whitelist/configuration for indexed collections
3. **No filtering**: Default behavior might be to index everything, but Bluesky's deployment filters
**Action item**: Review `indigo/cmd/collectiondir` source code to understand configuration options.
## Multi-Relay Strategy
Holds can request crawls from **multiple relays** simultaneously. This enables:
### Scenario: Bluesky + ATCR Relays
**Setup:**
1. Hold deploys with embedded PDS at `did:web:hold01.atcr.io`
2. Hold creates captain record (`io.atcr.hold.captain/self`)
3. Hold requests crawl from **both**:
- Bluesky relay: `https://bsky.network/xrpc/com.atproto.sync.requestCrawl`
- ATCR relay: `https://relay.atcr.io/xrpc/com.atproto.sync.requestCrawl`
**Result:**
- ✅ Bluesky relay indexes social posts (if hold owner posts)
- ✅ ATCR relay indexes hold captain records
- ✅ AppViews query ATCR relay for hold discovery
- ✅ Independent networks - Bluesky posts work regardless of ATCR relay
### Request Crawl Script
The existing script can be modified to support multiple relays:
```bash
#!/bin/bash
# deploy/request-crawl.sh
HOSTNAME=$1
BLUESKY_RELAY=${2:-"https://bsky.network"}
ATCR_RELAY=${3:-"https://relay.atcr.io"}
echo "Requesting crawl for $HOSTNAME from Bluesky relay..."
curl -X POST "$BLUESKY_RELAY/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOSTNAME\"}"
echo "Requesting crawl for $HOSTNAME from ATCR relay..."
curl -X POST "$ATCR_RELAY/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOSTNAME\"}"
```
Usage:
```bash
./deploy/request-crawl.sh hold01.atcr.io
```
## Deployment: Minimal Discovery Service
### 1. Infrastructure Setup
**Provision VPS:**
- Hetzner CX11, DigitalOcean Basic, or Fly.io
- Public domain (e.g., `discovery.atcr.io`)
- TLS certificate (Let's Encrypt)
**Configure reverse proxy (optional - nginx):**
```nginx
upstream discovery {
server 127.0.0.1:8080;
}
server {
listen 443 ssl http2;
server_name discovery.atcr.io;
ssl_certificate /etc/letsencrypt/live/discovery.atcr.io/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/discovery.atcr.io/privkey.pem;
location / {
proxy_pass http://discovery;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
```
### 2. Build and Deploy
```bash
# Clone ATCR repo
git clone https://github.com/atcr-io/atcr.git
cd atcr
# Build discovery service
go build -o atcr-discovery ./cmd/atcr-discovery
# Run
export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db"
export HTTP_ADDR=":8080"
export CRAWL_INTERVAL="12h"
./atcr-discovery
```
### 3. Update Hold Startup
Each hold should request crawl on startup:
```bash
# In hold startup script or environment
export ATCR_DISCOVERY_URL="https://discovery.atcr.io"
# Request crawl from both Bluesky and ATCR
curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"
curl -X POST "$ATCR_DISCOVERY_URL/xrpc/com.atproto.sync.requestCrawl" \
-H "Content-Type: application/json" \
-d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}"
```
### 4. Update AppView Configuration
Point AppView discovery worker to the discovery service:
```bash
# In .env.appview or environment
export ATCR_RELAY_ENDPOINT="https://discovery.atcr.io"
export ATCR_HOLD_DISCOVERY_ENABLED="true"
export ATCR_HOLD_DISCOVERY_INTERVAL="6h"
```
### 5. Monitor and Maintain
**Monitoring:**
- Check crawl queue status
- Monitor SQLite database size
- Track failed crawls
**Maintenance:**
- Re-crawl on schedule (every 6-24 hours)
- Prune stale records (>7 days old)
- Backup SQLite database regularly
## Trade-Offs and Considerations
### Running Your Own Relay
**Pros:**
- ✅ Full control over indexing (can index `did:web` holds)
- ✅ No dependency on third-party relay policies
- ✅ Can customize collection filters for ATCR-specific needs
- ✅ Relatively lightweight with modern relay implementation
**Cons:**
- ❌ Infrastructure cost (~$30-50/month minimum)
- ❌ Operational overhead (monitoring, updates, backups)
- ❌ Need to maintain as network grows
- ❌ Single point of failure for discovery (unless multi-relay)
### Alternatives to Running a Relay
#### 1. Direct Registration API
Holds POST to AppView on startup to register themselves:
**Pros:**
- ✅ Simplest implementation
- ✅ No relay infrastructure needed
- ✅ Immediate registration (no crawl delay)
**Cons:**
- ❌ Ties holds to specific AppView instances
- ❌ Breaks decentralized discovery model
- ❌ Each AppView has different hold registry
#### 2. Static Discovery File
Maintain `https://atcr.io/.well-known/holds.json`:
**Pros:**
- ✅ No infrastructure beyond static hosting
- ✅ All AppViews share same registry
- ✅ Simple to implement
**Cons:**
- ❌ Manual process (PRs/issues to add holds)
- ❌ Not real-time discovery
- ❌ Centralized control point
#### 3. Hybrid Approach
Combine multiple discovery mechanisms:
```go
func (w *HoldDiscoveryWorker) DiscoverHolds(ctx context.Context) error {
// 1. Fetch static registry
staticHolds := w.fetchStaticRegistry()
// 2. Query relay (if available)
relayHolds := w.queryRelay(ctx)
// 3. Accept direct registrations
registeredHolds := w.getDirectRegistrations()
// Merge and deduplicate
allHolds := mergeHolds(staticHolds, relayHolds, registeredHolds)
// Cache in database
for _, hold := range allHolds {
w.cacheHold(hold)
}
}
```
**Pros:**
- ✅ Multiple discovery paths (resilient)
- ✅ Gradual migration to relay-based discovery
- ✅ Supports both centralized bootstrap and decentralized growth
**Cons:**
- ❌ More complex implementation
- ❌ Potential for stale data if sources conflict
## Recommendations for ATCR
### Phase 1: MVP (Now - 1000 holds)
**Build minimal discovery service with WebSocket** (~$5-10/month):
1. Implement `requestCrawl` + `listReposByCollection` endpoints
2. Initial backfill via `getRepo` (CAR file parsing)
3. Real-time updates via WebSocket `subscribeRepos`
4. SQLite storage with cursor management
5. Filter to `io.atcr.*` collections only
**Deliverables:**
- `cmd/atcr-discovery` service
- SQLite schema with cursor storage
- CAR file parser (indigo libraries)
- WebSocket subscriber with reconnection
- Deployment scripts
**Cost**: ~$5-10/month VPS
**Why**: Minimal infrastructure, real-time updates, full control over indexing, sufficient for hundreds of holds.
### Phase 2: Migrate to Full Relay (1000+ holds)
**Deploy Bluesky relay v1.1** when scaling needed (~$30-50/month):
1. Set up PostgreSQL database
2. Deploy indigo relay with admin UI
3. Migrate indexed data from SQLite
4. Configure for `io.atcr.*` collection filtering (if possible)
5. Handle thousands of concurrent WebSocket connections
**Cost**: ~$30-50/month
**Why**: Proven scalability to 100M+ accounts, standardized protocol, community support, production-ready infrastructure.
### Phase 3: Multi-Relay Federation (Future)
**Decentralized relay network:**
1. Multiple ATCR relays operated independently
2. AppViews query multiple relays (fallback/redundancy)
3. Holds request crawls from all known ATCR relays
4. Cross-relay synchronization (optional)
**Why**: No single point of failure, fully decentralized discovery, geographic distribution.
## Next Steps
### For MVP Implementation
1. **Create `cmd/atcr-discovery` package structure**
- HTTP handlers for XRPC endpoints (`requestCrawl`, `listReposByCollection`)
- Crawler with indigo CAR parsing for initial backfill
- WebSocket subscriber for real-time updates
- SQLite storage layer with cursor management
- Background worker for managing subscriptions
2. **Database schema**
- `indexed_records` table for collection data
- `crawl_queue` table for crawl job management
- `subscriptions` table for WebSocket cursor tracking
- Indexes for efficient queries
3. **WebSocket implementation**
- Use `github.com/bluesky-social/indigo/events` for event handling
- Implement reconnection logic with cursor resume
- Filter events to `io.atcr.*` collections only
- Health monitoring for active subscriptions
4. **Testing strategy**
- Unit tests for CAR parsing
- Unit tests for event filtering
- Integration tests with mock PDSs and WebSocket
- Connection failure and reconnection testing
- Load testing with SQLite
5. **Deployment**
- Dockerfile for discovery service
- Deployment scripts (systemd, docker-compose)
- Monitoring setup (logs, metrics, WebSocket health)
- Alert on subscription failures
6. **Documentation**
- API documentation for XRPC endpoints
- Deployment guide
- Troubleshooting guide (WebSocket connection issues)
### Open Questions
1. **CAR parsing edge cases**: How to handle malformed CAR files or invalid records?
2. **WebSocket reconnection**: What's the optimal backoff strategy for reconnection attempts?
3. **Subscription management**: How many concurrent WebSocket connections can SQLite handle?
4. **Rate limiting**: Should discovery service rate-limit requestCrawl to prevent abuse?
5. **Authentication**: Should requestCrawl require authentication, or remain open?
6. **Cursor storage**: Should cursors be persisted immediately or batched for performance?
7. **Monitoring**: What metrics are most important for operational visibility (active subs, event rate, lag)?
8. **Error handling**: When a WebSocket dies, should we re-backfill via getRepo or trust cursor resume?
## References
### ATProto Specifications
- [ATProto Sync Specification](https://atproto.com/specs/sync)
- [Repository Specification](https://atproto.com/specs/repository)
- [CAR File Format](https://ipld.io/specs/transport/car/)
### Indigo Libraries
- [Indigo Repository](https://github.com/bluesky-social/indigo)
- [Indigo Repo Package](https://pkg.go.dev/github.com/bluesky-social/indigo/repo)
- [Indigo ATProto Package](https://pkg.go.dev/github.com/bluesky-social/indigo/atproto)
### Relay Reference (Future)
- [Relay v1.1 Updates](https://docs.bsky.app/blog/relay-sync-updates)
- [Indigo Relay Implementation](https://github.com/bluesky-social/indigo/tree/main/cmd/relay)
- [Running a Full-Network Relay](https://whtwnd.com/bnewbold.net/3kwzl7tye6u2y)