# Running an ATProto Relay for ATCR Hold Discovery This document explains what it takes to run an ATProto relay for indexing ATCR hold records, including infrastructure requirements, configuration, and trade-offs. ## Overview ### What is an ATProto Relay? An ATProto relay is a service that: - **Subscribes to multiple PDS hosts** and aggregates their data streams - **Outputs a combined "firehose"** event stream for real-time network updates - **Validates data integrity** and identity signatures - **Provides discovery endpoints** like `com.atproto.sync.listReposByCollection` The relay acts as a network-wide indexer, making it possible to discover which DIDs have records of specific types (collections). ### Why ATCR Needs a Relay ATCR uses hold captain records (`io.atcr.hold.captain`) stored in hold PDSs to enable hold discovery. The `listReposByCollection` endpoint allows AppViews to efficiently discover all holds in the network without crawling every PDS individually. **The problem**: Standard Bluesky relays appear to only index collections from `did:plc` DIDs, not `did:web` DIDs. Since ATCR holds use `did:web` (e.g., `did:web:hold01.atcr.io`), they aren't discoverable via Bluesky's public relays. ## Recommended Approach: Phased Implementation ATCR's discovery needs evolve as the network grows. Start simple, scale as needed. ## MVP: Minimal Discovery Service For initial deployment with a small number of holds (dozens, not thousands), build a **lightweight custom discovery service** focused solely on `io.atcr.*` collections. ### Why Minimal Service for MVP? - **Scope**: Only index `io.atcr.*` collections (manifests, tags, captain/crew, sailor profiles) - **Opt-in**: Only crawls PDSs that explicitly call `requestCrawl` - **Small scale**: Dozens of holds, not millions of users - **Simple storage**: SQLite sufficient for current scale - **Cost-effective**: $5-10/month VPS ### Architecture **Inbound endpoints:** ``` POST /xrpc/com.atproto.sync.requestCrawl → Hold registers itself for crawling GET /xrpc/com.atproto.sync.listReposByCollection?collection=io.atcr.hold.captain → AppView discovers holds ``` **Outbound (client to PDS):** ``` 1. com.atproto.repo.describeRepo → verify PDS exists 2. com.atproto.sync.getRepo → fetch full CAR file (initial backfill) 3. com.atproto.sync.subscribeRepos → WebSocket for real-time updates 4. Parse events → extract io.atcr.* records → index in SQLite ``` **Data flow:** **Initial crawl (on requestCrawl):** ``` 1. Hold POSTs requestCrawl → service queues crawl job 2. Service fetches getRepo (CAR file) from hold's PDS for backfill 3. Service parses CAR using indigo libraries 4. Service extracts io.atcr.* records (captain, crew, manifests, etc.) 5. Service stores: (did, collection, rkey, record_data) in SQLite 6. Service opens WebSocket to subscribeRepos for this DID 7. Service stores cursor for reconnection handling ``` **Ongoing updates (WebSocket):** ``` 1. Receive commit events via subscribeRepos WebSocket 2. Parse event, filter to io.atcr.* collections only 3. Update indexed_records incrementally (insert/update/delete) 4. Update cursor after processing each event 5. On disconnect: reconnect with stored cursor to resume ``` **Discovery (AppView query):** ``` 1. AppView GETs listReposByCollection?collection=io.atcr.hold.captain 2. Service queries SQLite WHERE collection='io.atcr.hold.captain' 3. Service returns list of DIDs with that collection ``` ### Implementation Requirements **Technologies:** - Go (reuse indigo libraries for CAR parsing and WebSocket) - SQLite (sufficient for dozens/hundreds of holds) - Standard HTTP server + WebSocket client **Core components:** 1. **HTTP handlers** (`cmd/atcr-discovery/handlers/`): - `requestCrawl` - queue crawl jobs - `listReposByCollection` - query indexed collections 2. **Crawler** (`pkg/discovery/crawler.go`): - Fetch CAR files from PDSs for initial backfill - Parse with `github.com/bluesky-social/indigo/repo` - Extract records, filter to `io.atcr.*` only 3. **WebSocket subscriber** (`pkg/discovery/subscriber.go`): - WebSocket client for `com.atproto.sync.subscribeRepos` - Event parsing and filtering - Cursor management and persistence - Automatic reconnection with resume 4. **Storage** (`pkg/discovery/storage.go`): - SQLite schema for indexed records - Indexes on (collection, did) for fast queries - Cursor storage for reconnection 5. **Worker** (`pkg/discovery/worker.go`): - Background crawl job processor - WebSocket connection manager - Health monitoring for subscriptions **Database schema:** ```sql CREATE TABLE indexed_records ( did TEXT NOT NULL, collection TEXT NOT NULL, rkey TEXT NOT NULL, record_data TEXT NOT NULL, -- JSON indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (did, collection, rkey) ); CREATE INDEX idx_collection ON indexed_records(collection); CREATE INDEX idx_did ON indexed_records(did); CREATE TABLE crawl_queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, hostname TEXT NOT NULL UNIQUE, did TEXT, status TEXT DEFAULT 'pending', -- pending, in_progress, subscribed, failed last_crawled_at TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE subscriptions ( did TEXT PRIMARY KEY, hostname TEXT NOT NULL, cursor INTEGER, -- Last processed sequence number status TEXT DEFAULT 'active', -- active, disconnected, failed last_event_at TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` **Leveraging indigo libraries:** ```go import ( "github.com/bluesky-social/indigo/repo" "github.com/bluesky-social/indigo/atproto/syntax" "github.com/bluesky-social/indigo/events" "github.com/gorilla/websocket" "github.com/ipfs/go-cid" ) // Initial backfill: Parse CAR file r, err := repo.ReadRepoFromCar(ctx, bytes.NewReader(carData)) if err != nil { return err } // Iterate records err = r.ForEach(ctx, "", func(path string, nodeCid cid.Cid) error { // Parse collection from path (e.g., "io.atcr.hold.captain/self") parts := strings.Split(path, "/") if len(parts) != 2 { return nil // skip invalid paths } collection := parts[0] rkey := parts[1] // Filter to io.atcr.* only if !strings.HasPrefix(collection, "io.atcr.") { return nil } // Get record data recordBytes, err := r.GetRecord(ctx, path) if err != nil { return err } // Store in database return store.IndexRecord(did, collection, rkey, recordBytes) }) // WebSocket subscription: Listen for updates wsURL := fmt.Sprintf("wss://%s/xrpc/com.atproto.sync.subscribeRepos", hostname) conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil) if err != nil { return err } // Read events rsc := &events.RepoStreamCallbacks{ RepoCommit: func(evt *events.RepoCommit) error { // Filter to io.atcr.* collections only for _, op := range evt.Ops { if !strings.HasPrefix(op.Collection, "io.atcr.") { continue } // Process create/update/delete operations switch op.Action { case "create", "update": store.IndexRecord(evt.Repo, op.Collection, op.Rkey, op.Record) case "delete": store.DeleteRecord(evt.Repo, op.Collection, op.Rkey) } } // Update cursor return store.UpdateCursor(evt.Repo, evt.Seq) }, } // Process stream scheduler := events.NewScheduler("discovery-worker", conn.RemoteAddr().String(), rsc) return events.HandleRepoStream(ctx, conn, scheduler) ``` ### Infrastructure Requirements **Minimum specs:** - 1 vCPU - 1-2GB RAM - 20GB SSD - Minimal bandwidth (<1GB/day for dozens of holds) **Estimated cost:** - Hetzner CX11: €4.15/month (~$5/month) - DigitalOcean Basic: $6/month - Fly.io: ~$5-10/month **Deployment:** ```bash # Build go build -o atcr-discovery ./cmd/atcr-discovery # Run export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db" export HTTP_ADDR=":8080" ./atcr-discovery ``` ### Limitations **What it does NOT do:** - ❌ Serve outbound `subscribeRepos` firehose (AppViews query via listReposByCollection) - ❌ Full MST validation (trust PDS validation) - ❌ Scale to millions of accounts (SQLite limits) - ❌ Multi-instance deployment (single process with SQLite) **When to migrate to full relay:** When you have 1000+ holds, need PostgreSQL, or multi-instance deployment. ## Future Scale: Full Relay (Sync v1.1) When ATCR grows beyond dozens of holds and needs real-time indexing, migrate to Bluesky's relay v1.1 implementation. ### When to Upgrade **Indicators:** - 100+ holds requesting frequent crawls - Need real-time updates (re-crawl latency too high) - Multiple AppView instances need coordinated discovery - SQLite performance becomes bottleneck ### Relay v1.1 Characteristics Released May 2025, this is Bluesky's current reference implementation. **Key features:** - **Non-archival**: Doesn't mirror full repository data, only processes firehose - **WebSocket subscriptions**: Real-time updates from PDSs - **Scalable**: 2 vCPU, 12GB RAM handles ~100M accounts - **PostgreSQL**: Required for production scale - **Admin UI**: Web dashboard for management **Source**: `github.com/bluesky-social/indigo/cmd/relay` ### Migration Path **Step 1: Deploy relay v1.1** ```bash git clone https://github.com/bluesky-social/indigo.git cd indigo go build -o relay ./cmd/relay export DATABASE_URL="postgres://relay:password@localhost:5432/atcr_relay" ./relay --admin-password="secure-password" ``` **Step 2: Migrate data** - Export indexed records from SQLite - Trigger crawls in relay for all known holds - Verify relay indexes correctly **Step 3: Update AppView configuration** ```bash # Point to new relay export ATCR_RELAY_ENDPOINT="https://relay.atcr.io" ``` **Step 4: Decommission minimal service** - Monitor relay for stability - Shut down old discovery service ### Infrastructure Requirements (Full Relay) **Minimum specs:** - 2 vCPU cores - 12GB RAM - 100GB SSD - 30 Mbps bandwidth **Estimated cost:** - Hetzner: ~$30-40/month - DigitalOcean: ~$50/month (with managed PostgreSQL) - Fly.io: ~$35-50/month ## Collection Indexing: The `collectiondir` Microservice The `com.atproto.sync.listReposByCollection` endpoint is **not part of the relay core**. It's provided by a separate microservice called **`collectiondir`**. ### What is collectiondir? - **Separate service** that indexes collections for efficient discovery - **Optional**: Not required by the ATProto spec, but very useful for AppViews - **Deployed alongside relay** by Bluesky's public instances ### Current Limitation: did:plc Only? Based on testing, Bluesky's public relays (with collectiondir) appear to: - ✅ Index `io.atcr.*` collections from `did:plc` DIDs - ❌ NOT index `io.atcr.*` collections from `did:web` DIDs This means: - ATCR manifests from users (did:plc) are discoverable - ATCR hold captain records (did:web) are NOT discoverable - The relay still **stores** all data (CAR file includes did:web records) - The issue is specifically with **indexing** for `listReposByCollection` ### Configuring collectiondir Documentation on configuring collectiondir is sparse. Possible approaches: 1. **Fork and modify**: Clone indigo repo, modify collectiondir to index all DIDs 2. **Configuration file**: Check if collectiondir accepts whitelist/configuration for indexed collections 3. **No filtering**: Default behavior might be to index everything, but Bluesky's deployment filters **Action item**: Review `indigo/cmd/collectiondir` source code to understand configuration options. ## Multi-Relay Strategy Holds can request crawls from **multiple relays** simultaneously. This enables: ### Scenario: Bluesky + ATCR Relays **Setup:** 1. Hold deploys with embedded PDS at `did:web:hold01.atcr.io` 2. Hold creates captain record (`io.atcr.hold.captain/self`) 3. Hold requests crawl from **both**: - Bluesky relay: `https://bsky.network/xrpc/com.atproto.sync.requestCrawl` - ATCR relay: `https://relay.atcr.io/xrpc/com.atproto.sync.requestCrawl` **Result:** - ✅ Bluesky relay indexes social posts (if hold owner posts) - ✅ ATCR relay indexes hold captain records - ✅ AppViews query ATCR relay for hold discovery - ✅ Independent networks - Bluesky posts work regardless of ATCR relay ### Request Crawl Script The existing script can be modified to support multiple relays: ```bash #!/bin/bash # deploy/request-crawl.sh HOSTNAME=$1 BLUESKY_RELAY=${2:-"https://bsky.network"} ATCR_RELAY=${3:-"https://relay.atcr.io"} echo "Requesting crawl for $HOSTNAME from Bluesky relay..." curl -X POST "$BLUESKY_RELAY/xrpc/com.atproto.sync.requestCrawl" \ -H "Content-Type: application/json" \ -d "{\"hostname\": \"$HOSTNAME\"}" echo "Requesting crawl for $HOSTNAME from ATCR relay..." curl -X POST "$ATCR_RELAY/xrpc/com.atproto.sync.requestCrawl" \ -H "Content-Type: application/json" \ -d "{\"hostname\": \"$HOSTNAME\"}" ``` Usage: ```bash ./deploy/request-crawl.sh hold01.atcr.io ``` ## Deployment: Minimal Discovery Service ### 1. Infrastructure Setup **Provision VPS:** - Hetzner CX11, DigitalOcean Basic, or Fly.io - Public domain (e.g., `discovery.atcr.io`) - TLS certificate (Let's Encrypt) **Configure reverse proxy (optional - nginx):** ```nginx upstream discovery { server 127.0.0.1:8080; } server { listen 443 ssl http2; server_name discovery.atcr.io; ssl_certificate /etc/letsencrypt/live/discovery.atcr.io/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/discovery.atcr.io/privkey.pem; location / { proxy_pass http://discovery; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } } ``` ### 2. Build and Deploy ```bash # Clone ATCR repo git clone https://github.com/atcr-io/atcr.git cd atcr # Build discovery service go build -o atcr-discovery ./cmd/atcr-discovery # Run export DATABASE_PATH="/var/lib/atcr-discovery/discovery.db" export HTTP_ADDR=":8080" export CRAWL_INTERVAL="12h" ./atcr-discovery ``` ### 3. Update Hold Startup Each hold should request crawl on startup: ```bash # In hold startup script or environment export ATCR_DISCOVERY_URL="https://discovery.atcr.io" # Request crawl from both Bluesky and ATCR curl -X POST "https://bsky.network/xrpc/com.atproto.sync.requestCrawl" \ -H "Content-Type: application/json" \ -d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}" curl -X POST "$ATCR_DISCOVERY_URL/xrpc/com.atproto.sync.requestCrawl" \ -H "Content-Type: application/json" \ -d "{\"hostname\": \"$HOLD_PUBLIC_URL\"}" ``` ### 4. Update AppView Configuration Point AppView discovery worker to the discovery service: ```bash # In .env.appview or environment export ATCR_RELAY_ENDPOINT="https://discovery.atcr.io" export ATCR_HOLD_DISCOVERY_ENABLED="true" export ATCR_HOLD_DISCOVERY_INTERVAL="6h" ``` ### 5. Monitor and Maintain **Monitoring:** - Check crawl queue status - Monitor SQLite database size - Track failed crawls **Maintenance:** - Re-crawl on schedule (every 6-24 hours) - Prune stale records (>7 days old) - Backup SQLite database regularly ## Trade-Offs and Considerations ### Running Your Own Relay **Pros:** - ✅ Full control over indexing (can index `did:web` holds) - ✅ No dependency on third-party relay policies - ✅ Can customize collection filters for ATCR-specific needs - ✅ Relatively lightweight with modern relay implementation **Cons:** - ❌ Infrastructure cost (~$30-50/month minimum) - ❌ Operational overhead (monitoring, updates, backups) - ❌ Need to maintain as network grows - ❌ Single point of failure for discovery (unless multi-relay) ### Alternatives to Running a Relay #### 1. Direct Registration API Holds POST to AppView on startup to register themselves: **Pros:** - ✅ Simplest implementation - ✅ No relay infrastructure needed - ✅ Immediate registration (no crawl delay) **Cons:** - ❌ Ties holds to specific AppView instances - ❌ Breaks decentralized discovery model - ❌ Each AppView has different hold registry #### 2. Static Discovery File Maintain `https://atcr.io/.well-known/holds.json`: **Pros:** - ✅ No infrastructure beyond static hosting - ✅ All AppViews share same registry - ✅ Simple to implement **Cons:** - ❌ Manual process (PRs/issues to add holds) - ❌ Not real-time discovery - ❌ Centralized control point #### 3. Hybrid Approach Combine multiple discovery mechanisms: ```go func (w *HoldDiscoveryWorker) DiscoverHolds(ctx context.Context) error { // 1. Fetch static registry staticHolds := w.fetchStaticRegistry() // 2. Query relay (if available) relayHolds := w.queryRelay(ctx) // 3. Accept direct registrations registeredHolds := w.getDirectRegistrations() // Merge and deduplicate allHolds := mergeHolds(staticHolds, relayHolds, registeredHolds) // Cache in database for _, hold := range allHolds { w.cacheHold(hold) } } ``` **Pros:** - ✅ Multiple discovery paths (resilient) - ✅ Gradual migration to relay-based discovery - ✅ Supports both centralized bootstrap and decentralized growth **Cons:** - ❌ More complex implementation - ❌ Potential for stale data if sources conflict ## Recommendations for ATCR ### Phase 1: MVP (Now - 1000 holds) **Build minimal discovery service with WebSocket** (~$5-10/month): 1. Implement `requestCrawl` + `listReposByCollection` endpoints 2. Initial backfill via `getRepo` (CAR file parsing) 3. Real-time updates via WebSocket `subscribeRepos` 4. SQLite storage with cursor management 5. Filter to `io.atcr.*` collections only **Deliverables:** - `cmd/atcr-discovery` service - SQLite schema with cursor storage - CAR file parser (indigo libraries) - WebSocket subscriber with reconnection - Deployment scripts **Cost**: ~$5-10/month VPS **Why**: Minimal infrastructure, real-time updates, full control over indexing, sufficient for hundreds of holds. ### Phase 2: Migrate to Full Relay (1000+ holds) **Deploy Bluesky relay v1.1** when scaling needed (~$30-50/month): 1. Set up PostgreSQL database 2. Deploy indigo relay with admin UI 3. Migrate indexed data from SQLite 4. Configure for `io.atcr.*` collection filtering (if possible) 5. Handle thousands of concurrent WebSocket connections **Cost**: ~$30-50/month **Why**: Proven scalability to 100M+ accounts, standardized protocol, community support, production-ready infrastructure. ### Phase 3: Multi-Relay Federation (Future) **Decentralized relay network:** 1. Multiple ATCR relays operated independently 2. AppViews query multiple relays (fallback/redundancy) 3. Holds request crawls from all known ATCR relays 4. Cross-relay synchronization (optional) **Why**: No single point of failure, fully decentralized discovery, geographic distribution. ## Next Steps ### For MVP Implementation 1. **Create `cmd/atcr-discovery` package structure** - HTTP handlers for XRPC endpoints (`requestCrawl`, `listReposByCollection`) - Crawler with indigo CAR parsing for initial backfill - WebSocket subscriber for real-time updates - SQLite storage layer with cursor management - Background worker for managing subscriptions 2. **Database schema** - `indexed_records` table for collection data - `crawl_queue` table for crawl job management - `subscriptions` table for WebSocket cursor tracking - Indexes for efficient queries 3. **WebSocket implementation** - Use `github.com/bluesky-social/indigo/events` for event handling - Implement reconnection logic with cursor resume - Filter events to `io.atcr.*` collections only - Health monitoring for active subscriptions 4. **Testing strategy** - Unit tests for CAR parsing - Unit tests for event filtering - Integration tests with mock PDSs and WebSocket - Connection failure and reconnection testing - Load testing with SQLite 5. **Deployment** - Dockerfile for discovery service - Deployment scripts (systemd, docker-compose) - Monitoring setup (logs, metrics, WebSocket health) - Alert on subscription failures 6. **Documentation** - API documentation for XRPC endpoints - Deployment guide - Troubleshooting guide (WebSocket connection issues) ### Open Questions 1. **CAR parsing edge cases**: How to handle malformed CAR files or invalid records? 2. **WebSocket reconnection**: What's the optimal backoff strategy for reconnection attempts? 3. **Subscription management**: How many concurrent WebSocket connections can SQLite handle? 4. **Rate limiting**: Should discovery service rate-limit requestCrawl to prevent abuse? 5. **Authentication**: Should requestCrawl require authentication, or remain open? 6. **Cursor storage**: Should cursors be persisted immediately or batched for performance? 7. **Monitoring**: What metrics are most important for operational visibility (active subs, event rate, lag)? 8. **Error handling**: When a WebSocket dies, should we re-backfill via getRepo or trust cursor resume? ## References ### ATProto Specifications - [ATProto Sync Specification](https://atproto.com/specs/sync) - [Repository Specification](https://atproto.com/specs/repository) - [CAR File Format](https://ipld.io/specs/transport/car/) ### Indigo Libraries - [Indigo Repository](https://github.com/bluesky-social/indigo) - [Indigo Repo Package](https://pkg.go.dev/github.com/bluesky-social/indigo/repo) - [Indigo ATProto Package](https://pkg.go.dev/github.com/bluesky-social/indigo/atproto) ### Relay Reference (Future) - [Relay v1.1 Updates](https://docs.bsky.app/blog/relay-sync-updates) - [Indigo Relay Implementation](https://github.com/bluesky-social/indigo/tree/main/cmd/relay) - [Running a Full-Network Relay](https://whtwnd.com/bnewbold.net/3kwzl7tye6u2y)