at-container-registry/docs/TROUBLESHOOTING.md

# ATCR Troubleshooting Guide

This document provides troubleshooting guidance for common ATCR deployment and operational issues.

## OAuth Authentication Failures

### JWT Timestamp Validation Errors

**Symptom:**
```
error: invalid_client
error_description: Validation of "client_assertion" failed: "iat" claim timestamp check failed (it should be in the past)
```

**Root Cause:**
The AppView server's system clock is ahead of the PDS server's clock. When the AppView generates a JWT for OAuth client authentication (confidential client mode), the "iat" (issued at) claim appears to be in the future from the PDS's perspective.

**Diagnosis:**

1. Check AppView system time:
```bash
date -u
timedatectl status
```

2. Check if NTP is active and synchronized:
```bash
timedatectl show-timesync --all
```

3. Compare AppView time with PDS time (if accessible):
```bash
# On AppView
date +%s

# On PDS (or via HTTP headers)
curl -I https://your-pds.example.com | grep -i date
```

4. Check AppView logs for clock information (logged at startup):
```bash
docker logs atcr-appview 2>&1 | grep "Configured confidential OAuth client"
```

Example log output:
```
level=INFO msg="Configured confidential OAuth client"
  key_id=did:key:z...
  system_time_unix=1731844215
  system_time_rfc3339=2025-11-17T14:30:15Z
  timezone=UTC
```

**Solution:**

1. **Enable NTP synchronization** (recommended):

   On most Linux systems using systemd:
   ```bash
   # Enable and start systemd-timesyncd
   sudo timedatectl set-ntp true

   # Verify NTP is active
   timedatectl status
   ```

   Expected output:
   ```
   System clock synchronized: yes
   NTP service: active
   ```

2. **Alternative: Use chrony** (if systemd-timesyncd is not available):
   ```bash
   # Install chrony
   sudo apt-get install chrony  # Debian/Ubuntu
   sudo yum install chrony       # RHEL/CentOS

   # Enable and start chronyd
   sudo systemctl enable chronyd
   sudo systemctl start chronyd

   # Check sync status
   chronyc tracking
   ```

3. **Force immediate sync**:
   ```bash
   # systemd-timesyncd
   sudo systemctl restart systemd-timesyncd

   # Or with chrony
   sudo chronyc makestep
   ```

4. **In Docker/Kubernetes environments:**

   The container inherits the host's system clock, so fix NTP on the **host** machine:
   ```bash
   # On Docker host
   sudo timedatectl set-ntp true

   # Restart AppView container to pick up correct time
   docker restart atcr-appview
   ```

5. **Verify clock skew is resolved**:
   ```bash
   # Should show clock offset < 1 second
   timedatectl timesync-status
   ```

**Acceptable Clock Skew:**
- Most OAuth implementations tolerate ±30-60 seconds of clock skew
- DPoP proof validation is typically stricter (±10 seconds)
- Aim for < 1 second skew for reliable operation

**Prevention:**
- Configure NTP synchronization in your infrastructure-as-code (Terraform, Ansible, etc.)
- Monitor clock skew in production (e.g., Prometheus node_exporter includes clock metrics)
- Use managed container platforms (ECS, GKE, AKS) that handle NTP automatically

---

### DPoP Nonce Mismatch Errors

**Symptom:**
```
error: use_dpop_nonce
error_description: DPoP "nonce" mismatch
```

Repeated multiple times, potentially followed by:
```
error: server_error
error_description: Server error
```

**Root Cause:**
DPoP (Demonstrating Proof-of-Possession) requires a server-provided nonce for replay protection. These errors typically occur when:
1. Multiple concurrent requests create a DPoP nonce race condition
2. Clock skew causes DPoP proof timestamps to fail validation
3. PDS session state becomes corrupted after repeated failures

**Diagnosis:**

1. Check if errors occur during concurrent operations:
```bash
# During docker push with multiple layers
docker logs atcr-appview 2>&1 | grep "use_dpop_nonce" | wc -l
```

2. Check for clock skew (see section above):
```bash
timedatectl status
```

3. Look for session lock acquisition in logs:
```bash
docker logs atcr-appview 2>&1 | grep "Acquired session lock"
```

**Solution:**

1. **If caused by clock skew**: Fix NTP synchronization (see section above)

2. **If caused by session corruption**:
   ```bash
   # The AppView will automatically delete corrupted sessions
   # User just needs to re-authenticate
   docker login atcr.io
   ```

3. **If persistent despite clock sync**:
   - Check PDS health and logs (may be a PDS-side issue)
   - Verify network connectivity between AppView and PDS
   - Check if PDS supports latest OAuth/DPoP specifications

**What ATCR does automatically:**
- Per-DID locking prevents concurrent DPoP nonce races
- Indigo library automatically retries with fresh nonces
- Sessions are auto-deleted after repeated failures
- Service token cache prevents excessive PDS requests

**Prevention:**
- Ensure reliable NTP synchronization
- Use a stable, well-maintained PDS implementation
- Monitor AppView error rates for DPoP-related issues

---

### OAuth Session Not Found

**Symptom:**
```
error: failed to get OAuth session: no session found for DID
```

**Root Cause:**
- User has never authenticated via OAuth
- OAuth session was deleted due to corruption or expiry
- Database migration cleared sessions

**Solution:**

1. User re-authenticates via OAuth flow:
   ```bash
   docker login atcr.io
   # Or for web UI: visit https://atcr.io/login
   ```

2. If using app passwords (legacy), verify token is cached:
   ```bash
   # Check if app-password token exists
   docker logout atcr.io
   docker login atcr.io -u your.handle -p your-app-password
   ```

---

## AppView Deployment Issues

### Client Metadata URL Not Accessible

**Symptom:**
```
error: unauthorized_client
error_description: Client metadata endpoint returned 404
```

**Root Cause:**
PDS cannot fetch OAuth client metadata from `{ATCR_BASE_URL}/client-metadata.json`

**Diagnosis:**

1. Verify client metadata endpoint is accessible:
   ```bash
   curl https://your-atcr-instance.com/client-metadata.json
   ```

2. Check AppView logs for startup errors:
   ```bash
   docker logs atcr-appview 2>&1 | grep "client-metadata"
   ```

3. Verify `ATCR_BASE_URL` is set correctly:
   ```bash
   echo $ATCR_BASE_URL
   ```

**Solution:**

1. Ensure `ATCR_BASE_URL` matches your public URL:
   ```bash
   export ATCR_BASE_URL=https://atcr.example.com
   ```

2. Verify reverse proxy (nginx, Caddy, etc.) routes `/.well-known/*` and `/client-metadata.json`:
   ```nginx
   location / {
       proxy_pass http://localhost:5000;
       proxy_set_header Host $host;
       proxy_set_header X-Forwarded-Proto $scheme;
   }
   ```

3. Check firewall rules allow inbound HTTPS:
   ```bash
   sudo ufw status
   sudo iptables -L -n | grep 443
   ```

---

## Hold Service Issues

### Blob Storage Connectivity

**Symptom:**
```
error: failed to upload blob: connection refused
```

**Diagnosis:**

1. Check hold service logs:
   ```bash
   docker logs atcr-hold 2>&1 | grep -i error
   ```

2. Verify S3 credentials are correct:
   ```bash
   # Test S3 access
   aws s3 ls s3://your-bucket --endpoint-url=$S3_ENDPOINT
   ```

3. Check hold configuration:
   ```bash
   env | grep -E "(S3_|AWS_|STORAGE_)"
   ```

**Solution:**

1. Verify environment variables in hold service:
   ```bash
   export AWS_ACCESS_KEY_ID=your-key
   export AWS_SECRET_ACCESS_KEY=your-secret
   export S3_BUCKET=your-bucket
   export S3_ENDPOINT=https://s3.us-west-2.amazonaws.com
   ```

2. Test S3 connectivity from hold container:
   ```bash
   docker exec atcr-hold curl -v $S3_ENDPOINT
   ```

3. Check S3 bucket permissions (requires PutObject, GetObject, DeleteObject)

---

## Performance Issues

### High Database Lock Contention

**Symptom:**
Slow Docker push/pull operations, high CPU usage on AppView

**Diagnosis:**

1. Check SQLite database size:
   ```bash
   ls -lh /var/lib/atcr/ui.db
   ```

2. Look for long-running queries:
   ```bash
   docker logs atcr-appview 2>&1 | grep "database is locked"
   ```

**Solution:**

1. For production, migrate to PostgreSQL (recommended):
   ```bash
   export ATCR_UI_DATABASE_TYPE=postgres
   export ATCR_UI_DATABASE_URL=postgresql://user:pass@localhost/atcr
   ```

2. Or increase SQLite busy timeout:
   ```go
   // In code: db.SetMaxOpenConns(1) for SQLite
   ```

3. Vacuum the database to reclaim space:
   ```bash
   sqlite3 /var/lib/atcr/ui.db "VACUUM;"
   ```

---

## Logging and Debugging

### Enable Debug Logging

Set log level to debug for detailed troubleshooting:

```bash
export ATCR_LOG_LEVEL=debug
docker restart atcr-appview
```

### Useful Log Queries

**OAuth token exchange errors:**
```bash
docker logs atcr-appview 2>&1 | grep "OAuth callback failed"
```

**Service token request failures:**
```bash
docker logs atcr-appview 2>&1 | grep "OAuth authentication failed during service token request"
```

**Clock diagnostics:**
```bash
docker logs atcr-appview 2>&1 | grep "system_time"
```

**DPoP nonce issues:**
```bash
docker logs atcr-appview 2>&1 | grep -E "(use_dpop_nonce|DPoP)"
```

### Health Checks

**AppView health:**
```bash
curl http://localhost:5000/v2/
# Should return: {"errors":[{"code":"UNAUTHORIZED",...}]}
```

**Hold service health:**
```bash
curl http://localhost:8080/.well-known/did.json
# Should return DID document
```

---

## Getting Help

If issues persist after following this guide:

1. **Check GitHub Issues**: https://github.com/ericvolp12/atcr/issues
2. **Collect logs**: Include output from `docker logs` for AppView and Hold services
3. **Include diagnostics**:
   - `timedatectl status` output
   - AppView version: `docker exec atcr-appview cat /VERSION` (if available)
   - PDS version and implementation (Bluesky PDS, other)
4. **File an issue** with reproducible steps

---

## Common Error Reference

| Error Code | Component | Common Cause | Fix |
|------------|-----------|--------------|-----|
| `invalid_client` (iat timestamp) | OAuth | Clock skew | Enable NTP sync |
| `use_dpop_nonce` | OAuth/DPoP | Concurrent requests or clock skew | Fix NTP, wait for auto-retry |
| `server_error` (500) | PDS | PDS internal error | Check PDS logs |
| `invalid_grant` | OAuth | Expired auth code | Retry OAuth flow |
| `unauthorized_client` | OAuth | Client metadata unreachable | Check ATCR_BASE_URL and firewall |
| `RecordNotFound` | ATProto | Manifest doesn't exist | Verify repository name |
| Connection refused | Hold/S3 | Network/credentials | Check S3 config and connectivity |