Files
at-container-registry/docs/TROUBLESHOOTING.md
2025-11-24 13:25:24 -06:00

434 lines
10 KiB
Markdown

# ATCR Troubleshooting Guide
This document provides troubleshooting guidance for common ATCR deployment and operational issues.
## OAuth Authentication Failures
### JWT Timestamp Validation Errors
**Symptom:**
```
error: invalid_client
error_description: Validation of "client_assertion" failed: "iat" claim timestamp check failed (it should be in the past)
```
**Root Cause:**
The AppView server's system clock is ahead of the PDS server's clock. When the AppView generates a JWT for OAuth client authentication (confidential client mode), the "iat" (issued at) claim appears to be in the future from the PDS's perspective.
**Diagnosis:**
1. Check AppView system time:
```bash
date -u
timedatectl status
```
2. Check if NTP is active and synchronized:
```bash
timedatectl show-timesync --all
```
3. Compare AppView time with PDS time (if accessible):
```bash
# On AppView
date +%s
# On PDS (or via HTTP headers)
curl -I https://your-pds.example.com | grep -i date
```
4. Check AppView logs for clock information (logged at startup):
```bash
docker logs atcr-appview 2>&1 | grep "Configured confidential OAuth client"
```
Example log output:
```
level=INFO msg="Configured confidential OAuth client"
key_id=did:key:z...
system_time_unix=1731844215
system_time_rfc3339=2025-11-17T14:30:15Z
timezone=UTC
```
**Solution:**
1. **Enable NTP synchronization** (recommended):
On most Linux systems using systemd:
```bash
# Enable and start systemd-timesyncd
sudo timedatectl set-ntp true
# Verify NTP is active
timedatectl status
```
Expected output:
```
System clock synchronized: yes
NTP service: active
```
2. **Alternative: Use chrony** (if systemd-timesyncd is not available):
```bash
# Install chrony
sudo apt-get install chrony # Debian/Ubuntu
sudo yum install chrony # RHEL/CentOS
# Enable and start chronyd
sudo systemctl enable chronyd
sudo systemctl start chronyd
# Check sync status
chronyc tracking
```
3. **Force immediate sync**:
```bash
# systemd-timesyncd
sudo systemctl restart systemd-timesyncd
# Or with chrony
sudo chronyc makestep
```
4. **In Docker/Kubernetes environments:**
The container inherits the host's system clock, so fix NTP on the **host** machine:
```bash
# On Docker host
sudo timedatectl set-ntp true
# Restart AppView container to pick up correct time
docker restart atcr-appview
```
5. **Verify clock skew is resolved**:
```bash
# Should show clock offset < 1 second
timedatectl timesync-status
```
**Acceptable Clock Skew:**
- Most OAuth implementations tolerate ±30-60 seconds of clock skew
- DPoP proof validation is typically stricter (±10 seconds)
- Aim for < 1 second skew for reliable operation
**Prevention:**
- Configure NTP synchronization in your infrastructure-as-code (Terraform, Ansible, etc.)
- Monitor clock skew in production (e.g., Prometheus node_exporter includes clock metrics)
- Use managed container platforms (ECS, GKE, AKS) that handle NTP automatically
---
### DPoP Nonce Mismatch Errors
**Symptom:**
```
error: use_dpop_nonce
error_description: DPoP "nonce" mismatch
```
Repeated multiple times, potentially followed by:
```
error: server_error
error_description: Server error
```
**Root Cause:**
DPoP (Demonstrating Proof-of-Possession) requires a server-provided nonce for replay protection. These errors typically occur when:
1. Multiple concurrent requests create a DPoP nonce race condition
2. Clock skew causes DPoP proof timestamps to fail validation
3. PDS session state becomes corrupted after repeated failures
**Diagnosis:**
1. Check if errors occur during concurrent operations:
```bash
# During docker push with multiple layers
docker logs atcr-appview 2>&1 | grep "use_dpop_nonce" | wc -l
```
2. Check for clock skew (see section above):
```bash
timedatectl status
```
3. Look for session lock acquisition in logs:
```bash
docker logs atcr-appview 2>&1 | grep "Acquired session lock"
```
**Solution:**
1. **If caused by clock skew**: Fix NTP synchronization (see section above)
2. **If caused by session corruption**:
```bash
# The AppView will automatically delete corrupted sessions
# User just needs to re-authenticate
docker login atcr.io
```
3. **If persistent despite clock sync**:
- Check PDS health and logs (may be a PDS-side issue)
- Verify network connectivity between AppView and PDS
- Check if PDS supports latest OAuth/DPoP specifications
**What ATCR does automatically:**
- Per-DID locking prevents concurrent DPoP nonce races
- Indigo library automatically retries with fresh nonces
- Sessions are auto-deleted after repeated failures
- Service token cache prevents excessive PDS requests
**Prevention:**
- Ensure reliable NTP synchronization
- Use a stable, well-maintained PDS implementation
- Monitor AppView error rates for DPoP-related issues
---
### OAuth Session Not Found
**Symptom:**
```
error: failed to get OAuth session: no session found for DID
```
**Root Cause:**
- User has never authenticated via OAuth
- OAuth session was deleted due to corruption or expiry
- Database migration cleared sessions
**Solution:**
1. User re-authenticates via OAuth flow:
```bash
docker login atcr.io
# Or for web UI: visit https://atcr.io/login
```
2. If using app passwords (legacy), verify token is cached:
```bash
# Check if app-password token exists
docker logout atcr.io
docker login atcr.io -u your.handle -p your-app-password
```
---
## AppView Deployment Issues
### Client Metadata URL Not Accessible
**Symptom:**
```
error: unauthorized_client
error_description: Client metadata endpoint returned 404
```
**Root Cause:**
PDS cannot fetch OAuth client metadata from `{ATCR_BASE_URL}/client-metadata.json`
**Diagnosis:**
1. Verify client metadata endpoint is accessible:
```bash
curl https://your-atcr-instance.com/client-metadata.json
```
2. Check AppView logs for startup errors:
```bash
docker logs atcr-appview 2>&1 | grep "client-metadata"
```
3. Verify `ATCR_BASE_URL` is set correctly:
```bash
echo $ATCR_BASE_URL
```
**Solution:**
1. Ensure `ATCR_BASE_URL` matches your public URL:
```bash
export ATCR_BASE_URL=https://atcr.example.com
```
2. Verify reverse proxy (nginx, Caddy, etc.) routes `/.well-known/*` and `/client-metadata.json`:
```nginx
location / {
proxy_pass http://localhost:5000;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
}
```
3. Check firewall rules allow inbound HTTPS:
```bash
sudo ufw status
sudo iptables -L -n | grep 443
```
---
## Hold Service Issues
### Blob Storage Connectivity
**Symptom:**
```
error: failed to upload blob: connection refused
```
**Diagnosis:**
1. Check hold service logs:
```bash
docker logs atcr-hold 2>&1 | grep -i error
```
2. Verify S3 credentials are correct:
```bash
# Test S3 access
aws s3 ls s3://your-bucket --endpoint-url=$S3_ENDPOINT
```
3. Check hold configuration:
```bash
env | grep -E "(S3_|AWS_|STORAGE_)"
```
**Solution:**
1. Verify environment variables in hold service:
```bash
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export S3_BUCKET=your-bucket
export S3_ENDPOINT=https://s3.us-west-2.amazonaws.com
```
2. Test S3 connectivity from hold container:
```bash
docker exec atcr-hold curl -v $S3_ENDPOINT
```
3. Check S3 bucket permissions (requires PutObject, GetObject, DeleteObject)
---
## Performance Issues
### High Database Lock Contention
**Symptom:**
Slow Docker push/pull operations, high CPU usage on AppView
**Diagnosis:**
1. Check SQLite database size:
```bash
ls -lh /var/lib/atcr/ui.db
```
2. Look for long-running queries:
```bash
docker logs atcr-appview 2>&1 | grep "database is locked"
```
**Solution:**
1. For production, migrate to PostgreSQL (recommended):
```bash
export ATCR_UI_DATABASE_TYPE=postgres
export ATCR_UI_DATABASE_URL=postgresql://user:pass@localhost/atcr
```
2. Or increase SQLite busy timeout:
```go
// In code: db.SetMaxOpenConns(1) for SQLite
```
3. Vacuum the database to reclaim space:
```bash
sqlite3 /var/lib/atcr/ui.db "VACUUM;"
```
---
## Logging and Debugging
### Enable Debug Logging
Set log level to debug for detailed troubleshooting:
```bash
export ATCR_LOG_LEVEL=debug
docker restart atcr-appview
```
### Useful Log Queries
**OAuth token exchange errors:**
```bash
docker logs atcr-appview 2>&1 | grep "OAuth callback failed"
```
**Service token request failures:**
```bash
docker logs atcr-appview 2>&1 | grep "OAuth authentication failed during service token request"
```
**Clock diagnostics:**
```bash
docker logs atcr-appview 2>&1 | grep "system_time"
```
**DPoP nonce issues:**
```bash
docker logs atcr-appview 2>&1 | grep -E "(use_dpop_nonce|DPoP)"
```
### Health Checks
**AppView health:**
```bash
curl http://localhost:5000/v2/
# Should return: {"errors":[{"code":"UNAUTHORIZED",...}]}
```
**Hold service health:**
```bash
curl http://localhost:8080/.well-known/did.json
# Should return DID document
```
---
## Getting Help
If issues persist after following this guide:
1. **Check GitHub Issues**: https://github.com/ericvolp12/atcr/issues
2. **Collect logs**: Include output from `docker logs` for AppView and Hold services
3. **Include diagnostics**:
- `timedatectl status` output
- AppView version: `docker exec atcr-appview cat /VERSION` (if available)
- PDS version and implementation (Bluesky PDS, other)
4. **File an issue** with reproducible steps
---
## Common Error Reference
| Error Code | Component | Common Cause | Fix |
|------------|-----------|--------------|-----|
| `invalid_client` (iat timestamp) | OAuth | Clock skew | Enable NTP sync |
| `use_dpop_nonce` | OAuth/DPoP | Concurrent requests or clock skew | Fix NTP, wait for auto-retry |
| `server_error` (500) | PDS | PDS internal error | Check PDS logs |
| `invalid_grant` | OAuth | Expired auth code | Retry OAuth flow |
| `unauthorized_client` | OAuth | Client metadata unreachable | Check ATCR_BASE_URL and firewall |
| `RecordNotFound` | ATProto | Manifest doesn't exist | Verify repository name |
| Connection refused | Hold/S3 | Network/credentials | Check S3 config and connectivity |