# ATCR Troubleshooting Guide This document provides troubleshooting guidance for common ATCR deployment and operational issues. ## OAuth Authentication Failures ### JWT Timestamp Validation Errors **Symptom:** ``` error: invalid_client error_description: Validation of "client_assertion" failed: "iat" claim timestamp check failed (it should be in the past) ``` **Root Cause:** The AppView server's system clock is ahead of the PDS server's clock. When the AppView generates a JWT for OAuth client authentication (confidential client mode), the "iat" (issued at) claim appears to be in the future from the PDS's perspective. **Diagnosis:** 1. Check AppView system time: ```bash date -u timedatectl status ``` 2. Check if NTP is active and synchronized: ```bash timedatectl show-timesync --all ``` 3. Compare AppView time with PDS time (if accessible): ```bash # On AppView date +%s # On PDS (or via HTTP headers) curl -I https://your-pds.example.com | grep -i date ``` 4. Check AppView logs for clock information (logged at startup): ```bash docker logs atcr-appview 2>&1 | grep "Configured confidential OAuth client" ``` Example log output: ``` level=INFO msg="Configured confidential OAuth client" key_id=did:key:z... system_time_unix=1731844215 system_time_rfc3339=2025-11-17T14:30:15Z timezone=UTC ``` **Solution:** 1. **Enable NTP synchronization** (recommended): On most Linux systems using systemd: ```bash # Enable and start systemd-timesyncd sudo timedatectl set-ntp true # Verify NTP is active timedatectl status ``` Expected output: ``` System clock synchronized: yes NTP service: active ``` 2. **Alternative: Use chrony** (if systemd-timesyncd is not available): ```bash # Install chrony sudo apt-get install chrony # Debian/Ubuntu sudo yum install chrony # RHEL/CentOS # Enable and start chronyd sudo systemctl enable chronyd sudo systemctl start chronyd # Check sync status chronyc tracking ``` 3. **Force immediate sync**: ```bash # systemd-timesyncd sudo systemctl restart systemd-timesyncd # Or with chrony sudo chronyc makestep ``` 4. **In Docker/Kubernetes environments:** The container inherits the host's system clock, so fix NTP on the **host** machine: ```bash # On Docker host sudo timedatectl set-ntp true # Restart AppView container to pick up correct time docker restart atcr-appview ``` 5. **Verify clock skew is resolved**: ```bash # Should show clock offset < 1 second timedatectl timesync-status ``` **Acceptable Clock Skew:** - Most OAuth implementations tolerate ±30-60 seconds of clock skew - DPoP proof validation is typically stricter (±10 seconds) - Aim for < 1 second skew for reliable operation **Prevention:** - Configure NTP synchronization in your infrastructure-as-code (Terraform, Ansible, etc.) - Monitor clock skew in production (e.g., Prometheus node_exporter includes clock metrics) - Use managed container platforms (ECS, GKE, AKS) that handle NTP automatically --- ### DPoP Nonce Mismatch Errors **Symptom:** ``` error: use_dpop_nonce error_description: DPoP "nonce" mismatch ``` Repeated multiple times, potentially followed by: ``` error: server_error error_description: Server error ``` **Root Cause:** DPoP (Demonstrating Proof-of-Possession) requires a server-provided nonce for replay protection. These errors typically occur when: 1. Multiple concurrent requests create a DPoP nonce race condition 2. Clock skew causes DPoP proof timestamps to fail validation 3. PDS session state becomes corrupted after repeated failures **Diagnosis:** 1. Check if errors occur during concurrent operations: ```bash # During docker push with multiple layers docker logs atcr-appview 2>&1 | grep "use_dpop_nonce" | wc -l ``` 2. Check for clock skew (see section above): ```bash timedatectl status ``` 3. Look for session lock acquisition in logs: ```bash docker logs atcr-appview 2>&1 | grep "Acquired session lock" ``` **Solution:** 1. **If caused by clock skew**: Fix NTP synchronization (see section above) 2. **If caused by session corruption**: ```bash # The AppView will automatically delete corrupted sessions # User just needs to re-authenticate docker login atcr.io ``` 3. **If persistent despite clock sync**: - Check PDS health and logs (may be a PDS-side issue) - Verify network connectivity between AppView and PDS - Check if PDS supports latest OAuth/DPoP specifications **What ATCR does automatically:** - Per-DID locking prevents concurrent DPoP nonce races - Indigo library automatically retries with fresh nonces - Sessions are auto-deleted after repeated failures - Service token cache prevents excessive PDS requests **Prevention:** - Ensure reliable NTP synchronization - Use a stable, well-maintained PDS implementation - Monitor AppView error rates for DPoP-related issues --- ### OAuth Session Not Found **Symptom:** ``` error: failed to get OAuth session: no session found for DID ``` **Root Cause:** - User has never authenticated via OAuth - OAuth session was deleted due to corruption or expiry - Database migration cleared sessions **Solution:** 1. User re-authenticates via OAuth flow: ```bash docker login atcr.io # Or for web UI: visit https://atcr.io/login ``` 2. If using app passwords (legacy), verify token is cached: ```bash # Check if app-password token exists docker logout atcr.io docker login atcr.io -u your.handle -p your-app-password ``` --- ## AppView Deployment Issues ### Client Metadata URL Not Accessible **Symptom:** ``` error: unauthorized_client error_description: Client metadata endpoint returned 404 ``` **Root Cause:** PDS cannot fetch OAuth client metadata from `{ATCR_BASE_URL}/client-metadata.json` **Diagnosis:** 1. Verify client metadata endpoint is accessible: ```bash curl https://your-atcr-instance.com/client-metadata.json ``` 2. Check AppView logs for startup errors: ```bash docker logs atcr-appview 2>&1 | grep "client-metadata" ``` 3. Verify `ATCR_BASE_URL` is set correctly: ```bash echo $ATCR_BASE_URL ``` **Solution:** 1. Ensure `ATCR_BASE_URL` matches your public URL: ```bash export ATCR_BASE_URL=https://atcr.example.com ``` 2. Verify reverse proxy (nginx, Caddy, etc.) routes `/.well-known/*` and `/client-metadata.json`: ```nginx location / { proxy_pass http://localhost:5000; proxy_set_header Host $host; proxy_set_header X-Forwarded-Proto $scheme; } ``` 3. Check firewall rules allow inbound HTTPS: ```bash sudo ufw status sudo iptables -L -n | grep 443 ``` --- ## Hold Service Issues ### Blob Storage Connectivity **Symptom:** ``` error: failed to upload blob: connection refused ``` **Diagnosis:** 1. Check hold service logs: ```bash docker logs atcr-hold 2>&1 | grep -i error ``` 2. Verify S3 credentials are correct: ```bash # Test S3 access aws s3 ls s3://your-bucket --endpoint-url=$S3_ENDPOINT ``` 3. Check hold configuration: ```bash env | grep -E "(S3_|AWS_|STORAGE_)" ``` **Solution:** 1. Verify environment variables in hold service: ```bash export AWS_ACCESS_KEY_ID=your-key export AWS_SECRET_ACCESS_KEY=your-secret export S3_BUCKET=your-bucket export S3_ENDPOINT=https://s3.us-west-2.amazonaws.com ``` 2. Test S3 connectivity from hold container: ```bash docker exec atcr-hold curl -v $S3_ENDPOINT ``` 3. Check S3 bucket permissions (requires PutObject, GetObject, DeleteObject) --- ## Performance Issues ### High Database Lock Contention **Symptom:** Slow Docker push/pull operations, high CPU usage on AppView **Diagnosis:** 1. Check SQLite database size: ```bash ls -lh /var/lib/atcr/ui.db ``` 2. Look for long-running queries: ```bash docker logs atcr-appview 2>&1 | grep "database is locked" ``` **Solution:** 1. For production, migrate to PostgreSQL (recommended): ```bash export ATCR_UI_DATABASE_TYPE=postgres export ATCR_UI_DATABASE_URL=postgresql://user:pass@localhost/atcr ``` 2. Or increase SQLite busy timeout: ```go // In code: db.SetMaxOpenConns(1) for SQLite ``` 3. Vacuum the database to reclaim space: ```bash sqlite3 /var/lib/atcr/ui.db "VACUUM;" ``` --- ## Logging and Debugging ### Enable Debug Logging Set log level to debug for detailed troubleshooting: ```bash export ATCR_LOG_LEVEL=debug docker restart atcr-appview ``` ### Useful Log Queries **OAuth token exchange errors:** ```bash docker logs atcr-appview 2>&1 | grep "OAuth callback failed" ``` **Service token request failures:** ```bash docker logs atcr-appview 2>&1 | grep "OAuth authentication failed during service token request" ``` **Clock diagnostics:** ```bash docker logs atcr-appview 2>&1 | grep "system_time" ``` **DPoP nonce issues:** ```bash docker logs atcr-appview 2>&1 | grep -E "(use_dpop_nonce|DPoP)" ``` ### Health Checks **AppView health:** ```bash curl http://localhost:5000/v2/ # Should return: {"errors":[{"code":"UNAUTHORIZED",...}]} ``` **Hold service health:** ```bash curl http://localhost:8080/.well-known/did.json # Should return DID document ``` --- ## Getting Help If issues persist after following this guide: 1. **Check GitHub Issues**: https://github.com/ericvolp12/atcr/issues 2. **Collect logs**: Include output from `docker logs` for AppView and Hold services 3. **Include diagnostics**: - `timedatectl status` output - AppView version: `docker exec atcr-appview cat /VERSION` (if available) - PDS version and implementation (Bluesky PDS, other) 4. **File an issue** with reproducible steps --- ## Common Error Reference | Error Code | Component | Common Cause | Fix | |------------|-----------|--------------|-----| | `invalid_client` (iat timestamp) | OAuth | Clock skew | Enable NTP sync | | `use_dpop_nonce` | OAuth/DPoP | Concurrent requests or clock skew | Fix NTP, wait for auto-retry | | `server_error` (500) | PDS | PDS internal error | Check PDS logs | | `invalid_grant` | OAuth | Expired auth code | Retry OAuth flow | | `unauthorized_client` | OAuth | Client metadata unreachable | Check ATCR_BASE_URL and firewall | | `RecordNotFound` | ATProto | Manifest doesn't exist | Verify repository name | | Connection refused | Hold/S3 | Network/credentials | Check S3 config and connectivity |