mirror of
https://tangled.org/evan.jarrett.net/at-container-registry
synced 2026-04-22 17:30:33 +00:00
434 lines
10 KiB
Markdown
434 lines
10 KiB
Markdown
# ATCR Troubleshooting Guide
|
|
|
|
This document provides troubleshooting guidance for common ATCR deployment and operational issues.
|
|
|
|
## OAuth Authentication Failures
|
|
|
|
### JWT Timestamp Validation Errors
|
|
|
|
**Symptom:**
|
|
```
|
|
error: invalid_client
|
|
error_description: Validation of "client_assertion" failed: "iat" claim timestamp check failed (it should be in the past)
|
|
```
|
|
|
|
**Root Cause:**
|
|
The AppView server's system clock is ahead of the PDS server's clock. When the AppView generates a JWT for OAuth client authentication (confidential client mode), the "iat" (issued at) claim appears to be in the future from the PDS's perspective.
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Check AppView system time:
|
|
```bash
|
|
date -u
|
|
timedatectl status
|
|
```
|
|
|
|
2. Check if NTP is active and synchronized:
|
|
```bash
|
|
timedatectl show-timesync --all
|
|
```
|
|
|
|
3. Compare AppView time with PDS time (if accessible):
|
|
```bash
|
|
# On AppView
|
|
date +%s
|
|
|
|
# On PDS (or via HTTP headers)
|
|
curl -I https://your-pds.example.com | grep -i date
|
|
```
|
|
|
|
4. Check AppView logs for clock information (logged at startup):
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "Configured confidential OAuth client"
|
|
```
|
|
|
|
Example log output:
|
|
```
|
|
level=INFO msg="Configured confidential OAuth client"
|
|
key_id=did:key:z...
|
|
system_time_unix=1731844215
|
|
system_time_rfc3339=2025-11-17T14:30:15Z
|
|
timezone=UTC
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. **Enable NTP synchronization** (recommended):
|
|
|
|
On most Linux systems using systemd:
|
|
```bash
|
|
# Enable and start systemd-timesyncd
|
|
sudo timedatectl set-ntp true
|
|
|
|
# Verify NTP is active
|
|
timedatectl status
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
System clock synchronized: yes
|
|
NTP service: active
|
|
```
|
|
|
|
2. **Alternative: Use chrony** (if systemd-timesyncd is not available):
|
|
```bash
|
|
# Install chrony
|
|
sudo apt-get install chrony # Debian/Ubuntu
|
|
sudo yum install chrony # RHEL/CentOS
|
|
|
|
# Enable and start chronyd
|
|
sudo systemctl enable chronyd
|
|
sudo systemctl start chronyd
|
|
|
|
# Check sync status
|
|
chronyc tracking
|
|
```
|
|
|
|
3. **Force immediate sync**:
|
|
```bash
|
|
# systemd-timesyncd
|
|
sudo systemctl restart systemd-timesyncd
|
|
|
|
# Or with chrony
|
|
sudo chronyc makestep
|
|
```
|
|
|
|
4. **In Docker/Kubernetes environments:**
|
|
|
|
The container inherits the host's system clock, so fix NTP on the **host** machine:
|
|
```bash
|
|
# On Docker host
|
|
sudo timedatectl set-ntp true
|
|
|
|
# Restart AppView container to pick up correct time
|
|
docker restart atcr-appview
|
|
```
|
|
|
|
5. **Verify clock skew is resolved**:
|
|
```bash
|
|
# Should show clock offset < 1 second
|
|
timedatectl timesync-status
|
|
```
|
|
|
|
**Acceptable Clock Skew:**
|
|
- Most OAuth implementations tolerate ±30-60 seconds of clock skew
|
|
- DPoP proof validation is typically stricter (±10 seconds)
|
|
- Aim for < 1 second skew for reliable operation
|
|
|
|
**Prevention:**
|
|
- Configure NTP synchronization in your infrastructure-as-code (Terraform, Ansible, etc.)
|
|
- Monitor clock skew in production (e.g., Prometheus node_exporter includes clock metrics)
|
|
- Use managed container platforms (ECS, GKE, AKS) that handle NTP automatically
|
|
|
|
---
|
|
|
|
### DPoP Nonce Mismatch Errors
|
|
|
|
**Symptom:**
|
|
```
|
|
error: use_dpop_nonce
|
|
error_description: DPoP "nonce" mismatch
|
|
```
|
|
|
|
Repeated multiple times, potentially followed by:
|
|
```
|
|
error: server_error
|
|
error_description: Server error
|
|
```
|
|
|
|
**Root Cause:**
|
|
DPoP (Demonstrating Proof-of-Possession) requires a server-provided nonce for replay protection. These errors typically occur when:
|
|
1. Multiple concurrent requests create a DPoP nonce race condition
|
|
2. Clock skew causes DPoP proof timestamps to fail validation
|
|
3. PDS session state becomes corrupted after repeated failures
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Check if errors occur during concurrent operations:
|
|
```bash
|
|
# During docker push with multiple layers
|
|
docker logs atcr-appview 2>&1 | grep "use_dpop_nonce" | wc -l
|
|
```
|
|
|
|
2. Check for clock skew (see section above):
|
|
```bash
|
|
timedatectl status
|
|
```
|
|
|
|
3. Look for session lock acquisition in logs:
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "Acquired session lock"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. **If caused by clock skew**: Fix NTP synchronization (see section above)
|
|
|
|
2. **If caused by session corruption**:
|
|
```bash
|
|
# The AppView will automatically delete corrupted sessions
|
|
# User just needs to re-authenticate
|
|
docker login atcr.io
|
|
```
|
|
|
|
3. **If persistent despite clock sync**:
|
|
- Check PDS health and logs (may be a PDS-side issue)
|
|
- Verify network connectivity between AppView and PDS
|
|
- Check if PDS supports latest OAuth/DPoP specifications
|
|
|
|
**What ATCR does automatically:**
|
|
- Per-DID locking prevents concurrent DPoP nonce races
|
|
- Indigo library automatically retries with fresh nonces
|
|
- Sessions are auto-deleted after repeated failures
|
|
- Service token cache prevents excessive PDS requests
|
|
|
|
**Prevention:**
|
|
- Ensure reliable NTP synchronization
|
|
- Use a stable, well-maintained PDS implementation
|
|
- Monitor AppView error rates for DPoP-related issues
|
|
|
|
---
|
|
|
|
### OAuth Session Not Found
|
|
|
|
**Symptom:**
|
|
```
|
|
error: failed to get OAuth session: no session found for DID
|
|
```
|
|
|
|
**Root Cause:**
|
|
- User has never authenticated via OAuth
|
|
- OAuth session was deleted due to corruption or expiry
|
|
- Database migration cleared sessions
|
|
|
|
**Solution:**
|
|
|
|
1. User re-authenticates via OAuth flow:
|
|
```bash
|
|
docker login atcr.io
|
|
# Or for web UI: visit https://atcr.io/login
|
|
```
|
|
|
|
2. If using app passwords (legacy), verify token is cached:
|
|
```bash
|
|
# Check if app-password token exists
|
|
docker logout atcr.io
|
|
docker login atcr.io -u your.handle -p your-app-password
|
|
```
|
|
|
|
---
|
|
|
|
## AppView Deployment Issues
|
|
|
|
### Client Metadata URL Not Accessible
|
|
|
|
**Symptom:**
|
|
```
|
|
error: unauthorized_client
|
|
error_description: Client metadata endpoint returned 404
|
|
```
|
|
|
|
**Root Cause:**
|
|
PDS cannot fetch OAuth client metadata from `{ATCR_BASE_URL}/client-metadata.json`
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Verify client metadata endpoint is accessible:
|
|
```bash
|
|
curl https://your-atcr-instance.com/client-metadata.json
|
|
```
|
|
|
|
2. Check AppView logs for startup errors:
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "client-metadata"
|
|
```
|
|
|
|
3. Verify `ATCR_BASE_URL` is set correctly:
|
|
```bash
|
|
echo $ATCR_BASE_URL
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Ensure `ATCR_BASE_URL` matches your public URL:
|
|
```bash
|
|
export ATCR_BASE_URL=https://atcr.example.com
|
|
```
|
|
|
|
2. Verify reverse proxy (nginx, Caddy, etc.) routes `/.well-known/*` and `/client-metadata.json`:
|
|
```nginx
|
|
location / {
|
|
proxy_pass http://localhost:5000;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
}
|
|
```
|
|
|
|
3. Check firewall rules allow inbound HTTPS:
|
|
```bash
|
|
sudo ufw status
|
|
sudo iptables -L -n | grep 443
|
|
```
|
|
|
|
---
|
|
|
|
## Hold Service Issues
|
|
|
|
### Blob Storage Connectivity
|
|
|
|
**Symptom:**
|
|
```
|
|
error: failed to upload blob: connection refused
|
|
```
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Check hold service logs:
|
|
```bash
|
|
docker logs atcr-hold 2>&1 | grep -i error
|
|
```
|
|
|
|
2. Verify S3 credentials are correct:
|
|
```bash
|
|
# Test S3 access
|
|
aws s3 ls s3://your-bucket --endpoint-url=$S3_ENDPOINT
|
|
```
|
|
|
|
3. Check hold configuration:
|
|
```bash
|
|
env | grep -E "(S3_|AWS_|STORAGE_)"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. Verify environment variables in hold service:
|
|
```bash
|
|
export AWS_ACCESS_KEY_ID=your-key
|
|
export AWS_SECRET_ACCESS_KEY=your-secret
|
|
export S3_BUCKET=your-bucket
|
|
export S3_ENDPOINT=https://s3.us-west-2.amazonaws.com
|
|
```
|
|
|
|
2. Test S3 connectivity from hold container:
|
|
```bash
|
|
docker exec atcr-hold curl -v $S3_ENDPOINT
|
|
```
|
|
|
|
3. Check S3 bucket permissions (requires PutObject, GetObject, DeleteObject)
|
|
|
|
---
|
|
|
|
## Performance Issues
|
|
|
|
### High Database Lock Contention
|
|
|
|
**Symptom:**
|
|
Slow Docker push/pull operations, high CPU usage on AppView
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Check SQLite database size:
|
|
```bash
|
|
ls -lh /var/lib/atcr/ui.db
|
|
```
|
|
|
|
2. Look for long-running queries:
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "database is locked"
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. For production, migrate to PostgreSQL (recommended):
|
|
```bash
|
|
export ATCR_UI_DATABASE_TYPE=postgres
|
|
export ATCR_UI_DATABASE_URL=postgresql://user:pass@localhost/atcr
|
|
```
|
|
|
|
2. Or increase SQLite busy timeout:
|
|
```go
|
|
// In code: db.SetMaxOpenConns(1) for SQLite
|
|
```
|
|
|
|
3. Vacuum the database to reclaim space:
|
|
```bash
|
|
sqlite3 /var/lib/atcr/ui.db "VACUUM;"
|
|
```
|
|
|
|
---
|
|
|
|
## Logging and Debugging
|
|
|
|
### Enable Debug Logging
|
|
|
|
Set log level to debug for detailed troubleshooting:
|
|
|
|
```bash
|
|
export ATCR_LOG_LEVEL=debug
|
|
docker restart atcr-appview
|
|
```
|
|
|
|
### Useful Log Queries
|
|
|
|
**OAuth token exchange errors:**
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "OAuth callback failed"
|
|
```
|
|
|
|
**Service token request failures:**
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "OAuth authentication failed during service token request"
|
|
```
|
|
|
|
**Clock diagnostics:**
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep "system_time"
|
|
```
|
|
|
|
**DPoP nonce issues:**
|
|
```bash
|
|
docker logs atcr-appview 2>&1 | grep -E "(use_dpop_nonce|DPoP)"
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
**AppView health:**
|
|
```bash
|
|
curl http://localhost:5000/v2/
|
|
# Should return: {"errors":[{"code":"UNAUTHORIZED",...}]}
|
|
```
|
|
|
|
**Hold service health:**
|
|
```bash
|
|
curl http://localhost:8080/.well-known/did.json
|
|
# Should return DID document
|
|
```
|
|
|
|
---
|
|
|
|
## Getting Help
|
|
|
|
If issues persist after following this guide:
|
|
|
|
1. **Check GitHub Issues**: https://github.com/ericvolp12/atcr/issues
|
|
2. **Collect logs**: Include output from `docker logs` for AppView and Hold services
|
|
3. **Include diagnostics**:
|
|
- `timedatectl status` output
|
|
- AppView version: `docker exec atcr-appview cat /VERSION` (if available)
|
|
- PDS version and implementation (Bluesky PDS, other)
|
|
4. **File an issue** with reproducible steps
|
|
|
|
---
|
|
|
|
## Common Error Reference
|
|
|
|
| Error Code | Component | Common Cause | Fix |
|
|
|------------|-----------|--------------|-----|
|
|
| `invalid_client` (iat timestamp) | OAuth | Clock skew | Enable NTP sync |
|
|
| `use_dpop_nonce` | OAuth/DPoP | Concurrent requests or clock skew | Fix NTP, wait for auto-retry |
|
|
| `server_error` (500) | PDS | PDS internal error | Check PDS logs |
|
|
| `invalid_grant` | OAuth | Expired auth code | Retry OAuth flow |
|
|
| `unauthorized_client` | OAuth | Client metadata unreachable | Check ATCR_BASE_URL and firewall |
|
|
| `RecordNotFound` | ATProto | Manifest doesn't exist | Verify repository name |
|
|
| Connection refused | Hold/S3 | Network/credentials | Check S3 config and connectivity |
|