mirror of
https://tangled.org/evan.jarrett.net/at-container-registry
synced 2026-04-26 19:25:09 +00:00
1290 lines
41 KiB
Markdown
1290 lines
41 KiB
Markdown
# ATCR Quota System
|
|
|
|
This document describes ATCR's storage quota implementation, inspired by Harbor's proven approach to per-project blob tracking with deduplication.
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Harbor's Approach (Reference Implementation)](#harbors-approach-reference-implementation)
|
|
- [Storage Options](#storage-options)
|
|
- [Quota Data Model](#quota-data-model)
|
|
- [Push Flow (Detailed)](#push-flow-detailed)
|
|
- [Delete Flow](#delete-flow)
|
|
- [Garbage Collection](#garbage-collection)
|
|
- [Quota Reconciliation](#quota-reconciliation)
|
|
- [Configuration](#configuration)
|
|
- [Trade-offs & Design Decisions](#trade-offs--design-decisions)
|
|
- [Future Enhancements](#future-enhancements)
|
|
|
|
## Overview
|
|
|
|
ATCR implements per-user storage quotas to:
|
|
1. **Limit storage consumption** on shared hold services
|
|
2. **Track actual S3 costs** (what new data was added)
|
|
3. **Benefit from deduplication** (users only pay once per layer)
|
|
4. **Provide transparency** (show users their storage usage)
|
|
|
|
**Key principle:** Users pay for layers they've uploaded, but only ONCE per layer regardless of how many images reference it.
|
|
|
|
### Example Scenario
|
|
|
|
```
|
|
Alice pushes myapp:v1 (layers A, B, C - each 100MB)
|
|
→ Alice's quota: +300MB (all new layers)
|
|
|
|
Alice pushes myapp:v2 (layers A, B, D)
|
|
→ Layers A, B already claimed by Alice
|
|
→ Layer D is new (100MB)
|
|
→ Alice's quota: +100MB (only D is new)
|
|
→ Total: 400MB
|
|
|
|
Bob pushes his-app:latest (layers A, E)
|
|
→ Layer A already exists in S3 (uploaded by Alice)
|
|
→ Bob claims it for first time → +100MB to Bob's quota
|
|
→ Layer E is new → +100MB to Bob's quota
|
|
→ Bob's quota: 200MB
|
|
|
|
Physical S3 storage: 500MB (A, B, C, D, E)
|
|
Claimed storage: 600MB (Alice: 400MB, Bob: 200MB)
|
|
Deduplication savings: 100MB (layer A shared)
|
|
```
|
|
|
|
## Harbor's Approach (Reference Implementation)
|
|
|
|
Harbor is built on distribution/distribution (same as ATCR) and implements quotas as middleware. Their approach:
|
|
|
|
### Key Insights from Harbor
|
|
|
|
1. **"Shared blobs are only computed once per project"**
|
|
- Each project tracks which blobs it has uploaded
|
|
- Same blob used in multiple images counts only once per project
|
|
- Different projects claiming the same blob each pay for it
|
|
|
|
2. **Quota checked when manifest is pushed**
|
|
- Blobs upload first (presigned URLs, can't intercept)
|
|
- Manifest pushed last → quota check happens here
|
|
- Can reject manifest if quota exceeded (orphaned blobs cleaned by GC)
|
|
|
|
3. **Middleware-based implementation**
|
|
- distribution/distribution has NO built-in quota support
|
|
- Harbor added it as request preprocessing middleware
|
|
- Uses database (PostgreSQL) or Redis for quota storage
|
|
|
|
4. **Per-project ownership model**
|
|
- Blobs are physically deduplicated globally
|
|
- Quota accounting is logical (per-project claims)
|
|
- Total claimed storage can exceed physical storage
|
|
|
|
### References
|
|
|
|
- Harbor Quota Documentation: https://goharbor.io/docs/1.10/administration/configure-project-quotas/
|
|
- Harbor Source: https://github.com/goharbor/harbor (see `src/controller/quota`)
|
|
|
|
## Storage Options
|
|
|
|
The hold service needs to store quota data somewhere. Two options:
|
|
|
|
### Option 1: S3-Based Storage (Recommended for BYOS)
|
|
|
|
Store quota metadata alongside blobs in the same S3 bucket:
|
|
|
|
```
|
|
Bucket structure:
|
|
/docker/registry/v2/blobs/sha256/ab/abc123.../data ← actual blobs
|
|
/atcr/quota/did:plc:alice.json ← quota tracking
|
|
/atcr/quota/did:plc:bob.json
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ No separate database needed
|
|
- ✅ Single S3 bucket (better UX - no second bucket to configure)
|
|
- ✅ Quota data lives with the blobs
|
|
- ✅ Hold service stays relatively stateless
|
|
- ✅ Works with any S3-compatible service (Storj, Minio, Upcloud, Fly.io)
|
|
|
|
**Cons:**
|
|
- ❌ Slower than local database (network round-trip)
|
|
- ❌ Eventual consistency issues
|
|
- ❌ Race conditions on concurrent updates
|
|
- ❌ Extra S3 API costs (GET/PUT per upload)
|
|
|
|
**Performance:**
|
|
- Each blob upload: 1 HEAD (blob exists?) + 1 GET (quota) + 1 PUT (update quota)
|
|
- Typical latency: 100-200ms total overhead
|
|
- For high-throughput registries, consider SQLite
|
|
|
|
### Option 2: SQLite Database (Recommended for Shared Holds)
|
|
|
|
Local database in hold service:
|
|
|
|
```bash
|
|
/var/lib/atcr/hold-quota.db
|
|
```
|
|
|
|
**Pros:**
|
|
- ✅ Fast local queries (no network latency)
|
|
- ✅ ACID transactions (no race conditions)
|
|
- ✅ Efficient for high-throughput registries
|
|
- ✅ Can use foreign keys and joins
|
|
|
|
**Cons:**
|
|
- ❌ Makes hold service stateful (persistent volume needed)
|
|
- ❌ Not ideal for ephemeral BYOS deployments
|
|
- ❌ Backup/restore complexity
|
|
- ❌ Multi-instance scaling requires shared database
|
|
|
|
**Schema:**
|
|
```sql
|
|
CREATE TABLE user_quotas (
|
|
did TEXT PRIMARY KEY,
|
|
quota_limit INTEGER NOT NULL DEFAULT 10737418240, -- 10GB
|
|
quota_used INTEGER NOT NULL DEFAULT 0,
|
|
updated_at TIMESTAMP
|
|
);
|
|
|
|
CREATE TABLE claimed_layers (
|
|
did TEXT NOT NULL,
|
|
digest TEXT NOT NULL,
|
|
size INTEGER NOT NULL,
|
|
claimed_at TIMESTAMP,
|
|
PRIMARY KEY(did, digest)
|
|
);
|
|
```
|
|
|
|
### Recommendation
|
|
|
|
- **BYOS (user-owned holds):** S3-based (keeps hold service ephemeral)
|
|
- **Shared holds (multi-user):** SQLite (better performance and consistency)
|
|
- **High-traffic production:** SQLite or PostgreSQL (Harbor uses this)
|
|
|
|
## Quota Data Model
|
|
|
|
### Quota File Format (S3-based)
|
|
|
|
```json
|
|
{
|
|
"did": "did:plc:alice123",
|
|
"limit": 10737418240,
|
|
"used": 5368709120,
|
|
"claimed_layers": {
|
|
"sha256:abc123...": 104857600,
|
|
"sha256:def456...": 52428800,
|
|
"sha256:789ghi...": 209715200
|
|
},
|
|
"last_updated": "2025-10-09T12:34:56Z",
|
|
"version": 1
|
|
}
|
|
```
|
|
|
|
**Fields:**
|
|
- `did`: User's ATProto DID
|
|
- `limit`: Maximum storage in bytes (default: 10GB)
|
|
- `used`: Current storage usage in bytes (sum of claimed_layers)
|
|
- `claimed_layers`: Map of digest → size for all layers user has uploaded
|
|
- `last_updated`: Timestamp of last quota update
|
|
- `version`: Schema version for future migrations
|
|
|
|
### Why Track Individual Layers?
|
|
|
|
**Q: Can't we just track a counter?**
|
|
|
|
**A: We need layer tracking for:**
|
|
|
|
1. **Deduplication detection**
|
|
- Check if user already claimed a layer → free upload
|
|
- Example: Updating an image reuses most layers
|
|
|
|
2. **Accurate deletes**
|
|
- When manifest deleted, only decrement unclaimed layers
|
|
- User may have 5 images sharing layer A - deleting 1 image doesn't free layer A
|
|
|
|
3. **Quota reconciliation**
|
|
- Verify quota matches reality by listing user's manifests
|
|
- Recalculate from layers in manifests vs claimed_layers map
|
|
|
|
4. **Auditing**
|
|
- "Show me what I'm storing"
|
|
- Users can see which layers consume their quota
|
|
|
|
## Push Flow (Detailed)
|
|
|
|
### Step-by-Step: User Pushes Image
|
|
|
|
```
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ Client │ │ Hold │ │ S3 │
|
|
│ (Docker) │ │ Service │ │ Bucket │
|
|
└──────────┘ └──────────┘ └──────────┘
|
|
│ │ │
|
|
│ 1. PUT /v2/.../blobs/ │ │
|
|
│ upload?digest=sha256:abc│ │
|
|
├───────────────────────────>│ │
|
|
│ │ │
|
|
│ │ 2. Check if blob exists │
|
|
│ │ (Stat/HEAD request) │
|
|
│ ├───────────────────────────>│
|
|
│ │<───────────────────────────┤
|
|
│ │ 200 OK (exists) or │
|
|
│ │ 404 Not Found │
|
|
│ │ │
|
|
│ │ 3. Read user quota │
|
|
│ │ GET /atcr/quota/{did} │
|
|
│ ├───────────────────────────>│
|
|
│ │<───────────────────────────┤
|
|
│ │ quota.json │
|
|
│ │ │
|
|
│ │ 4. Calculate quota impact │
|
|
│ │ - If digest in │
|
|
│ │ claimed_layers: 0 │
|
|
│ │ - Else: size │
|
|
│ │ │
|
|
│ │ 5. Check quota limit │
|
|
│ │ used + impact <= limit? │
|
|
│ │ │
|
|
│ │ 6. Update quota │
|
|
│ │ PUT /atcr/quota/{did} │
|
|
│ ├───────────────────────────>│
|
|
│ │<───────────────────────────┤
|
|
│ │ 200 OK │
|
|
│ │ │
|
|
│ 7. Presigned URL │ │
|
|
│<───────────────────────────┤ │
|
|
│ {url: "https://s3..."} │ │
|
|
│ │ │
|
|
│ 8. Upload blob to S3 │ │
|
|
├────────────────────────────┼───────────────────────────>│
|
|
│ │ │
|
|
│ 9. 200 OK │ │
|
|
│<───────────────────────────┼────────────────────────────┤
|
|
│ │ │
|
|
```
|
|
|
|
### Implementation (Pseudocode)
|
|
|
|
```go
|
|
// cmd/hold/main.go - HandlePutPresignedURL
|
|
|
|
func (s *HoldService) HandlePutPresignedURL(w http.ResponseWriter, r *http.Request) {
|
|
var req PutPresignedURLRequest
|
|
json.NewDecoder(r.Body).Decode(&req)
|
|
|
|
// Step 1: Check if blob already exists in S3
|
|
blobPath := fmt.Sprintf("/docker/registry/v2/blobs/%s/%s/%s/data",
|
|
algorithm, digest[:2], digest)
|
|
|
|
_, err := s.driver.Stat(ctx, blobPath)
|
|
blobExists := (err == nil)
|
|
|
|
// Step 2: Read quota from S3 (or SQLite)
|
|
quota, err := s.quotaManager.GetQuota(req.DID)
|
|
if err != nil {
|
|
// First upload - create quota with defaults
|
|
quota = &Quota{
|
|
DID: req.DID,
|
|
Limit: s.config.QuotaDefaultLimit,
|
|
Used: 0,
|
|
ClaimedLayers: make(map[string]int64),
|
|
}
|
|
}
|
|
|
|
// Step 3: Calculate quota impact
|
|
quotaImpact := req.Size // Default: assume new layer
|
|
|
|
if _, alreadyClaimed := quota.ClaimedLayers[req.Digest]; alreadyClaimed {
|
|
// User already uploaded this layer before
|
|
quotaImpact = 0
|
|
log.Printf("Layer %s already claimed by %s, no quota impact",
|
|
req.Digest, req.DID)
|
|
} else if blobExists {
|
|
// Blob exists in S3 (uploaded by another user)
|
|
// But this user is claiming it for first time
|
|
// Still counts against their quota
|
|
log.Printf("Layer %s exists globally but new to %s, quota impact: %d",
|
|
req.Digest, req.DID, quotaImpact)
|
|
} else {
|
|
// Brand new blob - will be uploaded to S3
|
|
log.Printf("New layer %s for %s, quota impact: %d",
|
|
req.Digest, req.DID, quotaImpact)
|
|
}
|
|
|
|
// Step 4: Check quota limit
|
|
if quota.Used + quotaImpact > quota.Limit {
|
|
http.Error(w, fmt.Sprintf(
|
|
"quota exceeded: used=%d, impact=%d, limit=%d",
|
|
quota.Used, quotaImpact, quota.Limit,
|
|
), http.StatusPaymentRequired) // 402
|
|
return
|
|
}
|
|
|
|
// Step 5: Update quota (optimistic - before upload completes)
|
|
quota.Used += quotaImpact
|
|
if quotaImpact > 0 {
|
|
quota.ClaimedLayers[req.Digest] = req.Size
|
|
}
|
|
quota.LastUpdated = time.Now()
|
|
|
|
if err := s.quotaManager.SaveQuota(quota); err != nil {
|
|
http.Error(w, "failed to update quota", http.StatusInternalServerError)
|
|
return
|
|
}
|
|
|
|
// Step 6: Generate presigned URL
|
|
presignedURL, err := s.getUploadURL(ctx, req.Digest, req.Size, req.DID)
|
|
if err != nil {
|
|
// Rollback quota update on error
|
|
quota.Used -= quotaImpact
|
|
delete(quota.ClaimedLayers, req.Digest)
|
|
s.quotaManager.SaveQuota(quota)
|
|
|
|
http.Error(w, "failed to generate presigned URL", http.StatusInternalServerError)
|
|
return
|
|
}
|
|
|
|
// Step 7: Return presigned URL + quota info
|
|
resp := PutPresignedURLResponse{
|
|
URL: presignedURL,
|
|
ExpiresAt: time.Now().Add(15 * time.Minute),
|
|
QuotaInfo: QuotaInfo{
|
|
Used: quota.Used,
|
|
Limit: quota.Limit,
|
|
Available: quota.Limit - quota.Used,
|
|
Impact: quotaImpact,
|
|
AlreadyClaimed: quotaImpact == 0,
|
|
},
|
|
}
|
|
|
|
w.Header().Set("Content-Type", "application/json")
|
|
json.NewEncoder(w).Encode(resp)
|
|
}
|
|
```
|
|
|
|
### Race Condition Handling
|
|
|
|
**Problem:** Two concurrent uploads of the same blob
|
|
|
|
```
|
|
Time User A User B
|
|
0ms Upload layer X (100MB)
|
|
10ms Upload layer X (100MB)
|
|
20ms Check exists: NO Check exists: NO
|
|
30ms Quota impact: 100MB Quota impact: 100MB
|
|
40ms Update quota A: +100MB Update quota B: +100MB
|
|
50ms Generate presigned URL Generate presigned URL
|
|
100ms Upload to S3 completes Upload to S3 (overwrites A's)
|
|
```
|
|
|
|
**Result:** Both users charged 100MB, but only 100MB stored in S3.
|
|
|
|
**Mitigation strategies:**
|
|
|
|
1. **Accept eventual consistency** (recommended for S3-based)
|
|
- Run periodic reconciliation to fix discrepancies
|
|
- Small inconsistency window (minutes) is acceptable
|
|
- Reconciliation uses PDS as source of truth
|
|
|
|
2. **Optimistic locking** (S3 ETags)
|
|
```go
|
|
// Use S3 ETags for conditional writes
|
|
oldETag := getQuotaFileETag(did)
|
|
err := putQuotaFileWithCondition(quota, oldETag)
|
|
if err == PreconditionFailed {
|
|
// Retry with fresh read
|
|
}
|
|
```
|
|
|
|
3. **Database transactions** (SQLite-based)
|
|
```sql
|
|
BEGIN TRANSACTION;
|
|
SELECT * FROM user_quotas WHERE did = ? FOR UPDATE;
|
|
UPDATE user_quotas SET used = used + ? WHERE did = ?;
|
|
COMMIT;
|
|
```
|
|
|
|
## Delete Flow
|
|
|
|
### Manifest Deletion via AppView UI
|
|
|
|
When a user deletes a manifest through the AppView web interface:
|
|
|
|
```
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ User │ │ AppView │ │ Hold │ │ PDS │
|
|
│ UI │ │ Database │ │ Service │ │ │
|
|
└──────────┘ └──────────┘ └──────────┘ └──────────┘
|
|
│ │ │ │
|
|
│ DELETE manifest │ │ │
|
|
├─────────────────────>│ │ │
|
|
│ │ │ │
|
|
│ │ 1. Get manifest │ │
|
|
│ │ and layers │ │
|
|
│ │ │ │
|
|
│ │ 2. Check which │ │
|
|
│ │ layers still │ │
|
|
│ │ referenced by │ │
|
|
│ │ user's other │ │
|
|
│ │ manifests │ │
|
|
│ │ │ │
|
|
│ │ 3. DELETE manifest │ │
|
|
│ │ from PDS │ │
|
|
│ ├──────────────────────┼─────────────────────>│
|
|
│ │ │ │
|
|
│ │ 4. POST /quota/decrement │
|
|
│ ├─────────────────────>│ │
|
|
│ │ {layers: [...]} │ │
|
|
│ │ │ │
|
|
│ │ │ 5. Update quota │
|
|
│ │ │ Remove unclaimed │
|
|
│ │ │ layers │
|
|
│ │ │ │
|
|
│ │ 6. 200 OK │ │
|
|
│ │<─────────────────────┤ │
|
|
│ │ │ │
|
|
│ │ 7. Delete from DB │ │
|
|
│ │ │ │
|
|
│ 8. Success │ │ │
|
|
│<─────────────────────┤ │ │
|
|
│ │ │ │
|
|
```
|
|
|
|
### AppView Implementation
|
|
|
|
```go
|
|
// pkg/appview/handlers/manifest.go
|
|
|
|
func (h *ManifestHandler) DeleteManifest(w http.ResponseWriter, r *http.Request) {
|
|
did := r.Context().Value("auth.did").(string)
|
|
repository := chi.URLParam(r, "repository")
|
|
digest := chi.URLParam(r, "digest")
|
|
|
|
// Step 1: Get manifest and its layers from database
|
|
manifest, err := db.GetManifest(h.db, digest)
|
|
if err != nil {
|
|
http.Error(w, "manifest not found", 404)
|
|
return
|
|
}
|
|
|
|
layers, err := db.GetLayersForManifest(h.db, manifest.ID)
|
|
if err != nil {
|
|
http.Error(w, "failed to get layers", 500)
|
|
return
|
|
}
|
|
|
|
// Step 2: For each layer, check if user still references it
|
|
// in other manifests
|
|
layersToDecrement := []LayerInfo{}
|
|
|
|
for _, layer := range layers {
|
|
// Query: does this user have other manifests using this layer?
|
|
stillReferenced, err := db.CheckLayerReferencedByUser(
|
|
h.db, did, repository, layer.Digest, manifest.ID,
|
|
)
|
|
|
|
if err != nil {
|
|
http.Error(w, "failed to check layer references", 500)
|
|
return
|
|
}
|
|
|
|
if !stillReferenced {
|
|
// This layer is no longer used by user
|
|
layersToDecrement = append(layersToDecrement, LayerInfo{
|
|
Digest: layer.Digest,
|
|
Size: layer.Size,
|
|
})
|
|
}
|
|
}
|
|
|
|
// Step 3: Delete manifest from user's PDS
|
|
atprotoClient := atproto.NewClient(manifest.PDSEndpoint, did, accessToken)
|
|
err = atprotoClient.DeleteRecord(ctx, atproto.ManifestCollection, manifestRKey)
|
|
if err != nil {
|
|
http.Error(w, "failed to delete from PDS", 500)
|
|
return
|
|
}
|
|
|
|
// Step 4: Notify hold service to decrement quota
|
|
if len(layersToDecrement) > 0 {
|
|
holdClient := &http.Client{}
|
|
|
|
decrementReq := QuotaDecrementRequest{
|
|
DID: did,
|
|
Layers: layersToDecrement,
|
|
}
|
|
|
|
body, _ := json.Marshal(decrementReq)
|
|
resp, err := holdClient.Post(
|
|
manifest.HoldEndpoint + "/quota/decrement",
|
|
"application/json",
|
|
bytes.NewReader(body),
|
|
)
|
|
|
|
if err != nil || resp.StatusCode != 200 {
|
|
log.Printf("Warning: failed to update quota on hold service: %v", err)
|
|
// Continue anyway - GC reconciliation will fix it
|
|
}
|
|
}
|
|
|
|
// Step 5: Delete from AppView database
|
|
err = db.DeleteManifest(h.db, did, repository, digest)
|
|
if err != nil {
|
|
http.Error(w, "failed to delete from database", 500)
|
|
return
|
|
}
|
|
|
|
w.WriteHeader(http.StatusNoContent)
|
|
}
|
|
```
|
|
|
|
### Hold Service Decrement Endpoint
|
|
|
|
```go
|
|
// cmd/hold/main.go
|
|
|
|
type QuotaDecrementRequest struct {
|
|
DID string `json:"did"`
|
|
Layers []LayerInfo `json:"layers"`
|
|
}
|
|
|
|
type LayerInfo struct {
|
|
Digest string `json:"digest"`
|
|
Size int64 `json:"size"`
|
|
}
|
|
|
|
func (s *HoldService) HandleQuotaDecrement(w http.ResponseWriter, r *http.Request) {
|
|
var req QuotaDecrementRequest
|
|
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
|
|
http.Error(w, "invalid request", 400)
|
|
return
|
|
}
|
|
|
|
// Read current quota
|
|
quota, err := s.quotaManager.GetQuota(req.DID)
|
|
if err != nil {
|
|
http.Error(w, "quota not found", 404)
|
|
return
|
|
}
|
|
|
|
// Decrement quota for each layer
|
|
for _, layer := range req.Layers {
|
|
if size, claimed := quota.ClaimedLayers[layer.Digest]; claimed {
|
|
// Remove from claimed layers
|
|
delete(quota.ClaimedLayers, layer.Digest)
|
|
quota.Used -= size
|
|
|
|
log.Printf("Decremented quota for %s: layer %s (%d bytes)",
|
|
req.DID, layer.Digest, size)
|
|
} else {
|
|
log.Printf("Warning: layer %s not in claimed_layers for %s",
|
|
layer.Digest, req.DID)
|
|
}
|
|
}
|
|
|
|
// Ensure quota.Used doesn't go negative (defensive)
|
|
if quota.Used < 0 {
|
|
log.Printf("Warning: quota.Used went negative for %s, resetting to 0", req.DID)
|
|
quota.Used = 0
|
|
}
|
|
|
|
// Save updated quota
|
|
quota.LastUpdated = time.Now()
|
|
if err := s.quotaManager.SaveQuota(quota); err != nil {
|
|
http.Error(w, "failed to save quota", 500)
|
|
return
|
|
}
|
|
|
|
// Return updated quota info
|
|
json.NewEncoder(w).Encode(map[string]any{
|
|
"used": quota.Used,
|
|
"limit": quota.Limit,
|
|
})
|
|
}
|
|
```
|
|
|
|
### SQL Query: Check Layer References
|
|
|
|
```sql
|
|
-- pkg/appview/db/queries.go
|
|
|
|
-- Check if user still references this layer in other manifests
|
|
SELECT COUNT(*)
|
|
FROM layers l
|
|
JOIN manifests m ON l.manifest_id = m.id
|
|
WHERE m.did = ? -- User's DID
|
|
AND l.digest = ? -- Layer digest
|
|
AND m.id != ? -- Exclude the manifest being deleted
|
|
```
|
|
|
|
## Garbage Collection
|
|
|
|
### Background: Orphaned Blobs
|
|
|
|
Orphaned blobs accumulate when:
|
|
1. Manifest push fails after blobs uploaded (presigned URLs bypass hold)
|
|
2. Quota exceeded - manifest rejected, blobs already in S3
|
|
3. User deletes manifest - blobs no longer referenced
|
|
|
|
**GC periodically cleans these up.**
|
|
|
|
### GC Cron Implementation
|
|
|
|
Similar to AppView's backfill worker, the hold service can run periodic GC:
|
|
|
|
```go
|
|
// cmd/hold/gc/gc.go
|
|
|
|
type GarbageCollector struct {
|
|
driver storagedriver.StorageDriver
|
|
appviewURL string
|
|
holdURL string
|
|
quotaManager *quota.Manager
|
|
}
|
|
|
|
// Run garbage collection
|
|
func (gc *GarbageCollector) Run(ctx context.Context) error {
|
|
log.Println("Starting garbage collection...")
|
|
|
|
// Step 1: Get list of referenced blobs from AppView
|
|
referenced, err := gc.getReferencedBlobs()
|
|
if err != nil {
|
|
return fmt.Errorf("failed to get referenced blobs: %w", err)
|
|
}
|
|
|
|
referencedSet := make(map[string]bool)
|
|
for _, digest := range referenced {
|
|
referencedSet[digest] = true
|
|
}
|
|
|
|
log.Printf("AppView reports %d referenced blobs", len(referenced))
|
|
|
|
// Step 2: Walk S3 blobs
|
|
deletedCount := 0
|
|
reclaimedBytes := int64(0)
|
|
|
|
err = gc.driver.Walk(ctx, "/docker/registry/v2/blobs", func(fileInfo storagedriver.FileInfo) error {
|
|
if fileInfo.IsDir() {
|
|
return nil // Skip directories
|
|
}
|
|
|
|
// Extract digest from path
|
|
// Path: /docker/registry/v2/blobs/sha256/ab/abc123.../data
|
|
digest := extractDigestFromPath(fileInfo.Path())
|
|
|
|
if !referencedSet[digest] {
|
|
// Unreferenced blob - delete it
|
|
size := fileInfo.Size()
|
|
|
|
if err := gc.driver.Delete(ctx, fileInfo.Path()); err != nil {
|
|
log.Printf("Failed to delete blob %s: %v", digest, err)
|
|
return nil // Continue anyway
|
|
}
|
|
|
|
deletedCount++
|
|
reclaimedBytes += size
|
|
|
|
log.Printf("GC: Deleted unreferenced blob %s (%d bytes)", digest, size)
|
|
}
|
|
|
|
return nil
|
|
})
|
|
|
|
if err != nil {
|
|
return fmt.Errorf("failed to walk blobs: %w", err)
|
|
}
|
|
|
|
log.Printf("GC complete: deleted %d blobs, reclaimed %d bytes",
|
|
deletedCount, reclaimedBytes)
|
|
|
|
return nil
|
|
}
|
|
|
|
// Get referenced blobs from AppView
|
|
func (gc *GarbageCollector) getReferencedBlobs() ([]string, error) {
|
|
// Query AppView for all blobs referenced by manifests
|
|
// stored in THIS hold service
|
|
url := fmt.Sprintf("%s/internal/blobs/referenced?hold=%s",
|
|
gc.appviewURL, url.QueryEscape(gc.holdURL))
|
|
|
|
resp, err := http.Get(url)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer resp.Body.Close()
|
|
|
|
var result struct {
|
|
Blobs []string `json:"blobs"`
|
|
}
|
|
|
|
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
return result.Blobs, nil
|
|
}
|
|
```
|
|
|
|
### AppView Internal API
|
|
|
|
```go
|
|
// pkg/appview/handlers/internal.go
|
|
|
|
// Get all referenced blobs for a specific hold
|
|
func (h *InternalHandler) GetReferencedBlobs(w http.ResponseWriter, r *http.Request) {
|
|
holdEndpoint := r.URL.Query().Get("hold")
|
|
if holdEndpoint == "" {
|
|
http.Error(w, "missing hold parameter", 400)
|
|
return
|
|
}
|
|
|
|
// Query database for all layers in manifests stored in this hold
|
|
query := `
|
|
SELECT DISTINCT l.digest
|
|
FROM layers l
|
|
JOIN manifests m ON l.manifest_id = m.id
|
|
WHERE m.hold_endpoint = ?
|
|
`
|
|
|
|
rows, err := h.db.Query(query, holdEndpoint)
|
|
if err != nil {
|
|
http.Error(w, "database error", 500)
|
|
return
|
|
}
|
|
defer rows.Close()
|
|
|
|
blobs := []string{}
|
|
for rows.Next() {
|
|
var digest string
|
|
if err := rows.Scan(&digest); err != nil {
|
|
continue
|
|
}
|
|
blobs = append(blobs, digest)
|
|
}
|
|
|
|
json.NewEncoder(w).Encode(map[string]any{
|
|
"blobs": blobs,
|
|
"count": len(blobs),
|
|
"hold": holdEndpoint,
|
|
})
|
|
}
|
|
```
|
|
|
|
### GC Cron Schedule
|
|
|
|
```go
|
|
// cmd/hold/main.go
|
|
|
|
func main() {
|
|
// ... service setup ...
|
|
|
|
// Start GC cron if enabled
|
|
if os.Getenv("GC_ENABLED") == "true" {
|
|
gcInterval := 24 * time.Hour // Daily by default
|
|
|
|
go func() {
|
|
ticker := time.NewTicker(gcInterval)
|
|
defer ticker.Stop()
|
|
|
|
for range ticker.C {
|
|
if err := garbageCollector.Run(context.Background()); err != nil {
|
|
log.Printf("GC error: %v", err)
|
|
}
|
|
}
|
|
}()
|
|
|
|
log.Printf("GC cron started: runs every %v", gcInterval)
|
|
}
|
|
|
|
// Start server...
|
|
}
|
|
```
|
|
|
|
## Quota Reconciliation
|
|
|
|
### PDS as Source of Truth
|
|
|
|
**Key insight:** Manifest records in PDS are publicly readable (no OAuth needed for reads).
|
|
|
|
Each manifest contains:
|
|
- Repository name
|
|
- Digest
|
|
- Layers array with digest + size
|
|
- Hold endpoint
|
|
|
|
The hold service can query the PDS to calculate the user's true quota:
|
|
|
|
```
|
|
1. List all io.atcr.manifest records for user
|
|
2. Filter manifests where holdEndpoint == this hold service
|
|
3. Extract unique layers (deduplicate by digest)
|
|
4. Sum layer sizes = true quota usage
|
|
5. Compare to quota file
|
|
6. Fix discrepancies
|
|
```
|
|
|
|
### Implementation
|
|
|
|
```go
|
|
// cmd/hold/quota/reconcile.go
|
|
|
|
type Reconciler struct {
|
|
quotaManager *Manager
|
|
atprotoResolver *atproto.Resolver
|
|
holdURL string
|
|
}
|
|
|
|
// ReconcileUser recalculates quota from PDS manifests
|
|
func (r *Reconciler) ReconcileUser(ctx context.Context, did string) error {
|
|
log.Printf("Reconciling quota for %s", did)
|
|
|
|
// Step 1: Resolve user's PDS endpoint
|
|
identity, err := r.atprotoResolver.ResolveIdentity(ctx, did)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to resolve DID: %w", err)
|
|
}
|
|
|
|
// Step 2: Create unauthenticated ATProto client
|
|
// (manifest records are public - no OAuth needed)
|
|
client := atproto.NewClient(identity.PDSEndpoint, did, "")
|
|
|
|
// Step 3: List all manifest records for this user
|
|
manifests, err := client.ListRecords(ctx, atproto.ManifestCollection, 1000)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to list manifests: %w", err)
|
|
}
|
|
|
|
// Step 4: Filter manifests stored in THIS hold service
|
|
// and extract unique layers
|
|
uniqueLayers := make(map[string]int64) // digest -> size
|
|
|
|
for _, record := range manifests {
|
|
var manifest atproto.ManifestRecord
|
|
if err := json.Unmarshal(record.Value, &manifest); err != nil {
|
|
log.Printf("Warning: failed to parse manifest: %v", err)
|
|
continue
|
|
}
|
|
|
|
// Only count manifests stored in this hold
|
|
if manifest.HoldEndpoint != r.holdURL {
|
|
continue
|
|
}
|
|
|
|
// Add config blob
|
|
if manifest.Config.Digest != "" {
|
|
uniqueLayers[manifest.Config.Digest] = manifest.Config.Size
|
|
}
|
|
|
|
// Add layer blobs
|
|
for _, layer := range manifest.Layers {
|
|
uniqueLayers[layer.Digest] = layer.Size
|
|
}
|
|
}
|
|
|
|
// Step 5: Calculate true quota usage
|
|
trueUsage := int64(0)
|
|
for _, size := range uniqueLayers {
|
|
trueUsage += size
|
|
}
|
|
|
|
log.Printf("User %s true usage from PDS: %d bytes (%d unique layers)",
|
|
did, trueUsage, len(uniqueLayers))
|
|
|
|
// Step 6: Compare with current quota file
|
|
quota, err := r.quotaManager.GetQuota(did)
|
|
if err != nil {
|
|
log.Printf("No existing quota for %s, creating new", did)
|
|
quota = &Quota{
|
|
DID: did,
|
|
Limit: r.quotaManager.DefaultLimit,
|
|
ClaimedLayers: make(map[string]int64),
|
|
}
|
|
}
|
|
|
|
// Step 7: Fix discrepancies
|
|
if quota.Used != trueUsage || len(quota.ClaimedLayers) != len(uniqueLayers) {
|
|
log.Printf("Quota mismatch for %s: recorded=%d, actual=%d (diff=%d)",
|
|
did, quota.Used, trueUsage, trueUsage - quota.Used)
|
|
|
|
// Update quota to match PDS truth
|
|
quota.Used = trueUsage
|
|
quota.ClaimedLayers = uniqueLayers
|
|
quota.LastUpdated = time.Now()
|
|
|
|
if err := r.quotaManager.SaveQuota(quota); err != nil {
|
|
return fmt.Errorf("failed to save reconciled quota: %w", err)
|
|
}
|
|
|
|
log.Printf("Reconciled quota for %s: %d bytes", did, trueUsage)
|
|
} else {
|
|
log.Printf("Quota for %s is accurate", did)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// ReconcileAll reconciles all users (run periodically)
|
|
func (r *Reconciler) ReconcileAll(ctx context.Context) error {
|
|
// Get list of all users with quota files
|
|
users, err := r.quotaManager.ListUsers()
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
log.Printf("Starting reconciliation for %d users", len(users))
|
|
|
|
for _, did := range users {
|
|
if err := r.ReconcileUser(ctx, did); err != nil {
|
|
log.Printf("Failed to reconcile %s: %v", did, err)
|
|
// Continue with other users
|
|
}
|
|
}
|
|
|
|
log.Println("Reconciliation complete")
|
|
return nil
|
|
}
|
|
```
|
|
|
|
### Reconciliation Cron
|
|
|
|
```go
|
|
// cmd/hold/main.go
|
|
|
|
func main() {
|
|
// ... setup ...
|
|
|
|
// Start reconciliation cron
|
|
if os.Getenv("QUOTA_RECONCILE_ENABLED") == "true" {
|
|
reconcileInterval := 24 * time.Hour // Daily
|
|
|
|
go func() {
|
|
ticker := time.NewTicker(reconcileInterval)
|
|
defer ticker.Stop()
|
|
|
|
for range ticker.C {
|
|
if err := reconciler.ReconcileAll(context.Background()); err != nil {
|
|
log.Printf("Reconciliation error: %v", err)
|
|
}
|
|
}
|
|
}()
|
|
|
|
log.Printf("Quota reconciliation cron started: runs every %v", reconcileInterval)
|
|
}
|
|
|
|
// ... start server ...
|
|
}
|
|
```
|
|
|
|
### Why PDS as Source of Truth Works
|
|
|
|
1. **Manifests are canonical** - If manifest exists in PDS, user owns those layers
|
|
2. **Public reads** - No OAuth needed, just resolve DID → PDS endpoint
|
|
3. **ATProto durability** - PDS is user's authoritative data store
|
|
4. **AppView is cache** - AppView database might lag or have inconsistencies
|
|
5. **Reconciliation fixes drift** - Periodic sync from PDS ensures accuracy
|
|
|
|
**Example reconciliation scenarios:**
|
|
|
|
- **Orphaned quota entries:** User deleted manifest from PDS, but hold quota still has it
|
|
→ Reconciliation removes from claimed_layers
|
|
|
|
- **Missing quota entries:** User pushed manifest, but quota update failed
|
|
→ Reconciliation adds to claimed_layers
|
|
|
|
- **Race condition duplicates:** Two concurrent pushes double-counted a layer
|
|
→ Reconciliation fixes to actual usage
|
|
|
|
## Configuration
|
|
|
|
### Hold Service Environment Variables
|
|
|
|
```bash
|
|
# .env.hold
|
|
|
|
# ============================================================================
|
|
# Quota Configuration
|
|
# ============================================================================
|
|
|
|
# Enable quota enforcement
|
|
QUOTA_ENABLED=true
|
|
|
|
# Default quota limit per user (bytes)
|
|
# 10GB = 10737418240
|
|
# 50GB = 53687091200
|
|
# 100GB = 107374182400
|
|
QUOTA_DEFAULT_LIMIT=10737418240
|
|
|
|
# Storage backend for quota data
|
|
# Options: s3, sqlite
|
|
QUOTA_STORAGE_BACKEND=s3
|
|
|
|
# For S3-based storage:
|
|
# Quota files stored in same bucket as blobs
|
|
QUOTA_STORAGE_PREFIX=/atcr/quota/
|
|
|
|
# For SQLite-based storage:
|
|
QUOTA_DB_PATH=/var/lib/atcr/hold-quota.db
|
|
|
|
# ============================================================================
|
|
# Garbage Collection
|
|
# ============================================================================
|
|
|
|
# Enable periodic garbage collection
|
|
GC_ENABLED=true
|
|
|
|
# GC interval (default: 24h)
|
|
GC_INTERVAL=24h
|
|
|
|
# AppView URL for GC reference checking
|
|
APPVIEW_URL=https://atcr.io
|
|
|
|
# ============================================================================
|
|
# Quota Reconciliation
|
|
# ============================================================================
|
|
|
|
# Enable quota reconciliation from PDS
|
|
QUOTA_RECONCILE_ENABLED=true
|
|
|
|
# Reconciliation interval (default: 24h)
|
|
QUOTA_RECONCILE_INTERVAL=24h
|
|
|
|
# ============================================================================
|
|
# Hold Service Identity (Required)
|
|
# ============================================================================
|
|
|
|
# Public URL of this hold service
|
|
HOLD_PUBLIC_URL=https://hold1.example.com
|
|
|
|
# Owner DID (for auto-registration)
|
|
HOLD_OWNER=did:plc:xyz123
|
|
```
|
|
|
|
### AppView Configuration
|
|
|
|
```bash
|
|
# .env.appview
|
|
|
|
# Internal API endpoint for hold services
|
|
# Used for GC reference checking
|
|
ATCR_INTERNAL_API_ENABLED=true
|
|
|
|
# Optional: authentication token for internal APIs
|
|
ATCR_INTERNAL_API_TOKEN=secret123
|
|
```
|
|
|
|
## Trade-offs & Design Decisions
|
|
|
|
### 1. Claimed Storage vs Physical Storage
|
|
|
|
**Decision:** Track claimed storage (logical accounting)
|
|
|
|
**Why:**
|
|
- Predictable for users: "you pay for what you upload"
|
|
- No complex cross-user dependencies
|
|
- Delete always gives you quota back
|
|
- Matches Harbor's proven model
|
|
|
|
**Trade-off:**
|
|
- Total claimed can exceed physical storage
|
|
- Users might complain "I uploaded 10GB but S3 only has 6GB"
|
|
|
|
**Mitigation:**
|
|
- Show deduplication savings metric
|
|
- Educate users: "You claimed 10GB, but deduplication saved 4GB"
|
|
|
|
### 2. S3 vs SQLite for Quota Storage
|
|
|
|
**Decision:** Support both, recommend based on use case
|
|
|
|
**S3 Pros:**
|
|
- No database to manage
|
|
- Quota data lives with blobs
|
|
- Better for ephemeral BYOS
|
|
|
|
**SQLite Pros:**
|
|
- Faster (no network)
|
|
- ACID transactions (no race conditions)
|
|
- Better for high-traffic shared holds
|
|
|
|
**Trade-off:**
|
|
- S3: eventual consistency, race conditions
|
|
- SQLite: stateful service, scaling challenges
|
|
|
|
**Mitigation:**
|
|
- Reconciliation fixes S3 inconsistencies
|
|
- SQLite can use shared DB for multi-instance
|
|
|
|
### 3. Optimistic Quota Update
|
|
|
|
**Decision:** Update quota BEFORE upload completes
|
|
|
|
**Why:**
|
|
- Prevent race conditions (two users uploading simultaneously)
|
|
- Can reject before presigned URL generated
|
|
- Simpler flow
|
|
|
|
**Trade-off:**
|
|
- If upload fails, quota already incremented (user "paid" for nothing)
|
|
|
|
**Mitigation:**
|
|
- Reconciliation from PDS fixes orphaned quota entries
|
|
- Acceptable for MVP (upload failures are rare)
|
|
|
|
### 4. AppView as Intermediary
|
|
|
|
**Decision:** AppView notifies hold service on deletes
|
|
|
|
**Why:**
|
|
- AppView already has manifest/layer database
|
|
- Can efficiently check if layer still referenced
|
|
- Hold service doesn't need to query PDS on every delete
|
|
|
|
**Trade-off:**
|
|
- AppView → Hold dependency
|
|
- Network hop on delete
|
|
|
|
**Mitigation:**
|
|
- If notification fails, reconciliation fixes quota
|
|
- Eventually consistent is acceptable
|
|
|
|
### 5. PDS as Source of Truth
|
|
|
|
**Decision:** Use PDS manifests for reconciliation
|
|
|
|
**Why:**
|
|
- Manifests in PDS are canonical user data
|
|
- Public reads (no OAuth for reconciliation)
|
|
- AppView database might lag or be inconsistent
|
|
|
|
**Trade-off:**
|
|
- Reconciliation requires PDS queries (slower)
|
|
- Limited to 1000 manifests per query
|
|
|
|
**Mitigation:**
|
|
- Run reconciliation daily (not real-time)
|
|
- Paginate if user has >1000 manifests
|
|
|
|
## Future Enhancements
|
|
|
|
### 1. Quota API Endpoints
|
|
|
|
```
|
|
GET /quota/usage - Get current user's quota
|
|
GET /quota/breakdown - Get storage by repository
|
|
POST /quota/limit - Update user's quota limit (admin)
|
|
GET /quota/stats - Get hold-wide statistics
|
|
```
|
|
|
|
### 2. Quota Alerts
|
|
|
|
Notify users when approaching limit:
|
|
- Email/webhook at 80%, 90%, 95%
|
|
- Reject uploads at 100% (currently implemented)
|
|
- Grace period: allow 105% temporarily
|
|
|
|
### 3. Tiered Quotas
|
|
|
|
Different limits based on user tier:
|
|
- Free: 10GB
|
|
- Pro: 100GB
|
|
- Enterprise: unlimited
|
|
|
|
### 4. Quota Purchasing
|
|
|
|
Allow users to buy additional storage:
|
|
- Stripe integration
|
|
- $0.10/GB/month pricing
|
|
- Dynamic limit updates
|
|
|
|
### 5. Cross-Hold Deduplication
|
|
|
|
If multiple holds share same S3 bucket:
|
|
- Track blob ownership globally
|
|
- Split costs proportionally
|
|
- More complex, but maximizes deduplication
|
|
|
|
### 6. Manifest-Based Quota (Alternative Model)
|
|
|
|
Instead of tracking layers, track manifests:
|
|
- Simpler: just count manifest sizes
|
|
- No deduplication benefits for users
|
|
- Might be acceptable for some use cases
|
|
|
|
### 7. Redis-Based Quota (High Performance)
|
|
|
|
For high-traffic registries:
|
|
- Use Redis instead of S3/SQLite
|
|
- Sub-millisecond quota checks
|
|
- Harbor-proven approach
|
|
|
|
### 8. Quota Visualizations
|
|
|
|
Web UI showing:
|
|
- Storage usage over time
|
|
- Top consumers by repository
|
|
- Deduplication savings graph
|
|
- Layer size distribution
|
|
|
|
## Appendix: SQL Queries
|
|
|
|
### Check if User Still References Layer
|
|
|
|
```sql
|
|
-- After deleting manifest, check if user has other manifests using this layer
|
|
SELECT COUNT(*)
|
|
FROM layers l
|
|
JOIN manifests m ON l.manifest_id = m.id
|
|
WHERE m.did = ? -- User's DID
|
|
AND l.digest = ? -- Layer digest to check
|
|
AND m.id != ? -- Exclude the manifest being deleted
|
|
```
|
|
|
|
### Get All Unique Layers for User
|
|
|
|
```sql
|
|
-- Calculate true quota usage for a user
|
|
SELECT DISTINCT l.digest, l.size
|
|
FROM layers l
|
|
JOIN manifests m ON l.manifest_id = m.id
|
|
WHERE m.did = ?
|
|
AND m.hold_endpoint = ?
|
|
```
|
|
|
|
### Get Referenced Blobs for Hold
|
|
|
|
```sql
|
|
-- For GC: get all blobs still referenced by any user of this hold
|
|
SELECT DISTINCT l.digest
|
|
FROM layers l
|
|
JOIN manifests m ON l.manifest_id = m.id
|
|
WHERE m.hold_endpoint = ?
|
|
```
|
|
|
|
### Get Storage Stats by Repository
|
|
|
|
```sql
|
|
-- User's storage broken down by repository
|
|
SELECT
|
|
m.repository,
|
|
COUNT(DISTINCT m.id) as manifest_count,
|
|
COUNT(DISTINCT l.digest) as unique_layers,
|
|
SUM(l.size) as total_size
|
|
FROM manifests m
|
|
JOIN layers l ON l.manifest_id = m.id
|
|
WHERE m.did = ?
|
|
AND m.hold_endpoint = ?
|
|
GROUP BY m.repository
|
|
ORDER BY total_size DESC
|
|
```
|
|
|
|
## References
|
|
|
|
- **Harbor Quotas:** https://goharbor.io/docs/1.10/administration/configure-project-quotas/
|
|
- **Harbor Source:** https://github.com/goharbor/harbor
|
|
- **ATProto Spec:** https://atproto.com/specs/record
|
|
- **OCI Distribution Spec:** https://github.com/opencontainers/distribution-spec
|
|
- **S3 API Reference:** https://docs.aws.amazon.com/AmazonS3/latest/API/
|
|
- **Distribution GC:** https://github.com/distribution/distribution/blob/main/registry/storage/garbagecollect.go
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2025-10-09
|
|
**Author:** Generated from implementation research and Harbor analysis
|