New design doc

Signed-off-by: Joseph <jvaikath@redhat.com>
This commit is contained in:
Joseph
2025-08-01 13:25:44 -04:00
parent 69e307918b
commit c0699c443b

View File

@@ -1,407 +1,79 @@
# Wildcard Namespace Support Design
# Wildcard namespace includes/excludes support for backups and restores
## Abstract
This proposal introduces wildcard pattern support for namespace inclusion and exclusion in Velero backups (e.g., `prod-*`, `*-staging`).
The implementation uses lazy evaluation within the existing `ShouldInclude()` method to resolve wildcards on-demand with request-scoped caching.
Based on [Issue #1874](https://github.com/vmware-tanzu/velero/issues/1874).
One to two sentences that describes the goal of this proposal and the problem being solved by the proposed change.
The reader should be able to tell by the title, and the opening paragraph, if this document is relevant to them.
Velero currently does not have any support for wildcard characters in the namespace spec.
It fully expects the namespaces to be string literals.
The only and notable exception is the "*" character by it's lonesome, which acts as an include all and ignore excludes option.
Internally Velero treats not specifying anything as the "*" case.
This document details the approach to implementing wildcard namespaces, while keeping the "*" characters purpose intact for legacy purposes.
## Background
- Currently, Velero users must explicitly list each namespace for backup operations
- In environments with many namespaces following naming conventions (e.g., `prod-app`, `prod-db`, `prod-cache`), this becomes:
- Cumbersome to maintain
- Error-prone to manage
- Users have requested wildcard support to enable patterns like `--include-namespaces "prod-*"`
This was raised in Issue [#1874](https://github.com/vmware-tanzu/velero/issues/1874)
## Goals
- Enable wildcard pattern support for namespace includes and excludes in Velero backup specifications
- Maintain optimal performance with lazy evaluation and request-scoped caching
- Preserve original wildcard patterns in backup specifications for audit and readability purposes
- A short list of things which will be accomplished by implementing this proposal.
- Two things is ok.
- Three is pushing it.
- More than three goals suggests that the proposal's scope is too large.
- Add support for wildcard namespaces in --include-namespaces and --exclude-namespaces
- Ensure legacy "*" support is not affected
## Non Goals
- Support for complex regex patterns beyond basic glob-style wildcards (`*`)
- Persistent caching of namespace resolution across backup requests
- Real-time namespace discovery that changes during backup execution
- A short list of items which are:
- a. out of scope
- b. follow on items which are deliberately excluded from this proposal.
- Completely rethinking the way "*" is treated and allowing it to work with wildcard excludes.
## High-Level Design
One to two paragraphs that describe the high level changes that will be made to implement this proposal.
**Core Approach:** We're making the existing concrete type (`*IncludesExcludes`) polymorphic so we can substitute our new lazy evaluation type (`*LazyNamespaceIncludesExcludes`) without changing any calling code.
Points of interest are two funcs within the utility layer, in file `velero/pkg/backup/item_collector.go`
- [collectNamespaces](https://github.com/vmware-tanzu/velero/blob/1535afb45e33a3d3820088e4189800a21ba55293/pkg/backup/item_collector.go#L742)
- [getNamespacesToList](https://github.com/vmware-tanzu/velero/blob/1535afb45e33a3d3820088e4189800a21ba55293/pkg/backup/item_collector.go#L638)
- Implementation at **backup request level** within the `ShouldInclude()` method
- Uses lazy evaluation with `LazyNamespaceIncludesExcludes` wrapper
- On-demand namespace resolution with thread-safe caching
- First call triggers Kubernetes API namespace enumeration and wildcard resolution
- Results cached for subsequent calls within the same backup request
collectNamespaces gets all the active namespaces and matches it against the user spec for included namespaces (r.backupRequest.Backup.Spec.IncludedNamespaces)
This is an ideal point where wildcard expansion can take place.
The implementation would mean that just like "*", namespaces with wildcard symbols would also be passed through without an existence check.
The resolved namespaces are stored in new status fields on the backup.
## Detailed Design
A detailed design describing how the changes to the product should be made.
### Polymorphic Interface Approach
The names of types, fields, interfaces, and methods should be agreed on here, not debated in code review.
The same applies to changes in CRDs, YAML examples, and so on.
The key insight is that all existing backup code already calls the same 4 methods on namespace filtering:
- `ShouldInclude(namespace string) bool` - Core filtering logic
- `IncludesString() string` - Logging display
- `ExcludesString() string` - Logging display
- `IncludeEverything() bool` - Optimization checks
Ideally the changes should be made in sequence so that the work required to implement this design can be done incrementally, possibly in parallel.
By creating a `NamespaceIncludesExcludesInterface` with these methods, we can:
1. **Standard case**: Use existing `*IncludesExcludes` (no wildcards)
2. **Wildcard case**: Use new `*LazyNamespaceIncludesExcludes` (with K8s API enumeration)
**No calling code changes needed** - the interface abstraction handles everything.
**Cache Scope:** Single backup request only - automatic cleanup when request completes.
### Implementation Strategy
**Location:** `pkg/util/collections/includes_excludes.go`
- New interface defining the 4 required methods
- `LazyNamespaceIncludesExcludes` struct embedding `*IncludesExcludes` for fallback
- Lazy resolution with thread-safe caching using mutex
- Special case handling for lone `*` to preserve existing efficient behavior
**Integration:** `pkg/backup/backup.go`
- Wildcard detection logic determines which implementation to return
- Lone `*` pattern → standard `IncludesExcludes` (preserve current behavior)
- Any other wildcards → lazy `LazyNamespaceIncludesExcludes`
**Type Updates:** Change struct fields from concrete `*IncludesExcludes` to interface type
- `pkg/backup/request.go` - Request struct field type
- `pkg/backup/item_collector.go` - Function parameter types
### Performance Characteristics
- **First `ShouldInclude()` call:** ~500ms (K8s API namespace enumeration + wildcard resolution)
- **Subsequent calls:** ~1ms (cached lookup with read lock)
- **Memory overhead:** Minimal (resolved namespace list stored once per backup request)
- **Concurrency:** Full concurrent read access to cached results
## Namespace Discovery Timing and Behavior
### Snapshot Timing
**Wildcard patterns are resolved at backup start time** and remain fixed for the entire backup duration. This provides:
- **Consistent behavior**: All resources in a backup come from the same namespace set
- **Predictable results**: Backup contents don't change mid-execution
- **Performance**: No repeated namespace enumeration during backup processing
### Runtime Namespace Changes
When namespaces are created or deleted during backup execution:
**Newly Created Namespaces:**
- If `prod-new` is created after backup starts, it will **NOT** be included even if it matches `prod-*`
- The resolved namespace list is fixed at backup start time
**Deleted Namespaces:**
- If a namespace matching the pattern is deleted during backup, the backup continues
- Resources already processed from that namespace remain in the backup
- Subsequent resource enumeration for that namespace may result in "not found" errors (handled gracefully)
- This should ideally fail so that the user can re-run it without a namespace being deleted while a backup is started which is rare.
**User Expectations:**
This behavior should be explicitly documented with examples:
```
# At backup start: namespaces [prod-app, prod-db] exist
velero backup create --include-namespaces "prod-*"
# During backup: prod-cache namespace is created
# Result: prod-cache is NOT included in this backup
# Recommendation: Run another backup to capture newly created namespaces
1. Add new status fields to the backup CRD to store expanded wildcard namespaces
```
## Pattern Complexity and Validation
### Supported Patterns
**Basic Wildcard Support (`*` only):**
- `prefix-*` - Matches namespaces starting with "prefix-"
- `*-suffix` - Matches namespaces ending with "-suffix"
- `*-middle-*` - Matches namespaces containing "-middle-"
- `*` - Special case: matches all namespaces (preserves current behavior)
### Unsupported Patterns
**Not supported in initial implementation:**
- `?` for single character matching (e.g., `prod-?-app`)
- Character classes (e.g., `prod-[abc]-app`)
- Regex patterns (e.g., `prod-\d+-app`)
### Pattern Validation
**Creation-time validation:**
- Invalid patterns containing unsupported characters will be rejected at backup creation
- Validation occurs in CLI and API server admission controller
- Clear error messages guide users to supported patterns
**Example validation errors:**
```bash
# Unsupported pattern
velero backup create --include-namespaces "prod-?-app"
# Error: Pattern 'prod-?-app' contains unsupported character '?'. Only '*' wildcards are supported.
# Valid patterns
velero backup create --include-namespaces "prod-*,*-staging"
# Success: Patterns validated successfully
```
2. Create a util package for wildcard expansion
## Error Handling
### Kubernetes API Failures
**Namespace enumeration failures:**
- If initial namespace list API call fails → backup fails with clear error message
- Transient failures are retried using standard Kubernetes client retry logic
- No fallback to cached/partial data to ensure consistent behavior
**Error response example:**
```
Error: Failed to enumerate namespaces for wildcard resolution: unable to connect to Kubernetes API
Backup creation aborted. Please verify cluster connectivity and try again.
```
### Zero Namespace Matches
**When wildcard patterns match no namespaces:**
- **Behavior**: Warning logged, backup proceeds with empty namespace set
- **User notification**: Warning in backup status and logs
- **Rationale**: Allows for valid scenarios (e.g., temporary namespace absence)
**Warning example:**
```
Warning: Wildcard pattern 'prod-*' matched 0 namespaces. Backup will include no namespaces from this pattern.
```
### Dry-Run Support
**Preview functionality:**
```bash
# New flag to preview wildcard resolution
velero backup create my-backup --include-namespaces "prod-*" --dry-run=wildcards
# Output:
Wildcard pattern 'prod-*' would include namespaces: [prod-app, prod-db, prod-cache]
Wildcard pattern '*-staging' would include namespaces: [app-staging, db-staging]
Total namespaces: 5
```
## Restore Operations
### Wildcard Behavior During Restore
**Restore uses namespaces captured at backup time:**
- Wildcard patterns in backup specs are **not** re-evaluated during restore
- Restore operates on the concrete namespace list that was resolved during backup
- This ensures restore consistency even if cluster namespace state has changed
**Implementation approach:**
1. **Backup metadata storage**: Store both original patterns and resolved namespace lists
2. **Restore processing**: Use resolved namespace lists, ignore original patterns
3. **Audit trail**: Both patterns and resolved lists visible in backup metadata
**Example scenario:**
```yaml
# Original backup spec
includedNamespaces: ["prod-*"]
# Stored in backup metadata
resolvedNamespaces: ["prod-app", "prod-db"]
originalPatterns: ["prod-*"]
# During restore (even if prod-cache now exists)
# Only prod-app and prod-db are restored
```
### Disaster Recovery Scenarios
**Cross-cluster restore behavior:**
- Restore attempts to create resources in target namespaces
- If target namespaces don't exist, Velero creates them (existing behavior)
- Wildcard patterns are not re-evaluated against target cluster
## Scheduled Backups
### Namespace State Changes Between Runs
**Each scheduled backup run performs fresh wildcard resolution:**
- Pattern `prod-*` may include different namespaces in each backup run
- This allows scheduled backups to automatically capture newly created namespaces
- **Trade-off**: Backup contents may vary between runs vs. automatic inclusion of new resources
**Storage implications:**
- Varying namespace sets between runs may affect deduplication efficiency
- Each backup stores its own resolved namespace list independently
**Example behavior:**
```
# Monday backup: prod-* matches [prod-app, prod-db]
# Tuesday: prod-cache namespace created
# Tuesday backup: prod-* matches [prod-app, prod-db, prod-cache]
```
**User expectations:**
- Document that scheduled backups automatically include newly matching namespaces
- Provide guidance on namespace naming conventions for predictable backup behavior
## Testing Strategy
### Unit Tests
**Pattern matching tests:**
```go
func TestWildcardPatterns(t *testing.T) {
tests := []struct {
pattern string
namespace string
expected bool
}{
{"prod-*", "prod-app", true},
{"prod-*", "staging-app", false},
{"*-staging", "app-staging", true},
{"*-test-*", "app-test-db", true},
}
// ... test implementation
}
```
**Edge cases:**
- Empty pattern list
- Pattern with no matches
- Pattern matching single namespace
- Multiple overlapping patterns
- Special case lone `*` behavior
### Integration Tests
**Kubernetes cluster scenarios:**
- Create namespaces, verify wildcard resolution
- Test namespace creation/deletion during backup
- Verify thread safety with concurrent backup operations
- Error scenarios (API failures, network issues)
**Concurrency testing:**
- Multiple concurrent `ShouldInclude()` calls
- Thread safety verification
- Cache hit ratio measurement
## Example Usage
### CLI Usage
```bash
# Single wildcard pattern
velero backup create prod-backup --include-namespaces "prod-*"
# Multiple patterns
velero backup create env-backup --include-namespaces "prod-*,staging-*,dev-*"
# Mixed literal and wildcard
velero backup create mixed-backup --include-namespaces "prod-*,kube-system,monitoring"
# Exclude patterns
velero backup create no-test --include-namespaces "*" --exclude-namespaces "*-test,*-temp"
# Preview before creating
velero backup create my-backup --include-namespaces "prod-*" --dry-run=wildcards
```
### Backup Specification YAML
```yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
name: production-backup
namespace: velero
spec:
# Wildcard patterns in includedNamespaces
includedNamespaces:
- "prod-*" # All namespaces starting with "prod-"
- "production-*" # All namespaces starting with "production-"
- "critical-app" # Literal namespace (mixed with wildcards)
# Wildcard patterns in excludedNamespaces
excludedNamespaces:
- "*-test" # Exclude any test namespaces
- "*-temp" # Exclude any temporary namespaces
# Other backup configuration
storageLocation: default
volumeSnapshotLocations:
- default
includeClusterResources: false
```
### Stored Backup Metadata
```yaml
# What gets stored in backup metadata
apiVersion: velero.io/v1
kind: Backup
metadata:
name: production-backup
status:
# Original user patterns preserved for audit
originalIncludePatterns: ["prod-*", "production-*", "critical-app"]
originalExcludePatterns: ["*-test", "*-temp"]
# Resolved concrete namespace lists (used for restore)
resolvedIncludedNamespaces: ["prod-app", "prod-db", "production-web", "critical-app"]
resolvedExcludedNamespaces: ["prod-app-test", "staging-temp"]
# Resolution timestamp
namespaceResolutionTime: "2024-01-15T10:30:00Z"
```
3. If required, expand wildcards and replace the request's includes and excludes with expanded namespaces
4. Populate the expanded namespace status field with the namespaces.
## Alternatives Considered
### CLI-Level Resolution
**Problem:** Resolving wildcards during `velero backup create` command
**Why rejected:**
- **Lost User Intent:** Backup specs store resolved lists instead of original patterns
- **Audit Trail Issues:** Original wildcard intent not visible when examining backup specifications
- **CLI Complexity:** CLI requires cluster access and namespace enumeration capabilities
### Server-Level (Controller) Resolution
**Problem:** Resolving wildcards in backup controller with persistent caching
**Why rejected:**
- **Architectural Complexity:** Requires additional API schema changes for storing resolved namespace lists
- **Cache Management:** Need cache invalidation, storage, and lifecycle management
- **Limited Benefit:** Performance gain only applies to narrow controller reconciliation retry scenarios
- **State Management:** Introduces persistent state maintained across backup lifecycle
### Request-Level (ShouldInclude) Resolution
**Chosen Approach:** Lazy evaluation within backup request processing
**Benefits:**
- **Preserved Intent:** Original wildcard patterns remain in backup specifications
- **Optimal Performance:** First resolution (~500ms), subsequent calls (~1ms) with request-scoped caching
- **Clean Architecture:** No persistent state, no API schema changes, minimal code changes
- **Thread Safety:** Proper mutex usage for concurrent worker access
- **Scoped Lifetime:** Cache automatically cleaned up when backup request completes
If there are alternative high level or detailed designs that were not pursued they should be called out here with a brief explanation of why they were not pursued.
## Security Considerations
- Implementation requires Velero service account to have `list` permissions on namespace resources
- Aligns with existing Velero RBAC requirements
- No additional privileges or security surface area introduced
## Addressing Implementation Concerns
### Multiple Pattern Support
Multiple wildcards work naturally: `--include-namespaces "prod-*,staging-*,dev-*"` - each pattern evaluated independently during lazy resolution.
### Mixed Literal and Wildcard Detection
Simple approach: strings containing `*` are wildcards, others use existing literal namespace logic. Zero breaking changes for existing validation.
### Include/Exclude Conflict Detection
Runtime resolution simplifies conflicts - wildcards resolve to actual namespace lists first, then standard include/exclude precedence applies.
### Backward Compatibility
Lazy evaluation triggers only when wildcards detected. Non-wildcard backups have zero overhead and identical behavior to current implementation.
## Special Consideration: Existing `*` Behavior
**Current Velero Behavior:** `--include-namespaces "*"` (the CLI default) means "include all namespaces" and uses special logic that doesn't enumerate namespaces - it simply bypasses namespace filtering entirely.
**Potential Breaking Change:** Our wildcard implementation would treat `*` as a glob pattern, resolving it to a specific list of namespaces at backup start time, which changes the behavior from "include everything" to "include these specific namespaces."
**Required Solution:** Special-case handling for the lone `*` pattern to preserve existing behavior by using original `IncludesExcludes` logic instead of wildcard resolution.
This ensures that `--include-namespaces "*"` continues to work exactly as before, while enabling new wildcard patterns like `prod-*`, `*-staging`, etc.
If this proposal has an impact to the security of the product, its users, or data stored or transmitted via the product, they must be addressed here.
## Compatibility
- Full backward compatibility with existing backup specifications using literal namespace lists
- No changes required to CLI commands, existing backups, or restore operations
A discussion of any compatibility issues that need to be considered
## Implementation
The implementation consists of approximately 200 lines of new code across four files:
- `pkg/util/collections/includes_excludes.go`: Core lazy evaluation logic (~150 lines)
- `pkg/backup/backup.go`: Wildcard detection logic (~20 lines)
- `pkg/backup/request.go`: Interface type usage (~5 lines)
- `pkg/backup/item_collector.go`: Compatibility method calls (~25 lines)
A description of the implementation, timelines, and any resources that have agreed to contribute.
## Open Issues
None.
A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.