Update the community page to add the correct links to community meeting
and meeting notes.
I also removed the referece of google group as I confirmed the last
message was sent 2 years ago.
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
* Add CI check for invalid characters in file paths
Go's module zip rejects filenames containing certain characters (shell
special chars like " ' * < > ? ` |, path separators : \, and non-letter
Unicode such as control/format characters). This caused a build failure
when a changelog file contained an invisible U+200E LEFT-TO-RIGHT MARK
(see PR #9552).
Add a GitHub Actions workflow that validates all tracked file paths on
every PR to catch these issues before they reach downstream consumers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
* Fix changelog filenames containing invisible U+200E characters
Remove LEFT-TO-RIGHT MARK unicode characters from changelog filenames
that would cause Go module zip failures.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
---------
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
The `getDataUpload` function in the CSI PVC backup plugin was
previously making a cluster-scoped list query to retrieve DataUpload
CRs. In environments with strict minimum-privilege RBAC, this would
fail with forbidden errors.
This explicitly passes the backup namespace into the `ListOptions`
when calling `crClient.List`, correctly scoping the queries to the
backup's namespace. Unit tests have also been updated to ensure
cross-namespace queries are rejected appropriately.
Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
Kubernetes 1.34 introduced VolumeGroupSnapshot v1beta2 API and
deprecated v1beta1. Distributions running K8s 1.34+ (e.g. OpenShift
4.21+) have removed v1beta1 VGS CRDs entirely, breaking Velero's
VGS functionality on those clusters.
This change bumps external-snapshotter/client/v8 from v8.2.0 to
v8.4.0 and migrates all VGS API usage from v1beta1 to v1beta2.
The v1beta2 API is structurally compatible - the Spec-level types
(GroupSnapshotHandles, VolumeGroupSnapshotContentSource) are
unchanged. The Status-level change (VolumeSnapshotHandlePairList
replaced by VolumeSnapshotInfoList) does not affect Velero as it
does not directly consume that type.
Fixes#9694
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix VolumeGroupSnapshot restore on Ceph RBD
This PR fixes two related issues affecting CSI snapshot restore on Ceph RBD:
1. VolumeGroupSnapshot restore fails because Ceph RBD populates
volumeGroupSnapshotHandle on pre-provisioned VSCs, but Velero doesn't
create the required VGSC during restore.
2. CSI snapshot restore fails because VolumeSnapshotClassName is removed
from restored VSCs, preventing the CSI controller from getting
credentials for snapshot verification.
Changes:
- Capture volumeGroupSnapshotHandle during backup as VS annotation
- Create stub VGSC during restore with matching handle in status
- Look up VolumeSnapshotClass by driver and set on restored VSC
Fixes#9512Fixes#9515
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Add changelog for VGS restore fix
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix gofmt import order
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Add changelog for VGS restore fix
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix import alias corev1 to corev1api per lint config
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix: Add snapshot handles to existing stub VGSC and add unit tests
When multiple VolumeSnapshots from the same VolumeGroupSnapshot are
restored, they share the same VolumeGroupSnapshotHandle but have
different individual snapshot handles. This commit:
1. Fixes incomplete logic where existing VGSC wasn't updated with
new snapshot handles (addresses review feedback)
2. Fixes race condition where Create returning AlreadyExists would
skip adding the snapshot handle
3. Adds comprehensive unit tests for ensureStubVGSCExists (5 cases)
and addSnapshotHandleToVGSC (4 cases) functions
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Clean up stub VolumeGroupSnapshotContents during restore finalization
Add cleanup logic for stub VGSCs created during VolumeGroupSnapshot restore.
The stub VGSCs are temporary objects needed to satisfy CSI controller
validation during VSC reconciliation. Once all related VSCs become
ReadyToUse, the stub VGSCs are no longer needed and should be removed.
The cleanup runs in the restore finalizer controller's execute() phase.
Before deleting each VGSC, it polls until all related VolumeSnapshotContents
(correlated by snapshot handle) are ReadyToUse, with a timeout fallback.
Deletion failures and CRD-not-installed scenarios are treated as warnings
rather than errors to avoid failing the restore.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix lint: remove unused nolint directive and simplify cleanupStubVGSC return
The cleanupStubVGSC function only produces warnings (not errors), so
simplify its return signature. Also remove the now-unused nolint:unparam
directive on execute() since warnings are no longer always nil.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
---------
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Restrict the listing of PodVolumeBackup resources to the specific
restore namespace in both the core restore controller and the pod
volume restore action plugin. This prevents "Forbidden" errors when
Velero is configured with namespace-scoped minimum privileges,
avoiding the need for cluster-scoped list permissions for
PodVolumeBackups.
Fixes: #9681
Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
The tag used to latest. Due to latest tag v0.23.3 already used
Golang v1.26, Velero main still uses v1.25. Build failed.
To fix this, pin the controller-runtime to v0.23.2
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
* feat: support backup hooks on sidecars
Add support for configuring Kubernates native
Sidecars as target containrs for Backup Hooks
commands. This is purely a validation level
patch as the actual pods/exec API doesn't make
any distinction between standard and sidecar
containers.
Signed-off-by: Gabriele Fedi <gabriele.fedi@enterprisedb.com>
* test: extend unit tests
Signed-off-by: Gabriele Fedi <gabriele.fedi@enterprisedb.com>
* chore: changelog
Signed-off-by: Gabriele Fedi <gabriele.fedi@enterprisedb.com>
* style: fix linter issues
Signed-off-by: Gabriele Fedi <gabriele.fedi@enterprisedb.com>
---------
Signed-off-by: Gabriele Fedi <gabriele.fedi@enterprisedb.com>
* fix configmap lookup in non-default namespaces
o.Namespace is empty when Validate runs (Complete hasn't been called yet),
causing VerifyJSONConfigs to query the default namespace instead of the
intended one. Replace o.Namespace with f.Namespace() in all three ConfigMap
validation calls so the factory's already-resolved namespace is used.
Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
* switch the call order of validate/complete
switch the call order of validate/complete which accomplish
the same effect.
Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
---------
Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
From 1.18.1, Velero adds some default affinity in the backup/restore pod,
so we can't directly compare the whole affinity,
but we can verify if the expected affinity is contained in the pod affinity.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
The itemOperationTimeout field was missing from the Schedule API type
documentation even though it is supported in the Schedule CRD template.
This led users to believe the field was not available per-schedule.
Fixes#9598
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix DBR stuck when CSI snapshot no longer exists in cloud provider
During backup deletion, VolumeSnapshotContentDeleteItemAction creates a
new VSC with the snapshot handle from the backup and polls for readiness.
If the underlying snapshot no longer exists (e.g., deleted externally),
the CSI driver reports Status.Error but checkVSCReadiness() only checks
ReadyToUse, causing it to poll for the full 10-minute timeout instead of
failing fast. Additionally, the newly created VSC is never cleaned up on
failure, leaving orphaned resources in the cluster.
This commit:
- Adds Status.Error detection in checkVSCReadiness() to fail immediately
on permanent CSI driver errors (e.g., InvalidSnapshot.NotFound)
- Cleans up the dangling VSC when readiness polling fails
Fixes#9579
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Add changelog for PR #9581
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix typo in pod_volume_test.go: colume -> volume
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
---------
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Issue #9544: Add test coverage and fix validation for MRAP ARN bucket names
S3 Multi-Region Access Point (MRAP) ARNs have the format:
arn:aws:s3::{account-id}:accesspoint/{mrap-alias}.mrap
These ARNs contain a '/' as part of the ARN path, which caused Velero's
BSL bucket validation to reject them with an error asking the user to
put the value in the Prefix field instead.
Fix the bucket name validation in objectBackupStoreGetter.Get() to
exempt ARNs (identified by the "arn:" prefix) from the slash check,
since slashes are a valid and required part of ARN syntax.
Add unit tests in object_store_mrap_test.go covering:
- A plain MRAP ARN as bucket name succeeds
- A MRAP ARN with a trailing slash is trimmed and accepted
Signed-off-by: Sabir Ali <testsabirweb@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* Address review comments: fix changelog filename and import grouping
Signed-off-by: Sabir Ali <testsabirweb@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* Restrict MRAP ARN bucket validation to arn:aws:s3: prefix
Per review, use HasPrefix(bucket, "arn:aws:s3:") instead of
HasPrefix(bucket, "arn:") so only S3 ARNs (e.g. MRAP) are exempt
from the slash check, not any ARN from other AWS services.
Signed-off-by: Sabir Ali <sabir.ali@spectrocloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* Move MRAP bucket tests into TestNewObjectBackupStoreGetter
Consolidate MRAP ARN test cases into the existing table in
object_store_test.go and remove object_store_mrap_test.go.
Signed-off-by: Sabir Ali <sabir.ali@spectrocloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Signed-off-by: Sabir Ali <testsabirweb@gmail.com>
Signed-off-by: Sabir Ali <sabir.ali@spectrocloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add a new Prometheus gauge metric that exposes the expected interval
between consecutive scheduled backups. This enables dynamic alerting
thresholds per schedule backups.
Signed-off-by: Quang Ngo <quang.ngo@canonical.com>
* Support all glob wildcard characters in namespace validation
Expand namespace validation to allow all valid glob pattern characters
(*, ?, {}, [], ,) by replacing them with valid characters during RFC 1123
validation. The actual glob pattern validation is handled separately by
the wildcard package.
Also add validation to reject unsupported characters (|, (), !) that are
not valid in glob patterns, and update terminology from "regex" to "glob"
for clarity since this implementation uses glob patterns, not regex.
Changes:
- Replace all glob wildcard characters in validateNamespaceName
- Add test coverage for valid glob patterns in includes/excludes
- Add test coverage for unsupported characters
- Reject exclamation mark (!) in wildcard patterns
- Clarify comments and error messages about glob vs regex
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Changelog
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add documentation: glob patterns are now accepted
Signed-off-by: Joseph <jvaikath@redhat.com>
* Error message fix
Signed-off-by: Joseph <jvaikath@redhat.com>
* Remove negation glob char test
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add bracket pattern validation for namespace glob patterns
Extends wildcard validation to support square bracket patterns [] used in glob character classes. Validates bracket syntax including empty brackets, unclosed brackets, and unmatched brackets. Extracts ValidateNamespaceName as a public function to enable reuse in namespace validation logic.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Reduce scope to *, ?, [ and ]
Signed-off-by: Joseph <jvaikath@redhat.com>
* Fix tests
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add namespace glob patterns documentation page
Adds dedicated documentation explaining supported glob patterns
for namespace include/exclude filtering to help users understand
the wildcard syntax.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Fix build-image Dockerfile envtest download
Replace inaccessible go.kubebuilder.io URL with setup-envtest and update envtest version to 1.33.0 to match Kubernetes v0.33.3 dependencies.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* kubebuilder binaries mv
Signed-off-by: Joseph <jvaikath@redhat.com>
* Reject brace patterns and update documentation
Add {, }, and , to unsupported characters list to explicitly reject
brace expansion patterns. Remove { from wildcard detection since these
patterns are not supported in the 1.18 release.
Update all documentation to show supported patterns inline (*, ?, [abc])
with clickable links to the detailed namespace-glob-patterns page.
Simplify YAML comments by removing non-clickable URLs.
Update tests to expect errors when brace patterns are used.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Document brace expansion as unsupported
Add {} and , to the unsupported patterns section to clarify that
brace expansion patterns like {a,b,c} are not supported.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Update tests to expect brace pattern rejection
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
---------
Signed-off-by: Joseph <jvaikath@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Use typed error approach: Make GetPVForPVC return ErrPVNotFoundForPVC
when PV is not expected to be found (unbound PVC), then use errors.Is
to check for this error type. When a matching policy exists (e.g.,
pvcPhase: [Pending, Lost] with action: skip), apply the action without
error. When no policy matches, return the original error to preserve
default behavior.
Changes:
- Add ErrPVNotFoundForPVC sentinel error to pvc_pv.go
- Update ShouldPerformSnapshot to handle unbound PVCs with policies
- Update ShouldPerformFSBackup to handle unbound PVCs with policies
- Update item_backupper.go to handle Lost PVCs in tracking functions
- Remove checkPVCOnlySkip helper (no longer needed)
- Update tests to reflect new behavior
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ensure the RBAC resources are restored before pods.
The change help to avoid pod starting error when pod depends on the RBAC resources,
e.g., prometheus operator check whether it has enough permission before launching
controller, if prometheus operator pod starts before RBAC resources created, it
will not launch controllers, and it will not retry.
f7f07bcdfb/cmd/operator/main.go (L392-L400)
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Remove 'self' from MIGRATE_FROM_VELERO_VERSION.
* Need to modify the BYOT case to specify MIGRATE_FROM_VELERO_VERSION.
Remove the CSI plugin installation check for no older than v1.14
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
This commit implements VolumePolicy support for PVC Phase conditions, resolving
vmware-tanzu/velero#7233 where backups fail with ''PVC has no volume backing this claim''
for Pending PVCs.
Changes made:
- Extended VolumePolicy API to support PVC phase conditions
- Added pvcPhaseCondition struct with matching logic
- Modified getMatchAction() to evaluate policies for unbound PVCs before returning errors
- Added case to GetMatchAction() to handle PVC-only scenarios (nil PV)
- Added comprehensive unit tests for PVC phase parsing and matching
Users can now skip Pending PVCs through volume policy configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: volume-policy
namespace: velero
data:
policy.yaml: |
version: v1
volumePolicies:
- conditions:
pvcPhase: [Pending]
action:
type: skip
chore: rename changelog file to match PR #9166
Renamed changelogs/unreleased/7233-claude to changelogs/unreleased/9166-claude
to match the opened PR at https://github.com/vmware-tanzu/velero/pull/9166
docs: Add PVC phase condition support to VolumePolicy documentation
- Added pvcPhase field to YAML template example
- Documented pvcPhase as a supported condition in the list
- Added comprehensive examples for using PVC phase conditions
- Included examples for Pending, Bound, and Lost phases
- Demonstrated combining PVC phase with other conditions
Co-Authored-By: Tiger Kaovilai <kaovilai@users.noreply.github.com>
Ensure plugin init container names satisfy DNS-1123 label constraints
(max 63 chars). Long names are truncated with an 8-char hash suffix to
maintain uniqueness.
Fixes: #9444
Signed-off-by: Michal Pryc <mpryc@redhat.com>
The nolint:staticcheck directives are not needed in the test file because
it calls NewVolumeHelperImpl within the same package, which doesn't
trigger deprecation warnings. Only cross-package calls need the directive.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
The ShouldPerformSnapshotWithVolumeHelper function and tests intentionally
use NewVolumeHelperImpl (deprecated) for backwards compatibility with
third-party plugins. Add nolint:staticcheck to suppress the linter
warnings with explanatory comments.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Remove deprecated functions that were marked for removal per review:
- Remove GetPodsUsingPVC (replaced by GetPodsUsingPVCWithCache)
- Remove IsPVCDefaultToFSBackup (replaced by IsPVCDefaultToFSBackupWithCache)
- Remove associated tests for deprecated functions
- Add deprecation marker to NewVolumeHelperImpl
- Add deprecation marker to ShouldPerformSnapshotWithBackup
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
This commit addresses reviewer feedback on PR #9441 regarding
concurrent backup caching concerns. Key changes:
1. Added lazy per-namespace caching for the CSI PVC BIA plugin path:
- Added IsNamespaceBuilt() method to check if namespace is cached
- Added BuildCacheForNamespace() for lazy, per-namespace cache building
- Plugin builds cache incrementally as namespaces are encountered
2. Added NewVolumeHelperImplWithCache constructor for plugins:
- Accepts externally-managed PVC-to-Pod cache
- Follows pattern from PR #9226 (Scott Seago's design)
3. Plugin instance lifecycle clarification:
- Plugin instances are unique per backup (created via newPluginManager)
- Cleaned up via CleanupClients at backup completion
- No mutex or backup UID tracking needed
4. Test coverage:
- Added tests for IsNamespaceBuilt and BuildCacheForNamespace
- Added tests for NewVolumeHelperImplWithCache constructor
- Added test verifying cache usage for fs-backup determination
This maintains the O(N+M) complexity improvement from issue #9179
while addressing architectural concerns about concurrent access.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Address review feedback to have a global VolumeHelper instance per
plugin process instead of creating one on each ShouldPerformSnapshot
call.
Changes:
- Add volumeHelper, cachedForBackup, and mu fields to pvcBackupItemAction
struct for caching the VolumeHelper per backup
- Add getOrCreateVolumeHelper() method for thread-safe lazy initialization
- Update Execute() to use cached VolumeHelper via
ShouldPerformSnapshotWithVolumeHelper()
- Update filterPVCsByVolumePolicy() to accept VolumeHelper parameter
- Add ShouldPerformSnapshotWithVolumeHelper() that accepts optional
VolumeHelper for reuse across multiple calls
- Add NewVolumeHelperForBackup() factory function for BIA plugins
- Add comprehensive unit tests for both nil and non-nil VolumeHelper paths
This completes the fix for issue #9179 by ensuring the PVC-to-Pod cache
is built once per backup and reused across all PVC processing, avoiding
O(N*M) complexity.
Fixes#9179
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
- Rename NewVolumeHelperImplWithCache to NewVolumeHelperImplWithNamespaces
- Move cache building logic from backup.go into volumehelper
- Return error from NewVolumeHelperImplWithNamespaces if cache build fails
- Remove fallback in main backup path - backup fails if cache build fails
- Update NewVolumeHelperImpl to call NewVolumeHelperImplWithNamespaces
- Add comments clarifying fallback is only used by plugins
- Update tests for new error return signature
This addresses review comments from @Lyndon-Li and @kaovilai:
- Cache building is now encapsulated in volumehelper
- No fallback in main backup path ensures predictable performance
- Code reuse between constructors
Fixes#9179
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
- Use ResolveNamespaceList() instead of GetIncludes() for more accurate
namespace resolution when building the PVC-to-Pod cache
- Refactor NewVolumeHelperImpl to call NewVolumeHelperImplWithCache with
nil cache parameter to avoid code duplication
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Add test case to verify that the PVC-to-Pod cache is used even when
no volume policy is configured. When defaultVolumesToFSBackup is true,
the cache is used to find pods using the PVC to determine if fs-backup
should be used instead of snapshot.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Add TestVolumeHelperImplWithCache_ShouldPerformFSBackup to verify:
- Volume policy match with cache returns correct fs-backup decision
- Volume policy match with snapshot action skips fs-backup
- Fallback to direct lookup when cache is not built
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Add TestVolumeHelperImplWithCache_ShouldPerformSnapshot to verify:
- Volume policy match with cache returns correct snapshot decision
- fs-backup via opt-out with cache properly skips snapshot
- Fallback to direct lookup when cache is not built
These tests verify the cache-enabled code path added in the previous
commit for improved volume policy performance.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
The GetPodsUsingPVC function had O(N*M) complexity - for each PVC,
it listed ALL pods in the namespace and iterated through each pod.
With many PVCs and pods, this caused significant performance
degradation (2+ seconds per PV in some cases).
This change introduces a PVC-to-Pod cache that is built once per
backup and reused for all PVC lookups, reducing complexity from
O(N*M) to O(N+M).
Changes:
- Add PVCPodCache struct with thread-safe caching in podvolume pkg
- Add NewVolumeHelperImplWithCache constructor for cache support
- Build cache before backup item processing in backup.go
- Add comprehensive unit tests for cache functionality
- Graceful fallback to direct lookups if cache fails
Fixes#9179
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
This change enables BSL validation to work when using caCertRef
(Secret-based CA certificate) by resolving the certificate from
the Secret in velero core before passing it to the object store
plugin as 'caCert' in the config map.
This approach requires no changes to provider plugins since they
already understand the 'caCert' config key.
Changes:
- Add SecretStore to objectBackupStoreGetter struct
- Add NewObjectBackupStoreGetterWithSecretStore constructor
- Update Get method to resolve caCertRef from Secret
- Update server.go to use new constructor with SecretStore
- Add CACertRef builder method and unit tests
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
- Introduced `CACertRef` field in `ObjectStorageLocation` to reference a Secret containing the CA certificate, replacing the deprecated `CACert` field.
- Implemented validation logic to ensure mutual exclusivity between `CACert` and `CACertRef`.
- Updated BSL controller and repository provider to handle the new certificate resolution logic.
- Enhanced CLI to support automatic certificate discovery from BSL configurations.
- Added unit and integration tests to validate new functionality and ensure backward compatibility.
- Documented migration strategy for users transitioning from inline certificates to Secret-based management.
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Explain that the duration histogram tracks distribution of individual
job durations, not accumulated sums, to address reviewer concerns.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
This commit addresses three review comments on PR #9321:
1. Keep sanitization in controller (response to @ywk253100)
- Maintaining centralized error handling for easier extension
- Azure-specific patterns detected and others passed through unchanged
2. Sanitize unavailableErrors array (@priyansh17)
- Now using sanitizeStorageError() for both unavailableErrors array
and location.Status.Message for consistency
3. Add SAS token scrubbing (@anshulahuja98)
- Scrubs Azure SAS token parameters to prevent credential leakage
- Redacts: sig, se, st, sp, spr, sv, sr, sip, srt, ss
- Example: ?sig=secret becomes ?sig=***REDACTED***
Added comprehensive test coverage for SAS token scrubbing with 4 new
test cases covering various scenarios.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Azure storage errors include verbose HTTP response details and XML
in error messages, making the BSL status.message field cluttered
and hard to read. This change adds sanitization to extract only
the error code and meaningful message.
Before:
BackupStorageLocation "test" is unavailable: rpc error: code = Unknown
desc = GET https://...
RESPONSE 404: 404 The specified container does not exist.
ERROR CODE: ContainerNotFound
<?xml version="1.0"...>
After:
BackupStorageLocation "test" is unavailable: rpc error: code = Unknown
desc = ContainerNotFound: The specified container does not exist.
AWS and GCP error messages are preserved as-is since they don't
contain verbose HTTP responses.
Fixes#8368
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
- Rename metric constants from maintenance_job_* to repo_maintenance_*
- Update metric help text to clarify these are for repo maintenance
- Rename functions: RegisterMaintenanceJob* → RegisterRepoMaintenance*
- Update all test references to use new names
Addresses review comments from @Lyndon-Li on PR #9414
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Adds three new Prometheus metrics to track backup repository
maintenance job execution:
- velero_maintenance_job_success_total: Counter for successful jobs
- velero_maintenance_job_failure_total: Counter for failed jobs
- velero_maintenance_job_duration_seconds: Histogram for job duration
Metrics use repository_name label to identify specific BackupRepositories.
Duration is recorded for both successful and failed jobs (when job runs),
but not when job fails to start.
Includes comprehensive unit and integration tests.
Fixes#9225
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Add wildcard status fields
Signed-off-by: Joseph <jvaikath@redhat.com>
* Implement wildcard namespace expansion in item collector
- Introduced methods to get active namespaces and expand wildcard includes/excludes in the item collector.
- Updated getNamespacesToList to handle wildcard patterns and return expanded lists.
- Added utility functions for setting includes and excludes in the IncludesExcludes struct.
- Created a new package for wildcard handling, including functions to determine when to expand wildcards and to perform the expansion.
This enhances the backup process by allowing more flexible namespace selection based on wildcard patterns.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Enhance wildcard expansion logic and logging in item collector
- Improved logging to include original includes and excludes when expanding wildcards.
- Updated the ShouldExpandWildcards function to check for wildcard patterns in excludes.
- Added comments for clarity in the expandWildcards function regarding pattern handling.
These changes enhance the clarity and functionality of the wildcard expansion process in the backup system.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add wildcard namespace fields to Backup CRD and update deepcopy methods
- Introduced `wildcardIncludedNamespaces` and `wildcardExcludedNamespaces` fields to the Backup CRD to support wildcard patterns for namespace inclusion and exclusion.
- Updated deepcopy methods to handle the new fields, ensuring proper copying of data during object manipulation.
These changes enhance the flexibility of namespace selection in backup operations, aligning with recent improvements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor Backup CRD to rename wildcard namespace fields
- Updated `BackupStatus` struct to rename `WildcardIncludedNamespaces` to `WildcardExpandedIncludedNamespaces` and `WildcardExcludedNamespaces` to `WildcardExpandedExcludedNamespaces` for clarity.
- Adjusted associated comments to reflect the new naming and ensure consistency in documentation.
- Modified deepcopy methods to accommodate the renamed fields, ensuring proper data handling during object manipulation.
These changes enhance the clarity and maintainability of the Backup CRD, aligning with recent improvements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Fix
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor where wildcard expansion happens
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor Backup CRD and related components for expanded namespace handling
- Updated `BackupStatus` struct to rename fields for clarity: `WildcardExpandedIncludedNamespaces` and `WildcardExpandedExcludedNamespaces` are now `ExpandedIncludedNamespaces` and `ExpandedExcludedNamespaces`, respectively.
- Adjusted associated comments and deepcopy methods to reflect the new naming conventions.
- Removed the `getActiveNamespaces` function from the item collector, streamlining the namespace handling process.
- Enhanced logging during wildcard expansion to provide clearer insights into the process.
These changes improve the clarity and maintainability of the Backup CRD and enhance the functionality of namespace selection in backup operations.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor wildcard expansion logic in item collector and enhance testing
- Moved the wildcard expansion logic into a dedicated method, `expandNamespaceWildcards`, improving code organization and readability.
- Updated logging to provide detailed insights during the wildcard expansion process.
- Introduced comprehensive unit tests for wildcard handling, covering various scenarios and edge cases.
- Enhanced the `ShouldExpandWildcards` function to better identify wildcard patterns and validate inputs.
These changes improve the maintainability and robustness of the wildcard handling in the backup system.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Enhance Restore CRD with expanded namespace fields and update logic
- Added `ExpandedIncludedNamespaces` and `ExpandedExcludedNamespaces` fields to the `RestoreStatus` struct to support expanded wildcard namespace handling.
- Updated the `DeepCopyInto` method to ensure proper copying of the new fields.
- Implemented logic in the restore process to expand wildcard patterns for included and excluded namespaces, improving flexibility in namespace selection during restores.
- Enhanced logging to provide insights into the expanded namespaces.
These changes improve the functionality and maintainability of the restore process, aligning with recent enhancements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor Backup and Restore CRDs to enhance wildcard namespace handling
- Renamed fields in `BackupStatus` and `RestoreStatus` from `ExpandedIncludedNamespaces` and `ExpandedExcludedNamespaces` to `IncludeWildcardMatches` and `ExcludeWildcardMatches` for clarity.
- Introduced a new field `WildcardResult` to record the final namespaces after applying wildcard logic.
- Updated the `DeepCopyInto` methods to accommodate the new field names and ensure proper data handling.
- Enhanced comments to reflect the changes and improve documentation clarity.
These updates improve the maintainability and clarity of the CRDs, aligning with recent enhancements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Enhance wildcard namespace handling in Backup and Restore processes
- Updated `BackupRequest` and `Restore` status structures to include a new field `WildcardResult`, which captures the final list of namespaces after applying wildcard logic.
- Renamed existing fields to `IncludeWildcardMatches` and `ExcludeWildcardMatches` for improved clarity.
- Enhanced logging to provide detailed insights into the expanded namespaces and final results during backup and restore operations.
- Introduced a new utility function `GetWildcardResult` to streamline the selection of namespaces based on include/exclude criteria.
These changes improve the clarity and functionality of namespace selection in both backup and restore processes, aligning with recent enhancements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Refactor namespace wildcard expansion logic in restore process
- Moved the wildcard expansion logic into a dedicated method, `expandNamespaceWildcards`, improving code organization and readability.
- Enhanced error handling and logging to provide detailed insights into the expanded namespaces during the restore operation.
- Updated the restore context with expanded namespace patterns and final results, ensuring clarity in the restore status.
These changes improve the maintainability and clarity of the restore process, aligning with recent enhancements in wildcard handling.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add checks for "*" in exclude
Signed-off-by: Joseph <jvaikath@redhat.com>
* Rebase
Signed-off-by: Joseph <jvaikath@redhat.com>
* Create NamespaceIncludesExcludes to get full NS listing for backup w/
Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add new NamespaceIncludesExcludes struct
Signed-off-by: Joseph <jvaikath@redhat.com>
* Move namespace expansion logic
Signed-off-by: Joseph <jvaikath@redhat.com>
* Update backup status with expansion
Signed-off-by: Joseph <jvaikath@redhat.com>
* Wildcard status update
Signed-off-by: Joseph <jvaikath@redhat.com>
* Skip ns check if wildcard expansion
Signed-off-by: Joseph <jvaikath@redhat.com>
* Move wildcard expansion to getResourceItems
Signed-off-by: Joseph <jvaikath@redhat.com>
* lint
Signed-off-by: Joseph <jvaikath@redhat.com>
* Changelog
Signed-off-by: Joseph <jvaikath@redhat.com>
* linting issues
Signed-off-by: Joseph <jvaikath@redhat.com>
* Remove wildcard restore to check if tests pass
Signed-off-by: Joseph <jvaikath@redhat.com>
* Fix namespace mapping test bug from lint fix
The previous commit (0a4aabcf4) attempted to fix linting issues by
using strings.Builder, but incorrectly wrote commas to a separate
builder and concatenated them at the end instead of between namespace
mappings.
This caused the namespace mapping string to be malformed:
Before: ns-1:ns-1-mapped,ns-2:ns-2-mapped
Bug: ns-1:ns-1-mappedns-2:ns-2-mapped,,
The malformed string was parsed as a single mapping with an invalid
namespace name containing a colon, causing Kubernetes to reject it:
"ns-1-mappedns-2:ns-2-mapped" is invalid
Fix by properly using strings.Builder to construct the mapping string
with commas between entries, addressing both the linting concern and
the functional bug.
Fixes the MultiNamespacesMappingResticTest and
MultiNamespacesMappingSnapshotTest failures.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Fix wildcard namespace expansion edge cases
This commit fixes two bugs in the wildcard namespace expansion feature:
1. Empty wildcard results: When a wildcard pattern (e.g., "invalid*")
matched no namespaces, the backup would incorrectly back up ALL
namespaces instead of backing up nothing. This was because the empty
includes list was indistinguishable from "no filter specified".
Fix: Added wildcardExpanded flag to NamespaceIncludesExcludes to
track when wildcard expansion has occurred. When true and the
includes list is empty, ShouldInclude now correctly returns false.
2. Premature namespace filtering: An earlier attempt to fix bug #1
filtered namespaces too early in collectNamespaces, breaking
LabelSelector tests where namespaces should be included based on
resources within them matching the label selector.
Fix: Removed the premature filtering and rely on the existing
filterNamespaces call at the end of getAllItems, which correctly
handles both wildcard expansion and label selector scenarios.
The fixes ensure:
- Wildcard patterns matching nothing result in empty backups
- Label selectors still work correctly (namespace included if any
resource in it matches the selector)
- State is preserved across multiple ResolveNamespaceList calls
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
* Run wildcard expansion during backup processing
Signed-off-by: Joseph <jvaikath@redhat.com>
* Lint fix
Signed-off-by: Joseph <jvaikath@redhat.com>
* Improve coverage
Signed-off-by: Joseph <jvaikath@redhat.com>
* gofmt fix
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add wildcard details to describe backup status
Signed-off-by: Joseph <jvaikath@redhat.com>
* Revert "Remove wildcard restore to check if tests pass"
This reverts commit 4e22c2af855b71447762cb0a9fab7e7049f38a5f.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add restore describe for wildcard namespaces Revert restore wildcard removal
Signed-off-by: Joseph <jvaikath@redhat.com>
* Add coverage
Signed-off-by: Joseph <jvaikath@redhat.com>
* Lint
Signed-off-by: Joseph <jvaikath@redhat.com>
* Remove unintentional changes
Signed-off-by: Joseph <jvaikath@redhat.com>
* Remove wildcard status fields and mentionsRemove usage of wildcard fields for backup and restore status.
Signed-off-by: Joseph <jvaikath@redhat.com>
* Remove status update changelog line
Signed-off-by: Joseph <jvaikath@redhat.com>
* Rename getNamespaceIncludesExcludes
Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
* Rewrite brace pattern validation
Signed-off-by: Joseph <jvaikath@redhat.com>
* Different var for internal loop
Signed-off-by: Joseph <jvaikath@redhat.com>
---------
Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Bump golangci-lint to v1.25.0, because golangci-lint start to support
Golang v1.25 since v1.24.0, and v1.26.x was not stable yet.
Align action pr-linter-check's golangci-lint version to v1.25.0
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Add documentation explaining how volume policies are applied before
VGS grouping, including examples and troubleshooting guidance for the
multiple CSI drivers scenario.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
VolumeGroupSnapshots were querying all PVCs with matching labels
directly from the cluster without respecting volume policies. This
caused errors when labeled PVCs included both CSI and non-CSI volumes,
or volumes from different CSI drivers that were excluded by policies.
This change filters PVCs by volume policy before VGS grouping,
ensuring only PVCs that should be snapshotted are included in the
group. A warning is logged when PVCs are excluded from VGS due to
volume policy.
Fixes#9344
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Fix the Job build error when BackupReposiotry name longer than 63.
Fix the Job build error.
Consider the name length limitation change in job list code.
Use hash to replace the GetValidName function.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
* Use ref_name to replace ref.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
---------
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Update test expectations to include createdName field for resources
with action 'created'. Also ensure namespaces track their created
names when created via EnsureNamespaceExistsAndIsReady.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
When restoring resources with GenerateName, Kubernetes assigns the actual name
after creation, but Velero only tracked the original name from the backup in
itemKey. This caused volume information collection to fail when trying to fetch
PVCs using the original name instead of the actual created name.
Example:
- Original PVC name from backup: "test-vm-disk-1"
- Actual created PVC name: "test-vm-backup-2025-10-27-test-vm-disk-1-mdjkd"
- Volume info tried to fetch: "test-vm-disk-1" → Failed with "not found"
This affects any plugin or workflow using GenerateName during restore:
- kubevirt-velero-plugin (VMFR use case with PVC collision avoidance)
- Custom restore item actions using generateName
- Secrets/ConfigMaps restored with generateName
Changes:
1. Add createdName field to restoredItemStatus struct (pkg/restore/request.go)
2. Capture actual name from createdObj.GetName() (pkg/restore/restore.go:1520)
3. Use createdName in RestoredResourceList() when available (pkg/restore/request.go:93-95)
This fix is backwards compatible:
- createdName defaults to empty string
- When empty, falls back to itemKey.name (original behavior)
- Only populated for GenerateName resources where needed
Fixes volume information collection errors like:
"Failed to get PVC" error="persistentvolumeclaims \"<original-name>\" not found"
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
When restoring resources with GenerateName (where name is empty and K8s
assigns the actual name), the managed fields patch was failing with error
"name is required" because it was using obj.GetName() which returns empty
for GenerateName resources.
The fix uses createdObj.GetName() instead, which contains the actual name
assigned by Kubernetes after resource creation.
This affects any resource using GenerateName for restore, including:
- PersistentVolumeClaims restored by kubevirt-velero-plugin
- Secrets and ConfigMaps created with generateName
- Any custom resources using generateName
Changes:
- Line 1707: Use createdObj.GetName() instead of obj.GetName() in Patch call
- Lines 1702, 1709, 1713, 1716: Use createdObj in error/info messages for accuracy
This is a backwards-compatible fix since:
- For resources WITHOUT generateName: obj.GetName() == createdObj.GetName()
- For resources WITH generateName: createdObj.GetName() has the actual name
The managed fields patch was already correctly using createdObj (lines 1698-1700),
only the Patch() call was incorrectly using obj.
Fixes restore status showing FinalizingPartiallyFailed with "name is required"
error when restoring resources with GenerateName.
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Add error message in the velero install CLI output if VerifyJSONConfigs fail.
Only allow one element in node-agent-configmap's Data.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
main branch will read go version from go.mod's go primitive, and
only keep major and minor version, because we want the actions to use
the lastest patch version automatically, even the go.mod specify version
like 1.24.0.
release branch can read the go version from go.mod file by setup-go
action's own logic.
Refactor the get Go version to reusable workflow.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
The Bitnami MinIO image bitnami/minio:2021.6.17-debian-10-r7 is no longer
available on Docker Hub, causing E2E tests to fail.
This change implements a solution to build the MinIO image locally from
Bitnami's public Dockerfile and cache it for subsequent runs:
- Fetches the latest commit hash of the Bitnami MinIO Dockerfile
- Uses GitHub Actions cache to store/retrieve built images
- Only rebuilds when the upstream Dockerfile changes
- Maintains compatibility with existing environment variables
Fixes#9279🤖 Generated with [Claude Code](https://claude.ai/code)
Update .github/workflows/e2e-test-kind.yaml
Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Fix GetResourceWithLabel's bug: labels were not applied.
Add workOS for deployment and pod creationg.
Add OS label for select node.
Enlarge the context timeout to 10 minutes. 5 min is not enough for Windows.
Enlarge the Kibishii test context to 15 minutes for Windows.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
- Expanded the design to include detailed implementation steps for wildcard expansion in both backup and restore operations.
- Added new status fields to the backup and restore CRDs to track expanded wildcard namespaces.
- Clarified the approach to ensure backward compatibility with existing `*` behavior.
- Addressed limitations and provided insights on restore operations handling wildcard-expanded backups.
This update aims to provide a comprehensive and clear framework for implementing wildcard namespace support in Velero.
Signed-off-by: Joseph <jvaikath@redhat.com>
- Clarified the use of the standalone `*` character in namespace specifications.
- Ensured consistent formatting for `*` throughout the document.
- Maintained focus on backward compatibility and limitations regarding wildcard usage.
This update enhances the clarity and consistency of the design document for implementing wildcard namespace support in Velero.
Signed-off-by: Joseph <jvaikath@redhat.com>
- Updated the abstract to clarify the current limitations of namespace specifications in Velero.
- Expanded the goals section to include specific objectives for implementing wildcard patterns in `--include-namespaces` and `--exclude-namespaces`.
- Detailed the high-level design and implementation steps, including the addition of new status fields in the backup CRD and the creation of a utility package for wildcard expansion.
- Addressed backward compatibility and known limitations regarding the use of wildcards alongside the existing "*" character.
This update aims to provide a comprehensive overview of the proposed changes for improved namespace selection flexibility.
Signed-off-by: Joseph <jvaikath@redhat.com>
This commit add content to cover "includeExcludePolicy" in resource
policies.
It also tweak the wordings to clarify the "volume policy" and "resource
policies"
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Fixes#4201: Ensure PriorityClasses are restored before pods that
reference them, preventing restoration failures when pods depend on
custom PriorityClasses.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
It failed with fetching the wrong VolumeInfo. Correct it.
Add AdditionalBSL label for applied cases.
Remove not needed default BSL kibishii test for additional bsl case.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
PVB and PVR used to print related pod namespace in output.
In v1.17, the behavior changed. Use backup or restore name to filter them.
Shorten the timeout context from 1h to 5m, because AWS was not covered anymore.
Remove an used if branch for vSphere.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for #8883
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
remove unused test utils
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
feat: add unit test for getting priority class name in maintenance jobs
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
doc update
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
feat: add priority class validation for repository maintenance jobs
- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior
This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.
Addresses feedback from:
https://github.com/vmware-tanzu/velero/pull/8883#discussion_r2238681442
Refs #8869
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
refactor: clean up priority class handling for data mover pods
- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization
This reduces code duplication and clarifies the scope of priority class configuration.
🤖 Generated with [Claude Code](https://claude.ai/code)
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
refactor: remove GetDataMoverPriorityClassName from kube utilities
Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.
Refs #8869🤖 Generated with [Claude Code](https://claude.ai/code)
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
refactor: remove priority class validation from install command
Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.
The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
fixes#8610
This commit extends the resources policy, such that user can define
resource include exclude filters in the policy and reuse it in different backups.
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Addresses #9133 by adding clear documentation about the current limitation
where only the first element in the loadAffinity array is processed.
Changes:
- Added prominent warning at the beginning of loadAffinity section
- Updated misleading examples that showed multiple array elements
- Added warnings to each multi-element example explaining the limitation
- Clarified that the recommended approach is to combine all conditions
into a single loadAffinity element using both matchLabels and matchExpressions
This provides the "bare minimum" documentation clarification requested
in the issue until a code fix can be implemented.
🤖 Generated with [Claude Code](https://claude.ai/code)
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Apply suggestion from @kaovilai
Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Apply suggestion from @kaovilai
Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Apply suggestion from @kaovilai
Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
feat: Add CA cert fallback when caCertFile fails in download requests
- Fallback to BSL cert when caCertFile cannot be opened
- Combine certificate handling blocks to reuse CA pool initialization
- Add comprehensive unit tests for fallback behavior
This improves robustness by allowing downloads to proceed with BSL CA cert
when the provided CA cert file is unavailable or unreadable.
🤖 Generated with [Claude Code](https://claude.ai/code)
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
If distributed snapshotting is enabled in the external snapshotter a manager label is added to the volume snapshot content. When exposing the snapshot velero needs to keep this label around otherwise the exposed snapshot will never become ready.
Signed-off-by: Felix Prasse <1330854+flx5@users.noreply.github.com>
* Enhance File System Backup documentation with details on volume snapshot behavior and backup method decision flow
Fixes: Velero AWS snapshots not occurring with the AWS plugin #9090
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
* Clarify conditions for excluding volumes from File System Backup and enhance decision flow for volume snapshots
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
---------
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
The ResticIdentifier field in BackupRepository is only relevant for restic
repositories. For kopia repositories, this field is unused and should be
omitted. This change:
- Adds omitempty tag to ResticIdentifier field in BackupRepository CRD
- Updates controller to only populate ResticIdentifier for restic repos
- Adds tests to verify behavior for both restic and kopia repository types
This ensures backward compatibility while properly handling kopia repositories
that don't require a restic-compatible identifier.
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
The reason is toolchain cannot update automatically.
If 1.24 can force CI use the latest patch version,
and it will not force user to upgrade their local go version,
this should be the better approach.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
add changelog file
Show defaultVolumesToFsBackup in describe only when set by the user
minor ut fix
minor fix
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
* Refactor backup volume info retrieval and snapshot checkpoint building in e2e tests
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
log backup volume info retrieval and snapshot checkpoint building
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Add error handling for volume info retrieval in backup tests
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Add error handling for volume info retrieval in backup tests
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
* Update snapshot checkpoint building to use DefaultKibishiiWorkerCounts
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
---------
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
This PR fixes issue #8870 where Velero was unnecessarily adding the restore-wait init container when restoring pods with volumes that were backed up using native datamover or CSI.
When restoring pods with volumes, Velero was always adding the restore-wait init container, even when the volumes were backed up using native datamover or CSI and didn't need file system restores. This was causing unnecessary overhead and potential issues.
PVR action to remove restore-wait init container on restore
Changes:
- Remove ALL existing restore-wait init containers before deciding whether to add a new one
- This covers both scenarios: when no file system restore is needed AND when preventing duplicates
- Simplify the add logic since we've already cleaned up existing containers
- Add better logging to show how many containers were removed
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
* Clarify thirdparty label/annotations on the maintenance jobs
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
* Clarify that maintenance jobs do not inherit all labels/annotations
- Address PR review feedback and issue #8974
- Make it explicit that only specific predefined third-party labels and annotations are propagated
- Add Important note to prevent user confusion about label/annotation inheritance behavior
- Currently only azure.workload.identity/use label and iam.amazonaws.com/role annotation are inherited
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
---------
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Co-authored-by: Xun Jiang/Bruce Jiang <59276555+blackpiglet@users.noreply.github.com>
* Add support for pod labels and service account annotations in Velero configuration
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
* Refactor Velero configuration to use string types for pod labels and service account annotations
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
* Please notice only Kibishii workload support Windows test,
because the other work loads use busybox image, and not support Windows.
* Refactor CreateFileToPod to support Windows.
* Add skip logic for migration test if the version is under 1.16.
* Add main in semver check.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Migration cases use the Kibishii as the workload, and SC mapping
ConfigMap was needed for all scenarios, because standby cluster
doesn't have the Kibishii SC after setting up.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
The restore workflow used name represents the backup resource and the
restore to be restored, but the restored resource name may be different
from the backup one, e.g. PV and VSC are global resources, to avoid
conflict, need to rename them.
Reanme the name variable to backupResourceName, and use obj.GetName()
for restore operation.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Enables the node-agent to start even if the
/host_pods path does not exist.
If the path is present, the existing logic
remains unchanged, ensuring it is readable.
Signed-off-by: Michal Pryc <mpryc@redhat.com>
Combine existing PVC non-CSI RIAs and move annotation
removal out of the CSI plugin to fix issues with
CSI volumes when using fs-backup
Signed-off-by: Scott Seago <sseago@redhat.com>
This change fixes a minor typo in the Backup Hooks documentation, changing "Defaults is" to "Defaults to".
Signed-off-by: Andreas Lindhé <7773090+lindhe@users.noreply.github.com>
Also bumped to support upgraded k8s.io/ deps.
- controller-gen to v0.16.5
- sigs.k8s.io/controller-runtime v0.19.2
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Run backup post hooks inside ItemBlock synchronously as the ItemBlocks are handled asynchronously
Fixes#8516
Signed-off-by: Wenkai Yin(尹文开) <yinw@vmware.com>
This commit makes sure when a backup is deleted the controller will
delete the CSI snapshot even when the bakckup tarball is not uploaded.
fixes#8160
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue
Fixes#8587
Signed-off-by: Wenkai Yin(尹文开) <yinw@vmware.com>
This commit makes change in restore finalizer controller, to make it
check the status in item operation of a PVC before patch the PV that is
bound to it. If the operation is not successful it will skip patching
the PV.
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
The issue is caused by the changes of controller-runtime: WithEventFilter() doesn't apply to WatchesRawSource(),
this commit set Predicate for WatchesRawSource() seperatedly
Fixes#8437
Signed-off-by: Wenkai Yin(尹文开) <yinw@vmware.com>
* issue 8433: add ask label to data mover pods
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
* check existence of the same label from node-agent
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
---------
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
This commit adds SecurityContext that complies with "restricted" level
per Pod Security Standards to "restore-helper" initContainer.
It ensures the restore won't fail when the cluster enforces PSA.
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
* Only install and uninstall SC and VSC once for default cluster.
* Install and uninstall SC and VSC for standby cluster on migration case.
* Refactor the StorageClass and VolumeSnapshotClass YAMLs.
* Prettify the e2e_suite_test.go
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Run the E2E test on kind / run-e2e-test (1.23.17, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.23.17, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.24.17, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.24.17, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.25.16, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.25.16, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.26.13, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.26.13, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.27.10, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.27.10, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.28.6, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.28.6, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.29.1, (NamespaceMapping && Single && Restic) || (NamespaceMapping && Multiple && Restic)) (push) Has been skipped
Run the E2E test on kind / run-e2e-test (1.29.1, ResourceModifier || (Backups && BackupsSync) || PrivilegesMgmt || OrderedResources) (push) Has been skipped
* Add new flag HAS_VSPHERE_PLUGIN for E2E test.
* Modify the E2E README for the new parameter.
* Add the VolumeSnapshotClass for VKS.
* Modify the plugin install logic.
* Modify the cases to support data mover case in VKS.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Always test latest available patch version of each supported k8s version available in Kindest/node images.
ie. This adds v1.31, v1.30 to test matrix and upgrade patch versions for others.
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
This commit implements a new --annotations flag in the backup and restore create commands.
This allows users to specify key-value pairs for annotations directly at the time of backup and restore creation, in the same way as the --labels flag.
Signed-off-by: Alvaro Romero <alromero@redhat.com>
by changing output type to image.
Then you can execute command like so to create a multi-arch image
```
BUILDX_PLATFORMS=linux/amd64,linux/arm64 BUILDX_OUTPUT_TYPE=image make container
```
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
As with other plugin types, the information on how to implement
an IBA plugin will be in the velero-plugin-example repo.
Signed-off-by: Scott Seago <sseago@redhat.com>
Rename backup-repository-config to backup-repository-configmap.
Rename repo-maintenance-job-config to repo-maintenance-job-configmap.
Rename node-agent-config to node-agent-configmap.
Add those three parameters to `velero install` CLI.
Modify the design and the site documents.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
package "k8s.io/api/core/v1" is being imported more than once (ST1019)
pvc_pv_test.go:32:2: other import of "k8s.io/api/core/v1"
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
The code changes are related to the `NewPeriodicalEnqueueSource` function in the `kube/periodical_enqueue_source.go` file. This function is used to create a new instance of the `PeriodicalEnqueueSource` struct, which is responsible for periodically enqueueing objects into a work queue.
The changes involve adding two new parameters to this function: `controllerName string` and modifying the existing `logger` parameter to include additional fields.
Here's what changed:
1. A new `controllerName` parameter was added to the `NewPeriodicalEnqueueSource` function.
These changes are to adding more context or metadata to the logging output, possibly for debugging or monitoring purposes.
The other files (`restore_operations_controller.go`, `schedule_controller.go`, and their respective test files) were modified to use this updated `NewPeriodicalEnqueueSource` function with the new `controllerName` parameter.
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Changed the tests to use mocked function that will not read actual
secrets from env variables nor AWS config file that may be
on the system that is running tests.
As a second guard against exposed secrets comparison for the values
does not shows the actual values for the AWS data. This is to prevent
situation where programming error may still allow the test to read
AWS config/env variables instead of using mocked function.
Signed-off-by: Michal Pryc <mpryc@redhat.com>
* Correcting Openshift on IBM Documentation Error
I have to admit to some significant error and embarrassment regarding the documentation update
about Openshift on IBM Cloud pull request https://github.com/vmware-tanzu/velero/pull/8069.
I will correct my error before it gets any further.
Just exposing /var/data/kubelet/pods is incorrect and host path /var/lib/kubelet/pods should remain unchanged.
The errors with the defaults during csi snapshot data movement were:
data path backup failed: Failed to run kopia backup: unable to get local
block device entry: resolveSymlink: lstat /var/data/: no such
file or directory
I suspected this was the same as RancherOS and Nutanix. It is not.
The original tested changes changed both /var/lib/kubelet/{pods,plugins} to
/var/data/kubelet/{pods,plugins}.
The published changes only result in the error:
```
status:
completionTimestamp: '2024-08-02T17:12:29Z'
message: >-
data path backup failed: Failed to run kopia backup: unable to get local
block device entry: resolveSymlink: lstat /var/data/kubelet/plugins: no such
file or directory
node: 10.240.0.5
phase: Failed
progress: {}
startTimestamp: '2024-08-02T17:12:11Z'
```
After making continued modifications to the daemonset the correct configuration was:
```
volumeMounts:
- name: host-pods
mountPath: /host_pods
mountPropagation: HostToContainer
- name: host-plugins
mountPath: /var/data/kubelet/plugins
mountPropagation: HostToContainer
```
```
volumes:
- name: host-pods
hostPath:
path: /var/lib/kubelet/pods
type: ''
- name: host-plugins
hostPath:
path: /var/data/kubelet/plugins
type: ''
```
Only the changes to the plugin path were required.
The plugin path changes were required to both the mount path and the host path.
Regardless of whether /var/lib/kubelet/pods or /var/data/kubelet/pods host path, backups and restore
succeeded provided the plugin path was modified.
```
volumeMounts:
- name: host-pods
mountPath: /host_pods
mountPropagation: HostToContainer
- name: host-plugins
mountPath: /var/data/kubelet/plugins
mountPropagation: HostToContainer
```
```
volumes:
- name: host-pods
hostPath:
path: /var/data/kubelet/pods
type: ''
- name: host-plugins
hostPath:
path: /var/data/kubelet/plugins
type: ''
```
After getting on-host access was able to confirm. Pods are at /var/lib/kubelet/pods.
```
ls /var/lib/kubelet/pods
07c0be63-335d-4cfb-b39f-816bc2fb32cd
51f31b3e-4710-4ef0-8626-5f1a78a624b2
a4802fd3-3b62-45a4-8f21-974880b6f92a
cccb35c9-b4f9-4ca9-a697-736ae64f09ad
0a5d4366-7fa1-4525-9e45-a43a362b8542
558b0643-0661-4d4a-b03e-aac60c6ad710
a4b106fb-5b7b-48e5-828a-ea7b41ba0e59
ce1290e1-4330-4df6-8166-14784bcce930
```
On host the volumes are in /var/data/kubelet/plugins.
```
ls /var/data/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/231e04896c4f528efb95d23a3c153db9fc4a7206b7320f74443f30de7228dba5/globalmount/velero/backups/backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c/
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-csi-volumesnapshotclasses.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-resource-list.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-csi-volumesnapshotcontents.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-results.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-csi-volumesnapshots.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-volumesnapshots.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-itemoperations.json.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c.tar.gz
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-logs.gz
velero-backup.json
backup-resources-41d84d16-47a7-4ea8-a9cb-6348d01bcb2c-podvolumebackups.json.gz
```
With the volume config changed to expose /var/data/kubelet/plugins for the plugin hostPath, the DataUploads and DataDownloads
succeed for both Filesystem and Block mode PVCs.
```
status:
completionTimestamp: '2024-08-02T17:23:33Z'
node: 10.240.0.5
path: >-
/host_pods/7fcb9d56-7885-437c-acd3-67db6b1ee8ae/volumeDevices/kubernetes.io~csi/pvc-47b91f56-db8c-44bf-9ecc-737170561b4b
phase: Completed
progress:
bytesDone: 5368709120
totalBytes: 5368709120
snapshotID: 8faae36b3592fee4efbfad024f26033e
startTimestamp: '2024-08-02T17:21:22Z'
```
```
status:
completionTimestamp: '2024-08-02T18:42:19Z'
node: 10.240.0.5
phase: Completed
progress:
bytesDone: 5368709120
totalBytes: 5368709120
startTimestamp: '2024-08-02T18:41:00Z'
```
My apologies for the error.
Signed-off-by: Michael Fruchtman <msfrucht@us.ibm.com>
* Add context to plugins mountPath
Signed-off-by: Michael Fruchtman <msfrucht@us.ibm.com>
---------
Signed-off-by: Michael Fruchtman <msfrucht@us.ibm.com>
This commit makes sure the dbr's status is "Processed" when an error
happens before the actual deletion is started
fixes#7812
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Added option checksumAlgorith, this stops 403 errors as per https://github.com/vmware-tanzu/velero/issues/7543
Added plugins line as velero install failed without this option in version 1.14.0
Removed the volumesnapshotlocation as it does not exist in 1.14.0
Signed-off-by: Gareth Anderson <gareth.anderson03@gmail.com>
Updates the documentation for CSI snapshot data movement for OpenShift
on IBM Cloud.
The default hostpath /var/lib/kubelet/pods cannot find
PersistentVolumeClaims with volumeMode: Block on host.
The correct hostpath for OpenShift on IBM Cloud is
/var/data/kubelet/pods.
Signed-off-by: Michael Fruchtman <msfrucht@us.ibm.com>
Breaking change (can be mitigated if needed in the future): v1 branch accepted an optional seconds field at the beginning of the cron spec. This is non-standard and has led to a lot of confusion. The new default parser conforms to the standard as described by [the Cron wikipedia page.](https://en.wikipedia.org/wiki/Cron). It is unlikely that this affects us per https://github.com/vmware-tanzu/velero/pull/31
Other notes:
> CRON_TZ is now the recommended way to specify the timezone of a single schedule, which is sanctioned by the specification. The legacy "TZ=" prefix will continue to be supported since it is unambiguous and easy to do so.
References: https://pkg.go.dev/github.com/robfig/cron/v3#readme-upgrading-to-v3-june-2019
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
add changelog file
change log level and add more detailed comments
make update
add return for sc get call if error
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
Check whether the namespaces specified in the
backup.Spec.IncludeNamespaces exist during backup resource collcetion
If not, log error to mark the backup as PartiallyFailed.
Signed-off-by: Xun Jiang <blackpigletbruce@gmail.com>
This commit fixes#7849.
It will use PVC instead of PV to track CSI snapshots to generate restore
volume info metadata. So that in the case the PVC is not bound to PV
the metadata can be populated correctly.
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
When dry-run the tag-release.sh, there's an error
"fatal: detected dubious ownership in repository at
'/github.com/vmware-tanzu/velero'"
This commit works around this issue to make sure "tag-release.sh"
can finish successful
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
- [ ] [Accepted the DCO](https://velero.io/docs/v1.5/code-standards/#dco-sign-off). Commits without the DCO will delay acceptance.
- [ ] [Created a changelog file](https://velero.io/docs/v1.5/code-standards/#adding-a-changelog) or added`/kind changelog-not-required`as a comment on this pull request.
- [ ] [Created a changelog file (`make new-changelog`)](https://velero.io/docs/main/code-standards/#adding-a-changelog) or comment`/kind changelog-not-required`on this PR.
- [ ] Updated the corresponding documentation in `site/content/docs/main`.
stale-issue-message:"This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands."
@@ -20,4 +20,4 @@ jobs:
days-before-pr-close:-1
# Only issues made after Feb 09 2021.
start-date:"2021-09-02T00:00:00"
exempt-issue-labels:"Epic,Area/CLI,Area/Cloud/AWS,Area/Cloud/Azure,Area/Cloud/GCP,Area/Cloud/vSphere,Area/CSI,Area/Design,Area/Documentation,Area/Plugins,Bug,Enhancement/User,kind/requirement,kind/refactor,kind/tech-debt,limitation,Needs investigation,Needs triage,Needs Product,P0 - Hair on fire,P1 - Important,P2 - Long-term important,P3 - Wouldn't it be nice if...,Product Requirements,Restic - GA,Restic,release-blocker,Security"
exempt-issue-labels:"Epic,Area/CLI,Area/Cloud/AWS,Area/Cloud/Azure,Area/Cloud/GCP,Area/Cloud/vSphere,Area/CSI,Area/Design,Area/Documentation,Area/Plugins,Bug,Enhancement/User,kind/requirement,kind/refactor,kind/tech-debt,limitation,Needs investigation,Needs triage,Needs Product,P0 - Hair on fire,P1 - Important,P2 - Long-term important,P3 - Wouldn't it be nice if...,Product Requirements,Restic - GA,Restic,release-blocker,Security,backlog"
Below is a list of adopters of Velero in **production environments** that have
@@ -68,6 +69,9 @@ Replicated uses the Velero open source project to enable snapshots in [KOTS][101
**[Microsoft Azure][105]**<br>
[Azure Backup for AKS][106] is an Azure native, Kubernetes aware, Enterprise ready backup for containerized applications deployed on Azure Kubernetes Service (AKS). AKS Backup utilizes Velero to perform backup and restore operations to protect stateful applications in AKS clusters.<br>
**[Broadcom][107]**<br>
[VMware Cloud Foundation][108] (VCF) offers built-in [vSphere Kubernetes Service][109] (VKS), a Kubernetes runtime that includes a CNCF certified Kubernetes distribution, to deploy and manage containerized workloads. VCF empowers platform engineers with native [Kubernetes multi-cluster management][110] capability for managing Kubernetes (K8s) infrastructure at scale. VCF utilizes Velero for Kubernetes data protection enabling platform engineers to back up and restore containerized workloads manifests & persistent volumes, helping to increase the resiliency of stateful applications in VKS cluster.
## Adding your organization to the list of Velero Adopters
If you are using Velero and would like to be included in the list of `Velero Adopters`, add an SVG version of your logo to the `site/static/img/adopters` directory in this repo and submit a [pull request][3] with your change. Name the image file something that reflects your company (e.g., if your company is called Acme, name the image acme.png). See this for an example [PR][4].
@@ -125,3 +129,8 @@ If you would like to add your logo to a future `Adopters of Velero` section on [
@@ -107,6 +107,29 @@ Lazy consensus does _not_ apply to the process of:
* Removal of maintainers from Velero
## Deprecation Policy
### Deprecation Process
Any contributor may introduce a request to deprecate a feature or an option of a feature by opening a feature request issue in the vmware-tanzu/velero GitHub project. The issue should describe why the feature is no longer needed or has become detrimental to Velero, as well as whether and how it has been superseded. The submitter should give as much detail as possible.
Once the issue is filed, a one-month discussion period begins. Discussions take place within the issue itself as well as in the community meetings. The person who opens the issue, or a maintainer, should add the date and time marking the end of the discussion period in a comment on the issue as soon as possible after it is opened. A decision on the issue needs to be made within this one-month period.
The feature will be deprecated by a supermajority vote of 50% plus one of the project maintainers at the time of the vote tallying, which is 72 hours after the end of the community meeting that is the end of the comment period. (Maintainers are permitted to vote in advance of the deadline, but should hold their votes until as close as possible to hear all possible discussion.) Votes will be tallied in comments on the issue.
Non-maintainers may add non-binding votes in comments to the issue as well; these are opinions to be taken into consideration by maintainers, but they do not count as votes.
If the vote passes, the deprecation window takes effect in the subsequent release, and the removal follows the schedule.
### Schedule
If depreciation proposal passes by supermajority votes, the feature is deprecated in the next minor release and the feature can be removed completely after two minor version or equivalent major version e.g., if feature gets deprecated in Nth minor version, then feature can be removed after N+2 minor version or its equivalent if the major version number changes.
### Deprecation Window
The deprecation window is the period from the release in which the deprecation takes effect through the release in which the feature is removed. During this period, only critical security vulnerabilities and catastrophic bugs should be fixed.
**Note:** If a backup relies on a deprecated feature, then backups made with the last Velero release before this feature is removed must still be restorable in version `n+2`. For instance, something like restic feature support, that might mean that restic is removed from the list of supported uploader types in version `n` but the underlying implementation required to restore from a restic backup won't be removed until release `n+2`.
## Updating Governance
All substantive changes in Governance require a supermajority agreement by all maintainers.
@@ -12,13 +12,13 @@ The Velero project maintains the following [governance document](https://github.
Security is of the highest importance and all security vulnerabilities or suspected security vulnerabilities should be reported to Velero privately, to minimize attacks against current users of Velero before they are fixed. Vulnerabilities will be investigated and patched on the next patch (or minor) release as soon as possible. This information could be kept entirely internal to the project.
If you know of a publicly disclosed security vulnerability for Velero, please **IMMEDIATELY** contact the VMware Security Team (security@vmware.com).
If you know of a publicly disclosed security vulnerability for Velero, please **IMMEDIATELY** contact the Security Team (velero-security.pdl@broadcom.com).
**IMPORTANT: Do not file public issues on GitHub for security vulnerabilities**
To report a vulnerability or a security-related issue, please contact the VMware email address with the details of the vulnerability. The email will be fielded by the VMware Security Team and then shared with the Velero maintainers who have committer and release permissions. Emails will be addressed within 3 business days, including a detailed plan to investigate the issue and any potential workarounds to perform in the meantime. Do not report non-security-impacting bugs through this channel. Use [GitHub issues](https://github.com/vmware-tanzu/velero/issues/new/choose) instead.
To report a vulnerability or a security-related issue, please contact the email address with the details of the vulnerability. The email will be fielded by the Security Team and then shared with the Velero maintainers who have committer and release permissions. Emails will be addressed within 3 business days, including a detailed plan to investigate the issue and any potential workarounds to perform in the meantime. Do not report non-security-impacting bugs through this channel. Use [GitHub issues](https://github.com/vmware-tanzu/velero/issues/new/choose) instead.
## Proposed Email Content
@@ -29,7 +29,7 @@ Provide a descriptive subject line and in the body of the email include the foll
* Basic identity information, such as your name and your affiliation or company.
* Detailed steps to reproduce the vulnerability (POC scripts, screenshots, and logs are all helpful to us).
* Description of the effects of the vulnerability on Velero and the related hardware and software configurations, so that the VMware Security Team can reproduce it.
* Description of the effects of the vulnerability on Velero and the related hardware and software configurations, so that the Security Team can reproduce it.
* How the vulnerability affects Velero usage and an estimation of the attack surface, if there is one.
* List other projects or dependencies that were used in conjunction with Velero to produce the vulnerability.
@@ -49,7 +49,7 @@ Provide a descriptive subject line and in the body of the email include the foll
## Patch, Release, and Disclosure
The VMware Security Team will respond to vulnerability reports as follows:
The Security Team will respond to vulnerability reports as follows:
@@ -62,7 +62,7 @@ The VMware Security Team will respond to vulnerability reports as follows:
5. The Security Team will also create a [CVSS](https://www.first.org/cvss/specification-document) using the [CVSS Calculator](https://www.first.org/cvss/calculator/3.0). The Security Team makes the final call on the calculated CVSS; it is better to move quickly than making the CVSS perfect. Issues may also be reported to [Mitre](https://cve.mitre.org/) using this [scoring calculator](https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator). The CVE will initially be set to private.
6. The Security Team will work on fixing the vulnerability and perform internal testing before preparing to roll out the fix.
7. The Security Team will provide early disclosure of the vulnerability by emailing the [Velero Distributors](https://groups.google.com/u/1/g/projectvelero-distributors) mailing list. Distributors can initially plan for the vulnerability patch ahead of the fix, and later can test the fix and provide feedback to the Velero team. See the section **Early Disclosure to Velero Distributors List** for details about how to join this mailing list.
8. A public disclosure date is negotiated by the VMware SecurityTeam, the bug submitter, and the distributors list. We prefer to fully disclose the bug as soon as possible once a user mitigation or patch is available. It is reasonable to delay disclosure when the bug or the fix is not yet fully understood, the solution is not well-tested, or for distributor coordination. The timeframe for disclosure is from immediate (especially if it’s already publicly known) to a few weeks. For a critical vulnerability with a straightforward mitigation, we expect the report date for the public disclosure date to be on the order of 14 business days. The VMware Security Team holds the final say when setting a public disclosure date.
8. A public disclosure date is negotiated by the SecurityTeam, the bug submitter, and the distributors list. We prefer to fully disclose the bug as soon as possible once a user mitigation or patch is available. It is reasonable to delay disclosure when the bug or the fix is not yet fully understood, the solution is not well-tested, or for distributor coordination. The timeframe for disclosure is from immediate (especially if it’s already publicly known) to a few weeks. For a critical vulnerability with a straightforward mitigation, we expect the report date for the public disclosure date to be on the order of 14 business days. The Security Team holds the final say when setting a public disclosure date.
9. Once the fix is confirmed, the Security Team will patch the vulnerability in the next patch or minor release, and backport a patch release into all earlier supported releases. Upon release of the patched version of Velero, we will follow the **Public Disclosure Process**.
@@ -79,7 +79,7 @@ The Security Team will also publish any mitigating steps users can take until th
* Use security@vmware.com to report security concerns to the VMware Security Team, who uses the list to privately discuss security issues and fixes prior to disclosure.
* Use velero-security.pdl@broadcom.com to report security concerns to the Security Team, who uses the list to privately discuss security issues and fixes prior to disclosure.
* Join the [Velero Distributors](https://groups.google.com/u/1/g/projectvelero-distributors) mailing list for early private information and vulnerability disclosure. Early disclosure may include mitigating steps and additional information on security patch releases. See below for information on how Velero distributors or vendors can apply to join this list.
@@ -107,11 +107,11 @@ To be eligible to join the [Velero Distributors](https://groups.google.com/u/1/g
## Embargo Policy
The information that members receive on the Velero Distributors mailing list must not be made public, shared, or even hinted at anywhere beyond those who need to know within your specific team, unless you receive explicit approval to do so from the VMware Security Team. This remains true until the public disclosure date/time agreed upon by the list. Members of the list and others cannot use the information for any reason other than to get the issue fixed for your respective distribution's users.
The information that members receive on the Velero Distributors mailing list must not be made public, shared, or even hinted at anywhere beyond those who need to know within your specific team, unless you receive explicit approval to do so from the Security Team. This remains true until the public disclosure date/time agreed upon by the list. Members of the list and others cannot use the information for any reason other than to get the issue fixed for your respective distribution's users.
Before you share any information from the list with members of your team who are required to fix the issue, these team members must agree to the same terms, and only be provided with information on a need-to-know basis.
In the unfortunate event that you share information beyond what is permitted by this policy, you must urgently inform the VMware Security Team (security@vmware.com) of exactly what information was leaked and to whom. If you continue to leak information and break the policy outlined here, you will be permanently removed from the list.
In the unfortunate event that you share information beyond what is permitted by this policy, you must urgently inform the Security Team (velero-security.pdl@broadcom.com) of exactly what information was leaked and to whom. If you continue to leak information and break the policy outlined here, you will be permanently removed from the list.
@@ -123,6 +123,6 @@ Send new membership requests to projectvelero-distributors@googlegroups.com. In
## Confidentiality, integrity and availability
We consider vulnerabilities leading to the compromise of data confidentiality, elevation of privilege, or integrity to be our highest priority concerns. Availability, in particular in areas relating to DoS and resource exhaustion, is also a serious security concern. The VMware Security Team takes all vulnerabilities, potential vulnerabilities, and suspected vulnerabilities seriously and will investigate them in an urgent and expeditious manner.
We consider vulnerabilities leading to the compromise of data confidentiality, elevation of privilege, or integrity to be our highest priority concerns. Availability, in particular in areas relating to DoS and resource exhaustion, is also a serious security concern. The Security Team takes all vulnerabilities, potential vulnerabilities, and suspected vulnerabilities seriously and will investigate them in an urgent and expeditious manner.
Note that we do not currently consider the default settings for Velero to be secure-by-default. It is necessary for operators to explicitly configure settings, role based access control, and other resource related features in Velero to provide a hardened Velero environment. We will not act on any security disclosure that relates to a lack of safe defaults. Over time, we will work towards improved safe-by-default configuration, taking into account backwards compatibility.
#### The maintenance work for kopia backup repositories is run in jobs
#### The maintenance work for kopia/restic backup repositories is run in jobs
Since velero started using kopia as the approach for filesystem-level backup/restore, we've noticed an issue when velero connects to the kopia backup repositories and performs maintenance, it sometimes consumes excessive memory that can cause the velero pod to get OOM Killed. To mitigate this issue, the maintenance work will be moved out of velero pod to a separate kubernetes job, and the user will be able to specify the resource request in "velero install".
#### Volume Policies are extended to support more actions to handle volumes
In an earlier release, a flexible volume policy was introduced to skip certain volumes from a backup. In v1.14 we've made enhancement to this policy to allow the user to set how the volumes should be backed up. The user will be able to set "fs-backup" or "snapshot" as value of “action" in the policy and velero will backup the volumes accordingly. This enhancement allows the user to achieve a fine-grained control like "opt-in/out" without having to update the target workload. For more details please refer to https://velero.io/docs/v1.14/resource-filtering/#supported-volumepolicy-actions
@@ -38,6 +38,7 @@ Besides the service principal with secret(password)-based authentication, Velero
* CSI plugin has been merged into velero repo in v1.14 release. It will be installed by default as an internal plugin, and should not be installed via "–plugins " parameter in "velero install" command.
* The default resource requests and limitations for node agent are removed in v1.14, to make the node agent pods have the QoS class of "BestEffort", more details please refer to #7391
* There's a change in namespace filtering behavior during backup: In v1.14, when the includedNamespaces/excludedNamespaces fields are not set and the labelSelector/OrLabelSelectors are set in the backup spec, the backup will only include the namespaces which contain the resources that match the label selectors, while in previous releases all namespaces will be included in the backup with such settings. More details refer to #7105
* Patching the PV in the "Finalizing" state may cause the restore to be in "PartiallyFailed" state when the PV is blocked in "Pending" state, while in the previous release the restore may end up being in "Complete" state. For more details refer to #7866
### All Changes
* Fix backup log to show error string, not index (#7805, @piny940)
Data transfer activities for CSI Snapshot Data Movement are moved from node-agent pods to dedicate backupPods or restorePods. This brings many benefits such as:
- This avoids to access volume data through host path, while host path access is privileged and may involve security escalations, which are concerned by users.
- This enables users to to control resource (i.e., cpu, memory) allocations in a granular manner, e.g., control them per backup/restore of a volume.
- This enhances the resilience, crash of one data movement activity won't affect others.
- This prevents unnecessary full backup because of host path changes after workload pods restart.
- For more information, check the design https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/vgdp-micro-service/vgdp-micro-service.md.
#### Item Block concepts and ItemBlockAction (IBA) plugin
Item Block concepts are introduced for resource backups to help to achieve multiple thread backups. Specifically, correlated resources are categorized in the same item block and item blocks could be processed concurrently in multiple threads.
ItemBlockAction plugin is introduced to help Velero to categorize resources into item blocks. At present, Velero provides built-in IBAs for pods and PVCs and Velero also supports customized IBAs for any resources.
In v1.15, Velero doesn't support multiple thread process of item blocks though item block concepts and IBA plugins are fully supported. The multiple thread support will be delivered in future releases.
For more information, check the design https://github.com/vmware-tanzu/velero/blob/main/design/backup-performance-improvements.md.
#### Node selection for repository maintenance job
Repository maintenance are resource consuming tasks, Velero now allows you to configure the nodes to run repository maintenance jobs, so that you can run repository maintenance jobs in idle nodes or avoid them to run in nodes hosting critical workloads.
To support the configuration, a new repository maintenance configuration configMap is introduced.
For more information, check the document https://velero.io/docs/v1.15/repository-maintenance/.
#### Backup PVC read-only configuration
In 1.15, Velero allows you to configure the data mover backupPods to read-only mount the backupPVCs. In this way, the data mover expose process could be significantly accelerated for some storages (i.e., ceph).
To support the configuration, a new backup PVC configuration configMap is introduced.
For more information, check the document https://velero.io/docs/v1.15/data-movement-backup-pvc-configuration/.
#### Backup PVC storage class configuration
In 1.15, Velero allows you to configure the storageclass used by the data mover backupPods. In this way, the provision of backupPVCs don't need to adhere to the same pattern as workload PVCs, e.g., for a backupPVC, it only needs one replica, whereas, the a workload PVC may have multiple replicas.
To support the configuration, the same backup PVC configuration configMap is used.
For more information, check the document https://velero.io/docs/v1.15/data-movement-backup-pvc-configuration/.
#### Backup repository data cache configuration
The backup repository may need to cache data on the client side during various repository operations, i.e., read, write, maintenance, etc. The cache consumes the root file system space of the pod where the repository access happens.
In 1.15, Velero allows you to configure the total size of the cache per repository. In this way, if your pod doesn't have enough space in its root file system, the pod won't be evicted due to running out of ephemeral storage.
To support the configuration, a new backup repository configuration configMap is introduced.
For more information, check the document https://velero.io/docs/v1.15/backup-repository-configuration/.
#### Performance improvements
In 1.15, several performance related issues/enhancements are included, which makes significant performance improvements in specific scenarios:
- There was a memory leak of Velero server after plugin calls, now it is fixed, see issue https://github.com/vmware-tanzu/velero/issues/7925
- The `client-burst/client-qps` parameters are automatically inherited to plugins, so that you can use the same velero server parameters to accelerate the plugin executions when large number of API server calls happen, see issue https://github.com/vmware-tanzu/velero/issues/7806
- Maintenance of Kopia repository takes huge memory in scenarios that huge number of files have been backed up, Velero 1.15 has included the Kopia upstream enhancement to fix the problem, see issue https://github.com/vmware-tanzu/velero/issues/7510
### Runtime and dependencies
Golang runtime: v1.22.8
kopia: v0.17.0
### Limitations/Known issues
#### Read-only backup PVC may not work on SELinux environments
Due to an issue of Kubernetes upstream, if a volume is mounted as read-only in SELinux environments, the read privilege is not granted to any user, as a result, the data mover backup will fail. On the other hand, the backupPVC must be mounted as read-only in order to accelerate the data mover expose process.
Therefore, a user option is added in the same backup PVC configuration configMap, once the option is enabled, the backupPod container will run as a super privileged container and disable SELinux access control. If you have concern in this super privileged container or you have configured [pod security admissions](https://kubernetes.io/docs/concepts/security/pod-security-admission/) and don't allow super privileged containers, you will not be able to use this read-only backupPVC feature and lose the benefit to accelerate the data mover expose process.
### Breaking changes
#### Deprecation of Restic
Restic path for fs-backup is in deprecation process starting from 1.15. According to [Velero deprecation policy](https://github.com/vmware-tanzu/velero/blob/v1.15/GOVERNANCE.md#deprecation-policy), for 1.15, if Restic path is used the backup/restore of fs-backup still creates and succeeds, but you will see warnings in below scenarios:
- When `--uploader-type=restic` is used in Velero installation
- When Restic path is used to create backup/restore of fs-backup
#### node-agent configuration name is configurable
Previously, a fixed name is searched for node-agent configuration configMap. Now in 1.15, Velero allows you to customize the name of the configMap, on the other hand, the name must be specified by node-agent server parameter `node-agent-configmap`.
#### Repository maintenance job configurations in Velero server parameter are moved to repository maintenance job configuration configMap
In 1.15, below Velero server parameters for repository maintenance jobs are moved to the repository maintenance job configuration configMap. While for back compatibility reason, the same Velero sever parameters are preserved as is. But the configMap is recommended and the same values in the configMap take preference if they exist in both places:
```
--keep-latest-maintenance-jobs
--maintenance-job-cpu-request
--maintenance-job-mem-request
--maintenance-job-cpu-limit
--maintenance-job-mem-limit
```
#### Changing PVC selected-node feature is deprecated
In 1.15, the [Changing PVC selected-node feature](https://velero.io/docs/v1.15/restore-reference/#changing-pvc-selected-node) enters deprecation process and will be removed in future releases according to [Velero deprecation policy](https://github.com/vmware-tanzu/velero/blob/v1.15/GOVERNANCE.md#deprecation-policy). Usage of this feature for any purpose is not recommended.
### All Changes
* add no-relabeling option to backupPVC configmap (#8288, @sseago)
* only set spec.volumes readonly if PVC is readonly for datamover (#8284, @sseago)
* Add labels to maintenance job pods (#8256, @shubham-pampattiwar)
* Add the Carvel package related resources to the restore priority list (#8228, @ywk253100)
* Reduces indirect imports for plugin/framework importers (#8208, @kaovilai)
* Add controller name to periodical_enqueue_source. The logger parameter now includes an additional field with the value of reflect.TypeOf(objList).String() and another field with the value of controllerName. (#8198, @kaovilai)
* Update Openshift SCC docs link (#8170, @shubham-pampattiwar)
* Modify E2E and perf test report generated directory (#8129, @blackpiglet)
* Add docs for backup pvc config support (#8119, @shubham-pampattiwar)
* Delete generated k8s client and informer. (#8114, @blackpiglet)
* Add support for backup PVC configuration (#8109, @shubham-pampattiwar)
* ItemBlock model and phase 1 (single-thread) workflow changes (#8102, @sseago)
* Fix issue #8032, make node-agent configMap name configurable (#8097, @Lyndon-Li)
* Fix issue #8072, add the warning messages for restic deprecation (#8096, @Lyndon-Li)
* Fix issue #7620, add backup repository configuration implementation and support cacheLimit configuration for Kopia repo (#8093, @Lyndon-Li)
* Patch dbr's status when error happens (#8086, @reasonerjt)
* According to design #7576, after node-agent restarts, if a DU/DD is in InProgress status, re-capture the data mover ms pod and continue the execution (#8085, @Lyndon-Li)
* Updates to IBM COS documentation to match current version (#8082, @gjanders)
* Data mover micro service DUCR/DDCR controller refactor according to design #7576 (#8074, @Lyndon-Li)
* add retries with timeout to existing patch calls that moves a backup/restore from InProgress/Finalizing to a final status phase. (#8068, @kaovilai)
* Data mover micro service restore according to design #7576 (#8061, @Lyndon-Li)
In v1.16, Velero supports to run in Windows clusters and backup/restore Windows workloads, either stateful or stateless:
* Hybrid build and all-in-one image: the build process is enhanced to build an all-in-one image for hybrid CPU architecture and hybrid platform. For more information, check the design https://github.com/vmware-tanzu/velero/blob/main/design/multiple-arch-build-with-windows.md
* Deployment in Windows clusters: Velero node-agent, data mover pods and maintenance jobs now support to run in both linux and Windows nodes
* Data mover backup/restore Windows workloads: Velero built-in data mover supports Windows workloads throughout its full cycle, i.e., discovery, backup, restore, pre/post hook, etc. It automatically identifies Windows workloads and schedules data mover pods to the right group of nodes
Check the epic issue https://github.com/vmware-tanzu/velero/issues/8289 for more information.
#### Parallel Item Block backup
v1.16 now supports to back up item blocks in parallel. Specifically, during backup, correlated resources are grouped in item blocks and Velero backup engine creates a thread pool to back up the item blocks in parallel. This significantly improves the backup throughput, especially when there are large scale of resources.
Pre/post hooks also belongs to item blocks, so will also run in parallel along with the item blocks.
Users are allowed to configure the parallelism through the `--item-block-worker-count` Velero server parameter. If not configured, the default parallelism is 1.
For more information, check issue https://github.com/vmware-tanzu/velero/issues/8334.
#### Data mover restore enhancement in scalability
In previous releases, for each volume of WaitForFirstConsumer mode, data mover restore is only allowed to happen in the node that the volume is attached. This severely degrades the parallelism and the balance of node resource(CPU, memory, network bandwidth) consumption for data mover restore (https://github.com/vmware-tanzu/velero/issues/8044).
In v1.16, users are allowed to configure data mover restores running and spreading evenly across all nodes in the cluster. The configuration is done through a new flag `ignoreDelayBinding` in node-agent configuration (https://github.com/vmware-tanzu/velero/issues/8242).
#### Data mover enhancements in observability
In 1.16, some observability enhancements are added:
* Output various statuses of intermediate objects for failures of data mover backup/restore (https://github.com/vmware-tanzu/velero/issues/8267)
* Output the errors when Velero fails to delete intermediate objects during clean up (https://github.com/vmware-tanzu/velero/issues/8125)
The outputs are in the same node-agent log and enabled automatically.
#### CSI snapshot backup/restore enhancement in usability
In previous releases, a unnecessary VolumeSnapshotContent object is retained for each backup and synced to other clusters sharing the same backup storage location. And during restore, the retained VolumeSnapshotContent is also restored unnecessarily.
In 1.16, the retained VolumeSnapshotContent is removed from the backup, so no unnecessary CSI objects are synced or restored.
For more information, check issue https://github.com/vmware-tanzu/velero/issues/8725.
#### Backup Repository Maintenance enhancement in resiliency and observability
In v1.16, some enhancements of backup repository maintenance are added to improve the observability and resiliency:
* A new backup repository maintenance history section, called `RecentMaintenance`, is added to the BackupRepository CR. Specifically, for each BackupRepository, including start/completion time, completion status and error message. (https://github.com/vmware-tanzu/velero/issues/7810)
* Running maintenance jobs are now recaptured after Velero server restarts. (https://github.com/vmware-tanzu/velero/issues/7753)
* The maintenance job will not be launched for readOnly BackupStorageLocation. (https://github.com/vmware-tanzu/velero/issues/8238)
* The backup repository will not try to initialize a new repository for readOnly BackupStorageLocation. (https://github.com/vmware-tanzu/velero/issues/8091)
* Users now are allowed to configure the intervals of an effective maintenance in the way of `normalGC`, `fastGC` and `eagerGC`, through the `fullMaintenanceInterval` parameter in backupRepository configuration. (https://github.com/vmware-tanzu/velero/issues/8364)
#### Volume Policy enhancement of filtering volumes by PVC labels
In v1.16, Volume Policy is extended to support filtering volumes by PVC labels. (https://github.com/vmware-tanzu/velero/issues/8256).
#### Resource Status restore per object
In v1.16, users are allowed to define whether to restore resource status per object through an annotation `velero.io/restore-status` set on the object. (https://github.com/vmware-tanzu/velero/issues/8204).
#### Velero Restore Helper binary is merged into Velero image
In v1.16, Velero banaries, i.e., velero, velero-helper and velero-restore-helper, are all included into the single Velero image. (https://github.com/vmware-tanzu/velero/issues/8484).
### Runtime and dependencies
Golang runtime: 1.23.7
kopia: 0.19.0
### Limitations/Known issues
#### Limitations of Windows support
* fs-backup is not supported for Windows workloads and so fs-backup runs only in linux nodes for linux workloads
* Backup/restore of NTFS extended attributes/advanced features are not supported, i.e., Security Descriptors, System/Hidden/ReadOnly attributes, Creation Time, NTFS Streams, etc.
### All Changes
* Add third party annotation support for maintenance job, so that the declared third party annotations could be added to the maintenance job pods (#8812, @Lyndon-Li)
* Fix issue #8803, use deterministic name to create backupRepository (#8808, @Lyndon-Li)
* Refactor restoreItem and related functions to differentiate the backup resource name and the restore target resource name. (#8797, @blackpiglet)
* ensure that PV is removed before VS is deleted (#8777, @ix-rzi)
* host_pods should not be mandatory to node-agent (#8774, @mpryc)
* Log doesn't show pv name, but displays %!s(MISSING) instead (#8771, @hu-keyu)
* Fix issue #8754, add third party annotation support for data mover (#8770, @Lyndon-Li)
* Add docs for volume policy with labels as a criteria (#8759, @shubham-pampattiwar)
* Move pvc annotation removal from CSI RIA to regular PVC RIA (#8755, @sseago)
* Add doc for maintenance history (#8747, @Lyndon-Li)
* Fix issue #8733, add doc for restorePVC (#8737, @Lyndon-Li)
* Fix issue #8426, add doc for Windows support (#8736, @Lyndon-Li)
* Return directly if no pod volme backup are tracked (#8728, @ywk253100)
* Fix issue #8706, for immediate volumes, there is no selected-node annotation on PVC, so deduce the attached node from VolumeAttachment CRs (#8715, @Lyndon-Li)
* Add labels as a criteria for volume policy (#8713, @shubham-pampattiwar)
* Copy SecurityContext from Containers[0] if present for PVR (#8712, @sseago)
* Support pushing images to an insecure registry (#8703, @ywk253100)
* Modify golangci configuration to make it work. (#8695, @blackpiglet)
* Run backup post hooks inside ItemBlock synchronously (#8694, @ywk253100)
* Add docs for object level status restore (#8693, @shubham-pampattiwar)
* Clean artifacts generated during CSI B/R. (#8684, @blackpiglet)
* Don't run maintenance on the ReadOnly BackupRepositories. (#8681, @blackpiglet)
* Fix issue #8418, add Windows toleration to data mover pods (#8606, @Lyndon-Li)
* Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue (#8603, @ywk253100)
* Fix issue #8067, add tmp folder (/tmp for linux, C:\Windows\Temp for Windows) as an alternative of udmrepo's config file location (#8602, @Lyndon-Li)
* Data mover restore for Windows (#8594, @Lyndon-Li)
* Skip patching the PV in finalization for failed operation (#8591, @reasonerjt)
* Fix issue #8579, set event burst to block event broadcaster from filtering events (#8590, @Lyndon-Li)
* Configurable Kopia Maintenance Interval. backup-repository-configmap adds an option for configurable`fullMaintenanceInterval` where fastGC (12 hours), and eagerGC (6 hours) allowing for faster removal of deleted velero backups from kopia repo. (#8581, @kaovilai)
* Fix issue #7753, recall repo maintenance history on Velero server restart (#8580, @Lyndon-Li)
* Clear validation errors when schedule is valid (#8575, @ywk253100)
* Merge restore helper image into Velero server image (#8574, @ywk253100)
* Don't include excluded items in ItemBlocks (#8572, @sseago)
* fs uploader and block uploader support Windows nodes (#8569, @Lyndon-Li)
* Fix issue #8418, support data mover backup for Windows nodes (#8555, @Lyndon-Li)
* Fix issue #8044, allow users to ignore delay binding the restorePVC of data mover when it is in WaitForFirstConsumer mode (#8550, @Lyndon-Li)
* Fix issue #8539, validate uploader types when o.CRDsOnly is set to false only since CRD installation doesn't rely on uploader types (#8538, @Lyndon-Li)
* Fix issue #7810, add maintenance history for backupRepository CRs (#8532, @Lyndon-Li)
* Make fs-backup work on linux nodes with the new Velero deployment and disable fs-backup if the source/target pod is running in non-linux node (#8424) (#8518, @Lyndon-Li)
* Fix issue: backup schedule pause/unpause doesn't work (#8512, @ywk253100)
* Fix backup post hook issue #8159 (caused by #7571): always execute backup post hooks after PVBs are handled (#8509, @ywk253100)
* Fix issue #8267, enhance the error message when expose fails (#8508, @Lyndon-Li)
* Fix issue #8416, #8417, deploy Velero server and node-agent in linux/Windows hybrid env (#8504, @Lyndon-Li)
* Design to add label selector as a criteria for volume policy (#8503, @shubham-pampattiwar)
* Related to issue #8485, move the acceptedByNode and acceptedTimestamp to Status of DU/DD CRD (#8498, @Lyndon-Li)
* Add SecurityContext to restore-helper (#8491, @reasonerjt)
* Fix issue #8433, add third party labels to data mover pods when the same labels exist in node-agent pods (#8487, @Lyndon-Li)
* Fix issue #8485, add an accepted time so as to count the prepare timeout (#8486, @Lyndon-Li)
* Fix issue #8125, log diagnostic info for data mover exposers when expose timeout (#8482, @Lyndon-Li)
* Fix issue #8415, implement multi-arch build and Windows build (#8476, @Lyndon-Li)
* Pin kopia to 0.18.2 (#8472, @Lyndon-Li)
* Add nil check for updating DataUpload VolumeInfo in finalizing phase (#8471, @blackpiglet)
* Allowing Object-Level Resource Status Restore (#8464, @shubham-pampattiwar)
* For issue #8429. Add the design for multi-arch build and windows build (#8459, @Lyndon-Li)
* Upgrade go.mod k8s.io/ go.mod to v0.31.3 and implemented proper logger configuration for both client-go and controller-runtime libraries. This change ensures that logging format and level settings are properly applied throughout the codebase. The update improves logging consistency and control across the Velero system. (#8450, @kaovilai)
* Add Design for Allowing Object-Level Resource Status Restore (#8403, @shubham-pampattiwar)
* Fix issue #8391, check ErrCancelled from suffix of data mover pod's termination message (#8396, @Lyndon-Li)
* Fix issue #8394, don't call closeDataPath in VGDP callbacks, otherwise, the VGDP cleanup will hang (#8395, @Lyndon-Li)
* Adding support in velero Resource Policies for filtering PVs based on additional VolumeAttributes properties under CSI PVs (#8383, @mayankagg9722)
* Add --item-block-worker-count flag to velero install and server (#8380, @sseago)
* Make BackedUpItems thread safe (#8366, @sseago)
* Include --annotations flag in backup and restore create commands (#8354, @alromeros)
* Use aggregated discovery API to discovery API groups and resources (#8353, @ywk253100)
* Copy "envFrom" from Velero server when creating maintenance jobs (#8343, @evhan)
* Set hinting region to use for GetBucketRegion() in pkg/repository/config/aws.go (#8297, @kaovilai)
* Bump up version of client-go and controller-runtime (#8275, @ywk253100)
* fix(pkg/repository/maintenance): don't panic when there's no container statuses (#8271, @mcluseau)
* Add Backup warning for inclusion of NS managed by ArgoCD (#8257, @shubham-pampattiwar)
* Added tracking for deleted namespace status check in restore flow. (#8233, @sangitaray2021)
In v1.17, Velero fs-backup is modernized to the micro-service architecture, which brings below benefits:
- Many features that were absent to fs-backup are now available, i.e., load concurrency control, cancel, resume on restart, etc.
- fs-backup is more robust, the running backup/restore could survive from node-agent restart; and the resource allocation is in a more granular manner, the failure of one backup/restore won't impact others.
- The resource usage of node-agent is steady, especially, the node-agent pods won't request huge memory and hold it for a long time.
Check design https://github.com/vmware-tanzu/velero/blob/main/design/vgdp-micro-service-for-fs-backup/vgdp-micro-service-for-fs-backup.md for more details.
#### fs-backup support Windows cluster
In v1.17, Velero fs-backup supports to backup/restore Windows workloads. By leveraging the new micro-service architecture for fs-backup, data mover pods could run in Windows nodes and backup/restore Windows volumes. Together with CSI snapshot data movement for Windows which is delivered in 1.16, Velero now supports Windows workload backup/restore in full scenarios.
Check design https://github.com/vmware-tanzu/velero/blob/main/design/vgdp-micro-service-for-fs-backup/vgdp-micro-service-for-fs-backup.md for more details.
#### Volume group snapshot support
In v1.17, Velero supports [volume group snapshots](https://kubernetes.io/blog/2024/12/18/kubernetes-1-32-volume-group-snapshot-beta/) which is a beta feature in Kubernetes upstream, for both CSI snapshot backup and CSI snapshot data movement. This allows a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency, which is helpful to achieve better data consistency when multiple volumes being backed up are correlated.
Check the document https://velero.io/docs/main/volume-group-snapshots/ for more details.
#### Priority class support
In v1.17, [Kubernetes priority class](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass) is supported for all modules across Velero. Specifically, users are allowed to configure priority class to Velero server, node-agent, data mover pods, backup repository maintenance jobs separately.
Check design https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/priority-class-name-support_design.md for more details.
#### Scalability and Resiliency improvements of data movers
##### Reduce excessive number of data mover pods in Pending state
In v1.17, Velero allows users to set a `PrepareQueueLength` in the node-agent configuration, data mover pods and volumes out of this number won't be created until data path quota is available, so that excessive number cluster resources won't be taken unnecessarily, which is particularly helpful for large scale environments. This improvement applies to all kinds of data movements, including fs-backup and CSI snapshot data movement.
Check design https://github.com/vmware-tanzu/velero/blob/main/design/node-agent-load-soothing.md for more details.
##### Enhancement on node-agent restart handling for data movements
In v1.17, data movements in all phases could survive from node-agent restart and resume themselves; when a data movement gets orphaned in special cases, e.g., cluster node absent, it could also be canceled appropriately after the restart. This improvement applies to all kinds of data movements, including fs-backup and CSI snapshot data movement.
Check issue https://github.com/vmware-tanzu/velero/issues/8534 for more details.
##### CSI snapshot data movement restore node-selection and node-selection by storage class
In v1.17, CSI snapshot data movement restore acquires the same node-selection capability as backup, that is, users could specify which nodes can/cannot run data mover pods for both backup and restore now. And users are also allowed to configure the node-selection per storage class, which is particularly helpful to the environments where a storage class are not usable by all cluster nodes.
Check issue https://github.com/vmware-tanzu/velero/issues/8186 and https://github.com/vmware-tanzu/velero/issues/8223 for more details.
#### Include/exclude policy support for resource policy
In v1.17, Velero resource policy supports `includeExcludePolicy` besides the existing `volumePolicy`. This allows users to set include/exclude filters for resources in a resource policy configmap, so that these filters are reusable among multiple backups.
Check the document https://velero.io/docs/main/resource-filtering/#creating-resource-policies:~:text=resources%3D%22*%22-,Resource%20policies,-Velero%20provides%20resource for more details.
### Runtime and dependencies
Golang runtime: 1.24.6
kopia: 0.21.1
### Limitations/Known issues
### Breaking changes
#### Deprecation of Restic
According to [Velero deprecation policy](https://github.com/vmware-tanzu/velero/blob/main/GOVERNANCE.md#deprecation-policy), backup of fs-backup under Restic path is removed in v1.17, so `--uploader-type=restic` is not a valid installation configuration anymore. This means you cannot create a backup under Restic path, but you can still restore from the previous backups under Restic path until v1.19.
#### Repository maintenance job configurations are removed from Velero server parameter
Since the repository maintenance job configurations are moved to repository maintenance job configMap, in v1.17 below Velero sever parameters are removed:
- --keep-latest-maintenance-jobs
- --maintenance-job-cpu-request
- --maintenance-job-mem-request
- --maintenance-job-cpu-limit
- --maintenance-job-mem-limit
### All Changes
* Add ConfigMap parameters validation for install CLI and server start. (#9200, @blackpiglet)
* Add priorityclasses to high priority restore list (#9175, @kaovilai)
* Introduced context-based logger for backend implementations (Azure, GCS, S3, and Filesystem) (#9168, @priyansh17)
* Fix issue #9140, add os=windows:NoSchedule toleration for Windows pods (#9165, @Lyndon-Li)
* Remove the repository maintenance job parameters from velero server. (#9147, @blackpiglet)
* Add include/exclude policy to resources policy (#9145, @reasonerjt)
* Add ConfigMap support for keepLatestMaintenanceJobs with CLI parameter fallback (#9135, @shubham-pampattiwar)
* Fix the dd and du's node affinity issue. (#9130, @blackpiglet)
* Remove the WaitUntilVSCHandleIsReady from vs BIA. (#9124, @blackpiglet)
* Add comprehensive Volume Group Snapshots documentation with workflow diagrams and examples (#9123, @shubham-pampattiwar)
* Update CSI Snapshot Data Movement doc for issue #8534, #8185 (#9113, @Lyndon-Li)
* Fix issue #8986, refactor fs-backup doc after VGDP Micro Service for fs-backup (#9112, @Lyndon-Li)
* Return error if timeout when checking server version (#9111, @ywk253100)
* Update "Default Volumes to Fs Backup" to "File System Backup (Default)" (#9105, @shubham-pampattiwar)
* Fix issue #9077, don't block backup deletion on list VS error (#9100, @Lyndon-Li)
* Bump up Kopia to v0.21.1 (#9098, @Lyndon-Li)
* Add imagePullSecrets inheritance for VGDP pod and maintenance job. (#9096, @blackpiglet)
* Avoid checking the VS and VSC status in the backup finalizing phase. (#9092, @blackpiglet)
* Fix issue #9053, Always remove selected-node annotation during PVC restore when no node mapping exists. Breaking change: Previously, the annotation was preserved if the node existed. (#9076, @Lyndon-Li)
* Enable parameterized kubelet mount path during node-agent installation (#9074, @longxiucai)
* Fix issue #8857, support third party tolerations for data mover pods (#9072, @Lyndon-Li)
* Fix issue #8813, remove restic from the valid uploader type (#9069, @Lyndon-Li)
* Fix issue #8185, allow users to disable pod volume host path mount for node-agent (#9068, @Lyndon-Li)
* Fix #8344, add the design for a mechanism to soothe creation of data mover pods for DataUpload, DataDownload, PodVolumeBackup and PodVolumeRestore (#9067, @Lyndon-Li)
* Fix #8344, add a mechanism to soothe creation of data mover pods for DataUpload, DataDownload, PodVolumeBackup and PodVolumeRestore (#9064, @Lyndon-Li)
* Add Gauge metric for BSL availability (#9059, @reasonerjt)
* Fix missing defaultVolumesToFsBackup flag output in Velero describe backup cmd (#9056, @shubham-pampattiwar)
* Allow for proper tracking of multiple hooks per container (#9048, @sseago)
* Make the backup repository controller doesn't invalidate the BSL on restart (#9046, @blackpiglet)
* Removed username/password credential handling from newConfigCredential as azidentity.UsernamePasswordCredentialOptions is reported as deprecated. (#9041, @priyansh17)
* Remove dependency with VolumeSnapshotClass in DataUpload. (#9040, @blackpiglet)
* Fix issue #8961, cancel PVB/PVR on Velero server restart (#9031, @Lyndon-Li)
* fix: update mc command in minio-deployment example (#8982, @vishal-chdhry)
* Fix issue #8957, add design for VGDP MS for fs-backup (#8979, @Lyndon-Li)
* Add BSL status check for backup/restore operations. (#8976, @blackpiglet)
* Mark BackupRepository not ready when BSL changed (#8975, @ywk253100)
* Add support for [distributed snapshotting](https://github.com/kubernetes-csi/external-snapshotter/tree/4cedb3f45790ac593ebfa3324c490abedf739477?tab=readme-ov-file#distributed-snapshotting) (#8969, @flx5)
* Fix issue #8534, refactor dm controllers to tolerate cancel request in more cases, e.g., node restart, node drain (#8952, @Lyndon-Li)
* The backup and restore VGDP affinity enhancement implementation. (#8949, @blackpiglet)
* Remove CSI VS and VSC metadata from backup. (#8946, @blackpiglet)
* Extend PVCAction itemblock plugin to support grouping PVCs under VGS label key (#8944, @shubham-pampattiwar)
* Copy security context from origin pod (#8943, @farodin91)
* Add support for configuring VGS label key (#8938, @shubham-pampattiwar)
* Add VolumeSnapshotContent into the RIA and the mustHave resource list. (#8924, @blackpiglet)
* Mounted cloud credentials should not be world-readable (#8919, @sseago)
* Warn for not found error in patching managed fields (#8902, @sseago)
In v1.18, Velero is capable to process multiple backups concurrently. This is a significant usability improvement, especially for multiple tenants or multiple users case, backups submitted from different users could run their backups simultaneously without interfering with each other.
Check design https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/concurrent-backup-processing.md for more details.
#### Cache volume for data movers
In v1.18, Velero allows users to configure cache volumes for data mover pods during restore for CSI snapshot data movement and fs-backup. This brings below benefits:
- Solve the problem that data mover pods fail to when pod's ephemeral disk is limited
- Solve the problem that multiple data mover pods fail to run concurrently in one node when the node's ephemeral disk is limited
- Working together with backup repository's cache limit configuration, cache volume with appropriate size helps to improve the restore throughput
Check design https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/backup-repo-cache-volume.md for more details.
#### Incremental size for data movers
In v1.18, Velero allows users to observe the incremental size of data movers backups for CSI snapshot data movement and fs-backup, so that users could visually see the data reduction due to incremental backup.
#### Wildcard support for namespaces
In v1.18, Velero allows to use Glob regular expressions for namespace filters during backup and restore, so that users could filter namespaces in a batch manner.
#### VolumePolicy for PVC phase
In v1.18, Velero VolumePolicy supports actions by PVC phase, which help users to do special operations for PVCs with a specific phase, e.g., skip PVCs in Pending/Lost status from the backup.
#### Scalability and Resiliency improvements
##### Prevent Velero server OOM Kill for large backup repositories
In v1.18, some backup repository operations are delay executed out of Velero server, so Velero server won't be OOM Killed.
#### Performance improvement for VolumePolicy
In v1.18, VolumePolicy is enhanced for large number of pods/PVCs so that the performance is significantly improved.
#### Events for data mover pod diagnostic
In v1.18, events are recorded into data mover pod diagnostic, which allows user to see more information for troubleshooting when the data mover pod fails.
### Runtime and dependencies
Golang runtime: 1.25.7
kopia: 0.22.3
### Limitations/Known issues
### Breaking changes
#### Deprecation of PVC selected node feature
According to [Velero deprecation policy](https://github.com/vmware-tanzu/velero/blob/main/GOVERNANCE.md#deprecation-policy), PVC selected node feature is deprecated in v1.18. Velero could appropriately handle PVC's selected-node annotation, so users don't need to do anything particularly.
### All Changes
* Remove backup from running list when backup fails validation (#9498, @sseago)
* Maintenance Job only uses the first element of the LoadAffinity array (#9494, @blackpiglet)
* Fix issue #9478, add diagnose info on expose peek fails (#9481, @Lyndon-Li)
* Add Role, RoleBinding, ClusterRole, and ClusterRoleBinding in restore sequence. (#9474, @blackpiglet)
* Add maintenance job and data mover pod's labels and annotations setting. (#9452, @blackpiglet)
* Implement concurrency control for cache of native VolumeSnapshotter plugin. (#9281, @0xLeo258)
* Fix issue #7904, remove the code and doc for PVC node selection (#9269, @Lyndon-Li)
* Fix schedule controller to prevent backup queue accumulation during extended blocking scenarios by properly handling empty backup phases (#9264, @shubham-pampattiwar)
* Fix repository maintenance jobs to inherit allowlisted tolerations from Velero deployment (#9256, @shubham-pampattiwar)
* Implement wildcard namespace pattern expansion for backup namespace includes/excludes. This change adds support for wildcard patterns (*, ?, [abc], {a,b,c}) in namespace includes and excludes during backup operations (#9255, @Joeavaikath)
* Protect VolumeSnapshot field from race condition during multi-thread backup (#9248, @0xLeo258)
* Update AzureAD Microsoft Authentication Library to v1.5.0 (#9244, @priyansh17)
* Get pod list once per namespace in pvc IBA (#9226, @sseago)
Fix VolumeGroupSnapshot restore failure with Ceph RBD CSI driver by creating stub VolumeGroupSnapshotContent during restore and looking up VolumeSnapshotClass by driver for credential support
Fix issue #9659, in the case that PVB/PVR/DU/DD is cancelled before the data path is really started, call EndEvent to prevent data mover pod from crashing because of delay event distribution
- Update VolumePolicy action type validation to account for `fs-backup` and `snapshot` as valid VolumePolicy actions.
- Modifications needed for `fs-backup` action:
- Now based on the specification of volume policy on backup request we will decide whether to go via legacy pod annotations approach or the newer volume policy based fs-backup action approach.
- If there is a presence of volume policy(fs-backup/snapshot) on the backup request that matches as an action for a volume we use the newer volume policy approach to get the list of the volumes for `fs-backup` action
- If there is a presence of volume policy(fs-backup/snapshot) on the backup request that matches as an action for a volume we use the newer volume policy approach to get the list of the volumes for `fs-backup` action
- Else continue with the annotation based legacy approach workflow.
- Modifications needed for `snapshot` action:
@@ -276,7 +276,7 @@ func (v *volumeHelperImpl) ShouldPerformSnapshot(obj runtime.Unstructured, group
if!boolptr.IsSetToFalse(v.snapshotVolumes){
// If the backup.Spec.SnapshotVolumes is not set, or set to true, then should take the snapshot.
v.logger.Infof("performing snapshot action for pv %s as the snapshotVolumes is not set to false")
v.logger.Infof("performing snapshot action for pv %s as the snapshotVolumes is not set to false",pv.Name)
Add an `--apply` flag to the install command that enables applying existing resources rather than creating them. This can be useful as part of the upgrade process for existing installations.
## Background
The current Velero install command creates resources but doesn't provide a direct way to apply updates to an existing installation.
Users attempting to run the install command on an existing installation receive "already exists" messages.
Upgrade steps for existing installs typically involve a three (or more) step process to apply updated CRDs (using `--dry-run` and piping to `kubectl apply`) and then updating/setting images on the Velero deployment and node-agent.
## Goals
- Provide a simple flag to enable applying resources on an existing Velero installation.
- Use server-side apply to update existing resources rather than attempting to create them.
- Maintain consistency with the regular install flow.
## Non Goals
- Implement special logic for specific version-to-version upgrades (i.e. resource deletion, etc).
- Add complex upgrade validation or pre/post-upgrade hooks.
- Provide rollback capabilities.
## High-Level Design
The `--apply` flag will be added to the Velero install command.
When this flag is set, the installation process will use server-side apply to update existing resources instead of using create on new resources.
This flag can be used as _part_ of the upgrade process, but will not always fully handle an upgrade.
## Detailed Design
The implementation adds a new boolean flag `--apply` to the install command.
This flag will be passed through to the underlying install functions where the resource creation logic resides.
When the flag is set to true:
- The `createOrApplyResource` function will use server-side apply with field manager "velero-cli" and `force=true` to update resources.
- Resources will be applied in the same order as they would be created during installation.
- Custom Resource Definitions will still be processed first, and the system will wait for them to be established before continuing.
The server-side apply approach with `force=true` ensures that resources are updated even if there are conflicts with the last applied state.
This provides a best-effort mechanism to apply resources that follows the same flow as installation but updates resources instead of creating them.
No special handling is added for specific versions or resource structures, making this a general-purpose mechanism for applying resources.
## Alternatives Considered
1. Creating a separate `upgrade` command that would duplicate much of the install command logic.
- Rejected due to code duplication and maintenance overhead.
2. Implementing version-specific upgrade logic to handle breaking changes between versions.
- Rejected as overly complex and difficult to maintain across multiple version paths.
- This could be considered again in the future, but is not in the scope of the current design.
3. Adding automatic detection of existing resources and switching to apply mode.
- Rejected as it could lead to unexpected behavior and confusion if users unintentionally apply changes to existing resources.
## Security Considerations
The apply flag maintains the same security profile as the install command.
No additional permissions are required beyond what is needed for resource creation.
The use of `force=true` with server-side apply could potentially override manual changes made to resources, but this is a necessary trade-off to ensure apply is successful.
## Compatibility
This enhancement is compatible with all existing Velero installations as it is a new opt-in flag.
It does not change any resource formats or API contracts.
The apply process is best-effort and does not guarantee compatibility between arbitrary versions of Velero.
Users should still consult release notes for any breaking changes that may require manual intervention.
This flag could be adopted by the helm chart, specifically for CRD updates, to simplify the CRD update job.
## Implementation
The implementation involves:
1. Adding support for `Apply` to the existing Kubernetes client code.
1. Adding the `--apply` flag to the install command options.
1. Changing `createResource` to `createOrApplyResource` and updating it to use server-side apply when the `apply` boolean is set.
The implementation is straightforward and follows existing code patterns.
No migration of state or special handling of specific resources is required.
# Velero Backup performance Improvements and VolumeGroupSnapshot enablement
There are two different goals here, linked by a single primary missing feature in the Velero backup workflow.
The first goal is to enhance backup performance by allowing the primary backup controller to run in multiple threads, enabling Velero to back up multiple items at the same time for a given backup.
The second goal is to enable Velero to eventually support VolumeGroupSnapshots.
For both of these goals, Velero needs a way to determine which items should be backed up together.
This design proposal will include two development phases:
- Phase 1 will refactor the backup workflow to identify blocks of related items that should be backed up together, and then coordinate backup hooks among items in the block.
- Phase 2 will add multiple worker threads for backing up item blocks, so instead of backing up each block as it identified, the velero backup workflow will instead add the block to a channel and one of the workers will pick it up.
- Actual support for VolumeGroupSnapshots is out-of-scope here and will be handled in a future design proposal, but the item block refactor introduced in Phase 1 is a primary building block for this future proposal.
## Background
Currently, during backup processing, the main Velero backup controller runs in a single thread, completely finishing the primary backup processing for one resource before moving on to the next one.
We can improve the overall backup performance by backing up multiple items for a backup at the same time, but before we can do this we must first identify resources that need to be backed up together.
Generally speaking, resources that need to be backed up together are resources with interdependencies -- pods with their PVCs, PVCs with their PVs, groups of pods that form a single application, CRs, pods, and other resources that belong to the same operator, etc.
As part of this initial refactoring, once these "Item Blocks" are identified, an additional change will be to move pod hook processing up to the ItemBlock level.
If there are multiple pods in the ItemBlock, pre-hooks for all pods will be run before backing up the items, followed by post-hooks for all pods.
This change to hook processing is another prerequisite for future VolumeGroupSnapshot support, since supporting this will require backing up the pods and volumes together for any volumes which belong to the same group.
Once we are backing up items by block, the next step will be to create multiple worker threads to process and back up ItemBlocks, so that we can back up multiple ItemBlocks at the same time.
In looking at the different kinds of large backups that Velero must deal with, two obvious scenarios come to mind:
1. Backups with a relatively small number of large volumes
2. Backups with a large number of relatively small volumes.
In case 1, the majority of the time spent on the backup is in the asynchronous phases -- CSI snapshot creation actions after the snaphandle exists, and DataUpload processing. In that case, parallel item processing will likely have a minimal impact on overall backup completion time.
In case 2, the majority of time spent on the backup will likely be during the synchronous actions. Especially as regards CSI snapshot creation, the waiting for the VSC snaphandle to exist will result in significant passage of time with thousands of volumes. This is the sort of use case which will benefit the most from parallel item processing.
## Goals
- Identify groups of related items to back up together (ItemBlocks).
- Manage backup hooks at the ItemBlock level rather than per-item.
- Using worker threads, back up ItemBlocks at the same time.
## Non Goals
- Support VolumeGroupSnapshots: this is a future feature, although certain prerequisites for this enhancement are included in this proposal.
- Process multiple backups in parallel: this is a future feature, although certain prerequisites for this enhancement are included in this proposal.
- Refactoring plugin infrastructure to avoid RPC calls for internal plugins.
- Restore performance improvements: this is potentially a future feature
## High-Level Design
### ItemBlock concept
The updated design is based on a new struct/type called `ItemBlock`.
Essentially, an `ItemBlock` is a group of items that must be backed up together in order to guarantee backup integrity.
When we eventually split item backup across multiple worker threads, `ItemBlocks` will be kept together as the basic unit of backup.
To facilitate this, a new plugin type, `ItemBlockAction` will allow relationships between items to be identified by velero -- any resources that must be backed up with other resources will need IBA plugins defined for them.
Examples of `ItemBlocks` include:
1. A pod, its mounted PVCs, and the bound PVs for those PVCs.
2. A VolumeGroup (related PVCs and PVs) along with any pods mounting these volumes.
3. For a ReadWriteMany PVC, the PVC, its bound PV, and all pods mounting this PVC.
### Phase 1: ItemBlock processing
- A new plugin type, `ItemBlockAction`, will be created
-`ItemBlockAction` will contain the API method `GetRelatedItems`, which will be needed for determining which items to group together into `ItemBlocks`.
- When processing the list of items returned from the item collector, instead of simply calling `BackupItem` on each in turn, we will use the `GetRelatedItems` API call to determine other items to include with the current item in an ItemBlock. Repeat recursively on each item returned.
- Don't include an item in more than one ItemBlock -- if the next item from the item collector is already in a block, skip it.
- Once ItemBlock is determined, call new func `BackupItemBlock` instead of `BackupItem`.
- New func `BackupItemBlock` will call pre hooks for any pods in the block, then back up the items in the block (`BackupItem` will no longer run hooks directly), then call post hooks for any pods in the block.
- The finalize phase will not be affected by the ItemBlock design, since this is just updating resources after async operations are completed on the items and there is no need to run these updates in parallel.
### Phase 2: Process ItemBlocks for a single backup in multiple threads
- Concurrent `BackupItemBlock` operations will be executed by worker threads invoked by the backup controller, which will communicate with the backup controller operation via a shared channel.
- The ItemBlock processing loop implemented in Phase 1 will be modified to send each newly-created ItemBlock to the shared channel rather than calling `BackupItemBlock` inline.
- Users will be able to configure the number of workers available for concurrent `BackupItemBlock` operations.
- Access to the BackedUpItems map must be synchronized
## Detailed Design
### Phase 1: ItemBlock processing
#### New ItemBlockAction plugin type
In order for Velero to identify groups of items to back up together in an ItemBlock, we need a way to identify items which need to be backed up along with the current item. While the current `Execute` BackupItemAction method does return a list of additional items which are required by the current item, we need to know this *before* we start the item backup. To support this, we need a new plugin type, `ItemBlockAction` (IBA) with an API method, `GetRelatedItems` which Velero will call on each item as it processes. The expectation is that the registered IBA plugins will return the same items as returned as additional items by the BIA `Execute` method, with the exception that items which are not created until calling `Execute` should not be returned here, as they don't exist yet.
#### Proto definition (compiled into golang by protoc)
The ItemBlockAction plugin type is defined as follows:
A new PluginKind, `ItemBlockAction`, will be created, and the backup process will be modified to use this plugin kind.
For any BIA plugins which return additional items from `Execute()` that need to be backed up at the same time or sequentially in the same worker thread as the current items should add a new IBA plugin to return these same items (minus any which won't exist before BIA `Execute()` is called).
This mainly applies to plugins that operate on pods which reference resources which must be backed up along with the pod and are potentially affected by pod hooks or for plugins which connect multiple pods whose volumes should be backed up at the same time.
### Changes to processing item list from the Item Collector
#### New structs BackupItemBlock, ItemBlock, and ItemBlockItem
```go
packagebackup
typeBackupItemBlockstruct{
itemblock.ItemBlock
// This is a reference to the shared itemBackupper for the backup
itemBackupper*itemBackupper
}
packageitemblock
typeItemBlockstruct{
Loglogrus.FieldLogger
Items[]ItemBlockItem
}
typeItemBlockItemstruct{
Grschema.GroupResource
Item*unstructured.Unstructured
PreferredGVRschema.GroupVersionResource
}
```
#### Current workflow
In the `BackupWithResolvers` func, the current Velero implementation iterates over the list of items for backup returned by the Item Collector. For each item, Velero loads the item from the file created by the Item Collector, we call `backupItem`, update the GR map if successful, remove the (temporary) file containing item metadata, and update progress for the backup.
#### Modifications to the loop over ItemCollector results
The `kubernetesResource` struct used by the item collector will be modified to add an `orderedResource` bool which will be set true for all of the resources moved to the beginning for each GroupResource as a result of being ordered resources.
In addition, an `inItemBlock` bool is added to the struct which will be set to true later when processing the list when each item is added to an ItemBlock.
While the item collector already puts ordered resources first for each GR, there is no indication in the list which of these initial items are from the ordered resources list and which are the remaining (unordered) items.
Velero needs to know which resources are ordered because when we process them later, the ordered resources for each GroupResource must be processed sequentially in a single ItemBlock.
The current workflow within each iteration of the ItemCollector.items loop will replaced with the following:
- (note that some of the below should be pulled out into a helper func to facilitate recursive call to it for items returned from `GetRelatedItems`.)
- Before loop iteration, create a pointer to a `BackupItemBlock` which will represent the current ItemBlock being processed.
- If `item` has `inItemBlock==true`, continue. This one has already been processed.
- If current `itemBlock` is nil, create it.
- Add `item` to `itemBlock`.
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl)
- If other versions of the same item exist (via EnableAPIGroupVersions), add these to the `itemBlock` as well (and load from ItemCollector file)
- Get matching IBA plugins for item, call `GetRelatedItems` for each. For each item returned, get full item content from ItemCollector (if present in item list, pulling from file, removing file when done) or from cluster (if not present in item list), add item to the current block, add item to `itemsInBlock` map, and then recursively apply current step to each (i.e. call IBA method, add to block, etc.)
- If current item and next item are both ordered items for the same GR, then continue to next item, adding to current `itemBlock`.
- Once full ItemBlock list is generated, call `backupItemBlock(block ItemBlock)
- Add `backupItemBlock` return values to `backedUpGroupResources` map
#### New func `backupItemBlock`
Method signature for new func `backupItemBlock` is as follows:
The return value is a slice of GRs for resources which were backed up. Velero tracks these to determine which CRDs need to be included in the backup. Note that we need to make sure we include in this not only those resources that were backed up directly, but also those backed up indirectly via additional items BIA execute returns.
In order to handle backup hooks, this func will first take the input item list (`block.items`) and get a list of included pods, filtered to include only those not yet backed up (using `block.itemBackupper.backupRequest.BackedUpItems`). Iterate over this list and execute pre hooks (pulled out of `itemBackupper.backupItemInternal`) for each item.
Now iterate over the full list (`block.items`) and call `backupItem` for each. After the first, the later items should already have been backed up, but calling a second time is harmless, since the first thing Velero does is check the `BackedUpItems` map, exiting if item is already backed up). We still need this call in case there's a plugin which returns something in `GetAdditionalItems` but forgets to return it in the `Execute` additional items return value. If we don't do this, we could end up missing items.
After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`).
#### `itemBackupper.backupItemInternal` cleanup
After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`.
### Phase 2: Process ItemBlocks for a single backup in multiple threads
#### New input field for number of ItemBlock workers
The velero installer and server CLIs will get a new input field `itemBlockWorkerCount`, which will be passed along to the `backupReconciler`.
The `backupReconciler` struct will also have this new field added.
#### Worker pool for item block processing
A new type, `ItemBlockWorker` will be added which will manage a pool of worker goroutines which will process item blocks, a shared input channel for passing blocks to workers, and a WaitGroup to shut down cleanly when the reconciler exits.
The worker pool will be started by calling `StartItemBlockWorkerPool` in `NewBackupReconciler()`, passing in the worker count and reconciler context.
`backupreconciler.prepareBackupRequest` will also add the input channel to the `backupRequest` so that it will be available during backup processing.
The func `StartItemBlockWorkerPool` will create the `ItemBlockWorkerPool` with a shared buffered input channel (fixed buffer size) and start `workers` gororoutines which will each call `processItemBlockWorker`.
The `processItemBlockWorker` func (run by the worker goroutines) will read from `itemBlockChannel`, call `BackupItemBlock` on the retrieved `ItemBlock`, and then send the return value to the retrieved `returnChan`, and then process the next block.
#### Modify ItemBlock processing loop to send ItemBlocks to the worker pool rather than backing them up directly
The ItemBlock processing loop implemented in Phase 1 will be modified to send each newly-created ItemBlock to the shared channel rather than calling `BackupItemBlock` inline, using a WaitGroup to manage in-process items. A separate goroutine will be created to process returns for this backup. After completion of the ItemBlock processing loop, velero will use the WaitGroup to wait for all ItemBlock processing to complete before moving forward.
A simplified example of what this response goroutine might look like:
```go
// omitting cancel handling, context, etc
ret:=make(chanItemBlockReturn)
wg:=&sync.WaitGroup{}
// Handle returns
gofunc(){
for{
select{
caseresponse:=<-ret:// process each BackupItemBlock response
func(){
deferwg.Done()
responses=append(responses,response)
}()
case<-ctx.Done():
return
}
}
}()
// Simplified illustration, looping over and assumed already-determined ItemBlock list
// responses from BackupItemBlock calls are in responses
```
When processing the responses, the main thing is to set `backedUpGroupResources[item.groupResource]=true` for each GR returned, which will give the same result as the current implementation calling items one-by-one and setting that field as needed.
The ItemBlock processing loop described above will be split into two separate iterations. For the first iteration, velero will only process those items at the beginning of the loop identified as `orderedResources` -- when the groups generated from these resources are passed to the worker channel, velero will wait for the response before moving on to the next ItemBlock.
This is to ensure that the ordered resources are processed in the required order. Once the last ordered resource is processed, the remaining ItemBlocks will be processed and sent to the worker channel without waiting for a response, in order to allow these ItemBlocks to be processed in parallel.
The reason we must execute `ItemBlocks` with ordered resources first (and one at a time) is that this is a list of resources identified by the user as resources which must be backed up first, and in a particular order.
#### Synchronize access to the BackedUpItems map
Velero uses a map of BackedUpItems to track which items have already been backed up. This prevents velero from attempting to back up an item more than once, as well as guarding against creating infinite loops due to circular dependencies in the additional items returns. Since velero will now be accessing this map from the parallel goroutines, access to the map must be synchronized with mutexes.
### Backup Finalize phase
The finalize phase will not be affected by the ItemBlock design, since this is just updating resources after async operations are completed on the items and there is no need to run these updates in parallel.
## Alternatives considered
### BackpuItemAction v3 API
Instead of adding a new `ItemBlockAction` plugin type, we could add a `GetAdditionalItems` method to BackupItemAction.
This was rejected because the new plugin type provides a cleaner interface, and keeps the function of grouping related items separate from the function of modifying item content for the backup.
### Per-backup worker pool
The current design makes use of a permanent worker pool, started at backup controller startup time. With this design, when we follow on with running multiple backups in parallel, the same set of workers will take ItemBlock inputs from more than one backup. Another approach that was initially considered was a temporary worker pool, created while processing a backup, and deleted upon backup completion.
#### User-visible API differences between the two approaches
The main user-visible difference here is in the configuration API. For the permanent worker approach, the worker count represents the total worker count for all backups. The concurrent backup count represents the number of backups running at the same time. At any given time, though, the maximum number of worker threads backing up items concurrently is equal to the worker count. If worker count is 15 and the concurrent backup count is 3, then there will be, at most, 15 items being processed at the same time, split among up to three running backups.
For the per-backup worker approach, the worker count represents the worker count for each backup. The concurrent backup count, as before, represents the number of backups running at the same time. If worker count is 15 and the concurrent backup count is 3, then there will be, at most, 45 items being processed at the same time, up to 15 for each of up to three running backups.
#### Comparison of the two approaches
- Permanent worker pool advantages:
- This is the more commonly-followed Kubernetes pattern. It's generally better to follow standard practices, unless there are genuine reasons for the use case to go in a different way.
- It's easier for users to understand the maximum number of concurrent items processed, which will have performance impact and impact on the resource requirements for the Velero pod. Users will not have to multiply the config numbers in their heads when working out how many total workers are present.
- It will give us more flexibility for future enhancements around concurrent backups. One possible use case: backup priority. Maybe a user wants scheduled backups to have a lower priority than user-generated backups, since a user is sitting there waiting for completion -- a shared worker pool could react to the priority by taking ItemBlocks for the higher priority backup first, which would allow a large lower-priority backup's items to be preempted by a higher-priority backup's items without needing to explicitly stop the main controller flow for that backup.
- Per-backup worker pool advantages:
- Lower memory consumption than permanent worker pool, but the total memory used by a worker blocked on input will be pretty low, so if we're talking only 10-20 workers, the impact will be minimal.
## Compatibility
### Example IBA implementation for BIA plugins which return additional items
Included below is an example of what might be required for a BIA plugin which returns additional items.
The code is taken from the internal velero `pod_action.go` which identifies the items required for a given pod.
In this particular case, the only function of pod_action is to return additional items, so we can really just convert this plugin to an IBA plugin. If there were other actions, such as modifying the pod content on backup, then we would still need the pod action, and the related items vs. content manipulation functions would need to be separated.
```go
// PodAction implements ItemBlockAction.
typePodActionstruct{
loglogrus.FieldLogger
}
// NewPodAction creates a new ItemAction for pods.
**Velero Generic Data Path (VGDP)**: VGDP is the collective modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transfer for various purposes (i.e., PodVolume backup/restore, Volume Snapshot Data Movement). VGDP modules include uploaders and the backup repository.
**Exposer**: Exposer is a module that is introduced in [Volume Snapshot Data Movement Design][2]. Velero uses this module to expose the volume snapshots to Velero node-agent pods or node-agent associated pods so as to complete the data movement from the snapshots.
**backupPVC**: The intermediate PVC created by the exposer for VGDP to access data from, see [Volume Snapshot Data Movement Design][2] for more details.
**backupPod**: The pod consumes the backupPVC so that VGDP could access data from the backupPVC, see [Volume Snapshot Data Movement Design][2] for more details.
**sourcePVC**: The PVC to be backed up, see [Volume Snapshot Data Movement Design][2] for more details.
## Background
As elaberated in [Volume Snapshot Data Movement Design][2], a backupPVC may be created by the Exposer and the VGDP reads data from the backupPVC.
In some scenarios, users may need to configure some advanced settings of the backupPVC so that the data movement could work in best performance in their environments. Specifically:
- For some storage providers, when creating a read-only volume from a snapshot, it is very fast; whereas, if a writable volume is created from the snapshot, they need to clone the entire disk data, which is time consuming. If the backupPVC's `accessModes` is set as `ReadOnlyMany`, the volume driver is able to tell the storage to create a read-only volume, which may dramatically shorten the snapshot expose time. On the other hand, `ReadOnlyMany` is not supported by all volumes. Therefore, users should be allowed to configure the `accessModes` for the backupPVC.
- Some storage providers create one or more replicas when creating a volume, the number of replicas is defined in the storage class. However, it doesn't make any sense to keep replicas when an intermediate volume used by the backup. Therefore, users should be allowed to configure another storage class specifically used by the backupPVC.
## Goals
- Create a mechanism for users to specify various configurations for backupPVC
## Non-Goals
## Solution
We will use the ConfigMap specified by `velero node-agent` CLI's parameter `--node-agent-configmap` to host the backupPVC configurations.
This configMap is not created by Velero, users should create it manually on demand. The configMap should be in the same namespace where Velero is installed. If multiple Velero instances are installed in different namespaces, there should be one configMap in each namespace which applies to node-agent in that namespace only.
Node-agent server checks these configurations at startup time and use it to initiate the related Exposer modules. Therefore, users could edit this configMap any time, but in order to make the changes effective, node-agent server needs to be restarted.
Inside the ConfigMap we will add one new kind of configuration as the data in the configMap, the name is ```backupPVC```.
Users may want to set different backupPVC configurations for different volumes, therefore, we define the configurations as a map and allow users to specific configurations by storage class. Specifically, the key of the map element is the storage class name used by the sourcePVC and the value is the set of configurations for the backupPVC created for the sourcePVC.
The data structure is as below:
```go
type Configs struct {
// LoadConcurrency is the config for data path load concurrency per node.
// ReadOnly sets the backupPVC's access mode as read only.
ReadOnly bool `json:"readOnly,omitempty"`
}
```
### Sample
A sample of the ConfigMap is as below:
```json
{
"backupPVC": {
"storage-class-1": {
"storageClass": "snapshot-storage-class",
"readOnly": true
},
"storage-class-2": {
"storageClass": "snapshot-storage-class"
},
"storage-class-3": {
"readOnly": true
}
}
}
```
To create the configMap, users need to save something like the above sample to a json file and then run below command:
```
kubectl create cm <ConfigMap name> -n velero --from-file=<json file name>
```
### Implementation
The `backupPVC` is passed to the exposer and the exposer sets the related specification and create the backupPVC.
If `backupPVC.storageClass` doesn't exist or set as empty, the sourcePVC's storage class will be used.
If `backupPVC.readOnly` is set to true, `ReadOnlyMany` will be the only value set to the backupPVC's `accessModes`, otherwise, `ReadWriteOnce` is used.
Once `backupPVC.storageClass` is set, users must make sure that the specified storage class exists in the cluster and can be used the the backupPVC, otherwise, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min).
Once `backupPVC.readOnly` is set to true, users must make sure that the storage supports to create a `ReadOnlyMany` PVC from a snapshot, otherwise, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min).
Once above problems happen, the DataUpload CR is cancelled after prepare timeout and the backupPVC and backupPod will be deleted, so there is no way to tell the cause is one of the above problems or others.
To help the troubleshooting, we can add some diagnostic mechanism to discover the status of the backupPod before deleting it as a result of the prepare timeout.
**Backup Storage**: The storage to store the backup data. Check [Unified Repository design][1] for details.
**Backup Repository**: Backup repository is layered between BR data movers and Backup Storage to provide BR related features that is introduced in [Unified Repository design][1].
**Velero Generic Data Path (VGDP)**: VGDP is the collective of modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transfer for various purposes (i.e., PodVolume backup/restore, Volume Snapshot Data Movement). VGDP modules include uploaders and the backup repository.
**Data Mover Pods**: Intermediate pods which hold VGDP and complete the data transfer. See [VGDP Micro Service for Volume Snapshot Data Movement][2] and [VGDP Micro Service For fs-backup][3] for details.
**Repository Maintenance Pods**: Pods for [Repository Maintenance Jobs][4], which holds VGDP to run repository maintenance.
## Background
According to the [Unified Repository design][1] Velero uses selectable backup repositories for various backup/restore methods, i.e., fs-backup, volume snapshot data movement, etc. Some backup repositories may need to cache data on the client side for various repository operation, so as to accelerate the execution.
In the existing [Backup Repository Configuration][5], we allow users to configure the cache data size (`cacheLimitMB`). However, the cache data is still stored in the root file system of data mover pods/repository maintenance pods, so stored in the root file system of the node. This is not good enough, reasons:
- In many distributions, the node's system disk size is predefined, non configurable and limit, e.g., the system disk size may be 20G or less
- Velero supports concurrent data movements in each node. The cache in each of the concurrent data mover pods could quickly run out of the system disk and cause problems like pod eviction, failure of pod creation, degradation of Kubernetes QoS, etc.
We need to allow users to prepare a dedicated location, e.g., a dedictated volume, for the cache.
Not all backup repositories or not all backup repository operations require cache, we need to define the details when and how the cache is used.
## Goals
- Create a mechanism for users to configure cache volumes for various pods running VGDP
- Design the workflow to assign the cache volume pod path to backup repositories
- Describe when and how the cache volume is used
## Non-Goals
- The solution is based on [Unified Repository design][1], [VGDP Micro Service for Volume Snapshot Data Movement][2] and [VGDP Micro Service For fs-backup][3], legacy data paths are not supported. E.g., when a pod volume restore (PVR) runs with legacy Restic path, if any data is cached, the cache still resides in the root file system.
## Solution
### Cache Data
Varying on backup repositoires, cache data may include payload data or repository metadata, e.g., indexes to the payload data chunks.
Payload data is highly related to the backup data, and normally take the majority of the repository data as well as the cache data.
Repository metadata is related to the backup repository's chunking algorithm, data chunk mapping method, etc, and so the size is not proportional to the backup data size.
On the other hand for some backup repository, in extreme cases, the repository metadata may be significantly large. E.g., Kopia's indexes are per chunks, if there are huge number of small files in the repository, Kopia's index data may be in the same level of or even larger than the payload data.
However, in the cases that repository metadata data become the majority, other bottlenecks may emerge and concurrency of data movers may be significantly constrained, so the requirement to cache volumes may go away.
Therefore, for now we only consider the cache volume requirement for payload data, and leave the consideration for metadata as a future enhancement.
### Scenarios
Backup repository cache varies on backup repositories and backup repository operation during VGDP runs. Below are the scenarios when VGDP runs:
- Data Upload for Backup: this is the process to upload/write the backup data into the backup repository, e.g., DataUpload or PodVolumeBackup. The pieces of data is almost directly written to the repository, sometimes with a small group staying shortly in the local place. That is to say, there should not be large scale data cached for this scenario, so we don't prepare dedicated cache for this scenario.
- Repository Maintenance: Repository maintenance most often visits the backup repository's metadata and sometimes it needs to visit the file system directories from the backed up data. On the other hand, it is not practical to run concurrent maintenance jobs in one node. So the cache data is neither large nor affect the root file system too much. Therefore, we don't need to prepare dedicated cache for this scenario.
- Data Download for Restore: this is the process to download/read the backup data from the backup repository during restore, e.g., DataDownload or PodVolumeRestore. For backup repositories for which data are stored in remote backup storages (e.g., Kopia repository stores data in remote object stores), large scale of data are cached locally to accerlerate the restore. Therefore, we need dedicate cache volumes for this scenario.
- Backup Deletion: During this scenario, backup repository is connected, metadata is enumerated to find the repository snapshot representing the backup data. That is to say, only metadata is cached if any. Therefore, dedicated cache volumes are not required in this scenario.
The above analyses are based on the common behavior of backup repositories and they are not considering the case that backup repository metadata takes majority or siginficant proportion of the cache data.
As a conclusion of the analyses, we will create dedicated cache volumes for restore scenarios.
For other scenarios, we can add them regarded to the future changes/requirements. The mechanism to expose and connect the cache volumes should work for all scenarios. E.g., if we need to consider the backup repository metadata case, we may need cache volumes for backup and repository maintenance as well, then we can just reuse the same cache volume provision and connection mechanism to backup and repository maintenance scenarios.
### Cache Data and Lifecycle
If available, one cache volume is dedicately assigned to one data mover pod. That is, the cached data is destroyed when the data mover pod completes. Then the backup repository instance also closes.
Cache data are fully managed by the specific backup repository. So the backup repository may also have its own way to GC the cache data.
That is to say, cache data GC may be launched by the backup repository instance during the running of the data mover pod; then the left data are automatically destroyed when the data mover pod and the cache PVC are destroyed (cache PVC's `reclaimPolicy` is always `Deleted`, so once the cache PVC is destroyed, the volume will also be destroyed). So no specially logics are needed for cache data GC.
### Data Size
Cache volumes take storage space and cluster resources (PVC, PV), therefore, cache volumes should be created only when necessary and the volumes should be with reasonable size based on the cache data size:
- It is not a good bargain to have cache volumes for small backups, small backups will use resident cache location (the cache location in the root file system)
- The cache data size has a limit, the existing `cacheLimitMB` is used for this purpose. E.g., it could be set as 1024 for a 1TB backup, which means 1GB of data is cached and the old cache data exceeding this size will be cleared. Therefore, it is meaningless to set the cache volume size much larger than `cacheLimitMB`
### Cache Volume Size
The cache volume size is calculated from below factors (for Restore scenarios):
- **Limit**: The limit of the cache data, that is represented by `cacheLimitMB`, the default value is 5GB
- **backupSize**: The size of the backup as a reference to evaluate whether to create a cache volume. It doesn't mean the backup data really decides the cache data all the time, it is just a reference to evaluate the scale of the backup, small scale backups may need small cache data. Sometimes, backupSize is not irrelevant to the size of cache data, in this case, ResidentThreshold should not be set, Limit will be used directly. It is unlikely that backupSize is unavailable, but once that happens, ResidentThreshold is ignored, Limit will be used directly.
- **ResidentThreshold**: The minimum backup size that a cache volume is created
- **InflationPercentage**: Considering the overhead of the file system and the possible delay of the cache cleanup, there should be an inflation for the final volume size vs. the logical size, otherwise, the cache volume may be overrun. This inflation percentage is hardcoded, e.g., 20%.
Finally, the `cacheVolumeSize` will be rounded up to GiB considering the UX friendliness, storage friendliness and management friendliness.
### PVC/PV
The PVC for a cache volume is created in Velero namespace and a storage class is required for the cache PVC. The PVC's accessMode is `ReadWriteOnce` and volumeMode is `FileSystem`, so the storage class provided should support this specification. Otherwise, if the storageclass doesn't support either of the specifications, the data mover pod may be hang in `Pending` state until a timeout setting with the data movement (e.g. `prepareTimeout`) and the data movement will finally fail.
It is not expected that the cache volume is retained after data mover pod is deleted, so the `reclaimPolicy` for the storageclass must be `Delete`.
To detect the problems in the storageclass and fail earlier, a validation is applied to the storageclass and once the validation fails, the cache configuration will be ignored, so the data mover pod will be created without a cache volume.
### Cache Volume Configurations
Below configurations are introduced:
- **residentThresholdMB**: the minimum data size(in MB) to be processed (if available) that a cache volume is created
- **cacheStorageClass**: the name of the storage class to provision the cache PVC
Not like `cacheLimitMB` which is set to and affect the backup repository, the above two configurations are actually data mover configurations of how to create cache volumes to data mover pods; and the two configurations don't need to be per backup repository. So we add them to the node-agent Configuration.
### Sample
Below are some examples of the node-agent configMap with the configurations:
Sample-1:
```json
{
"cacheVolume":{
"storageClass":"sc-1",
"residentThresholdMB":1024
}
}
```
Sample-2:
```json
{
"cacheVolume":{
"storageClass":"sc-1",
}
}
```
Sample-3:
```json
{
"cacheVolume":{
"residentThresholdMB":1024
}
}
```
**sample-1**: This is a valid configuration. Restores with backup data size larger than 1G will be assigned a cache volume using storage class `sc-1`.
**sample-2**: This is a valid configuration. Data mover pods are always assigned a cache volume using storage class `sc-1`.
**sample-3**: This is not a valid configuration because the storage class is absent. Velero gives up creating a cache volume.
To create the configMap, users need to save something like the above sample to a json file and then run below command:
```
kubectl create cm <ConfigMap name> -n velero --from-file=<json file name>
```
The cache volume configurations will be visited by node-agent server, so they also need to specify the `--node-agent-configmap` to the `velero node-agent` parameters.
## Detailed Design
### Backup and Restore
The restore needs to know the backup size so as to calculate the cache volume size, some new fields are added to the DataDownload and PodVolumeRestore CRDs.
`snapshotSize` field is also added to DataDownload and PodVolumeRestore's `spec`:
```yaml
spec:
snapshotID:
description:SnapshotID is the ID of the Velero backup snapshot to
be restored from.
type:string
snapshotSize:
description:SnapshotSize is the logical size of the snapshot.
format:int64
type:integer
```
`snapshotSize` represents the total size of the backup; during restore, the value is transferred from DataUpload/PodVolumeBackup's `Status.Progress.TotalBytes` to DataDownload/PodVolumeRestore.
It is unlikely that `Status.Progress.TotalBytes` from DataUpload/PodVolumeBackup is unavailable, but once it happens, according to the above formula, `residentThresholdMB` is ignored, cache volume size is calculated directly from cache limit for the corresponding backup repository.
### Exposer
Cache volume configurations are retrieved by node-agent and passed through DataDownload/PodVolumeRestore to GenericRestore exposer/PodVolume exposer.
The exposers are responsible to calculate cache volume size, create cache PVCs and mount them to the restorePods.
If the calculated cache volume size is 0, or any of the critical parameters is missing (e.g., cache volume storage class), the exposers ignore the cache volume configuration and continue with creating restorePods without cache volumes, so no impact to the result of the restore.
Exposers mount the cache volume to a predefined directory and pass the directory to the data mover pods through the `cache-volume-path` parameter.
Below data structure is added to the exposers' expose parameters:
```go
typeGenericRestoreExposeParamstruct{
// RestoreSize specifies the data size for the volume to be restored
RestoreSizeint64
// CacheVolume specifies the info for cache volumes
CacheVolume*CacheVolumeInfo
}
typePodVolumeExposeParamstruct{
// RestoreSize specifies the data size for the volume to be restored
RestoreSizeint64
// CacheVolume specifies the info for cache volumes
CacheVolume*repocache.CacheConfigs
}
typeCacheConfigsstruct{
// StorageClass specifies the storage class for cache volumes
StorageClassstring
// Limit specifies the maximum size of the cache data
Limitint64
// ResidentThreshold specifies the minimum size of the cache data to create a cache volume
ResidentThresholdint64
}
```
### Data Mover Pods
Data mover pods retrieve the cache volume directory from `cache-volume-path` parameter and pass it to Unified Repository.
If the directory is empty, Unified Repository uses the resident location for data cache, that is, the root file system.
### Kopia Repository
Kopia repository supports cache directory configuration for both metadata and data. The existing `SetupConnectOptions` is modified to customize the `CacheDirectory`:
**Backup Storage**: The storage to store the backup data. Check [Unified Repository design][1] for details.
**Backup Repository**: Backup repository is layered between BR data movers and Backup Storage to provide BR related features that is introduced in [Unified Repository design][1].
## Background
According to the [Unified Repository design][1] Velero uses selectable backup repositories for various backup/restore methods, i.e., fs-backup, volume snapshot data movement, etc. To achieve the best performance, backup repositories may need to be configured according to the running environments.
For example, if there are sufficient CPU and memory resources in the environment, users may enable compression feature provided by the backup repository, so as to achieve the best backup throughput.
As another example, if the local disk space is not sufficient, users may want to constraint the backup repository's cache size, so as to prevent the repository from running out of the disk space.
Therefore, it is worthy to allow users to configure some essential parameters of the backup repsoitories, and the configuration may vary from backup repositories.
## Goals
- Create a mechanism for users to specify configurations for backup repositories
## Non-Goals
## Solution
### BackupRepository CRD
After a backup repository is initialized, a BackupRepository CR is created to represent the instance of the backup repository. The BackupRepository's spec is a core parameter used by Unified Repo modules when interactive with the backup repsoitory. Therefore, we can add the configurations into the BackupRepository CR called ```repositoryConfig```.
The configurations may be different varying from backup repositories, therefore, we will not define each of the configurations explicitly. Instead, we add a map in the BackupRepository's spec to take any configuration to be set to the backup repository.
During various operations to the backup repository, the Unified Repo modules will retrieve from the map for the specific configuration that is required at that time. So even though it is specified, a configuration may not be visited/hornored if the operations don't require it for the specific backup repository, this won't bring any issue. When and how a configuration is hornored is decided by the configuration itself and should be clarified in the configuration's specification.
Below is the new BackupRepository's spec after adding the configuration map:
```yaml
spec:
description: BackupRepositorySpec is the specification for a BackupRepository.
properties:
backupStorageLocation:
description: |-
BackupStorageLocation is the name of the BackupStorageLocation
that should contain this repository.
type: string
maintenanceFrequency:
description: MaintenanceFrequency is how often maintenance should
be run.
type: string
repositoryConfig:
additionalProperties:
type: string
description: RepositoryConfig contains configurations for the specific
repository.
type: object
repositoryType:
description: RepositoryType indicates the type of the backend repository
enum:
- kopia
- restic
- ""
type: string
resticIdentifier:
description: |-
ResticIdentifier is the full restic-compatible string for identifying
this repository.
type: string
volumeNamespace:
description: |-
VolumeNamespace is the namespace this backup repository contains
pod volume backups for.
type: string
required:
- backupStorageLocation
- maintenanceFrequency
- resticIdentifier
- volumeNamespace
type: object
```
### BackupRepository configMap
The BackupRepository CR is not created explicitly by a Velero CLI, but created as part of the backup/restore/maintenance operation if the CR doesn't exist. As a result, users don't have any way to specify the configurations before the BackupRepository CR is created.
Therefore, a BackupRepository configMap is introduced as a template of the configurations to be applied to the backup repository CR.
When the backup repository CR is created by the BackupRepository controller, the configurations in the configMap are copied to the ```repositoryConfig``` field.
For an existing BackupRepository CR, the configMap is never visited, if users want to modify the configuration value, they should directly edit the BackupRepository CR.
The BackupRepository configMap is created by users in velero installation namespace. The configMap name must be specified in the velero server parameter ```--backup-repository-configmap```, otherwise, it won't effect.
If the configMap name is specified but the configMap doesn't exist by the time of a backup repository is created, the configMap name is ignored.
For any reason, if the configMap doesn't effect, nothing is specified to the backup repository CR, so the Unified Repo modules use the hard-coded values to configure the backup repository.
The BackupRepository configMap supports backup repository type specific configurations, even though users can only specify one configMap.
So in the configMap struct, multiple entries are supported, indexed by the backup repository type. During the backup repository creation, the configMap is searched by the repository type.
### Configurations
With the above mechanisms, any kind of configuration could be added. Here list the configurations defined at present:
```cacheLimitMB```: specifies the size limit(in MB) for the local data cache. The more data is cached locally, the less data may be downloaded from the backup storage, so the better performance may be achieved. Practically, users can specify any size that is smaller than the free space so that the disk space won't run out. This parameter is for each repository connection, that is, users could change it before connecting to the repository. If a backup repository doesn't use local cache, this parameter will be ignored. For Kopia repository, this parameter is supported.
```enableCompression```: specifies to enable/disable compression for a backup repsotiory. Most of the backup repositories support the data compression feature, if it is not supported by a backup repository, this parameter is ignored. Most of the backup repositories support to dynamically enable/disable compression, so this parameter is defined to be used whenever creating a write connection to the backup repository, if the dynamically changing is not supported, this parameter will be hornored only when initializing the backup repository. For Kopia repository, this parameter is supported and can be dynamically modified.
### Sample
Below is an example of the BackupRepository configMap with the configurations:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: <config-name>
namespace: velero
data:
<repository-type-1>: |
{
"cacheLimitMB": 2048,
"enableCompression": true
}
<repository-type-2>: |
{
"cacheLimitMB": 1,
"enableCompression": false
}
```
To create the configMap, users need to save something like the above sample to a file and then run below commands:
This design document describes the enhancement of BackupStorageLocation (BSL) certificate management in Velero, introducing a Secret-based certificate reference mechanism (`caCertRef`) alongside the existing inline certificate field (`caCert`). This enhancement provides a more secure, Kubernetes-native approach to certificate management while enabling future CLI improvements for automatic certificate discovery.
## Background
Currently, Velero supports TLS certificate verification for object storage providers through an inline `caCert` field in the BSL specification. While functional, this approach has several limitations:
- **Security**: Certificates are stored directly in the BSL YAML, potentially exposing sensitive data
- **Management**: Certificate rotation requires updating the BSL resource itself
- **CLI Usability**: Users must manually specify certificates when using CLI commands
- **Size Limitations**: Large certificate bundles can make BSL resources unwieldy
Issue #9097 and PR #8557 highlight the need for improved certificate management that addresses these concerns while maintaining backward compatibility.
## Goals
- Provide a secure, Secret-based certificate storage mechanism
- Maintain full backward compatibility with existing BSL configurations
- Enable future CLI enhancements for automatic certificate discovery
- Simplify certificate rotation and management
- Provide clear migration path for existing users
## Non-Goals
- Removing support for inline certificates immediately
- Changing the behavior of existing BSL configurations
- Implementing client-side certificate validation
- Supporting certificates from ConfigMaps or other resource types
## High-Level Design
### API Changes
#### New Field: CACertRef
```go
typeObjectStorageLocationstruct{
// Existing field (now deprecated)
// +optional
// +kubebuilder:deprecatedversion:warning="caCert is deprecated, use caCertRef instead"
returnnil,errors.Wrap(err,"error getting CA certificate from secret")
}
return[]byte(certString),nil
}
// Fall back to caCert (deprecated)
ifbsl.Spec.ObjectStorage.CACert!=nil{
returnbsl.Spec.ObjectStorage.CACert,nil
}
returnnil,nil
}
```
### CLI Certificate Discovery Integration
#### Background: PR #8557 Implementation
PR #8557 ("CLI automatically discovers and uses cacert from BSL") was merged in August 2025, introducing automatic CA certificate discovery from BackupStorageLocation for Velero CLI download operations. This eliminated the need for users to manually specify the `--cacert` flag when performing operations like `backup describe`, `backup download`, `backup logs`, and `restore logs`.
#### Current Implementation (Post PR #8557)
The CLI now automatically discovers certificates from BSL through the `pkg/cmd/util/cacert/bsl_cacert.go` module:
```go
// Current implementation only supports inline caCert
This design provides a secure, Kubernetes-native approach to certificate management in Velero while maintaining backward compatibility. It establishes the foundation for enhanced CLI functionality and improved user experience, addressing the concerns raised in issue #9097 and enabling the features proposed in PR #8557.
The phased approach ensures smooth migration for existing users while delivering immediate security benefits for new deployments.
# Design to clean the artifacts generated in the CSI backup and restore workflows
## Terminology
* VSC: VolumeSnapshotContent
* VS: VolumeSnapshot
## Abstract
* The design aims to delete the unnecessary VSs and VSCs generated during CSI backup and restore process.
* The design stop creating related VSCs during backup syncing.
## Background
In the current CSI backup and restore workflows, please notice the CSI B/R workflows means only using the CSI snapshots in the B/R, not including the CSI snapshot data movement workflows, some generated artifacts are kept after the backup or the restore process completion.
Some of them are kept due to design, for example, the VolumeSnapshotContents generated during the backup are kept to make sure the backup deletion can clean the snapshots in the storage providers.
Some of them are kept by accident, for example, after restore, two VolumeSnapshotContents are generated for the same VolumeSnapshot. One is from the backup content, and one is dynamically generated from the restore's VolumeSnapshot.
The design aims to clean the unnecessary artifacts, and make the CSI B/R workflow more concise and reliable.
## Goals
- Clean the redundant VSC generated during CSI backup and restore.
- Remove the VSCs in the backup sync process.
## Non Goals
- There were some discussion about whether Velero backup should include VSs and VSCs not generated in during the backup. By far, the conclusion is not including them is a better option. Although that is a useful enhancement, that is not included this design.
- Delete all the CSI-related metadata files in the BSL is not the aim of this design.
## Detailed Design
### Backup
During backup, the main change is the backup-generated VSCs should not kept anymore.
The reasons is we don't need them to ensure the snapshots clean up during backup deletion. Please reference to the [Backup Deletion section](#backup-deletion) section for detail.
As a result, we can simplify the VS deletion logic in the backup. Before, we need to not only delete the VS, but also recreate a static VSC pointing a non-exiting VS.
The deletion code in VS BackupItemAction can be simplify to the following:
``` go
if backup.Status.Phase == velerov1api.BackupPhaseFinalizing ||
logger.Errorf("VolumeSnapshot %s/%s is not ready. This is not expected.",
vs.Namespace, vs.Name)
return
}
if vs.Status != nil && vs.Status.BoundVolumeSnapshotContentName != nil {
// Patch the DeletionPolicy of the VolumeSnapshotContent to set it to Retain.
// This ensures that the volume snapshot in the storage provider is kept.
if err := SetVolumeSnapshotContentDeletionPolicy(
vsc.Name,
client,
snapshotv1api.VolumeSnapshotContentRetain,
); err != nil {
logger.Warnf("Failed to patch DeletionPolicy of volume snapshot %s/%s",
vs.Namespace, vs.Name)
return
}
if err := client.Delete(context.TODO(), &vsc); err != nil {
logger.Warnf("Failed to delete the VSC %s: %s", vsc.Name, err.Error())
}
}
if err := client.Delete(context.TODO(), &vs); err != nil {
logger.Warnf("Failed to delete volumesnapshot %s/%s: %v", vs.Namespace, vs.Name, err)
} else {
logger.Infof("Deleted volumesnapshot with volumesnapshotContent %s/%s",
vs.Namespace, vs.Name)
}
}
```
### Restore
#### Restore the VolumeSnapshotContent
The current behavior of VSC restoration is that the VSC from the backup is restore, and the restored VS also triggers creating a new VSC dynamically.
Two VSCs created for the same VS in one restore seems not right.
Skip restore the VSC from the backup is not a viable alternative, because VSC may reference to a [snapshot create secret](https://kubernetes-csi.github.io/docs/secrets-and-credentials-volume-snapshot-class.html?highlight=snapshotter-secret-name#createdelete-volumesnapshot-secret).
If the `SkipRestore` is set true in the restore action's result, the secret returned in the additional items is ignored too.
As a result, restore the VSC from the backup, and setup the VSC and the VS's relation is a better choice.
Another consideration is the VSC name should not be the same as the backed-up VSC's, because the older version Velero's restore and backup keep the VSC after completion.
There's high possibility that the restore will fail due to the VSC already exists in the cluster.
Multiple restores of the same backup will also meet the same problem.
The proposed solution is using the restore's UID and the VS's name to generate sha256 hash value as the new VSC name. Both the VS and VSC RestoreItemAction can access those UIDs, and it will avoid the conflicts issues.
The restored VS name also shares the same generated name.
The VS-referenced VSC name and the VSC's snapshot handle name are in their status.
Velero restore process purges the restore resources' metadata and status before running the RestoreItemActions.
As a result, we cannot read these information in the VS and VSC RestoreItemActions.
Fortunately, RestoreItemAction input parameters includes the `ItemFromBackup`. The status is intact in `ItemFromBackup`.
csi-volumesnapshotclasses.json, csi-volumesnapshotcontents.json, and csi-volumesnapshots.json are CSI-related metadata files in the BSL for each backup.
csi-volumesnapshotcontents.json and csi-volumesnapshots.json are not needed anymore, but csi-volumesnapshotclasses.json is still needed.
One concrete scenario is that a backup is created in cluster-A, then the backup is synced to cluster-B, and the backup is deleted in the cluster-B. In this case, we don't have a chance to create the VS and VSC needed VolumeSnapshotClass.
The VSC deletion workflow proposed by this design needs to create the VSC first. If the VSC's referenced VolumeSnapshotClass doesn't exist in cluster, the creation of VSC will fail.
As a result, the VolumeSnapshotClass should still be synced in the backup sync process.
### Backup Deletion
Two factors are worthy for consideration for the backup deletion change:
* Because the VSCs generated by the backup are not synced anymore, and the VSCs generated during the backup will not be kept too. The backup deletion needs to generate a VSC, then deletes it to make sure the snapshots in the storage provider are clean too.
* The VSs generated by the backup are already deleted in the backup process, we don't need a DeleteItemAction for the VS anymore. As a result, the `velero.io/csi-volumesnapshot-delete` plugin is unneeded.
For the VSC DeleteItemAction, we need to generate a VSC. Because we only care about the snapshot deletion, we don't need to create a VS associated with the VSC.
Create a static VSC, then point it to a pseudo VS, and reference to the snapshot handle should be enough.
To avoid the created VSC conflict with older version Velero B/R generated ones, the VSC name is set to `vsc-uuid`.
The following is an example of the implementation.
``` go
uuid, err := uuid.NewRandom()
if err != nil {
p.log.WithError(err).Errorf("Fail to generate the UUID to create VSC %s", snapCont.Name)
return errors.Wrapf(err, "Fail to generate the UUID to create VSC %s", snapCont.Name)
if err := p.crClient.Get(ctx, crclient.ObjectKeyFromObject(&snapCont), tmpVSC); err != nil {
return false, errors.Wrapf(
err, "failed to get VolumeSnapshotContent %s", snapCont.Name,
)
}
if tmpVSC.Status != nil && boolptr.IsSetToTrue(tmpVSC.Status.ReadyToUse) {
return true, nil
}
return false, nil
},
); err != nil {
return errors.Wrapf(err, "fail to wait VolumeSnapshotContent %s becomes ready.", snapCont.Name)
}
```
## Security Considerations
Security is not relevant to this design.
## Compatibility
In this design, no new information is added in backup and restore. As a result, this design doesn't have any compatibility issue.
## Open Issues
Please notice the CSI snapshot backup and restore mechanism not supporting all file-store-based volume, e.g. Azure Files, EFS or vSphere CNS File Volume. Only block-based volumes are supported.
Refer to [this comment](https://github.com/vmware-tanzu/velero/issues/3151#issuecomment-2623507686) for more details.
This enhancement will enable Velero to process multiple backups at the same time. This is largely a usability enhancement rather than a performance enhancement, since the overall backup throughput may not be significantly improved over the current implementation, since we are already processing individual backup items in parallel. It is a significant usability improvement, though, as with the current design, a user who submits a small backup may have to wait significantly longer than expected if the backup is submitted immediately after a large backup.
## Background
With the current implementation, only one backup may be `InProgress` at a time. A second backup created will not start processing until the first backup moves on to `WaitingForPluginOperations` or `Finalizing`. This is a usability concern, especially in clusters when multiple users are initiating backups. With this enhancement, we intend to allow multiple backups to be processed concurrently. This will allow backups to start processing immediately, even if a large backup was just submitted by another user. This enhancement will build on top of the prior parallel item processing feature by creating a dedicatede ItemBlock worker pool for each running backup. The pool will be created at the beginning of the backup reconcile, and the input channel will be passed to the Kubernetes backupper just like it is in the current release.
The primary challenge is to make sure that the same workload in multiple backups is not backed up concurrently. If that were to happen, we would risk data corruption, especially around the processing of pod hooks and volume backup. For this first release we will take a conservative, high-level approach to overlap detection. Two backups will not run concurrently if there is any overlap in included namespaces. For example, if a backup that includes `ns1` and `ns2` is running, then a second backup for `ns2` and `ns3` will not be started. If a backup which does not filter namespaces is running (either a whole cluster backup or a non-namespace-limited backup with a label selector) then no other backups will be started, since a backup across all namespaces overlaps with any other backup. Calculating item-level overlap for queued backups is problematic since we don't know which items are included in a backup until backup processing has begun. A future release may add ItemBlock overlap detection, where at the item block worker level, the same item will not be processed by two different workers at the same time. This works together with workload conflict detection to further detect conflicts in a more granular level for shared resources between backups. Eventually, with a more complete understanding of individual workloads (either via ItemBlocks or some higher level model), the namespace level overlap detection may be relaxed in future versions.
## Goals
- Process multiple backups concurrently
- Detect namespace overlap to avoid conflicts
- For queued backups (not yet runnable due to concurrency limits or overlap), indicate the queue position in status
## Non Goals
- Handling NFS PVs when more than one PV point to the same underlying NFS share
- Handling VGDP cancellation for failed backups on restart
- Mounting a PVC for scenarios in which /tmp is too small for the number of concurrent backups
- Providing a mechanism to identify high priority backups which get preferential treatment in terms of ItemBlock worker availability
- Item-level overlap detection (future feature)
- Providing the ability to disable namespace-level overlap detection once Item-level overlap detection is in place (although this may be supported in a future version).
## High-Level Design
### Backup CRD changes
Two new backup phases will be added: `Queued` and `ReadyToStart`. In the Backup workflow, new backups will be moved to the Queued phase when they are added to the backup queue. When a backup is removed from the queue because it is now able to run, it will be moved to the `ReadyToStart` phase, which will allow the backup controller to start processing it.
In addition, a new Status field, `QueuePosition`, will be added to track the backup's current position in the queue.
### New Controller: `backupQueueReconciler`
A new reconciler will be added, `backupQueueReconciler` which will use the current `backupReconciler` logic for reconciling `New` backups but instead of running the backup, it will move the Backup to the `Queued` phase and set `QueuePosition`.
In addition, this reconciler will periodically reconcile all queued backups (on some configurable time interval) and if there is a runnable backup, remove it from the queue, update `QueuePosition` for any queued backups behind it, and update its phase to `ReadyToStart`.
Queued backups will be reconciled in order based on `QueuePosition`, so the first runnable backup found will be processed. A backup is runnable if both of the following conditions are true:
1) The total number of backups either `InProgress` or `ReadyToStart` is less than the configured number of concurrent backups.
2) The backup has no overlap with any backups currently `InProgress` or `ReadyToStart` or with any `Queued` backups with a higher (i.e. closer to 1) queue position than this backup.
### Updates to Backup controller
The current `backupReconciler` will change its reconciling rules. Instead of watching and reconciling New backups, it will reconcile `ReadyToStart` backups. In addition, it will be configured to run in parallel by setting `MaxConcurrentReconciles` based on the `concurrent-backups` server arg.
The startup (and shutdown) of the ItemBlock worker pool will be moved from reconciler startup to the backup reconcile, which will give each running backup its own dedicated worker pool. The per-backup worker pool will will use the existing `--item-block-worker-count` installer/server arg. This means that the maximum number of ItemBlock workers for the entire Velero pod will be the ItemBlock worker count multiplied by concurrentBackups. For example, if concurrentBackups is 5, and itemBlockWorkerCount is 6, then there will be, at most, 30 worker threads active, 5 dedicated to each InProgress backup, but this maximum will only be achieved when the maximum number of backups are InProgress. This also means that each InProgress backup will have a dedicated ItemBlock input channel with the same fixed buffer size.
## Detailed Design
### New Install/Server configuration args
A new install/server arg, `concurrent-backups` will be added. This will be an int-valued field specifying the number of backups which may be processed concurrently (with phase `InProgress`). If not specified, the default value of 1 will be used.
### Consideration of backup overlap and concurrent backup processing
The primary consideration for running additional backups concurrently is the configured `concurrent-backups` parameter. If the total number of `InProgress` and `ReadyToStart` backups is equal to `concurrent-backups` then any `Queued` backups will remain in the queue.
The second consideration is backup overlap. In order to prevent interaction between running backups (particularly around volume backup and pod hooks), we cannot allow two overlapping backups to run at the same time. For now, we will define overlap broadly -- requiring that two concurrent backups don't include any of the same namespaces. A backup for `ns1` can run concurrently with a backup for `ns2`, but a backup for `[ns1,ns2]` cannot run concurrently with a backup for `ns1`. One consequence of this approach is that a backup which includes all namespaces (even if further filtered by resource or label) cannot run concurrently with *any other backup*.
When determining which queued backup to run next, velero will look for the next queued backup which has no overlap with any InProgress backup or any Queued backup ahead of it. The reason we need to consider queued as well as running backups for overlap detection is as follows.
Consider the following scenario. These are the current not-completed backups (ordered from oldest to newest)
Assuming `concurrent-backups` is 2, on the next reconcile, Velero will be able to start a second backup if there is one with no overlap. `backup2` cannot run, since `ns2` overlaps between it and the running `backup1`. If we only considered running overlap (and not queued overlap), then `backup3` could run now. It conflicts with the queued `backup2` on `ns3` but it does not conflict with the running backup. However, if it runs now, then when `backup1` completes, then `backup2` still can't run (since it now overlaps with running `backup3`on `ns3`), so `backup4` starts instead. Now when `backup3` completes, `backup2` still can't run (since it now conflicts with `backup4` on `ns5`). This means that even though it was the second backup created, it's the fourth to run -- providing worse time to completion than without parallel backups. If a queued backup has a large number of namespaces (a full-cluster backup for example), it would never run as long as new single-namespace backups keep being added to the queue.
To resolve this problem we consider both running backups as well as backups ahead in the queue when resolving overlap conflicts. In the above scenario, `backup2` can't run yet since it overlaps with the running backup on `ns2`. In addition, `backup3` and `backup4` also can't run yet since they overlap with queued `backup2`. Therefore, `backup5` will run now. Once `backup1` completes, `backup2` will be free to run.
### Backup CRD changes
New Backup phases:
```go
const(
// BackupPhaseQueued means the backup has been added to the
// queue by the BackupQueueReconciler.
BackupPhaseQueuedBackupPhase="Queued"
// BackupPhaseReadyToStart means the backup has been removed from the
// queue by the BackupQueueReconciler and is ready to start.
BackupPhaseReadyToStartBackupPhase="ReadyToStart"
)
```
In addition, a new Status field, `queuePosition`, will be added to track the backup's current position in the queue.
```go
// QueuePosition is the position held by the backup in the queue.
// QueuePosition=1 means this backup is the next to be considered.
// Only relevant when Phase is "Queued"
// +optional
QueuePositionint`json:"queuePosition,omitempty"`
```
### New Controller: `backupQueueReconciler`
A new reconciler will be added, `backupQueueReconciler` which will reconcile backups under these conditions:
1) Watching Create/Update for backups in `New` (or empty) phase
2) Watching for Backup phase transition from `InProgress` to something else to reconcile all `Queued` backups
2) Watching for Backup phase transition from `New` (or empty) to `Queued` to reconcile all `Queued` backups
2) Periodic reconcile of `Queued` backups to handle backups queued at server startup as well as to make sure we never have a situation where backups are queued indefinitely because of a race condition or was otherwise missed in the reconcile on prior backup completion.
The reconciler will be set up as follows -- note that New backups are reconciled on Create/Update, while Queued backups are reconciled when an InProgress backup moves on to another state or when a new backup moves to the Queued state. We also reconcile Queued backups periodically to handle the case of a Velero pod restart with Queued backups, as well as to handle possible edge cases where a queued backup doesn't get moved out of the queue at the point of backup completion or an error occurs during a prior Queued backup reconcile.
New backups will be queued: Phase will be set to `Queued`, and `QueuePosition` will be set to a int value incremented from the highest current `QueuePosition` value among Queued backups.
Queued backups will be removed from the queue if runnable:
1) If the total number of backups either InProgress or ReadyToStart is greater than or equal to the concurrency limit, then exit without removing from the queue.
2) If the current backup overlaps with any InProgress, ReadyToStart, or Queued backup with `QueuePosition < currentBackup.QueuePosition` then exit without removing from the queue.
3) If we get here, the backup is runnable. To resolve a potential race condition where an InProgress backup completes between reconciling the backup with QueuePosition `n-1` and reconciling the current backup with QueuePosition `n`, we also check to see whether there are any runnable backups in the queue ahead of this one. The only time this will happen is if a backup completes immediately before reconcile starts which either frees up a concurrency slot or removes a namespace conflict. In this case, we don't want to run the current backup since the one ahead of this one in the queue (which was recently passed over before the InProgress backup completed) must run first. In this case, exit without removing from the queue.
4) If we get here, remove the backup from the queue by setting Phase to `ReadyToStart` and `QueuePosition` to zero. Decrement the `QueuePosition` of any other Queued backups with a `QueuePosition` higher than the current backup's queue position prior to dequeuing. At this point, the backup reconciler will start the backup.
// enqueue backup -- set phase=Queued, set queuePosition=maxCurrentQueuePosition+1
}
// We should only ever get these events when added in order by the periodical enqueue source
// so as long as the current backup has not conflicts ahead of it or running, we should be good to
// dequeue
case "", velerov1api.BackupPhaseQueued:
// list backups, filter on Queued, ReadyToStart, and InProgress
// if number of InProgress backups + number of ReadyToStart backups >= concurrency limit, exit
// generate list of all namespaces included in InProgress, ReadyToStart, and Queued backups with
// queuePosition < backup.Status.QueuePosition
// if overlap found, exit
// check backups ahead of this one in the queue for runnability. If any are runnable, exit
// dequeue backup: set Phase to ReadyToStart, QueuePosition to 0, and decrement QueuePosition
// for all QueuedBackups behind this one in the queue
}
```
The queue controller will run as a single reconciler thread, so we will not need to deal with concurrency issues when moving backups from New to Queued or from Queued to ReadyToStart, and all of the updates to QueuePosition will be from a single thread.
### Updates to Backup controller
The Reconcile logic will be updated to respond to ReadyToStart backups instead of New backups:
The controller-runtime core reconciler logic already prevents the same resource from being reconciled by two different reconciler threads, so we don't need to worry about concurrency issues at the controller level.
The workerPool reference will be moved from the backupReconciler to the backupRequest, since this will now be backup-specific, and the initialization code for the worker pool will be moved from the reconciler init into the backup reconcile. This worker pool will be shut down upon exiting the Reconcile method.
### Resilience to restart of velero pod
The new backup phases (`Queued` and `ReadyToStart`) will be resilient to velero pod restarts. If the velero pod crashes or is restarted, only backups in the `InProgress` phase will be failed, so there is no change to current behavior. Queued backups will retain their queue position on restart, and ReadyToStart backups will move to InProgress when reconciled.
### Observability
#### Logging
When a backup is dequeued, an info log message will also include the wait time, calculated as `now - creationTimestamp`. When a backup is passed over due to overlap, an info log message will indicate which namespaces were in conflict.
#### Velero CLI
The `velero backup describe` output will include the current queue position for queued backups.
# Proposal to add include exclude policy to resource policy
This enhancement will allow the user to set include and exclude filters for resources in a resource policy configmap, so that
these filters are reusable and the user will not need to set them each time they create a backup.
## Background
As mentioned in issue [#8610](https://github.com/vmware-tanzu/velero/issues/8610). When there's a long list of resources
to include or exclude in a backup, it can be cumbersome to set them each time a backup is created. There's a requirement to
set these filters in a separate data structure so that they can be reused in multiple backups.
## High-Level Design
We may extend the data structure of resource policy to add `includeExcludePolicy`, which include the include and exclude filters
in the BackupSpec. When the user creates a backup which references the resource policy config `velero backup create --resource-policies-configmap <configmap-name>`,
the filters in "includeExcludePolicy" will take effect to filter the resources when velero collects the resources to backup.
## Detailed Design
### Data Structure
The map `includeExcludePolicy` contains four fields `includedClusterScopedResources`, `excludedClusterScopedResources`,
`includedNamespaceScopedResources`,`excludedNamespaceScopedResources`. These filters work exactly as the filters defined BackupSpec with
the same names. An example of the policy looks like:
```yaml
#omitted other irrelevant fields like 'version', 'volumePolicies'
includeExcludePolicy:
includedClusterScopedResources:
- "cr"
- "crd"
- "pv"
excludedClusterScopedResources:
- "volumegroupsnapshotclass"
- "ingressclass"
includedNamespaceScopedResources:
- "pod"
- "service"
- "deployment"
- "pvc"
excludedNamespaceScopedResources:
- "configmap"
```
These filters are in the form of scoped include/exclude filters, which by design will not work with the "old" resource filters.
Therefore, when a Backup references a resource policy configmap which has `includeExcludePolicy`, and at the same time it has
the "old" resource filters, i.e. `includedResources`, `excludedResources`, `includeClusterResources` set in the BackupSpec, the
Backup will fail with a validation error.
### Priorities
A user may set the include/exclude filters in Backupspec and also in the resource policy configmap. In this case, the filters
in both the Backupspec and the resource policy configmap will take effect. When there's a conflict, the filters in the Backupspec
will take precedence. For example, if resource X is in the list of `includedNamespaceScopedResources` filter in the Backupspec, but
it's also in the list of `excludedClusterScopedResources` in the resource policy configmap, then resource X will be included in the backup.
In this way, users can set the filters in the resource policy configmap to cover most of their use cases, and then override them
in the Backupspec when needed.
### Implementation
In addition to the data structure change, we will need to implement the following changes:
1. A new function `CombineWithPolicy` will be added to the struct `ScopeIncludesExcludes`, which will combine the include/exclude filters
in the resource policy configmap with the include/exclude filters in the Backupspec:
@@ -65,7 +65,7 @@ This page contains a pre-migration checklist for ensuring a repo migration goes
#### Updating Netlify
The settings for Netflify should remain the same, except that it now needs to be installed in the new repo. The instructions on how to install Netlify on the new repo are here: https://www.netlify.com/docs/github-permissions/.
The settings for Netlify should remain the same, except that it now needs to be installed in the new repo. The instructions on how to install Netlify on the new repo are here: https://www.netlify.com/docs/github-permissions/.
At present, Velero images could be built for linux-amd64 and linux-arm64. We need to support other platforms, i.e., windows-amd64.
At present, for linux image build, we leverage Buildkit's `--platform` option to create the image manifest list in one build call. However, it is a limited way and doesn't fully support all multi-arch scenarios. Specifically, since the build is done in one call with the same parameters, it is impossbile to build images with different configurations (e.g., Windows build requires a different Dockerfile).
At present, Velero by default build images locally, or no image or manifest is pushed to registry. However, docker doesn't support multi-arch build locally. We need to clarify the behavior of local build.
## Goals
- Refactor the `make container` process to fully support multi-arch build
- Add Windows build to the existing build process
- Clarify the behavior of local build with multi-arch build capabilities
- Don't change the pattern of the final image tag to be used by users
## Non-Goals
- There may be some workarounds to make the multi-arch image/manifest fully available locally. These workarounds will not be adopted, so local build always build single-arch images
## Local Build
For local build, two values of `--output` parameter for `docker buildx build` are supported:
-`docker`: a docker format image is built, but the image is only built for the platform (`<os>/<arch>`) as same as the building env. E.g., when building from linux-amd64 env, a single manifest of linux-amd64 is created regardless how the input parameters are configured.
-`tar`: one or more images are built as tarballs according to the input platform (`<os>/<arch>`) parameters. Specifically, one tarball is generated for each platform. The build process is the same with the `Build Separate Manifests` of `Push Build` as detailed below. Merely, the `--output` parameter diffs, as `type=tar;dest=<tarball generated path>`. The tarball is generated to the `_output` folder and named with the platform info, e.g., `_output/velero-main-linux-amd64.tar`.
## Push Build
For push build, the `--output` parameter for `docker buildx build` is always `registry`. And build will go according to the input parameters and create multi-arch manifest lists.
### Step 1: Build Separate Manifests
Instead of specifying multiple platforms (`<os>/<arch>`) to `--platform` option, we add multiple `container-%` targets in Makefile and each target builds one platform representively.
The goal here is to build multiple manifests through the multiple targets. However, `docker buildx build` by default creates a manifest list even though there is only one element in `--platform`. Therefore, two flags `--provenance=false` and `--sbom=false` will be set additionally to force `docker buildx build` to create manifests.
Each manifest has a unique tag, the OS type and arch is added to the tag, in the pattern `$(REGISTRY)/$(BIN):$(VERSION)-$(OS)-$(ARCH)`. For example, `velero/velero:main-linux-amd64`.
All the created manifests will be pushed to registry so that the all-in-one manifest list could be created.
### Step 2: Create All-In-One Manifest List
The next step is to create a manifest list to include all the created manifests. This could be done by `docker manifest create` command, the tags created and pushed at Step 1 are passed to this command.
A tag is also created for the manifest list, in the pattern `$(REGISTRY)/$(BIN):$(VERSION)`. For example, `velero/velero:main`.
### Step 3: Push All-In-One Manifest List
The created manifest will be pushed to registry by command `docker manifest push`.
## Input Parameters
Below are the input parameters that are configurable to meet different build purposes during Dev and release cycle:
- BUILD_OUTPUT_TYPE: the type of output for the build, i.e., `docker`, `tar`, `registry`, while `docker` and `tar` is for local build; `registry` means push build. Default value is `docker`
- BUILD_OS: which types of OS should be built for. Multiple values are accepted, e.g., `linux,windows`. Default value is `linux`
- BUILD_ARCH: which types of architecture should be built for. Multiple values are accepted, e.g., `amd64,arm64`. Default value is `amd64`
- BUILDX_INSTANCE: an existing buildx instance to be used by the build. Default value is <empty> which indicates the build to create a new buildx instance
## Windows Build
Windows container images vary from Windows OS versions, e.g., `ltsc2022` for Windows server 2022 and `1809` for Windows server 2019. Images for different OS versions should be built separately.
Therefore, separate build targets are added for each OS version, like `container-windows-%`.
For the same reason, a new input parameter is added, `BUILD_WINDOWS_VERSION`. The default value is `ltsc2022`. Windows server 2022 is the only base image we will deliver officially, Windows server 2019 is not supported. In future, we may need to support Windows server 2025 base image.
For local build to tar, the Windows OS version is also added to the name of the tarball, e.g., `_output/velero-main-windows-ltsc2022-amd64.tar`.
At present, Windows container image only supports `amd64` as the architecture, so `BUILD_ARCH` is ignored for Windows.
The Windows manifests need to be annotated with os type, arch, and os version. This will be done through `docker manifest annotate` command.
## Use Malti-arch Images
In order to use the images, the manifest list's tag should be provided to `velero install` command or helm, the individual manifests are covered by the manifest list. During launch time, the container engine will load the right image to the container according to the platform of the running node.
## Build Samples
**Local build to docker**
```
make container
```
The built image could be listed by `docker image ls`.
**Local build for linux-amd64 and windows-amd64 to tar**
```
BUILD_OUTPUT_TYPE=tar BUILD_OS=linux,windows make container
```
Under `_output` directory, below files are generated:
```
velero-main-linux-amd64.tar
velero-main-windows-ltsc2022-amd64.tar
```
**Local build for linux-amd64, linux-arm64 and windows-amd64 to tar**
```
BUILD_OUTPUT_TYPE=tar BUILD_OS=linux,windows BUILD_ARCH=amd64,arm64 make container
```
Under `_output` directory, below files are generated:
```
velero-main-linux-amd64.tar
velero-main-linux-arm64.tar
velero-main-windows-ltsc2022-amd64.tar
```
**Push build for linux-amd64 and windows-amd64**
Prerequisite: login to registry, e.g., through `docker login`
```
BUILD_OUTPUT_TYPE=registry REGISTRY=<registry> BUILD_OS=linux,windows make container
```
Nothing is available locally, in the registry 3 tags are available:
```
velero/velero:main
velero/velero:main-windows-ltsc2022-amd64
velero/velero:main-linux-amd64
```
**Push build for linux-amd64, linux-arm64 and windows-amd64**
Prerequisite: login to registry, e.g., through `docker login`
```
BUILD_OUTPUT_TYPE=registry REGISTRY=<registry> BUILD_OS=linux,windows BUILD_ARCH=amd64,arm64 make container
```
Nothing is available locally, in the registry 4 tags are available:
```
velero/velero:main
velero/velero:main-windows-ltsc2022-amd64
velero/velero:main-linux-amd64
velero/velero:main-linux-arm64
```
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.