Commit Graph

898 Commits

Author SHA1 Message Date
Priyansh Choudhary
8a6ac7af1c fix: backup deletion silently succeeds when tarball download fails (#9693)
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m13s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 11s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 13s
Main CI / Build (push) Failing after 34s
Close stale issues and PRs / stale (push) Successful in 14s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m34s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 1m5s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m1s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 1m4s
* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
2026-04-14 16:36:40 -04:00
lyndon-li
4a6756d57b Merge pull request #9683 from Lyndon-Li/increase-repo-maintenance-history-queue-length
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m18s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 4s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 16s
Main CI / Build (push) Failing after 31s
Close stale issues and PRs / stale (push) Successful in 15s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m24s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 1m4s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m9s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 1m4s
Issue 9428: increase repo maintenance history queue length
2026-04-10 11:50:24 +08:00
Xun Jiang/Bruce Jiang
e1cc07cec3 Merge pull request #9695 from shubham-pampattiwar/bump-ext-snapshotter-v8.4-vgs-v1beta2
Bump external-snapshotter to v8.4.0 for VGS v1beta2 support
2026-04-10 11:38:24 +08:00
Lyndon-Li
1730b7f414 issue 9428: incremental repo maintenance history queue length
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-04-09 14:41:07 +08:00
lyndon-li
37abfb4bfa Merge pull request #9682 from adam-jian-zhang/fix-restore-pvr-scope
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m4s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 13s
Main CI / Build (push) Failing after 30s
Close stale issues and PRs / stale (push) Successful in 11s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m34s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 1m2s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m2s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 1m3s
Fix PodVolumeBackup list scope during restore
2026-04-09 10:59:26 +08:00
Shubham Pampattiwar
1b5503e20b Bump external-snapshotter to v8.4.0 for VGS v1beta2 support
Kubernetes 1.34 introduced VolumeGroupSnapshot v1beta2 API and
deprecated v1beta1. Distributions running K8s 1.34+ (e.g. OpenShift
4.21+) have removed v1beta1 VGS CRDs entirely, breaking Velero's
VGS functionality on those clusters.

This change bumps external-snapshotter/client/v8 from v8.2.0 to
v8.4.0 and migrates all VGS API usage from v1beta1 to v1beta2.

The v1beta2 API is structurally compatible - the Spec-level types
(GroupSnapshotHandles, VolumeGroupSnapshotContentSource) are
unchanged. The Status-level change (VolumeSnapshotHandlePairList
replaced by VolumeSnapshotInfoList) does not affect Velero as it
does not directly consume that type.

Fixes #9694

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
2026-04-08 15:46:06 -07:00
Shubham Pampattiwar
e439977117 Fix VolumeGroupSnapshot restore failure with Ceph RBD CSI driver (#9516)
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m5s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 14s
Main CI / Build (push) Failing after 30s
* Fix VolumeGroupSnapshot restore on Ceph RBD

This PR fixes two related issues affecting CSI snapshot restore on Ceph RBD:

1. VolumeGroupSnapshot restore fails because Ceph RBD populates
   volumeGroupSnapshotHandle on pre-provisioned VSCs, but Velero doesn't
   create the required VGSC during restore.

2. CSI snapshot restore fails because VolumeSnapshotClassName is removed
   from restored VSCs, preventing the CSI controller from getting
   credentials for snapshot verification.

Changes:
- Capture volumeGroupSnapshotHandle during backup as VS annotation
- Create stub VGSC during restore with matching handle in status
- Look up VolumeSnapshotClass by driver and set on restored VSC

Fixes #9512
Fixes #9515

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Add changelog for VGS restore fix

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Fix gofmt import order

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Add changelog for VGS restore fix

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Fix import alias corev1 to corev1api per lint config

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Fix: Add snapshot handles to existing stub VGSC and add unit tests

When multiple VolumeSnapshots from the same VolumeGroupSnapshot are
restored, they share the same VolumeGroupSnapshotHandle but have
different individual snapshot handles. This commit:

1. Fixes incomplete logic where existing VGSC wasn't updated with
   new snapshot handles (addresses review feedback)

2. Fixes race condition where Create returning AlreadyExists would
   skip adding the snapshot handle

3. Adds comprehensive unit tests for ensureStubVGSCExists (5 cases)
   and addSnapshotHandleToVGSC (4 cases) functions

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Clean up stub VolumeGroupSnapshotContents during restore finalization

Add cleanup logic for stub VGSCs created during VolumeGroupSnapshot restore.
The stub VGSCs are temporary objects needed to satisfy CSI controller
validation during VSC reconciliation. Once all related VSCs become
ReadyToUse, the stub VGSCs are no longer needed and should be removed.

The cleanup runs in the restore finalizer controller's execute() phase.
Before deleting each VGSC, it polls until all related VolumeSnapshotContents
(correlated by snapshot handle) are ReadyToUse, with a timeout fallback.
Deletion failures and CRD-not-installed scenarios are treated as warnings
rather than errors to avoid failing the restore.

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

* Fix lint: remove unused nolint directive and simplify cleanupStubVGSC return

The cleanupStubVGSC function only produces warnings (not errors), so
simplify its return signature. Also remove the now-unused nolint:unparam
directive on execute() since warnings are no longer always nil.

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>

---------

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
2026-04-08 12:08:56 -07:00
Adam Zhang
dd82645909 Fix PodVolumeBackup list scope during restore
Restrict the listing of PodVolumeBackup resources to the specific
restore namespace in both the core restore controller and the pod
volume restore action plugin. This prevents "Forbidden" errors when
Velero is configured with namespace-scoped minimum privileges,
avoiding the need for cluster-scoped list permissions for
PodVolumeBackups.

Fixes: #9681

Signed-off-by: Adam Zhang <adam.zhang@broadcom.com>
2026-04-08 16:50:09 +08:00
Lyndon-Li
9598c50295 Merge branch 'main' into remove-restic-for-repo 2026-04-08 13:37:34 +08:00
Lyndon-Li
dca3d3001f remove restic for repo
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-04-08 11:11:15 +08:00
Lyndon-Li
fca4d405b1 remove restic for uploader
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-04-07 18:07:51 +08:00
Lyndon-Li
235e579581 remove restic for repo
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-04-07 07:35:25 +00:00
Quang Ngo
6c3d81a146 Add schedule_expected_interval_seconds metric
Add a new Prometheus gauge metric that exposes the expected interval
between consecutive scheduled backups. This enables dynamic alerting
thresholds per schedule backups.

Signed-off-by: Quang Ngo <quang.ngo@canonical.com>
2026-03-02 10:20:09 +11:00
Scott Seago
dfb1d45831 Remove backup from running list when backup fails validation (#9498)
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m4s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 12s
Main CI / Build (push) Failing after 32s
Close stale issues and PRs / stale (push) Successful in 15s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m34s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 1m16s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m30s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 1m16s
Signed-off-by: Scott Seago <sseago@redhat.com>
2026-01-27 16:25:30 -05:00
Wenkai Yin(尹文开)
7442d20f9d Merge pull request #9481 from Lyndon-Li/issue-fix-9478
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m3s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 15s
Main CI / Build (push) Failing after 35s
Close stale issues and PRs / stale (push) Successful in 13s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m46s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 1m34s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m22s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 1m18s
Issue 9478: Diagnose expose on peek error
2026-01-15 16:53:57 +08:00
Lyndon-Li
e72fea8ecd fix issue for cache volume
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-01-14 17:45:01 +08:00
Lyndon-Li
e703e06eeb diagnose expose on peek error
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
2026-01-13 16:33:14 +08:00
lyndon-li
2d93ab261e Merge pull request #9141 from kaovilai/9097
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m8s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 16s
Main CI / Build (push) Failing after 45s
feat: Enhance BackupStorageLocation with Secret-based CA certificate support
2025-12-19 13:13:47 +08:00
Xun Jiang/Bruce Jiang
aa3bd251dd Merge branch 'main' into 9097 2025-12-18 14:18:04 +08:00
Xun Jiang
e39374f335 Add maintenance job and data mover pod's labels and annotations setting.
Add wait in file_system_test's async test cases.
Add related documents.

Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
2025-12-17 13:21:07 +08:00
Tiger Kaovilai
61bf2ef777 feat: Enhance BackupStorageLocation with Secret-based CA certificate support
- Introduced `CACertRef` field in `ObjectStorageLocation` to reference a Secret containing the CA certificate, replacing the deprecated `CACert` field.
- Implemented validation logic to ensure mutual exclusivity between `CACert` and `CACertRef`.
- Updated BSL controller and repository provider to handle the new certificate resolution logic.
- Enhanced CLI to support automatic certificate discovery from BSL configurations.
- Added unit and integration tests to validate new functionality and ensure backward compatibility.
- Documented migration strategy for users transitioning from inline certificates to Secret-based management.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
2025-12-12 21:07:37 +07:00
Shubham Pampattiwar
14b34f08cc Merge pull request #9321 from shubham-pampattiwar/fix-azure-bsl-status-message-8368
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m11s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 13s
Main CI / Build (push) Failing after 33s
Sanitize Azure HTTP responses in BSL status messages
2025-12-11 22:00:18 -08:00
Xun Jiang
096436507e Remove VolumeSnapshotClass from CSI restore and deletion process.
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m1s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 3s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Remove VolumeSnapshotClass from backup sync process.

Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
2025-12-11 17:56:11 +08:00
Shubham Pampattiwar
f0c97c489d Merge pull request #9414 from shubham-pampattiwar/add-maintenance-job-metrics
Some checks failed
Run the E2E test on kind / get-go-version (push) Failing after 1m8s
Run the E2E test on kind / build (push) Has been skipped
Run the E2E test on kind / setup-test-matrix (push) Successful in 5s
Run the E2E test on kind / run-e2e-test (push) Has been skipped
Main CI / get-go-version (push) Successful in 14s
Main CI / Build (push) Failing after 37s
Close stale issues and PRs / stale (push) Successful in 15s
Trivy Nightly Scan / Trivy nightly scan (velero, main) (push) Failing after 1m43s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-aws, main) (push) Failing after 58s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-gcp, main) (push) Failing after 1m8s
Trivy Nightly Scan / Trivy nightly scan (velero-plugin-for-microsoft-azure, main) (push) Failing after 58s
Add Prometheus metrics for maintenance jobs
2025-12-08 09:23:44 -08:00
Scott Seago
7286d24c35 Updates for merge conflict and to refine reconciler queueing logic
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-03 16:55:59 -05:00
Scott Seago
7e4797f588 Track running backup count via BackupTracker
This avoids an unnecessary apiserver List call when
the backup reconciler is already at capacity.

Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:23:47 -05:00
Scott Seago
f238a7e47b make update
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:23:21 -05:00
Scott Seago
0b2e7d1238 Minor refactoring
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:23:21 -05:00
Scott Seago
73864e31ff Fix linters
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:04:55 -05:00
Scott Seago
8a95d512b3 make update, changelog
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:04:07 -05:00
Scott Seago
4d1802233a add various scenarios to queue controller unit tests
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 17:01:09 -05:00
Scott Seago
f73443659a Backup queue controller implementation
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:57:18 -05:00
Scott Seago
845eee4e60 feat: Create backup queue controller and add to disableable list
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:46:56 -05:00
Scott Seago
6a3f821606 fix lint
Signed-off-by: Scott Seago <sseago@redhat.com>

Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
34dc381182 Refactor after review
Signed-off-by: Scott Seago <sseago@redhat.com>

Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
29b01c3170 make update
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
9c1c7d20ff Minor refactoring
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
7bc57b5a5f Refactor queue controller to reduce apiserver list calls
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
e7b5d20f4c Fix linters
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:10 -05:00
Scott Seago
aedc0fe5e2 make update, changelog
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:39:07 -05:00
Scott Seago
91357b28c4 Move worker pool creation to backup reconcile.
ItemBlockWorkerPool is now created for each backup.

Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:38:41 -05:00
Scott Seago
e0c08f03cf add various scenarios to queue controller unit tests
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:38:41 -05:00
Scott Seago
a56ab10f23 Move debug logs to info
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:38:41 -05:00
Scott Seago
d39ad6f208 run multiple backup reconcilers, only reconcile ReadyToStart backups
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:38:41 -05:00
Scott Seago
13041b40c2 Backup queue controller implementation
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:38:41 -05:00
Scott Seago
fe799d7546 feat: Add concurrent backups configuration to backup reconciler
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:28:08 -05:00
Scott Seago
d91d50f696 feat: Add concurrentBackups to backupQueueReconciler
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:28:08 -05:00
Scott Seago
5d02af3ce3 feat: Create backup queue controller and add to disableable list
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
2025-12-02 16:28:08 -05:00
Shubham Pampattiwar
20af2c20c5 Address PR review comments: sanitize errors and add SAS token scrubbing
This commit addresses three review comments on PR #9321:

1. Keep sanitization in controller (response to @ywk253100)
   - Maintaining centralized error handling for easier extension
   - Azure-specific patterns detected and others passed through unchanged

2. Sanitize unavailableErrors array (@priyansh17)
   - Now using sanitizeStorageError() for both unavailableErrors array
     and location.Status.Message for consistency

3. Add SAS token scrubbing (@anshulahuja98)
   - Scrubs Azure SAS token parameters to prevent credential leakage
   - Redacts: sig, se, st, sp, spr, sv, sr, sip, srt, ss
   - Example: ?sig=secret becomes ?sig=***REDACTED***

Added comprehensive test coverage for SAS token scrubbing with 4 new
test cases covering various scenarios.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
2025-12-02 11:37:50 -08:00
Shubham Pampattiwar
a5d32f29da Sanitize Azure HTTP responses in BSL status messages
Azure storage errors include verbose HTTP response details and XML
in error messages, making the BSL status.message field cluttered
and hard to read. This change adds sanitization to extract only
the error code and meaningful message.

Before:
  BackupStorageLocation "test" is unavailable: rpc error: code = Unknown
  desc = GET https://...
  RESPONSE 404: 404 The specified container does not exist.
  ERROR CODE: ContainerNotFound
  <?xml version="1.0"...>

After:
  BackupStorageLocation "test" is unavailable: rpc error: code = Unknown
  desc = ContainerNotFound: The specified container does not exist.

AWS and GCP error messages are preserved as-is since they don't
contain verbose HTTP responses.

Fixes #8368

Signed-off-by: Shubham Pampattiwar <spampatt@redhat.com>
2025-12-02 11:37:50 -08:00