Merge pull request #9418 from Lyndon-Li/cache-volume-doc

Issue 9276: doc for cache volume
2025-12-23 14:25:22 +00:00 · 2025-11-18 21:29:09 -08:00
parent fa374b6143 9dc27555bc
commit e4726b2389
5 changed files with 64 additions and 7 deletions
--- a/changelogs/unreleased/9418-Lyndon-Li
+++ b/changelogs/unreleased/9418-Lyndon-Li
@@ -0,0 +1 @@
+Fix issue #9276, add doc for cache volume support
--- a/site/content/docs/main/csi-snapshot-data-movement.md
+++ b/site/content/docs/main/csi-snapshot-data-movement.md
@@ -376,7 +376,10 @@ For Velero built-in data mover, Velero uses [BestEffort as the QoS][13] for data
 If you want to constraint the CPU/memory usage, you need to [Customize Data Mover Pod Resource Limits][11]. The CPU/memory consumption is always related to the scale of data to be backed up/restored, refer to [Performance Guidance][12] for more details, so it is highly recommended that you perform your own testing to find the best resource limits for your data.  

 During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.  
-For Kopia repository, the cache is stored in the data mover pod's root file system. Velero allows you to configure a limit of the cache size so that the data mover pod won't be evicted due to running out of the ephemeral storage. For more details, check [Backup Repository Configuration][17]. 
+For Kopia repository, by default, the cache is stored in the data mover pod's root file system. If your root file system space is limited, the data mover pods may be evicted due to running out of the ephemeral storage, which causes the restore fails. To cope with this problem, Velero allows you:
+- configure a limit of the cache size per backup repository, for more details, check [Backup Repository Configuration][17].  
+- configure a dedicated volume for cache data, for more details, check [Data Movement Cache Volume][21].  
+

 ### Node Selection

@@ -416,4 +419,6 @@ Sometimes, `RestorePVC` needs to be configured to increase the performance of re
 [18]: https://github.com/vmware-tanzu/velero/pull/7576
 [19]: data-movement-restore-pvc-configuration.md
 [20]: node-agent-prepare-queue-length.md
+[21]: data-movement-cache-volume.md
+

--- a/site/content/docs/main/data-movement-cache-volume.md
+++ b/site/content/docs/main/data-movement-cache-volume.md
@@ -0,0 +1,46 @@
+---
+title: "Cache PVC Configuration for Data Movement Restore"
+layout: docs
+---
+
+Velero data movement restore (i.e., for CSI snapshot data movement and fs-backup) may request the backup repository to cache data locally so as to reduce the data request from the remote backup storage.  
+The cache behavior is decided by the specific backup repository, and Velero allows you to configure a cache limit for the backup repositories who support it (i.e., kopia repository). For more details, see [Backup Repository Configuration][1].  
+The size of cache may significantly impact on the performance. Specifically, if the cache size is too small, the restore throughput will be severely reduced and much more data would be downloaded from the backup storage.  
+By default, the cache data location is in the data mover pods' root disk. In some environments, the pods' root disk size is very limited, so a large cache size would cause the data mover pods evicted because of running out of ephemeral disk.  
+
+To cope with the problems and guarantee the data mover pods always run with a fine tuned local cache, Velero supports dedicated cache PVCs for data movement restore, for CSI snapshot data movement and fs-backup.  
+
+By default, Velero data mover pods run without cache PVCs. To enable cache PVC, you need to fill the cache PVC configurations in the node-agent configMap.  
+
+A sample of cache PVC configuration as part of the ConfigMap would look like:
+```json
+{
+    "cachePVC": {
+        "thresholdInGB": 1,
+        "storageClass": "sc-wffc"
+    }
+}
+```
+
+To create the configMap, save something like the above sample to a file and then run below commands:  
+```shell
+kubectl create cm node-agent-config -n velero --from-file=<json file name>
+```
+
+A must-have field in the configuration is `storageClass` which tells Velero which storage class is used to provision the cache PVC. Velero relies on Kubernetes dynamic provision process to provision the PVC, static provision is not supported.  
+
+The cache PVC behavior could be further fine tuned through `thresholdInGB`. Its value is compared to the size of the backup, if the size is smaller than this value, no cache PVC would be created when restoring from the backup. This ensures that cache PVCs are not created in vain when the backup size is too small and can be accommodated in the data mover pods' root disk.  
+
+This configuration decides whether and how to provision cache PVCs, but it doesn't decide their size. Instead, the size is decided by the specific backup repository. Specifically, Velero asks a cache limit from the backup repository and uses this limit to calculate the cache PVC size.  
+The cache limit is decided by the backup repository itself, for Kopia repository, if `cacheLimitMB` is specified in the backup repository configuration, its value will be used; otherwise, a default limit (5 GB) is used.  
+Then Velero inflates the limit by 20% by considering the non-payload overheads and delay cache cleanup behavior varying on backup repositories.    
+
+Take Kopia repository and the above cache PVC configuration for example:  
+- When `cacheLimitMB` is not available for the repository, a 6GB cache PVC is created for the backup that is larger than 1GB; otherwise, no cache volume is created
+- When `cacheLimitMB` is specified as `10240` for the repository, a 12GB cache PVC is created for the backup that is larger than 1GB; otherwise, no cache volume is created  
+
+To enable both the node-agent configMap and backup repository configMap, specify the flags in velero installation by CLI:
+`velero install --node-agent-configmap=<ConfigMap-Name> --backup-repository-configmap=<ConfigMap-Name>`
+
+
+[1]: backup-repository-configuration.md
--- a/site/content/docs/main/file-system-backup.md
+++ b/site/content/docs/main/file-system-backup.md
@@ -693,7 +693,7 @@ spec:

 ## Priority Class Configuration

-For Velero built-in data mover, data mover pods launched during file system backup will use the priority class name configured in the node-agent configmap. The node-agent daemonset itself gets its priority class from the `--node-agent-priority-class-name` flag during Velero installation. This can help ensure proper scheduling behavior in resource-constrained environments. For more details on configuring data mover pod resources, see [Data Movement Pod Resource Configuration][data-movement-config].
+For Velero built-in data mover, data mover pods launched during file system backup will use the priority class name configured in the node-agent configmap. The node-agent daemonset itself gets its priority class from the `--node-agent-priority-class-name` flag during Velero installation. This can help ensure proper scheduling behavior in resource-constrained environments. For more details on configuring data mover pod resources, see [Data Movement Pod Resource Configuration][21].

 ## Resource Consumption

@@ -712,7 +712,9 @@ totalPreservedMemory = (128M + 24M * numOfCPUCores) * numOfWorkerNodes
 However, whether and when this limit is reached is related to the data you are backing up/restoring.  

 During the restore, the repository may also cache data/metadata so as to reduce the network footprint and speed up the restore. The repository uses its own policy to store and clean up the cache.  
-For Kopia repository, the cache is stored in the node-agent pod's root file system. Velero allows you to configure a limit of the cache size so that the node-agent pod won't be evicted due to running out of the ephemeral storage. For more details, check [Backup Repository Configuration][18].  
+For Kopia repository, by default, the cache is stored in the data mover pod's root file system. If your root file system space is limited, the data mover pods may be evicted due to running out of the ephemeral storage, which causes the restore fails. To cope with this problem, Velero allows you:
+- configure a limit of the cache size per backup repository, for more details, check [Backup Repository Configuration][18].  
+- configure a dedicated volume for cache data, for more details, check [Data Movement Cache Volume][22].  

 ## Restic Deprecation  

@@ -766,4 +768,5 @@ Velero still effectively manage restic repository, though you cannot write any n
 [18]: backup-repository-configuration.md
 [19]: node-agent-concurrency.md
 [20]: node-agent-prepare-queue-length.md
-[data-movement-config]: data-movement-pod-resource-configuration.md
+[21]: data-movement-pod-resource-configuration.md
+[22]: data-movement-cache-volume.md
--- a/site/data/docs/main-toc.yml
+++ b/site/data/docs/main-toc.yml
@@ -45,8 +45,6 @@ toc:
        url: /restore-resource-modifiers
      - page: Run in any namespace
        url: /namespace       
-      - page: File system backup
-        url: /file-system-backup        
      - page: CSI Support
        url: /csi
      - page: Volume Group Snapshots
@@ -67,6 +65,8 @@ toc:
    subfolderitems:
      - page: CSI Snapshot Data Mover
        url: /csi-snapshot-data-movement
+      - page: File system backup
+        url: /file-system-backup        
      - page: Data Movement Backup PVC Configuration
        url: /data-movement-backup-pvc-configuration
      - page: Data Movement Restore PVC Configuration
@@ -75,6 +75,8 @@ toc:
        url: /data-movement-pod-resource-configuration        
      - page: Data Movement Node Selection Configuration
        url: /data-movement-node-selection
+      - page: Data Movement Cache PVC Configuration
+        url: /data-movement-cache-volume.md
      - page: Node-agent Concurrency
        url: /node-agent-concurrency
  - title: Plugins
				`@@ -0,0 +1 @@`
				`Fix issue #9276, add doc for cache volume support`