mirror of
https://github.com/vmware-tanzu/velero.git
synced 2026-01-05 13:05:17 +00:00
Update README and move the implemented Designs for v1.14
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
This commit is contained in:
@@ -0,0 +1,344 @@
|
||||
# Extend VolumePolicies to support more actions
|
||||
|
||||
## Abstract
|
||||
|
||||
Currently, the [VolumePolicies feature](https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/handle-backup-of-volumes-by-resources-filters.md) which can be used to filter/handle volumes during backup only supports the skip action on matching conditions. Users need more actions to be supported.
|
||||
|
||||
## Background
|
||||
|
||||
The `VolumePolicies` feature was introduced in Velero 1.11 as a flexible way to handle volumes. The main agenda of
|
||||
introducing the VolumePolicies feature was to improve the overall user experience when performing backup operations
|
||||
for volume resources, the feature enables users to group volumes according the `conditions` (criteria) specified and
|
||||
also lets you specify the `action` that velero needs to take for these grouped volumes during the backup operation.
|
||||
The limitation being that currently `VolumePolicies` only supports `skip` as an action, We want to extend the `action`
|
||||
functionality to support more usable options like `fs-backup` (File system backup) and `snapshot` (VolumeSnapshots).
|
||||
|
||||
## Goals
|
||||
- Extending the VolumePolicies to support more actions like `fs-backup` (File system backup) and `snapshot` (VolumeSnapshots).
|
||||
- Improve user experience when backing up Volumes via Velero
|
||||
|
||||
## Non-Goals
|
||||
- No changes to existing approaches to opt-in/opt-out annotations for volumes
|
||||
- No changes to existing `VolumePolicies` functionalities
|
||||
- No additions or implementations to support more granular actions like `snapshot-csi` and `snapshot-datamover`. These actions can be implemented as a future enhancement
|
||||
|
||||
|
||||
## Use-cases/Scenarios
|
||||
|
||||
**Use-case 1:**
|
||||
- A user wants to use `snapshot` (volumesnapshots) backup option for all the csi supported volumes and `fs-backup` for the rest of the volumes.
|
||||
- Currently, velero supports this use-case but the user experience is not that great.
|
||||
- The user will have to individually annotate the volume mounting pod with the annotation "backup.velero.io/backup-volumes" for `fs-backup`
|
||||
- This becomes cumbersome at scale.
|
||||
- Using `VolumePolicies`, the user can just specify 2 simple `VolumePolicies` like for csi supported volumes as `snapshot` action and rest can be backed up`fs-backup` action:
|
||||
```yaml
|
||||
version: v1
|
||||
volumePolicies:
|
||||
- conditions:
|
||||
storageClass:
|
||||
- gp2
|
||||
action:
|
||||
type: snapshot
|
||||
- conditions: {}
|
||||
action:
|
||||
type: fs-backup
|
||||
```
|
||||
|
||||
**Use-case 2:**
|
||||
- A user wants to use `fs-backup` for nfs volumes pertaining to a particular server
|
||||
- In such a scenario the user can just specify a `VolumePolicy` like:
|
||||
```yaml
|
||||
version: v1
|
||||
volumePolicies:
|
||||
- conditions:
|
||||
nfs:
|
||||
server: 192.168.200.90
|
||||
action:
|
||||
type: fs-backup
|
||||
```
|
||||
## High-Level Design
|
||||
- When the VolumePolicy action is set as `fs-backup` the backup workflow modifications would be:
|
||||
- We call [backupItem() -> backupItemInternal()](https://github.com/vmware-tanzu/velero/blob/main/pkg/backup/item_backupper.go#L95) on all the items that are to be backed up
|
||||
- Here when we encounter [Pod as an item ](https://github.com/vmware-tanzu/velero/blob/main/pkg/backup/item_backupper.go#L195)
|
||||
- We will have to modify the backup workflow to account for the `fs-backup` VolumePolicy action
|
||||
|
||||
|
||||
- When the VolumePolicy action is set as `snapshot` the backup workflow modifications would be:
|
||||
- Once again, We call [backupItem() -> backupItemInternal()](https://github.com/vmware-tanzu/velero/blob/main/pkg/backup/item_backupper.go#L95) on all the items that are to be backed up
|
||||
- Here when we encounter [Persistent Volume as an item](https://github.com/vmware-tanzu/velero/blob/d4128542590470b204a642ee43311921c11db880/pkg/backup/item_backupper.go#L253)
|
||||
- And we call the [takePVSnapshot func](https://github.com/vmware-tanzu/velero/blob/d4128542590470b204a642ee43311921c11db880/pkg/backup/item_backupper.go#L508)
|
||||
- We need to modify the takePVSnapshot function to account for the `snapshot` VolumePolicy action.
|
||||
- In case of csi snapshots for PVC objects, these snapshot actions are taken by the velero-plugin-for-csi, we need to modify the [executeActions()](https://github.com/vmware-tanzu/velero/blob/512fe0dabdcb3bbf1ca68a9089056ae549663bcf/pkg/backup/item_backupper.go#L232) function to account for the `snapshot` VolumePolicy action.
|
||||
|
||||
**Note:** `Snapshot` action can either be a native snapshot or a csi snapshot, as is the case with the current flow where velero itself makes the decision based on the backup CR.
|
||||
|
||||
## Detailed Design
|
||||
- Update VolumePolicy action type validation to account for `fs-backup` and `snapshot` as valid VolumePolicy actions.
|
||||
- Modifications needed for `fs-backup` action:
|
||||
- Now based on the specification of volume policy on backup request we will decide whether to go via legacy pod annotations approach or the newer volume policy based fs-backup action approach.
|
||||
- If there is a presence of volume policy(fs-backup/snapshot) on the backup request that matches as an action for a volume we use the newer volume policy approach to get the list of the volumes for `fs-backup` action
|
||||
- Else continue with the annotation based legacy approach workflow.
|
||||
|
||||
- Modifications needed for `snapshot` action:
|
||||
- In the [takePVSnapshot function](https://github.com/vmware-tanzu/velero/blob/d4128542590470b204a642ee43311921c11db880/pkg/backup/item_backupper.go#L508) we will check the PV fits the volume policy criteria and see if the associated action is `snapshot`
|
||||
- If it is not snapshot then we skip the further workflow and avoid taking the snapshot of the PV
|
||||
- Similarly, For csi snapshot of PVC object, we need to do similar changes in [executeAction() function](https://github.com/vmware-tanzu/velero/blob/512fe0dabdcb3bbf1ca68a9089056ae549663bcf/pkg/backup/item_backupper.go#L348). we will check the PVC fits the volume policy criteria and see if the associated action is `snapshot` via csi plugin
|
||||
- If it is not snapshot then we skip the csi BIA execute action and avoid taking the snapshot of the PVC by not invoking the csi plugin action for the PVC
|
||||
|
||||
**Note:**
|
||||
- When we are using the `VolumePolicy` approach for backing up the volumes then the volume policy criteria and action need to be specific and explicit, there is no default behavior, if a volume matches `fs-backup` action then `fs-backup` method will be used for that volume and similarly if the volume matches the criteria for `snapshot` action then the snapshot workflow will be used for the volume backup.
|
||||
- Another thing to note is the workflow proposed in this design uses the legacy `opt-in/opt-out` approach as a fallback option. For instance, the user specifies a VolumePolicy but for a particular volume included in the backup there are no actions(fs-backup/snapshot) matching in the volume policy for that volume, in such a scenario the legacy approach will be used for backing up the particular volume.
|
||||
- The relation between the `VolumePolicy` and the backup's legacy parameter `SnapshotVolumes`:
|
||||
- The `VolumePolicy`'s `snapshot` action matching for volume has higher priority. When there is a `snapshot` action matching for the selected volume, it will be backed by the snapshot way, no matter of the `backup.Spec.SnapshotVolumes` setting.
|
||||
- If there is no `snapshot` action matching the selected volume in the `VolumePolicy`, then the volume will be backed up by `snapshot` way, if the `backup.Spec.SnapshotVolumes` is not set to false.
|
||||
- The relation between the `VolumePolicy` and the backup's legacy filesystem `opt-in/opt-out` approach:
|
||||
- The `VolumePolicy`'s `fs-backup` action matching for volume has higher priority. When there is a `fs-backup` action matching for the selected volume, it will be backed by the fs-backup way, no matter of the `backup.Spec.DefaultVolumesToFsBackup` setting and the pod's `opt-in/opt-out` annotation setting.
|
||||
- If there is no `fs-backup` action matching the selected volume in the `VolumePolicy`, then the volume will be backed up by the legacy `opt-in/opt-out` way.
|
||||
|
||||
## Implementation
|
||||
|
||||
- The implementation should be included in velero 1.14
|
||||
|
||||
- We will introduce a `VolumeHelper` interface. It will consist of two methods:
|
||||
```go
|
||||
type VolumeHelper interface {
|
||||
ShouldPerformSnapshot(obj runtime.Unstructured, groupResource schema.GroupResource) (bool, error)
|
||||
ShouldPerformFSBackup(volume corev1api.Volume, pod corev1api.Pod) (bool, error)
|
||||
}
|
||||
```
|
||||
- The `VolumeHelperImpl` struct will implement the `VolumeHelper` interface and will consist of the functions that we will use through the backup workflow to accommodate volume policies for PVs and PVCs.
|
||||
```go
|
||||
type volumeHelperImpl struct {
|
||||
volumePolicy *resourcepolicies.Policies
|
||||
snapshotVolumes *bool
|
||||
logger logrus.FieldLogger
|
||||
client crclient.Client
|
||||
defaultVolumesToFSBackup bool
|
||||
backupExcludePVC bool
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
- We will create an instance of the structure `volumeHelperImpl` in `item_backupper.go`
|
||||
```go
|
||||
itemBackupper := &itemBackupper{
|
||||
...
|
||||
volumeHelperImpl: volumehelper.NewVolumeHelperImpl(
|
||||
resourcePolicy,
|
||||
backupRequest.Spec.SnapshotVolumes,
|
||||
log,
|
||||
kb.kbClient,
|
||||
boolptr.IsSetToTrue(backupRequest.Spec.DefaultVolumesToFsBackup),
|
||||
!backupRequest.ResourceIncludesExcludes.ShouldInclude(kuberesource.PersistentVolumeClaims.String()),
|
||||
),
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
#### FS-Backup
|
||||
- Regarding `fs-backup` action to decide whether to use legacy annotation based approach or volume policy based approach:
|
||||
- We will use the `vh.ShouldPerformFSBackup()` function from the `volumehelper` package
|
||||
- Functions involved in processing `fs-backup` volume policy action will somewhat look like:
|
||||
|
||||
```go
|
||||
func (v volumeHelperImpl) ShouldPerformFSBackup(volume corev1api.Volume, pod corev1api.Pod) (bool, error) {
|
||||
if !v.shouldIncludeVolumeInBackup(volume) {
|
||||
v.logger.Debugf("skip fs-backup action for pod %s's volume %s, due to not pass volume check.", pod.Namespace+"/"+pod.Name, volume.Name)
|
||||
return false, nil
|
||||
}
|
||||
|
||||
if v.volumePolicy != nil {
|
||||
pvc, err := kubeutil.GetPVCForPodVolume(&volume, &pod, v.client)
|
||||
if err != nil {
|
||||
v.logger.WithError(err).Errorf("fail to get PVC for pod %s", pod.Namespace+"/"+pod.Name)
|
||||
return false, err
|
||||
}
|
||||
pv, err := kubeutil.GetPVForPVC(pvc, v.client)
|
||||
if err != nil {
|
||||
v.logger.WithError(err).Errorf("fail to get PV for PVC %s", pvc.Namespace+"/"+pvc.Name)
|
||||
return false, err
|
||||
}
|
||||
|
||||
action, err := v.volumePolicy.GetMatchAction(pv)
|
||||
if err != nil {
|
||||
v.logger.WithError(err).Errorf("fail to get VolumePolicy match action for PV %s", pv.Name)
|
||||
return false, err
|
||||
}
|
||||
|
||||
if action != nil {
|
||||
if action.Type == resourcepolicies.FSBackup {
|
||||
v.logger.Infof("Perform fs-backup action for volume %s of pod %s due to volume policy match",
|
||||
volume.Name, pod.Namespace+"/"+pod.Name)
|
||||
return true, nil
|
||||
} else {
|
||||
v.logger.Infof("Skip fs-backup action for volume %s for pod %s because the action type is %s",
|
||||
volume.Name, pod.Namespace+"/"+pod.Name, action.Type)
|
||||
return false, nil
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if v.shouldPerformFSBackupLegacy(volume, pod) {
|
||||
v.logger.Infof("Perform fs-backup action for volume %s of pod %s due to opt-in/out way",
|
||||
volume.Name, pod.Namespace+"/"+pod.Name)
|
||||
return true, nil
|
||||
} else {
|
||||
v.logger.Infof("Skip fs-backup action for volume %s of pod %s due to opt-in/out way",
|
||||
volume.Name, pod.Namespace+"/"+pod.Name)
|
||||
return false, nil
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- The main function from the above will be called when we encounter Pods during the backup workflow:
|
||||
```go
|
||||
for _, volume := range pod.Spec.Volumes {
|
||||
shouldDoFSBackup, err := ib.volumeHelperImpl.ShouldPerformFSBackup(volume, *pod)
|
||||
if err != nil {
|
||||
backupErrs = append(backupErrs, errors.WithStack(err))
|
||||
}
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
#### Snapshot (PV)
|
||||
|
||||
- Making sure that `snapshot` action is skipped for PVs that do not fit the volume policy criteria, for this we will use the `vh.ShouldPerformSnapshot` from the `VolumeHelperImpl(vh)` receiver.
|
||||
```go
|
||||
func (v *volumeHelperImpl) ShouldPerformSnapshot(obj runtime.Unstructured, groupResource schema.GroupResource) (bool, error) {
|
||||
// check if volume policy exists and also check if the object(pv/pvc) fits a volume policy criteria and see if the associated action is snapshot
|
||||
// if it is not snapshot then skip the code path for snapshotting the PV/PVC
|
||||
pvc := new(corev1api.PersistentVolumeClaim)
|
||||
pv := new(corev1api.PersistentVolume)
|
||||
var err error
|
||||
|
||||
if groupResource == kuberesource.PersistentVolumeClaims {
|
||||
if err = runtime.DefaultUnstructuredConverter.FromUnstructured(obj.UnstructuredContent(), &pvc); err != nil {
|
||||
return false, err
|
||||
}
|
||||
|
||||
pv, err = kubeutil.GetPVForPVC(pvc, v.client)
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
}
|
||||
|
||||
if groupResource == kuberesource.PersistentVolumes {
|
||||
if err = runtime.DefaultUnstructuredConverter.FromUnstructured(obj.UnstructuredContent(), &pv); err != nil {
|
||||
return false, err
|
||||
}
|
||||
}
|
||||
|
||||
if v.volumePolicy != nil {
|
||||
action, err := v.volumePolicy.GetMatchAction(pv)
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
|
||||
// If there is a match action, and the action type is snapshot, return true,
|
||||
// or the action type is not snapshot, then return false.
|
||||
// If there is no match action, go on to the next check.
|
||||
if action != nil {
|
||||
if action.Type == resourcepolicies.Snapshot {
|
||||
v.logger.Infof(fmt.Sprintf("performing snapshot action for pv %s", pv.Name))
|
||||
return true, nil
|
||||
} else {
|
||||
v.logger.Infof("Skip snapshot action for pv %s as the action type is %s", pv.Name, action.Type)
|
||||
return false, nil
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// If this PV is claimed, see if we've already taken a (pod volume backup)
|
||||
// snapshot of the contents of this PV. If so, don't take a snapshot.
|
||||
if pv.Spec.ClaimRef != nil {
|
||||
pods, err := podvolumeutil.GetPodsUsingPVC(
|
||||
pv.Spec.ClaimRef.Namespace,
|
||||
pv.Spec.ClaimRef.Name,
|
||||
v.client,
|
||||
)
|
||||
if err != nil {
|
||||
v.logger.WithError(err).Errorf("fail to get pod for PV %s", pv.Name)
|
||||
return false, err
|
||||
}
|
||||
|
||||
for _, pod := range pods {
|
||||
for _, vol := range pod.Spec.Volumes {
|
||||
if vol.PersistentVolumeClaim != nil &&
|
||||
vol.PersistentVolumeClaim.ClaimName == pv.Spec.ClaimRef.Name &&
|
||||
v.shouldPerformFSBackupLegacy(vol, pod) {
|
||||
v.logger.Infof("Skipping snapshot of pv %s because it is backed up with PodVolumeBackup.", pv.Name)
|
||||
return false, nil
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !boolptr.IsSetToFalse(v.snapshotVolumes) {
|
||||
// If the backup.Spec.SnapshotVolumes is not set, or set to true, then should take the snapshot.
|
||||
v.logger.Infof("performing snapshot action for pv %s as the snapshotVolumes is not set to false")
|
||||
return true, nil
|
||||
}
|
||||
|
||||
v.logger.Infof(fmt.Sprintf("skipping snapshot action for pv %s possibly due to no volume policy setting or snapshotVolumes is false", pv.Name))
|
||||
return false, nil
|
||||
}
|
||||
```
|
||||
|
||||
- The function `ShouldPerformSnapshot` will be used as follows in `takePVSnapshot` function of the backup workflow:
|
||||
```go
|
||||
snapshotVolume, err := ib.volumeHelperImpl.ShouldPerformSnapshot(obj, kuberesource.PersistentVolumes)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if !snapshotVolume {
|
||||
log.Info(fmt.Sprintf("skipping volume snapshot for PV %s as it does not fit the volume policy criteria specified by the user for snapshot action", pv.Name))
|
||||
ib.trackSkippedPV(obj, kuberesource.PersistentVolumes, volumeSnapshotApproach, "does not satisfy the criteria for volume policy based snapshot action", log)
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
#### Snapshot (PVC)
|
||||
|
||||
- Making sure that `snapshot` action is skipped for PVCs that do not fit the volume policy criteria, for this we will again use the `vh.ShouldPerformSnapshot` from the `VolumeHelperImpl(vh)` receiver.
|
||||
- We will pass the `VolumeHelperImpl(vh)` instance in `executeActions` method so that it is available to use.
|
||||
```go
|
||||
|
||||
```
|
||||
- The above function will be used as follows in the `executeActions` function of backup workflow.
|
||||
- Considering the vSphere plugin doesn't support the VolumePolicy yet, don't use the VolumePolicy for vSphere plugin by now.
|
||||
```go
|
||||
if groupResource == kuberesource.PersistentVolumeClaims {
|
||||
if actionName == csiBIAPluginName {
|
||||
snapshotVolume, err := ib.volumeHelperImpl.ShouldPerformSnapshot(obj, kuberesource.PersistentVolumeClaims)
|
||||
if err != nil {
|
||||
return nil, itemFiles, errors.WithStack(err)
|
||||
}
|
||||
|
||||
if !snapshotVolume {
|
||||
log.Info(fmt.Sprintf("skipping csi volume snapshot for PVC %s as it does not fit the volume policy criteria specified by the user for snapshot action", namespace+"/"+name))
|
||||
ib.trackSkippedPV(obj, kuberesource.PersistentVolumeClaims, volumeSnapshotApproach, "does not satisfy the criteria for volume policy based snapshot action", log)
|
||||
continue
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Future Implementation
|
||||
It makes sense to add more specific actions in the future, once we deprecate the legacy opt-in/opt-out approach to keep things simple. Another point of note is, csi related action can be
|
||||
easier to implement once we decide to merge the csi plugin into main velero code flow.
|
||||
In the future, we envision the following actions that can be implemented:
|
||||
- `snapshot-native`: only use volume snapshotter (native cloud provider snapshots), do nothing if not present/not compatible
|
||||
- `snapshot-csi`: only use csi-plugin, don't use volume snapshotter(native cloud provider snapshots), don't use datamover even if snapshotMoveData is true
|
||||
- `snapshot-datamover`: only use csi with datamover, don't use volume snapshotter (native cloud provider snapshots), use datamover even if snapshotMoveData is false
|
||||
|
||||
**Note:** The above actions are just suggestions for future scope, we may not use/implement them as is. We could definitely merge these suggested actions as `Snapshot` actions and use volume policy parameters and criteria to segregate them instead of making the user explicitly supply the action names to such granular levels.
|
||||
|
||||
## Related to Design
|
||||
|
||||
[Handle backup of volumes by resources filters](https://github.com/vmware-tanzu/velero/blob/main/design/Implemented/handle-backup-of-volumes-by-resources-filters.md)
|
||||
|
||||
## Alternatives Considered
|
||||
Same as the earlier design as this is an extension of the original VolumePolicies design
|
||||
137
design/Implemented/node-agent-affinity.md
Normal file
137
design/Implemented/node-agent-affinity.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Node-agent Load Affinity Design
|
||||
|
||||
## Glossary & Abbreviation
|
||||
|
||||
**Velero Generic Data Path (VGDP)**: VGDP is the collective modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transfer for various purposes (i.e., PodVolume backup/restore, Volume Snapshot Data Movement). VGDP modules include uploaders and the backup repository.
|
||||
|
||||
**Exposer**: Exposer is a module that is introduced in [Volume Snapshot Data Movement Design][2]. Velero uses this module to expose the volume snapshots to Velero node-agent pods or node-agent associated pods so as to complete the data movement from the snapshots.
|
||||
|
||||
## Background
|
||||
|
||||
Velero node-agent is a daemonset hosting controllers and VGDP modules to complete the concrete work of backups/restores, i.e., PodVolume backup/restore, Volume Snapshot Data Movement backup/restore.
|
||||
Specifically, node-agent runs DataUpload controllers to watch DataUpload CRs for Volume Snapshot Data Movement backups, so there is one controller instance in each node. One controller instance takes a DataUpload CR and then launches a VGDP instance, which initializes a uploader instance and the backup repository connection, to finish the data transfer. The VGDP instance runs inside a node-agent pod or in a pod associated to the node-agent pod in the same node.
|
||||
|
||||
Varying from the data size, data complexity, resource availability, VGDP may take a long time and remarkable resources (CPU, memory, network bandwidth, etc.).
|
||||
Technically, VGDP instances are able to run in any node that allows pod schedule. On the other hand, users may want to constrain the nodes where VGDP instances run for various reasons, below are some examples:
|
||||
- Prevent VGDP instances from running in specific nodes because users have more critical workloads in the nodes
|
||||
- Constrain VGDP instances to run in specific nodes because these nodes have more resources than others
|
||||
- Constrain VGDP instances to run in specific nodes because the storage allows volume/snapshot provisions in these nodes only
|
||||
|
||||
Therefore, in order to improve the compatibility, it is worthy to configure the affinity of VGDP to nodes, especially for backups for which VGDP instances run frequently and centrally.
|
||||
|
||||
## Goals
|
||||
|
||||
- Define the behaviors of node affinity of VGDP instances in node-agent for volume snapshot data movement backups
|
||||
- Create a mechanism for users to specify the node affinity of VGDP instances for volume snapshot data movement backups
|
||||
|
||||
## Non-Goals
|
||||
- It is also beneficial to support VGDP instances affinity for PodVolume backup/restore, however, it is not possible since VGDP instances for PodVolume backup/restore should always run in the node where the source/target pods are created.
|
||||
- It is also beneficial to support VGDP instances affinity for data movement restores, however, it is not possible in some cases. For example, when the `volumeBindingMode` in the storageclass is `WaitForFirstConsumer`, the restore volume must be mounted in the node where the target pod is scheduled, so the VGDP instance must run in the same node. On the other hand, considering the fact that restores may not frequently and centrally run, we will not support data movement restores.
|
||||
- As elaberated in the [Volume Snapshot Data Movement Design][2], the Exposer may take different ways to expose snapshots, i.e., through backup pods (this is the only way supported at present). The implementation section below only considers this approach currently, if a new expose method is introduced in future, the definition of the affinity configurations and behaviors should still work, but we may need a new implementation.
|
||||
|
||||
## Solution
|
||||
|
||||
We will use the ```node-agent-config``` configMap to host the node affinity configurations.
|
||||
This configMap is not created by Velero, users should create it manually on demand. The configMap should be in the same namespace where Velero is installed. If multiple Velero instances are installed in different namespaces, there should be one configMap in each namespace which applies to node-agent in that namespace only.
|
||||
Node-agent server checks these configurations at startup time and use it to initiate the related VGDP modules. Therefore, users could edit this configMap any time, but in order to make the changes effective, node-agent server needs to be restarted.
|
||||
Inside ```node-agent-config``` configMap we will add one new kind of configuration as the data in the configMap, the name is ```loadAffinity```.
|
||||
Users may want to set different LoadAffinity configurations according to different conditions (i.e., for different storages represented by StorageClass, CSI driver, etc.), so we define ```loadAffinity``` as an array. This is for extensibility consideration, at present, we don't implement multiple configurations support, so if there are multiple configurations, we always take the first one in the array.
|
||||
|
||||
The data structure for ```node-agent-config``` is as below:
|
||||
```go
|
||||
type Configs struct {
|
||||
// LoadConcurrency is the config for load concurrency per node.
|
||||
LoadConcurrency *LoadConcurrency `json:"loadConcurrency,omitempty"`
|
||||
|
||||
// LoadAffinity is the config for data path load affinity.
|
||||
LoadAffinity []*LoadAffinity `json:"loadAffinity,omitempty"`
|
||||
}
|
||||
|
||||
type LoadAffinity struct {
|
||||
// NodeSelector specifies the label selector to match nodes
|
||||
NodeSelector metav1.LabelSelector `json:"nodeSelector"`
|
||||
}
|
||||
```
|
||||
|
||||
### Affinity
|
||||
Affinity configuration means allowing VGDP instances running in the nodes specified. There are two ways to define it:
|
||||
- It could be defined by `MatchLabels` of `metav1.LabelSelector`. The labels defined in `MatchLabels` means a `LabelSelectorOpIn` operation by default, so in the current context, they will be treated as affinity rules.
|
||||
- It could be defined by `MatchExpressions` of `metav1.LabelSelector`. The labels are defined in `Key` and `Values` of `MatchExpressions` and the `Operator` should be defined as `LabelSelectorOpIn` or `LabelSelectorOpExists`.
|
||||
|
||||
### Anti-affinity
|
||||
Anti-affinity configuration means preventing VGDP instances running in the nodes specified. Below is the way to define it:
|
||||
- It could be defined by `MatchExpressions` of `metav1.LabelSelector`. The labels are defined in `Key` and `Values` of `MatchExpressions` and the `Operator` should be defined as `LabelSelectorOpNotIn` or `LabelSelectorOpDoesNotExist`.
|
||||
|
||||
### Sample
|
||||
A sample of the ```node-agent-config``` configMap is as below:
|
||||
```json
|
||||
{
|
||||
"loadAffinity": [
|
||||
{
|
||||
"nodeSelector": {
|
||||
"matchLabels": {
|
||||
"beta.kubernetes.io/instance-type": "Standard_B4ms"
|
||||
},
|
||||
"matchExpressions": [
|
||||
{
|
||||
"key": "kubernetes.io/hostname",
|
||||
"values": [
|
||||
"node-1",
|
||||
"node-2",
|
||||
"node-3"
|
||||
],
|
||||
"operator": "In"
|
||||
},
|
||||
{
|
||||
"key": "xxx/critial-workload",
|
||||
"operator": "DoesNotExist"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
This sample showcases two affinity configurations:
|
||||
- matchLabels: VGDP instances will run only in nodes with label key `beta.kubernetes.io/instance-type` and value `Standard_B4ms`
|
||||
- matchExpressions: VGDP instances will run in node `node1`, `node2` and `node3` (selected by `kubernetes.io/hostname` label)
|
||||
|
||||
This sample showcases one anti-affinity configuration:
|
||||
- matchExpressions: VGDP instances will not run in nodes with label key `xxx/critial-workload`
|
||||
|
||||
To create the configMap, users need to save something like the above sample to a json file and then run below command:
|
||||
```
|
||||
kubectl create cm node-agent-config -n velero --from-file=<json file name>
|
||||
```
|
||||
|
||||
### Implementation
|
||||
As mentioned in the [Volume Snapshot Data Movement Design][2], the exposer decides where to launch the VGDP instances. At present, for volume snapshot data movement backups, the exposer creates backupPods and the VGDP instances will be initiated in the nodes where backupPods are scheduled. So the loadAffinity will be translated (from `metav1.LabelSelector` to `corev1.Affinity`) and set to the backupPods.
|
||||
|
||||
It is possible that node-agent pods, as a daemonset, don't run in every worker nodes, users could fulfil this by specify `nodeSelector` or `nodeAffinity` to the node-agent daemonset spec. On the other hand, at present, VGDP instances must be assigned to nodes where node-agent pods are running. Therefore, if there is any node selection for node-agent pods, users must consider this into this load affinity configuration, so as to guarantee that VGDP instances are always assigned to nodes where node-agent pods are available. This is done by users, we don't inherit any node selection configuration from node-agent daemonset as we think daemonset scheduler works differently from plain pod scheduler, simply inheriting all the configurations may cause unexpected result of backupPod schedule.
|
||||
Otherwise, if a backupPod are scheduled to a node where node-agent pod is absent, the corresponding DataUpload CR will stay in `Accepted` phase until the prepare timeout (by default 30min).
|
||||
|
||||
At present, as part of the expose operations, the exposer creates a volume, represented by backupPVC, from the snapshot. The backupPVC uses the same storageClass with the source volume. If the `volumeBindingMode` in the storageClass is `Immediate`, the volume is immediately allocated from the underlying storage without waiting for the backupPod. On the other hand, the loadAffinity is set to the backupPod's affinity. If the backupPod is scheduled to a node where the snapshot volume is not accessible, e.g., because of storage topologies, the backupPod won't get into Running state, concequently, the data movement won't complete.
|
||||
Once this problem happens, the backupPod stays in `Pending` phase, and the corresponding DataUpload CR stays in `Accepted` phase until the prepare timeout (by default 30min).
|
||||
|
||||
There is a common solution for the both problems:
|
||||
- We have an existing logic to periodically enqueue the dataupload CRs which are in the `Accepted` phase for timeout and cancel checks
|
||||
- We add a new logic to this existing logic to check if the corresponding backupPods are in unrecoverable status
|
||||
- The above problems could be covered by this check, because in both cases the backupPods are in abnormal and unrecoverable status
|
||||
- If a backupPod is unrecoverable, the dataupload controller cancels the dataupload and deletes the backupPod
|
||||
|
||||
Specifically, when the above problems happen, the status of a backupPod is like below:
|
||||
```
|
||||
status:
|
||||
conditions:
|
||||
- lastProbeTime: null
|
||||
message: '0/2 nodes are available: 1 node(s) didn''t match Pod''s node affinity/selector,
|
||||
1 node(s) had volume node affinity conflict. preemption: 0/2 nodes are available:
|
||||
2 Preemption is not helpful for scheduling..'
|
||||
reason: Unschedulable
|
||||
status: "False"
|
||||
type: PodScheduled
|
||||
phase: Pending
|
||||
```
|
||||
|
||||
[1]: Implemented/unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
|
||||
[2]: volume-snapshot-data-movement/volume-snapshot-data-movement.md
|
||||
143
design/Implemented/pv_restore_info.md
Normal file
143
design/Implemented/pv_restore_info.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Volume information for restore design
|
||||
|
||||
## Background
|
||||
Velero has different ways to handle data in the volumes during restore. The users want to have more clarity in terms of how
|
||||
the volumes are handled in restore process via either Velero CLI or other downstream product which consumes Velero.
|
||||
|
||||
## Goals
|
||||
- Create new metadata to store the information of the restored volume, which will have the same life-cycle as the restore CR.
|
||||
- Consume the metadata in velero CLI to enable it display more details for volumes in the output of `velero restore describe --details`
|
||||
|
||||
## Non Goals
|
||||
- Provide finer grained control of the volume restore process. The focus of the design is to enable displaying more details.
|
||||
- Persist additional metadata like podvolume, datadownloads etc to the restore folder in backup-location.
|
||||
|
||||
## Design
|
||||
|
||||
### Structure of the restore volume info
|
||||
The restore volume info will be stored in a file named like `${restore_name}-vol-info.json`. The content of the file will
|
||||
be a list of volume info objects, each of which will map to a volume that is restored, and will contain the information
|
||||
like name of the restored PV/PVC, restore method and related objects to provide details depending on the way it's restored,
|
||||
it will look like this:
|
||||
```
|
||||
[
|
||||
{
|
||||
"pvcName": "nginx-logs-2",
|
||||
"pvcNamespace": "nginx-app-restore",
|
||||
"pvName": "pvc-e320d75b-a788-41a3-b6ba-267a553efa5e",
|
||||
"restoreMethod": "PodVolumeRestore",
|
||||
"snapshotDataMoved": false,
|
||||
"pvrInfo": {
|
||||
"snapshotHandle": "81973157c3a945a5229285c931b02c68",
|
||||
"uploaderType": "kopia",
|
||||
"volumeName": "nginx-logs",
|
||||
"podName": "nginx-deployment-79b56c644b-mjdhp",
|
||||
"podNamespace": "nginx-app-restore"
|
||||
}
|
||||
},
|
||||
{
|
||||
"pvcName": "nginx-logs-1",
|
||||
"pvcNamespace": "nginx-app-restore",
|
||||
"pvName": "pvc-98c151f4-df47-4980-ba6d-470842f652cc",
|
||||
"restoreMethod": "CSISnapshot",
|
||||
"snapshotDataMoved": false,
|
||||
"csiSnapshotInfo": {
|
||||
"snapshotHandle": "snap-01a3b21a5e9f85528",
|
||||
"size": 2147483648,
|
||||
"driver": "ebs.csi.aws.com",
|
||||
"vscName": "velero-velero-nginx-logs-1-jxmbg-hx9x5"
|
||||
}
|
||||
}
|
||||
......
|
||||
]
|
||||
```
|
||||
Each field will have the same meaning as the corresponding field in the backup volume info. It will not have the fields
|
||||
that were introduced to help with the backup process, like `pvInfo`, `dataupload` etc.
|
||||
|
||||
### How the restore volume info is generated
|
||||
Two steps are involved in generating the restore volume info, the first is "collection", which is to gather the information
|
||||
for restoration of the volumes, the second is "generation", which is to iterate through the data collected in the first step
|
||||
and generate the volume info list as is described above.
|
||||
|
||||
Unlike backup, the CR objects created during the restore process will not be persisted to the backup storage location.
|
||||
Therefore, to gather the information needed to generate volume information, we either need to collect the CRs in the middle
|
||||
of the restore process, or retrieve the objects based on the `resouce-list.json` of the restore via API server.
|
||||
The information to be collected are:
|
||||
- **PV/PVC mapping relationship:** It will be collected via the `restore-resource-list.json`, b/c at the time the json is ready, all
|
||||
PVCs and PVs are already created.
|
||||
- **Native snapshot information:** It will be collected in the restore workflow when each snapshot is restored.
|
||||
- **podvolumerestore CRs:** It will be collected in the restore workflow after each pvr is created.
|
||||
- **volumesnapshot CRs for CSI snapshot:** It will be collected in the step of collecting PVC info, by reading the `dataSource`
|
||||
field in the spec of the PVC.
|
||||
- **datadownload CRs** It will be collected in the phase of collecting PVC info, by querying the API-server to list the datadownload
|
||||
CRs labeled with the restore name.
|
||||
|
||||
After the collection step, the generation step is relatively straight-forward, as we have all the information needed in
|
||||
the data structures.
|
||||
|
||||
The whole collection and generation steps will be done with the "best-effort" manner, i.e. if there are any failures we
|
||||
will only log the error in restore log, rather than failing the whole restore process, we will not put these errors or warnings
|
||||
into the `result.json`, b/c it won't impact the restored resources.
|
||||
|
||||
Depending on the number of the restored PVCs the "collection" step may involve many API calls, but it's considered acceptable
|
||||
b/c at that time the resources are already created, so the actual RTO is not impacted. By using the client of controller runtime
|
||||
we can make the collection step more efficient by using the cache of the API server. We may consider to make improvements if
|
||||
we observe performance issues, like using multiple go-routines in the collection.
|
||||
|
||||
### Implementation
|
||||
Because the restore volume info shares the same data structures with the backup volume info, we will refactor the code in
|
||||
package `internal/volume` to make the sub-components in backup volume info shared by both backup and restore volume info.
|
||||
|
||||
We'll introduce a struct called `RestoreVolumeInfoTracker` which encapsulates the logic of collecting and generating the restore volume info:
|
||||
```
|
||||
// RestoreVolumeInfoTracker is used to track the volume information during restore.
|
||||
// It is used to generate the RestoreVolumeInfo array.
|
||||
type RestoreVolumeInfoTracker struct {
|
||||
*sync.Mutex
|
||||
restore *velerov1api.Restore
|
||||
log logrus.FieldLogger
|
||||
client kbclient.Client
|
||||
pvPvc *pvcPvMap
|
||||
|
||||
// map of PV name to the NativeSnapshotInfo from which the PV is restored
|
||||
pvNativeSnapshotMap map[string]NativeSnapshotInfo
|
||||
// map of PV name to the CSISnapshot object from which the PV is restored
|
||||
pvCSISnapshotMap map[string]snapshotv1api.VolumeSnapshot
|
||||
datadownloadList *velerov2alpha1.DataDownloadList
|
||||
pvrs []*velerov1api.PodVolumeRestore
|
||||
}
|
||||
```
|
||||
The `RestoreVolumeInfoTracker` will be created when the restore request is initialized, and it will be passed to the `restoreContext`
|
||||
and carried over the whole restore process.
|
||||
|
||||
The `client` in this struct is to be used to query the resources in the restored namespace, and the current client in restore
|
||||
reconciler only watches the resources in the namespace where velero is installed. Therefore, we need to introduce the
|
||||
`CrClient` which has the same life-cycle of velero server to the restore reconciler, because this is the client that watches all the
|
||||
resources on the cluster.
|
||||
|
||||
In addition to that, we will make small changes in the restore workflow to collect the information needed. We'll make the
|
||||
changes un-intrusive and make sure not to change the logic of the restore to avoid break change or regression.
|
||||
We'll also introduce routine changes in the package `pkg/persistence` to persist the restore volume info to the backup storage location.
|
||||
|
||||
Last but not least, the `velero restore describe --details` will be updated to display the volume info in the output.
|
||||
|
||||
## Alternatives Considered
|
||||
There used to be suggestion that to provide more details about volume, we can query the `backup-vol-info.json` with the resource
|
||||
identifier in `restore-resource-list.json`. This will not work when there're resource modifiers involved in the restore process,
|
||||
which may change the metadata of PVC/PV. In addition, we may add more detailed restore-specific information about the volumes that is not available
|
||||
in the `backup-vol-info.json`. Therefore, the `restore-vol-info.json` is a better approach.
|
||||
|
||||
## Security Considerations
|
||||
There should be no security impact introduced by this design.
|
||||
|
||||
## Compatibility
|
||||
The restore volume info will be consumed by Velero CLI and downstream products for displaying details. So the functionality
|
||||
of backup and restore will not be impacted for restores created by older versions of Velero which do not have the restore volume info
|
||||
metadata. The client should properly handle the case when the restore volume info does not exist.
|
||||
|
||||
The data structures referenced by volume info is shared between both restore and backup and it's not versioned, so in the future
|
||||
we must make sure there will only be incremental changes to the metadata, such that no break change will be introduced to the client.
|
||||
|
||||
## Open Issues
|
||||
https://github.com/vmware-tanzu/velero/issues/7546
|
||||
https://github.com/vmware-tanzu/velero/issues/6478
|
||||
318
design/Implemented/repository-maintenance.md
Normal file
318
design/Implemented/repository-maintenance.md
Normal file
@@ -0,0 +1,318 @@
|
||||
# Design for repository maintenance job
|
||||
|
||||
## Abstract
|
||||
This design proposal aims to decouple repository maintenance from the Velero server by launching a maintenance job when needed, to mitigate the impact on the Velero server during backups.
|
||||
|
||||
## Background
|
||||
During backups, Velero performs periodic maintenance on the repository. This operation may consume significant CPU and memory resources in some cases, leading to potential issues such as the Velero server being killed by OOM. This proposal addresses these challenges by separating repository maintenance from the Velero server.
|
||||
|
||||
## Goals
|
||||
1. **Independent Repository Maintenance**: Decouple maintenance from Velero's main logic to reduce the impact on the Velero server pod.
|
||||
|
||||
2. **Configurable Resources Usage**: Make the resources used by the maintenance job configurable.
|
||||
|
||||
3. **No API Changes**: Retain existing APIs and workflow in the backup repository controller.
|
||||
|
||||
## Non Goals
|
||||
We have lots of concerns over parallel maintenance, which will increase the complexity of our design currently.
|
||||
|
||||
- Non-blocking maintenance job: it may conflict with updating the same `backuprepositories` CR when parallel maintenance.
|
||||
|
||||
- Maintenance job concurrency control: there is no one suitable mechanism in Kubernetes to control the concurrency of different jobs.
|
||||
|
||||
- Parallel maintenance: Maintaining the same repo by multiple jobs at the same time would have some compatible cases that some providers may not support.
|
||||
|
||||
Unfortunately, parallel maintenance is currently not a priority because of the concerns above, improving maintenance efficiency is not the primary focus at this stage.
|
||||
|
||||
## High-Level Design
|
||||
1. **Add Maintenance Subcommand**: Introduce a new Velero server subcommand for repository maintenance.
|
||||
|
||||
2. **Create Jobs by Repository Manager**: Modify the backup repository controller to create a maintenance job instead of directly calling the multiple chain calls for Kopia or Restic maintenance.
|
||||
|
||||
3. **Update Maintenance Job Result in BackupRepository CR**: Retrieve the result of the maintenance job and update the status of the `BackupRepository` CR accordingly.
|
||||
|
||||
4. **Add Setting for Maintenance Job**: Introduce a configuration option to set maintenance jobs, including resource limits (CPU and memory), keeping the latest N maintenance jobs for each repository.
|
||||
|
||||
## Detailed Design
|
||||
|
||||
### 1. Add Maintenance sub-command
|
||||
|
||||
The CLI command will be added to the Velero CLI, the command is designed for use in a pod of maintenance jobs.
|
||||
|
||||
Our CLI command is designed as follows:
|
||||
```shell
|
||||
$ velero repo-maintenance --repo-name $repo-name --repo-type $repo-type --backup-storage-location $bsl
|
||||
```
|
||||
|
||||
Compared with other CLI commands, the maintenance command is used in a pod of maintenance jobs not for user use, and the job should show the result of maintenance after finish.
|
||||
|
||||
Here we will write the error message into one specific file which could be read by the maintenance job.
|
||||
|
||||
on the whole, we record two kinds of logs:
|
||||
|
||||
- one is the log output of the intermediate maintenance process: this log could be retrieved via the Kubernetes API server, including the error log.
|
||||
|
||||
- one is the result of the command which could indicate whether the execution is an error or not: the result could be redirected to a file that the maintenance job itself could read, and the file only contains the error message.
|
||||
|
||||
we will write the error message into the `/dev/termination-log` file if execution is failed.
|
||||
|
||||
The main maintenance logic would be using the repository provider to do the maintenance.
|
||||
|
||||
```golang
|
||||
func checkError(err error, file *os.File) {
|
||||
if err != nil {
|
||||
if err != context.Canceled {
|
||||
if _, errWrite := file.WriteString(fmt.Sprintf("An error occurred: %v", err)); errWrite != nil {
|
||||
fmt.Fprintf(os.Stderr, "Failed to write error to termination log file: %v\n", errWrite)
|
||||
}
|
||||
file.Close()
|
||||
os.Exit(1) // indicate the command executed failed
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (o *Options) Run(f veleroCli.Factory) {
|
||||
logger := logging.DefaultLogger(o.LogLevelFlag.Parse(), o.FormatFlag.Parse())
|
||||
logger.SetOutput(os.Stdout)
|
||||
|
||||
errorFile, err := os.Create("/dev/termination-log")
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "Failed to create termination log file: %v\n", err)
|
||||
return
|
||||
}
|
||||
defer errorFile.Close()
|
||||
...
|
||||
|
||||
err = o.runRepoPrune(cli, f.Namespace(), logger)
|
||||
checkError(err, errorFile)
|
||||
...
|
||||
}
|
||||
|
||||
func (o *Options) runRepoPrune(cli client.Client, namespace string, logger logrus.FieldLogger) error {
|
||||
...
|
||||
var repoProvider provider.Provider
|
||||
if o.RepoType == velerov1api.BackupRepositoryTypeRestic {
|
||||
repoProvider = provider.NewResticRepositoryProvider(credentialFileStore, filesystem.NewFileSystem(), logger)
|
||||
} else {
|
||||
repoProvider = provider.NewUnifiedRepoProvider(
|
||||
credentials.CredentialGetter{
|
||||
FromFile: credentialFileStore,
|
||||
FromSecret: credentialSecretStore,
|
||||
}, o.RepoType, cli, logger)
|
||||
}
|
||||
...
|
||||
|
||||
err = repoProvider.BoostRepoConnect(context.Background(), para)
|
||||
if err != nil {
|
||||
return errors.Wrap(err, "failed to boost repo connect")
|
||||
}
|
||||
|
||||
err = repoProvider.PruneRepo(context.Background(), para)
|
||||
if err != nil {
|
||||
return errors.Wrap(err, "failed to prune repo")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Create Jobs by Repository Manager
|
||||
Currently, the backup repository controller will call the repository manager to do the `PruneRepo`, and Kopia or Restic maintenance is then finally called through multiple chain calls.
|
||||
|
||||
We will keep using the `PruneRepo` function in the repository manager, but we cut off the multiple chain calls by creating a maintenance job.
|
||||
|
||||
The job definition would be like below:
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
items:
|
||||
- apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
# labels or affinity or topology settings would inherit from the velero deployment
|
||||
labels:
|
||||
# label the job name for later list jobs by name
|
||||
job-name: nginx-example-default-kopia-pqz6c
|
||||
name: nginx-example-default-kopia-pqz6c
|
||||
namespace: velero
|
||||
spec:
|
||||
# Not retry it again
|
||||
backoffLimit: 1
|
||||
# Only have one job one time
|
||||
completions: 1
|
||||
# Not parallel running job
|
||||
parallelism: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
job-name: nginx-example-default-kopia-pqz6c
|
||||
name: kopia-maintenance-job
|
||||
spec:
|
||||
containers:
|
||||
# arguments for repo maintenance job
|
||||
- args:
|
||||
- repo-maintenance
|
||||
- --repo-name=nginx-example
|
||||
- --repo-type=kopia
|
||||
- --backup-storage-location=default
|
||||
# inherit from Velero server
|
||||
- --log-level=debug
|
||||
command:
|
||||
- /velero
|
||||
# inherit environment variables from the velero deployment
|
||||
env:
|
||||
- name: AZURE_CREDENTIALS_FILE
|
||||
value: /credentials/cloud
|
||||
# inherit image from the velero deployment
|
||||
image: velero/velero:main
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: kopia-maintenance-container
|
||||
# resource limitation set by Velero server configuration
|
||||
# if not specified, it would apply best effort resources allocation strategy
|
||||
resources: {}
|
||||
# error message would be written to /dev/termination-log
|
||||
terminationMessagePath: /dev/termination-log
|
||||
terminationMessagePolicy: File
|
||||
# inherit volume mounts from the velero deployment
|
||||
volumeMounts:
|
||||
- mountPath: /credentials
|
||||
name: cloud-credentials
|
||||
dnsPolicy: ClusterFirst
|
||||
restartPolicy: Never
|
||||
schedulerName: default-scheduler
|
||||
securityContext: {}
|
||||
# inherit service account from the velero deployment
|
||||
serviceAccount: velero
|
||||
serviceAccountName: velero
|
||||
volumes:
|
||||
# inherit cloud credentials from the velero deployment
|
||||
- name: cloud-credentials
|
||||
secret:
|
||||
defaultMode: 420
|
||||
secretName: cloud-credentials
|
||||
# ttlSecondsAfterFinished set the job expired seconds
|
||||
ttlSecondsAfterFinished: 86400
|
||||
status:
|
||||
# which contains the result after maintenance
|
||||
message: ""
|
||||
lastMaintenanceTime: ""
|
||||
```
|
||||
|
||||
Now, the backup repository controller will call the repository manager to create one maintenance job and wait for the job to complete. The Kopia or Restic maintenance multiple chains are called by the job.
|
||||
|
||||
### 3. Update the Result of the Maintenance Job into BackupRepository CR
|
||||
|
||||
The backup repository controller will update the result of the maintenance job into the backup repository CR.
|
||||
|
||||
For how to get the result of the maintenance job we could refer to [here](https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/#writing-and-reading-a-termination-message).
|
||||
|
||||
After the maintenance job is finished, we could get the result of maintenance by getting the terminated message from the related pod:
|
||||
|
||||
```golang
|
||||
func GetContainerTerminatedMessage(pod *v1.Pod) string {
|
||||
...
|
||||
for _, containerStatus := range pod.Status.ContainerStatuses {
|
||||
if containerStatus.LastTerminationState.Terminated != nil {
|
||||
return containerStatus.LastTerminationState.Terminated.Message
|
||||
}
|
||||
}
|
||||
...
|
||||
return ""
|
||||
}
|
||||
```
|
||||
Then we could update the status of backupRepository CR with the message.
|
||||
|
||||
### 4. Add Setting for Resource Usage of Maintenance
|
||||
Add one configuration for setting the resource limit of maintenance jobs as below:
|
||||
```shell
|
||||
velero server --maintenance-job-cpu-request $cpu-request --maintenance-job-mem-request $mem-request --maintenance-job-cpu-limit $cpu-limit --maintenance-job-mem-limit $mem-limit
|
||||
```
|
||||
Our default value is 0, which means we don't limit the resources, and the resource allocation strategy would be [best effort](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#besteffort).
|
||||
|
||||
### 5. Automatic Cleanup for Finished Maintenance Jobs
|
||||
Add configuration for clean up maintenance jobs:
|
||||
|
||||
- keep-latest-maintenance-jobs: the number of keeping latest maintenance jobs for each repository.
|
||||
|
||||
```shell
|
||||
velero server --keep-latest-maintenance-jobs $num
|
||||
```
|
||||
|
||||
We would check and keep the latest N jobs after a new job is finished.
|
||||
```golang
|
||||
func deleteOldMaintenanceJobs(cli client.Client, repo string, keep int) error {
|
||||
// Get the maintenance job list by label
|
||||
jobList := &batchv1.JobList{}
|
||||
err := cli.List(context.TODO(), jobList, client.MatchingLabels(map[string]string{RepositoryNameLabel: repo}))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// Delete old maintenance jobs
|
||||
if len(jobList.Items) > keep {
|
||||
sort.Slice(jobList.Items, func(i, j int) bool {
|
||||
return jobList.Items[i].CreationTimestamp.Before(&jobList.Items[j].CreationTimestamp)
|
||||
})
|
||||
for i := 0; i < len(jobList.Items)-keep; i++ {
|
||||
err = cli.Delete(context.TODO(), &jobList.Items[i], client.PropagationPolicy(metav1.DeletePropagationBackground))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### 6 Velero Install with Maintenance Options
|
||||
All the above maintenance options should be supported by Velero install command.
|
||||
|
||||
### 7. Observability and Debuggability
|
||||
Some monitoring metrics are added for backup repository maintenance:
|
||||
- repo_maintenance_total
|
||||
- repo_maintenance_success_total
|
||||
- repo_maintenance_failed_total
|
||||
- repo_maintenance_duration_seconds
|
||||
|
||||
We will keep the latest N maintenance jobs for each repo, and users can get the log from the job. the job log level inherent from the Velero server setting.
|
||||
|
||||
Also, we would integrate maintenance job logs and `backuprepositories` CRs into `velero debug`.
|
||||
|
||||
Roughly, the process is as follows:
|
||||
1. The backup repository controller will check the BackupRepository request in the queue periodically.
|
||||
|
||||
2. If the maintenance period of the repository checked by `runMaintenanceIfDue` in `Reconcile` is due, then the backup repository controller will call the Repository manager to execute `PruneRepo`
|
||||
|
||||
3. The `PruneRepo` of the Repository manager will create one maintenance job, the resource limitation, environment variables, service account, images, etc. would inherit from the Velero server pod. Also, one clean up TTL would be set to maintenance job.
|
||||
|
||||
4. The maintenance job will execute the Velero maintenance command, wait for maintaining to finish and write the maintenance result into the terminationMessagePath file of the related pod.
|
||||
|
||||
5. Kubernetes could show the result in the status of the pod by reading the termination message in the pod.
|
||||
|
||||
6. The backup repository controller will wait for the maintenance job to finish and read the status of the maintenance job, then update the message field and phase in the status of `backuprepositories` CR accordingly.
|
||||
|
||||
6. Clean up old maintenance jobs and keep only N latest for each repository.
|
||||
|
||||
### 8. Codes Refinement
|
||||
Once `backuprepositories` CR status is modified, the CR would re-queue to be reconciled, and re-execute logics in reconcile shortly not respecting the re-queue frequency configured by `repoSyncPeriod`.
|
||||
For one abnormal scenario if the maintenance job fails, the status of `backuprepositories` CR would be updated and the CR will re-queue immediately, if the new maintenance job still fails, then it will re-queue again, making the logic of `backuprepositories` CR re-queue like a dead loop.
|
||||
|
||||
So we change the Predicates logic in Controller manager making it only re-queue if the Spec of `backuprepositories` CR is changed.
|
||||
|
||||
```golang
|
||||
ctrl.NewControllerManagedBy(mgr).For(&velerov1api.BackupRepository{}, builder.WithPredicates(kube.SpecChangePredicate{}))
|
||||
```
|
||||
|
||||
This change would bring the behavior different from the previous, errors that occurred in the maintenance job would retry in the next reconciliation period instead of retrying immediately.
|
||||
|
||||
## Prospects for Future Work
|
||||
Future work may focus on improving the efficiency of Velero maintenance through non-blocking parallel modes. Potential areas for enhancement include:
|
||||
|
||||
**Non-blocking Mode**: Explore the implementation of a non-blocking mode for parallel maintenance to enhance overall efficiency.
|
||||
|
||||
**Concurrency Control**: Investigate mechanisms for better concurrency control of different maintenance jobs.
|
||||
|
||||
**Provider Support for Parallel Maintenance**: Evaluate the feasibility of parallel maintenance for different providers and address any compatibility issues.
|
||||
|
||||
**Efficiency Improvements**: Investigate strategies to optimize maintenance efficiency without compromising reliability.
|
||||
|
||||
By considering these areas, future iterations of Velero may benefit from enhanced parallelization and improved resource utilization during repository maintenance.
|
||||
120
design/Implemented/restore-finalizing-phase_design.md
Normal file
120
design/Implemented/restore-finalizing-phase_design.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Design for Adding Finalization Phase in Restore Workflow
|
||||
|
||||
## Abstract
|
||||
This design proposes adding the finalization phase to the restore workflow. The finalization phase would be entered after all item restoration and plugin operations have been completed, similar to the way the backup process proceeds. Its purpose is to perform any wrap-up work necessary before transitioning the restore process to a terminal phase.
|
||||
|
||||
## Background
|
||||
Currently, the restore process enters a terminal phase once all item restoration and plugin operations have been completed. However, there are some wrap-up works that need to be performed after item restoration and plugin operations have been fully executed. There is no suitable opportunity to perform them at present.
|
||||
|
||||
To address this, a new finalization phase should be added to the existing restore workflow. in this phase, all plugin operations and item restoration has been fully completed, which provides a clean opportunity to perform any wrap-up work before termination, improving the overall restore process.
|
||||
|
||||
Wrap-up tasks in Velero can serve several purposes:
|
||||
- Post-restore modification - Velero can modify the restored data that was temporarily changed for some purpose but required to be changed back finally or data that was newly created but missing some information. For example, [issue6435](https://github.com/vmware-tanzu/velero/issues/6435) indicates that some custom settings(like labels, reclaim policy) on restored PVs was lost because those restored PVs was newly dynamically provisioned. Velero can address it by patching the PVs' custom settings back in the finalization phase.
|
||||
- Clean up unused data - Velero can identify and delete any data that are no longer needed after a successful restore in the finalization phase.
|
||||
- Post-restore validation - Velero can validate the state of restored data and report any errors to help users locate the issue in the finalization phase.
|
||||
|
||||
The uses of wrap-up tasks are not limited to these examples. Additional needs may be addressed as they develop over time.
|
||||
|
||||
## Goals
|
||||
- Add the finalization phase and the corresponding controller to restore workflow.
|
||||
|
||||
## Non Goals
|
||||
- Implement the specific wrap-up work.
|
||||
|
||||
|
||||
## High-Level Design
|
||||
- The finalization phase will be added to current restore workflow.
|
||||
- The logic for handling current phase transition in restore and restore operations controller will be modified with the introduction of the finalization phase.
|
||||
- A new restore finalizer controller will be implemented to handle the finalization phase.
|
||||
|
||||
## Detailed Design
|
||||
|
||||
### phase transition
|
||||
Two new phases related to finalization will be added to restore workflow, which are `FinalizingPartiallyFailed` and `Finalizing`. The new phase transition will be similar to backup workflow, proceeding as follow:
|
||||
|
||||

|
||||
|
||||
### restore finalizer controller
|
||||
The new restore finalizer controller will be implemented to watch for restores in `FinalizingPartiallyFailed` and `Finalizing` phases. Any wrap-up work that needs to wait for the completion of item restoration and plugin operations will be executed by this controller, and the phase will be set to either `Completed` or `PartiallyFailed` based on the results of these works.
|
||||
|
||||
Points worth noting about the new restore finalizer controller:
|
||||
|
||||
A new structure `finalizerContext` will be created to facilitate the implementation of any wrap-up tasks. It includes all the dependencies the tasks require as well as a function `execute()` to orderly implement task logic.
|
||||
```
|
||||
// finalizerContext includes all the dependencies required by wrap-up tasks
|
||||
type finalizerContext struct {
|
||||
.......
|
||||
restore *velerov1api.Restore
|
||||
log logrus.FieldLogger
|
||||
.......
|
||||
}
|
||||
|
||||
// execute executes all the wrap-up tasks and return the result
|
||||
func (ctx *finalizerContext) execute() (results.Result, results.Result) {
|
||||
// execute task1
|
||||
.......
|
||||
|
||||
// execute task2
|
||||
.......
|
||||
|
||||
// the task execution logic will be expanded as new tasks are included
|
||||
.......
|
||||
}
|
||||
|
||||
// newFinalizerContext returns a finalizerContext object, the parameters will be added as new tasks are included.
|
||||
func newFinalizerContext(restore *velerov1api.Restore, log logrus.FieldLogger, ...) *finalizerContext{
|
||||
return &finalizerContext{
|
||||
.......
|
||||
restore: restore,
|
||||
log: log,
|
||||
.......
|
||||
}
|
||||
}
|
||||
```
|
||||
The finalizer controller is responsible for collecting all dependencies and creating a `finalizerContext` object using those dependencies. It then invokes the `execute` function.
|
||||
```
|
||||
func (r *restoreFinalizerReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
|
||||
.......
|
||||
|
||||
// collect all dependencies required by wrap-up tasks
|
||||
.......
|
||||
|
||||
// create a finalizerContext object and invoke execute()
|
||||
finalizerCtx := newFinalizerContext(restore, log, ...)
|
||||
warnings, errs := finalizerCtx.execute()
|
||||
|
||||
.......
|
||||
}
|
||||
|
||||
```
|
||||
After completing all necessary tasks, the result metadata in object storage will be updated if any errors or warnings occur during the execution. This behavior breaks the feature of keeping metadata files in object storage immutable, However, we believe the tradeoff is justified because it provides users with the access to examine the error/warning details when the wrap-up tasks go wrong.
|
||||
|
||||
```
|
||||
// UpdateResults updates the result metadata in object storage if necessary
|
||||
func (r *restoreFinalizerReconciler) UpdateResults(restore *api.Restore, newWarnings *results.Result, newErrs *results.Result, backupStore persistence.BackupStore) error {
|
||||
originResults, err := backupStore.GetRestoreResults(restore.Name)
|
||||
if err != nil {
|
||||
return errors.Wrap(err, "error getting restore results")
|
||||
}
|
||||
warnings := originResults["warnings"]
|
||||
errs := originResults["errors"]
|
||||
warnings.Merge(newWarnings)
|
||||
errs.Merge(newErrs)
|
||||
|
||||
m := map[string]results.Result{
|
||||
"warnings": warnings,
|
||||
"errors": errs,
|
||||
}
|
||||
if err := putResults(restore, m, backupStore); err != nil {
|
||||
return errors.Wrap(err, "error putting restore results")
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
## Compatibility
|
||||
The new finalization phases are added without modifying the existing phases in the restore workflow. Both new and ongoing restore processes will continue to eventually transition to a terminal phase from any prior phase, ensuring backward compatibility.
|
||||
|
||||
## Implementation
|
||||
This will be implemented during the Velero 1.14 development cycle.
|
||||
BIN
design/Implemented/restore-phases-transition.png
Normal file
BIN
design/Implemented/restore-phases-transition.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 65 KiB |
Reference in New Issue
Block a user