Files
seaweedfs/weed/shell/command_volume_move.go
Chris Lu 0fed72d95a volume.tier.move: fulfill target replication before deleting old replicas (#8950)
* volume.tier.move: fulfill target replication before deleting old replicas

When -toReplication is specified, volume.tier.move now creates all
required replicas on the destination tier before deleting old replicas.
This closes the data-loss window where only one copy existed on the
target tier while awaiting volume.fix.replication.

If replication fulfillment fails, old replicas are preserved and marked
writable so the volume remains accessible.

Also extracts replicateVolumeToServer and configureVolumeReplication
helpers to reduce duplication across volume.tier.move and
volume.fix.replication.

Fixes #8937

* volume.tier.move: always fulfill replication before deleting old replicas

When -toReplication is specified, use that replication setting.
Otherwise, read the volume's existing replication from the super block.
In both cases, all required replicas are created on the destination
tier before old replicas are deleted.

If replication fulfillment fails (e.g. not enough destination nodes),
old replicas are preserved and marked writable so no data is lost.

* volume.tier.move: address review feedback on ensureReplicationFulfilled

- Add 5s delay before re-collecting topology to allow master heartbeat
  propagation after the move
- Add nil guard for targetTierReplicas to prevent panic if the moved
  replica is not yet visible in the topology
- Treat configureVolumeReplication failure as a hard error instead of a
  warning, so the rollback logic preserves old replicas

* volume.tier.move: harden replication config error handling

- Make configureVolumeReplication failure on the primary moved replica a
  hard error that aborts the move, instead of logging and continuing
- Configure replication metadata on all existing target-tier replicas
  (not just newly created ones) when -toReplication is specified
- Deletion of old replicas cannot affect new replicas since the
  locations list only contains pre-move servers (verified, no change)

* volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery

Fix 1: The cleanup loop now preserves pre-existing target-tier replicas
that ensureReplicationFulfilled counted toward the replication target.
Previously, a mixed-tier volume with an existing replica on the target
tier could have that replica deleted right after being counted as
fulfilled, leaving the volume under-replicated.

ensureReplicationFulfilled now returns a preserveServers set that the
deletion loop checks before removing any old replica.

Fix 2: Failure paths after LiveMoveVolume (which deletes the source
replica) now use restoreSurvivingReplicasWritable instead of
markVolumeReplicasWritable. The old helper stopped on first error, so
attempting to mark the already-deleted source writable would prevent
all surviving replicas from being restored. The new helper skips the
deleted source and continues through all remaining locations, logging
per-replica errors instead of aborting.

* volume.tier.move: mark preserved replicas writable, skip nodes with existing volume

Fix 1: Preserved pre-existing target-tier replicas were left read-only
after the move completed. They were marked read-only at the start
(along with all other replicas) but never restored since the old code
deleted them. Now they are explicitly marked writable before cleanup.

Fix 2: The fulfillment loop could pick a candidate node that already
hosts this volume on a different disk type, causing a VolumeCopy
conflict. Added a guard that skips any node already hosting the volume
(on any disk) before attempting replication.
2026-04-06 14:55:37 -07:00

288 lines
11 KiB
Go

package shell
import (
"context"
"errors"
"flag"
"fmt"
"io"
"log"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb"
"github.com/seaweedfs/seaweedfs/weed/wdclient"
"github.com/seaweedfs/seaweedfs/weed/operation"
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
"github.com/seaweedfs/seaweedfs/weed/util"
"google.golang.org/grpc"
)
func init() {
Commands = append(Commands, &commandVolumeMove{})
}
type commandVolumeMove struct {
}
func (c *commandVolumeMove) Name() string {
return "volume.move"
}
func (c *commandVolumeMove) Help() string {
return `move a live volume from one volume server to another volume server
volume.move -source <source volume server host:port> -target <target volume server host:port> -volumeId <volume id>
volume.move -source <source volume server host:port> -target <target volume server host:port> -volumeId <volume id> -disk [hdd|ssd|<tag>]
This command move a live volume from one volume server to another volume server. Here are the steps:
1. This command marks the source volume as read-only, copies it to the target volume server, and records the last entry timestamp.
2. This command asks the target volume server to mount the new volume.
3. This command asks the target volume server to tail the source volume for updates after the timestamp, for 1 minutes to drain any in-flight requests.
4. This command asks the source volume server to delete the source volume.
The option "-disk [hdd|ssd|<tag>]" can be used to change the volume disk type.
`
}
func (c *commandVolumeMove) HasTag(CommandTag) bool {
return false
}
func (c *commandVolumeMove) Do(args []string, commandEnv *CommandEnv, writer io.Writer) (err error) {
volMoveCommand := flag.NewFlagSet(c.Name(), flag.ContinueOnError)
volumeIdInt := volMoveCommand.Int("volumeId", 0, "the volume id")
sourceNodeStr := volMoveCommand.String("source", "", "the source volume server <host>:<port>")
targetNodeStr := volMoveCommand.String("target", "", "the target volume server <host>:<port>")
diskTypeStr := volMoveCommand.String("disk", "", "[hdd|ssd|<tag>] hard drive or solid state drive or any tag")
ioBytePerSecond := volMoveCommand.Int64("ioBytePerSecond", 0, "limit the speed of move")
noLock := volMoveCommand.Bool("noLock", false, "do not lock the admin shell at one's own risk")
if err = volMoveCommand.Parse(args); err != nil {
return nil
}
if *noLock {
commandEnv.noLock = true
} else {
if err = commandEnv.confirmIsLocked(args); err != nil {
return
}
}
sourceVolumeServer, targetVolumeServer := pb.ServerAddress(*sourceNodeStr), pb.ServerAddress(*targetNodeStr)
volumeId := needle.VolumeId(*volumeIdInt)
if sourceVolumeServer == targetVolumeServer {
return fmt.Errorf("source and target volume servers are the same!")
}
return LiveMoveVolume(commandEnv.option.GrpcDialOption, writer, volumeId, sourceVolumeServer, targetVolumeServer, 5*time.Second, *diskTypeStr, *ioBytePerSecond, false)
}
// LiveMoveVolume moves one volume from one source volume server to one target volume server, with idleTimeout to drain the incoming requests.
func LiveMoveVolume(grpcDialOption grpc.DialOption, writer io.Writer, volumeId needle.VolumeId, sourceVolumeServer, targetVolumeServer pb.ServerAddress, idleTimeout time.Duration, diskType string, ioBytePerSecond int64, skipTailError bool) (err error) {
log.Printf("copying volume %d from %s to %s", volumeId, sourceVolumeServer, targetVolumeServer)
lastAppendAtNs, err := copyVolume(grpcDialOption, writer, volumeId, sourceVolumeServer, targetVolumeServer, diskType, ioBytePerSecond, false)
if err != nil {
return fmt.Errorf("copy volume %d from %s to %s: %v", volumeId, sourceVolumeServer, targetVolumeServer, err)
}
log.Printf("tailing volume %d from %s to %s", volumeId, sourceVolumeServer, targetVolumeServer)
if err = tailVolume(grpcDialOption, volumeId, sourceVolumeServer, targetVolumeServer, lastAppendAtNs, idleTimeout); err != nil {
if skipTailError {
fmt.Fprintf(writer, "tail volume %d from %s to %s: %v\n", volumeId, sourceVolumeServer, targetVolumeServer, err)
} else {
return fmt.Errorf("tail volume %d from %s to %s: %v", volumeId, sourceVolumeServer, targetVolumeServer, err)
}
}
log.Printf("deleting volume %d from %s", volumeId, sourceVolumeServer)
if err = deleteVolume(grpcDialOption, volumeId, sourceVolumeServer, false); err != nil {
return fmt.Errorf("delete volume %d from %s: %v", volumeId, sourceVolumeServer, err)
}
log.Printf("moved volume %d from %s to %s", volumeId, sourceVolumeServer, targetVolumeServer)
return nil
}
func copyVolume(grpcDialOption grpc.DialOption, writer io.Writer, volumeId needle.VolumeId, sourceVolumeServer, targetVolumeServer pb.ServerAddress, diskType string, ioBytePerSecond int64, restoreWritable bool) (lastAppendAtNs uint64, err error) {
// check to see if the volume is already read-only and if its not then we need
// to mark it as read-only and then before we return we need to undo what we
// did
var shouldMarkWritable bool
defer func() {
if !shouldMarkWritable {
return
}
if !restoreWritable && err == nil {
return
}
clientErr := operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, writableErr := volumeServerClient.VolumeMarkWritable(context.Background(), &volume_server_pb.VolumeMarkWritableRequest{
VolumeId: uint32(volumeId),
})
return writableErr
})
if clientErr != nil {
log.Printf("failed to mark volume %d as writable after copy from %s: %v", volumeId, sourceVolumeServer, clientErr)
}
}()
err = operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
resp, statusErr := volumeServerClient.VolumeStatus(context.Background(), &volume_server_pb.VolumeStatusRequest{
VolumeId: uint32(volumeId),
})
if statusErr == nil && !resp.IsReadOnly {
shouldMarkWritable = true
_, readonlyErr := volumeServerClient.VolumeMarkReadonly(context.Background(), &volume_server_pb.VolumeMarkReadonlyRequest{
VolumeId: uint32(volumeId),
Persist: false,
})
return readonlyErr
}
return statusErr
})
if err != nil {
return
}
err = operation.WithVolumeServerClient(true, targetVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
stream, replicateErr := volumeServerClient.VolumeCopy(context.Background(), &volume_server_pb.VolumeCopyRequest{
VolumeId: uint32(volumeId),
SourceDataNode: string(sourceVolumeServer),
DiskType: diskType,
IoBytePerSecond: ioBytePerSecond,
})
if replicateErr != nil {
return replicateErr
}
for {
resp, recvErr := stream.Recv()
if recvErr != nil {
if recvErr == io.EOF {
break
} else {
return recvErr
}
}
if resp.LastAppendAtNs != 0 {
lastAppendAtNs = resp.LastAppendAtNs
} else {
fmt.Fprintf(writer, "%s => %s volume %d processed %s\n", sourceVolumeServer, targetVolumeServer, volumeId, util.BytesToHumanReadable(uint64(resp.ProcessedBytes)))
}
}
return nil
})
return
}
func tailVolume(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, sourceVolumeServer, targetVolumeServer pb.ServerAddress, lastAppendAtNs uint64, idleTimeout time.Duration) (err error) {
return operation.WithVolumeServerClient(true, targetVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, replicateErr := volumeServerClient.VolumeTailReceiver(context.Background(), &volume_server_pb.VolumeTailReceiverRequest{
VolumeId: uint32(volumeId),
SinceNs: lastAppendAtNs,
IdleTimeoutSeconds: uint32(idleTimeout.Seconds()),
SourceVolumeServer: string(sourceVolumeServer),
})
return replicateErr
})
}
func deleteVolume(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, sourceVolumeServer pb.ServerAddress, onlyEmpty bool) (err error) {
return operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, deleteErr := volumeServerClient.VolumeDelete(context.Background(), &volume_server_pb.VolumeDeleteRequest{
VolumeId: uint32(volumeId),
OnlyEmpty: onlyEmpty,
})
return deleteErr
})
}
func markVolumeWritable(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, sourceVolumeServer pb.ServerAddress, writable, persist bool) (err error) {
return operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
if writable {
_, err = volumeServerClient.VolumeMarkWritable(context.Background(), &volume_server_pb.VolumeMarkWritableRequest{
VolumeId: uint32(volumeId),
})
} else {
_, err = volumeServerClient.VolumeMarkReadonly(context.Background(), &volume_server_pb.VolumeMarkReadonlyRequest{
VolumeId: uint32(volumeId),
Persist: persist,
})
}
return err
})
}
func markVolumeReplicaWritable(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, location wdclient.Location, writable, persist bool) error {
fmt.Printf("markVolumeReadonly %d on %s ...\n", volumeId, location.Url)
return markVolumeWritable(grpcDialOption, volumeId, location.ServerAddress(), writable, persist)
}
func markVolumeReplicasWritable(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, locations []wdclient.Location, writable, persist bool) error {
for _, location := range locations {
if err := markVolumeReplicaWritable(grpcDialOption, volumeId, location, writable, persist); err != nil {
return err
}
}
return nil
}
// replicateVolumeToServer copies a volume from sourceAddress to targetAddress via the VolumeCopy gRPC stream.
func replicateVolumeToServer(grpcDialOption grpc.DialOption, writer io.Writer, volumeId needle.VolumeId, sourceAddress, targetAddress pb.ServerAddress, diskType string) error {
return operation.WithVolumeServerClient(false, targetAddress, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
stream, replicateErr := volumeServerClient.VolumeCopy(context.Background(), &volume_server_pb.VolumeCopyRequest{
VolumeId: uint32(volumeId),
SourceDataNode: string(sourceAddress),
DiskType: diskType,
})
if replicateErr != nil {
return replicateErr
}
for {
resp, recvErr := stream.Recv()
if recvErr != nil {
if recvErr == io.EOF {
break
}
return recvErr
}
if resp.ProcessedBytes > 0 {
fmt.Fprintf(writer, "volume %d processed %s bytes\n", volumeId, util.BytesToHumanReadable(uint64(resp.ProcessedBytes)))
}
}
return nil
})
}
// configureVolumeReplication sets the replication setting on a volume at the given server.
func configureVolumeReplication(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, targetAddress pb.ServerAddress, replicationString string) error {
return operation.WithVolumeServerClient(false, targetAddress, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
resp, configureErr := volumeServerClient.VolumeConfigure(context.Background(), &volume_server_pb.VolumeConfigureRequest{
VolumeId: uint32(volumeId),
Replication: replicationString,
})
if configureErr != nil {
return configureErr
}
if resp.Error != "" {
return errors.New(resp.Error)
}
return nil
})
}