From 7e4691f2dcf0d99c058a328e6dcc57e0f52da8a0 Mon Sep 17 00:00:00 2001
From: Chris Lu <chrislusf@users.noreply.github.com>
Date: Thu, 21 May 2026 00:17:14 -0700
Subject: [PATCH] test(ec): make multi-disk EC balance disk-spread assertion
 deterministic (#9595)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

test(ec): pre-populate disks so multi-disk EC balance spread is deterministic

The multidisk shard-loss regression asserts EC shards spread across more
than one disk per node, but that only holds for disks the balancer can see.
The master enumerates a physical disk only when it already holds a volume
or EC shard — an empty disk leaves no trace, since heartbeats aggregate
capacity per disk type, not per physical disk. So whether the post-encode
balance spread shards depended on how the master happened to place the
filler volumes across disks, which varies by environment: the test passed
locally (shards on 5 disks) but produced one disk per node in CI and failed
the "got 3 disks across 3 nodes" assertion.

Grow a few volumes on each server before encoding so every physical disk
holds a volume and is visible to the balancer. The volume server places
each new volume on its least-loaded disk, so a handful of grows touches
every disk, making the spread deterministic. The assertion still has teeth:
it counts disks holding shard files, so a balancer that failed to spread
would still collapse to one disk per node.
---
 .../multidisk_shardloss_test.go               | 20 +++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/test/erasure_coding/multidisk_shardloss_test.go b/test/erasure_coding/multidisk_shardloss_test.go
index 5fceb8965..6dc8186cc 100644
--- a/test/erasure_coding/multidisk_shardloss_test.go
+++ b/test/erasure_coding/multidisk_shardloss_test.go
@@ -72,6 +72,26 @@ func TestMultiDiskECBalanceNoShardLoss(t *testing.T) {
 	t.Logf("using volume %d", volumeId)
 	time.Sleep(3 * time.Second)
 
+	// Populate every server's disks with volumes so the balancer can see and
+	// target each physical disk. The master only enumerates disks that already
+	// hold a volume or EC shard — an empty disk leaves no trace in the topology
+	// (heartbeats aggregate capacity per disk type, not per physical disk). So
+	// without pre-populating, the post-encode balance would collapse each node's
+	// shards onto the single disk that happened to hold data, and whether the
+	// fillers spread across disks is environment-dependent (master volume-growth
+	// timing). Growing a few volumes per server makes the multi-disk layout
+	// deterministic: the volume server places each new volume on its least-loaded
+	// disk, so a handful of grows touches every disk.
+	for i := 0; i < 3; i++ {
+		server := fmt.Sprintf("127.0.0.1:809%d", i)
+		out, growErr := captureCommandOutput(t, shell.Commands[findCommandIndex("volume.grow")],
+			[]string{"-collection", "test", "-dataNode", server, "-count", "4"}, commandEnv)
+		require.NoError(t, growErr, "volume.grow on %s failed: %s", server, out)
+	}
+	// Let the freshly grown volumes reach the master via heartbeat before encoding
+	// so collectEcNodes sees every disk.
+	time.Sleep(5 * time.Second)
+
 	locked, unlock := tryLockWithTimeout(t, commandEnv, 15*time.Second)
 	require.True(t, locked, "could not acquire shell lock")
 	defer unlock()