Files
seaweedfs/weed/wdclient
os-pradipbabar d1b1338558 Fix stale cache fallback for empty volume locations in wdclient (#10081)
fix(wdclient): prevent stale cache fallback for empty volume locations

## Problem
During Kubernetes pod restarts, volume servers temporarily disconnect and their
locations are removed from vidMap. The deleteLocation function leaves an empty
array [] in vid2Locations map instead of removing the key entirely.

GetLocations() was checking 'if found && len(locations) > 0', which would fail
for empty arrays and fall back to the cache chain, returning STALE locations
from before the restart. This caused S3 gateway to try connecting to old pod
IPs that no longer exist, resulting in connection timeouts and hanging registry
sync jobs.

Example timeline:
1. Volume pod at 10.131.1.28:8081 registers volumes 10,12
2. S3 gateway caches: vid2Locations[10] = [10.131.1.28:8081]
3. Pod restarts, gets new IP 10.131.1.65:8081
4. Master sends delete → vid2Locations[10] = [] (empty, but key exists)
5. BUG: GetLocations(10) sees found=true, len=0 → falls back to cache
6. Returns stale 10.131.1.28:8081 instead of waiting for new location
7. S3 requests timeout trying to reach unreachable old IP

## Solution
Distinguish between two cases:
- found=true, locations=[] : Volume explicitly has no locations (e.g. restart)
  → Return nil, false (no fallback to cache)
- found=false : Volume never seen in current map
  → Check cache (preserve cache benefits for unknown volumes)

An empty array explicitly means 'this volume currently has no locations',
which is semantically different from 'volume unknown'. Don't fall back to
stale cache for explicitly empty volumes.

## Testing
Added comprehensive tests:
- TestGetLocationsEmptyArrayNoFallback: Verifies empty arrays don't use cache
- TestGetLocationsUnknownVolumeUsesCache: Verifies unknown volumes still use cache
- All existing tests pass

## Impact
Fixes registry sync job hangs during SeaweedFS upgrades/restarts. S3 gateway
will now correctly wait for updated volume locations instead of using stale
cached IPs.

Related: OutSystems.SeaWeedfs Helm chart, vega cluster incident 2026-06-24
2026-06-24 16:31:32 -07:00
..