Revert "master: bind heartbeat claims to the connecting peer (#9443)"
This reverts commit f28c7ce6df.
The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.
After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .
Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:
- Reject heartbeats whose Ip does not match the gRPC peer's source
address. Loopback peers are still trusted; operators behind a proxy
can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
and drop foreign re-claims. Non-EC volume claims are bounded by the
replica copy count so legitimate replicas still register. EC
ownership is keyed by (vid, shard_id) so the same vid can legitimately
be split across many peers as long as their EcIndexBits are disjoint;
rejected bits are cleared from the bitmap and the parallel ShardSizes
array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
so disconnect cleanup is O(M) in what that peer held rather than O(N)
over the whole map.
Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.
The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
* refactor(command): expand "~" in all path-style CLI flags
Many of weed's path-bearing flags (-s3.config, -s3.iam.config,
-admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key
files, profile output paths, mount cache dirs, sftp key files, ...)
were never run through util.ResolvePath, so a value like "~/iam.json"
was used literally. Tilde only worked when the shell expanded it,
which silently fails for the common -flag=~/path form (bash leaves
the tilde literal in --opt=~/path).
- Extend util.ResolvePath to also handle "~user" / "~user/rest",
matching shell tilde expansion. Add unit tests.
- Apply util.ResolvePath at the top of each shared start* function
(s3, webdav, sftp) so mini/server/filer/standalone callers all
inherit it; resolve at the few one-off use sites (mount cache
dirs, volume idx folder, mini admin.dataDir, profile paths).
- Drop the duplicate expandHomeDir helper from admin.go in favor of
the now-equivalent util.ResolvePath.
* fixup: handle comma-separated -dir flags for tilde expansion
`weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept
comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on
the whole string mishandled multi-folder values with tilde, e.g.
"~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath.
- Add util.ResolveCommaSeparatedPaths: split on ",", run each entry
through ResolvePath, rejoin. Short-circuits when no "~" present.
- Use it for *miniDataFolders (mini.go), *volumeDataFolders (server.go),
and resolve each entry of v.folders in-place (volume.go) so all
downstream consumers see resolved paths.
- Add 7-case TestResolveCommaSeparatedPaths covering empty, single,
multiple, and mixed inputs.
* address PR review: metaFolder + Windows backslash
- master.go: resolve *m.metaFolder at the top of runMaster so
util.FullPath(*m.metaFolder) on the next line sees an expanded
path. Drop the now-redundant ResolvePath in TestFolderWritable.
- server.go: same treatment for *masterOptions.metaFolder, paired
with the existing cpu/mem profile resolves. Drop the redundant
inner ResolvePath at TestFolderWritable.
- file_util.go: ResolvePath now accepts filepath.Separator as a
separator after the tilde, so "~\\data" works on Windows. Other
platforms keep current behaviour (backslash stays literal because
it is a valid filename character in usernames and paths).
- file_util_test.go: add two cases using filepath.Separator that
exercise the new code path on Windows and remain a no-op on Unix.
* address PR review: resolve "~" in remaining command path flags
Comprehensive sweep of path-bearing flags across every weed
subcommand, applying util.ResolvePath in-place at the top of each
run* function so all downstream consumers see expanded paths.
- webdav.go: resolve *wo.cacheDir at the top of startWebDav so
mini/server/filer/standalone callers all inherit it.
- mount_std.go: cpu/mem profile paths.
- filer_sync.go: cpu/mem profile paths.
- mq_broker.go: cpu/mem profile paths.
- benchmark.go: cpuprofile output path.
- backup.go: -dir resolved once at runBackup; drop the duplicated
inline ResolvePath in NewVolume calls.
- compact.go: -dir resolved at runCompact; drop inline ResolvePath.
- export.go: -dir and -o resolved at runExport; drop inline
ResolvePath in LoadFromIdx and ScanVolumeFile.
- download.go: -dir resolved at runDownload; drop inline.
- update.go: -dir resolved at runUpdate so filepath.Join uses the
expanded path; drop inline ResolvePath in TestFolderWritable.
- scaffold.go: -output expanded before filepath.Join.
- worker.go: -workingDir expanded before being passed to runtime.
* address PR review: resolve option-struct paths at run* entry points
server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile
*before* startS3Server runs, which meant the filer-side code saw the
unresolved tilde-prefixed pointer. Same pattern for webdavOptions and
sftpOptions (and equivalent in mini.go / filer.go).
The fix: hoist resolution from the shared start* functions up to the
run* entry points, where every shared pointer is set up before any
propagation happens.
- s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each
Options struct that runs every path field through util.ResolvePath
in-place. Idempotent.
- runS3, runWebDav, runSftp: call the standalone struct's resolvePaths
before starting metrics / loading security config.
- runServer, runMini, runFiler: call resolvePaths on every embedded
options struct, plus resolve loose flags (serverIamConfig,
miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and
filer's defaultLevelDbDirectory) so they're expanded before any
pointer copy or use.
- Drop the now-redundant inline ResolvePath at filer's
defaultLevelDbDirectory composition.
* address PR review: re-resolve mini -dir post-config, cover misc paths
- mini.go: applyConfigFileOptions can overwrite -dir with a literal
~/data from mini.options. Re-resolve *miniDataFolders after the
config-file apply, alongside the other path resolves, so the mini
filer no longer ends up with a literal ~/data/filerldb2.
- benchmark.go: resolve *b.idListFile (-list).
- filer_sync.go: resolve *syncOptions.aSecurity / .bSecurity
(-a.security / -b.security) before LoadClientTLSFromFile.
- filer_cat.go: resolve *filerCat.output (-o) before os.OpenFile.
- admin.go: drop trailing blank line at EOF (git diff --check).
* address PR review: resolve -a.security/-b.security/-config before use
Three follow-up fixes:
- filer_sync.go: the -a.security / -b.security resolves were placed
*after* LoadClientTLSFromFile / LoadHTTPClientFromFile were called,
so weed filer.sync -a.security=~/a.toml still passed the literal
tilde path. Hoist the resolves above the security-loading block so
TLS clients see expanded paths.
- filer_sync_verify.go: same flag pair was never resolved at all in
the verify command; resolve at the top of runFilerSyncVerify.
- filer_meta_backup.go: -config (the backup_filer.toml path) was
passed directly to viper. Resolve at the top of runFilerMetaBackup.
- mini.go: master.dir defaulted to the entire comma-joined
miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir
setup), TestFolderWritable then stat'd the joined string instead
of a single directory. Default to the first entry via StringSplit
to mirror the disk-space calculation a few lines below, and drop
the now-redundant ResolvePath in TestFolderWritable.
* fix(weed/command) address unhandled errors
* fix(command): don't log graceful-shutdown sentinels; plug response-body leak
- s3: Serve on unix socket treated http.ErrServerClosed as fatal; now
excluded like the other Serve/ServeTLS paths in this file.
- mq_agent, mq_broker: filter grpc.ErrServerStopped so clean shutdown
doesn't log as an error.
- worker_runtime: the added decodeErr early-continue skipped
resp.Body.Close(); drop it since the existing check below already
surfaces the decode error.
- mount_std: the pre-mount Unmount commonly fails when nothing is
mounted; demote to V(1) Infof.
- fuse_std: tidy panic message to match sibling cases.
* fix(mq_broker): filter grpc.ErrServerStopped on localhost listener
The localhost listener goroutine logged any Serve error unconditionally,
which includes grpc.ErrServerStopped on graceful shutdown. Match the
main listener's check so clean stops don't surface as errors.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin
S3 and filer already use a refreshing pemfile provider for their HTTPS
cert, so rotated certificates (e.g. from k8s cert-manager) are picked up
without a restart. Master, volume, webdav, and admin, however, passed
cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at
startup — rotating those certs required a pod restart.
Add a small helper NewReloadingServerCertificate in weed/security that
wraps pemfile.Provider and returns a tls.Config.GetCertificate closure,
then wire it into the four remaining HTTPS entry points. httpdown now
also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates
but CertFile/KeyFile are empty, so volume server can pre-populate
TLSConfig.
A unit test exercises the rotation path (write cert, rotate on disk,
assert the callback returns the new cert) with a short refresh window.
* refactor(security): route filer/s3 HTTPS through the shared cert reloader
Before: filer.go and s3.go each kept a *certprovider.Provider on the
options struct plus a duplicated GetCertificateWithUpdate method. Both
were loading pemfile themselves. Behaviorally they already reloaded, but
the logic was duplicated two ways and neither path was shared with the
newly-added master/volume/webdav/admin wiring.
After: both use security.NewReloadingServerCertificate like the other
servers. The per-struct certProvider field and GetCertificateWithUpdate
method are removed, along with the now-unused certprovider and pemfile
imports. Net: -32 lines, one code path for all HTTPS cert reloading.
No behavior change — the refresh window, cache, and handshake contract
are identical (the helper wraps the same pemfile.NewProvider).
* feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc
The HTTP client in weed/util/http/client loaded the mTLS client cert
once at startup via tls.LoadX509KeyPair. That left every long-lived
HTTPS client process (weed mount, backup, filer.copy, filer→volume,
s3→filer/volume) unable to pick up a rotated client cert without a
restart — even though the same cert-manager setup was already rotating
the server side fine.
Swap the client cert loader for a tls.Config.GetClientCertificate
callback backed by the same refreshing pemfile provider. New TLS
handshakes pick up the rotated cert; in-flight pooled connections keep
their old cert and drop as normal transport churn happens.
To keep this reusable from both server and client TLS code without an
import cycle (weed/security already imports weed/util/http/client for
LoadHTTPClientFromFile), extract the pemfile wrapper into a new
weed/security/certreload subpackage. weed/security keeps its thin
NewReloadingServerCertificate wrapper. The existing unit test moves
with the implementation.
gRPC mTLS was already handled by security.LoadServerTLS /
LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ
agent, Kafka gateway, and FUSE mount control plane are gRPC-only and
therefore already rotate.
CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted
as a known limitation in the wiki.
* fix(security): address PR review feedback on cert reloader
Bots (gemini-code-assist + coderabbit) flagged three real issues and a
couple of nits. Addressing them here:
1. KeyMaterial used context.Background(). The grpc pemfile provider's
KeyMaterial blocks until material arrives or the context deadline
expires; with Background() a slow disk could hang the TLS handshake
indefinitely. Switched both the server and client callbacks to use
hello.Context() / cri.Context() so a stuck read is bounded by the
handshake timeout.
2. Admin server loaded TLS inside the serve goroutine. If the cert was
bad, the goroutine returned but startAdminServer kept blocking on
<-ctx.Done() with no listener, making the process look healthy with
nothing bound. Moved TLS setup to run before the goroutine starts
and propagate errors via fmt.Errorf; also captures the provider and
defers Close().
3. HTTP client discarded the certprovider.Provider from
NewClientGetCertificate. That leaked the refresh goroutine, and
NewHttpClientWithTLS had a worse case where a CA-file failure after
provider creation orphaned the provider entirely. Added a
certProvider field and a Close() method on HTTPClient, and made
the constructors close the provider on subsequent error paths.
4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain
the provider. filer and webdav run ServeTLS synchronously, so a
plain defer works. master/volume/s3 dispatch goroutines and return
while the server keeps running, so they hook Close() into
grace.OnInterrupt.
5. Test: certreload_test now tolerates transient read/parse errors
during file rotation (writeSelfSigned rewrites cert before key) and
reports the last error only if the deadline expires.
No user-visible behavior change for the happy path.
* test(tls): add end-to-end HTTPS cert rotation integration test
Boots a real `weed master` with HTTPS enabled, captures the leaf cert
served at TLS handshake time, atomically rewrites the cert/key files
on disk (the same rename-in-place pattern kubelet does when it swaps
a cert-manager Secret), and asserts that a subsequent TLS handshake
observes the rotated leaf — with no process restart, no SIGHUP, no
reloader sidecar. Verifies the full path: on-disk change → pemfile
refresh tick → provider.KeyMaterial → tls.Config.GetCertificate →
server TLS handshake.
Runtime is ~1s by exposing the reloader's refresh window as an env
var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the
test. The same env var is user-facing — documented in the wiki — so
operators running short-lived certs (Vault, cert-manager with
duration: 24h, etc.) can tighten the rotation-pickup window without a
rebuild. Defaults to 5h to preserve prior behavior.
security.CredRefreshingInterval is kept for API compatibility but now
aliases certreload.DefaultRefreshInterval so the same env controls
both gRPC mTLS and HTTPS reload.
* ci(tls): wire the TLS rotation integration test into GitHub Actions
Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner,
Go 1.25, build weed, run `go test` in test/tls_rotation, upload master
logs on failure. 10-minute job timeout; the test itself finishes in
about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms
inside the test.
Runs on every push to master and on every PR to master.
* fix(tls): address follow-up PR review comments
Three new comments on the integration test + volume shutdown path:
1. Test: peekServerCert was swallowing every dial/handshake error,
which meant waitForCert's "last err: <nil>" fatal message lost all
diagnostic value. Thread errors back through: peekServerCert now
returns (*x509.Certificate, error), and waitForCert records the
latest error so a CI flake points at the actual cause (master
didn't come up, handshake rejected, CA pool mismatch, etc.).
2. Test: set HOME=<tempdir> on the master subprocess. Viper today
registers the literal path "$HOME/.seaweedfs" without env
expansion, so a developer's ~/.seaweedfs/security.toml is
accidentally invisible — the test was relying on that. Pinning
HOME is belt-and-braces against a future viper upgrade that does
expand env vars.
3. volume.go: startClusterHttpService's provider close was registered
via grace.OnInterrupt, which fires on SIGTERM but NOT on the
v.shutdownCtx.Done() path used by mini / integration tests. The
pemfile refresh goroutine leaked in that shutdown path. Now the
helper returns a close func and the caller invokes it on BOTH
shutdown paths for parity.
Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the
ast-grep static-analysis nit — zero-risk since the pool only trusts
our in-memory CA.
Test runs clean 3/3.
* fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C
Interrupts fired grace hooks in registration order, so master (started
first) shut down before its clients, producing heartbeat-canceled errors
and masterClient reconnection noise during weed mini shutdown. Admin/s3/
webdav had no interrupt hooks at all and were killed at os.Exit.
- grace: execute interrupt hooks in LIFO (defer-style) order so later-
started services tear down first.
- filer: consolidate the three separate interrupt hooks (gRPC / HTTP /
DB) into one that runs in order, so filer shutdown stays correct
independent of FIFO/LIFO semantics.
- mini: add MiniClientsShutdownCtx (separate from test-facing
MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3,
WebDAV and the maintenance worker observe it; runMini registers a
cancel hook after startup so under LIFO it fires first and waits up to
10s on a WaitGroup for those services to drain before filer, volume,
and master shut down.
Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP
-> DB) -> volume -> master.
* refactor(mini): group mini-client shutdown into one state struct
The first pass spread the shutdown plumbing across three globals
(MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two
ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker).
Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini
invocation, and chain its ctx from MiniClusterCtx so clients only observe
one signal. Tests that cancel MiniClusterCtx still trigger client
shutdown via parent-child propagation.
- resetMiniClients() installs fresh state at the top of runMini, so
in-process test reruns don't inherit stale ctx/wg.
- onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown
and only observes one ctx.
- trackMiniClient() replaces the manual wg.Add/Done dance for the admin
goroutine.
- miniClientsCtx() gives the admin startup a ctx without re-deriving.
- triggerMiniClientsShutdown(timeout) is the interrupt hook body.
No behaviour change; existing tests pass.
* refactor: generalize shutdown ctx as an option, not a mini-specific helper
Several service files (s3, webdav, filer, master, volume) observed the
mini-specific MiniClusterCtx or called onMiniClientsShutdown directly.
That leaked mini orchestration into code that also runs under weed s3,
weed webdav, weed filer, weed master, and weed volume standalone.
Replace with a generic `shutdownCtx context.Context` field on each
service's Options struct. When non-nil, the server watches it and shuts
down gracefully; when nil (standalone), the shutdown path is a no-op.
Mini wires the contexts up from a single place (runMini):
- miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx =
MiniClusterCtx (drives test-triggered teardown)
- miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives
Ctrl+C teardown before filer/volume/master)
All knowledge of MiniClusterCtx now lives in mini.go.
* fix(mini): stop worker before clients ctx so admin shutdown isn't blocked
Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/
webdav hook sat for 10s then logged "timed out". Admin had started its
shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop,
waiting for the still-connected worker stream. That in turn left filer
clients connected and cascaded into filer's own 10s gRPC graceful-stop
timeout.
Two causes, both fixed:
1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which
makes managerLoop `break out` and exit), then called getTaskLoad()
which sends to the same unbuffered cmd channel — no receiver, hangs
forever. Reorder Stop() to snapshot the admin client and drain tasks
BEFORE sending ActionStop, and call Disconnect() via the local
snapshot afterwards.
2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads
from c.incoming, which Disconnect closes, yielding a nil response and
a panic on response.Message. Handle the closed channel explicitly.
3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs
synchronously BEFORE the clients ctx is cancelled. Register worker
shutdown there so admin's worker-gRPC GracefulStop finds the worker
already disconnected and returns immediately, instead of waiting on
a stream that is about to close anyway.
Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full
process exit in ~11s (the remaining 10s is a pre-existing filer gRPC
graceful-stop timeout, not cascaded from the clients tier).
* feat(mini): cap filer gRPC graceful stop at 1s under weed mini
Full weed mini shutdown was ~11s on a clean exit, dominated by the
filer's default 10s gRPC GracefulStop timeout while background
SubscribeLocalMetadata streams drained.
Expose the timeout as a FilerOptions.gracefulStopTimeout field (default
10s for standalone weed filer) and set it to 1s in mini. Clean weed mini
shutdown now takes ~2s.
* fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock
When fastResume is active (single-master + resumeState + non-empty log),
the raft server becomes leader within ~1ms. DoJoinCommand then enters
the leaderLoop's processCommand path, which calls setCommitIndex to
commit all pending entries. The goraft setCommitIndex implementation
returns early when it encounters a JoinCommand entry (to recalculate
quorum), which can prevent the new entry's event channel from being
notified — leaving DoJoinCommand blocked forever.
Each restart appends a new raft:join entry to the log, while the conf
file's commitIndex (only persisted on AddPeer) lags behind. After 3-4
restarts the uncommitted range contains old JoinCommand entries that
trigger the early return before the new entry is reached.
Fix: skip DoJoinCommand when the raft log already has entries (the
server was already joined in a previous run). The fastResume mechanism
handles leader election independently.
* fix(master): handle Hashicorp Raft in HasExistingState
Add Hashicorp Raft support to HasExistingState by checking
AppliedIndex, consistent with how other RaftServer methods
handle both raft implementations.
* fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft
AppliedIndex() reflects in-memory FSM state which starts at 0 before
log replay completes. LastIndex() reads from persisted stable storage,
correctly mirroring the non-Hashicorp IsLogEmpty() check.
* fix(master): fast resume state and default resumeState to true
When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.
Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.
Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.
* fix(master): prevent fastResume goroutine from hanging forever
Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.
* fix(master): use ticker instead of time.After in fastResume polling loop
* Use Unix sockets for gRPC between co-located services in mini mode
In `weed mini`, all services run in one process. Previously, inter-service
gRPC traffic (volume↔master, filer↔master, S3↔filer, worker↔admin, etc.)
went through TCP loopback. This adds a gRPC Unix socket registry in the pb
package: mini mode registers a socket path per gRPC port at startup, each
gRPC server additionally listens on its socket, and GrpcDial transparently
routes to the socket via WithContextDialer when a match is found.
Standalone commands (weed master, weed filer, etc.) are unaffected since
no sockets are registered. TCP listeners are kept for external clients.
* Handle Serve error and clean up socket file in ServeGrpcOnLocalSocket
Log non-expected errors from grpcServer.Serve (ignoring
grpc.ErrServerStopped) and always remove the Unix socket file
when Serve returns, ensuring cleanup on Stop/GracefulStop.
* fix: clear raft vote state file on non-resume startup
The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.
Fixes#8690
* fix: recover TopologyId from snapshot before clearing raft state
When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.
Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.
* fix: stagger multi-master bootstrap delay by peer index
Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.
Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.
* fix: skip unreachable masters during leader reconnection
When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.
Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.
* fix: take snapshot after TopologyId generation for recovery
After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.
* test: add multi-master raft failover integration tests
Integration test framework and 5 test scenarios for 3-node master
clusters:
- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
all nodes agree on TopologyId
* fix: address PR review comments
- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them
* fix: address review comments on failover tests
- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
Capture global MiniClusterCtx into local variables before goroutine/select
evaluation to prevent nil dereference/data race when context is reset to nil
after nil check. Applied to filer, master, volume, and s3 commands.
- Introduce MiniClusterCtx to coordinate shutdown across mini services
- Update Master, Volume, Filer, S3, and WebDAV servers to respect context cancellation
- Ensure all resources are cleaned up properly during test teardown
- Integrate MiniClusterCtx in s3tables integration tests
* Add consistent -debug and -debug.port flags to commands
Add -debug and -debug.port flags to weed master, weed volume, weed s3,
weed mq.broker, and weed filer.sync commands for consistency with
weed filer.
When -debug is enabled, an HTTP server starts on the specified port
(default 6060) serving runtime profiling data at /debug/pprof/.
For mq.broker, replaced the older -port.pprof flag with the new
-debug and -debug.port pattern for consistency.
* Update weed/util/grace/pprof.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* weed master -peers=none
* single master mode only when peers is none
* refactoring
* revert duplicated code
* revert
* Update weed/command/master.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* preventing "none" passed to other components if master is not started
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* adjust "weed benchmark" CLI to use readOnly/writeOnly
* consistently use "-master" CLI option
* If both -readOnly and -writeOnly are specified, the current logic silently allows it with -writeOnly taking precedence. This is confusing and could lead to unexpected behavior.
* Added/Updated:
- Added metrics ip options for all servers;
- Fixed a bug with the selection of the binIp or ip parameter for the metrics handler;
* Fixed cmd flags
* Added context for the MasterClient's methods to avoid endless loops
* Returned WithClient function. Added WithClientCustomGetMaster function
* Hid unused ctx arguments
* Using a common context for the KeepConnectedToMaster and WaitUntilConnected functions
* Changed the context termination check in the tryConnectToMaster function
* Added a child context to the tryConnectToMaster function
* Added a common context for KeepConnectedToMaster and WaitUntilConnected functions in benchmark
`weed server` was not correctly propagating
`-master.raftHashicorp` and `-master.raftBootstrap` flags when
starting the master server.
Related to #4307
* refactor(net_timeout): `listner` -> `listener`
Signed-off-by: Ryan Russell <git@ryanrussell.org>
* refactor(s3): `s3ApiLocalListner` -> `s3ApiLocalListener`
Signed-off-by: Ryan Russell <git@ryanrussell.org>
* refactor(filer): `localPublicListner` -> `localPublicListener`
Signed-off-by: Ryan Russell <git@ryanrussell.org>
* refactor(command): `masterLocalListner` -> `masterLocalListener`
Signed-off-by: Ryan Russell <git@ryanrussell.org>
* refactor(net_timeout): `ipListner` -> `ipListener`
Signed-off-by: Ryan Russell <git@ryanrussell.org>
Signed-off-by: Ryan Russell <git@ryanrussell.org>