* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin
S3 and filer already use a refreshing pemfile provider for their HTTPS
cert, so rotated certificates (e.g. from k8s cert-manager) are picked up
without a restart. Master, volume, webdav, and admin, however, passed
cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at
startup — rotating those certs required a pod restart.
Add a small helper NewReloadingServerCertificate in weed/security that
wraps pemfile.Provider and returns a tls.Config.GetCertificate closure,
then wire it into the four remaining HTTPS entry points. httpdown now
also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates
but CertFile/KeyFile are empty, so volume server can pre-populate
TLSConfig.
A unit test exercises the rotation path (write cert, rotate on disk,
assert the callback returns the new cert) with a short refresh window.
* refactor(security): route filer/s3 HTTPS through the shared cert reloader
Before: filer.go and s3.go each kept a *certprovider.Provider on the
options struct plus a duplicated GetCertificateWithUpdate method. Both
were loading pemfile themselves. Behaviorally they already reloaded, but
the logic was duplicated two ways and neither path was shared with the
newly-added master/volume/webdav/admin wiring.
After: both use security.NewReloadingServerCertificate like the other
servers. The per-struct certProvider field and GetCertificateWithUpdate
method are removed, along with the now-unused certprovider and pemfile
imports. Net: -32 lines, one code path for all HTTPS cert reloading.
No behavior change — the refresh window, cache, and handshake contract
are identical (the helper wraps the same pemfile.NewProvider).
* feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc
The HTTP client in weed/util/http/client loaded the mTLS client cert
once at startup via tls.LoadX509KeyPair. That left every long-lived
HTTPS client process (weed mount, backup, filer.copy, filer→volume,
s3→filer/volume) unable to pick up a rotated client cert without a
restart — even though the same cert-manager setup was already rotating
the server side fine.
Swap the client cert loader for a tls.Config.GetClientCertificate
callback backed by the same refreshing pemfile provider. New TLS
handshakes pick up the rotated cert; in-flight pooled connections keep
their old cert and drop as normal transport churn happens.
To keep this reusable from both server and client TLS code without an
import cycle (weed/security already imports weed/util/http/client for
LoadHTTPClientFromFile), extract the pemfile wrapper into a new
weed/security/certreload subpackage. weed/security keeps its thin
NewReloadingServerCertificate wrapper. The existing unit test moves
with the implementation.
gRPC mTLS was already handled by security.LoadServerTLS /
LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ
agent, Kafka gateway, and FUSE mount control plane are gRPC-only and
therefore already rotate.
CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted
as a known limitation in the wiki.
* fix(security): address PR review feedback on cert reloader
Bots (gemini-code-assist + coderabbit) flagged three real issues and a
couple of nits. Addressing them here:
1. KeyMaterial used context.Background(). The grpc pemfile provider's
KeyMaterial blocks until material arrives or the context deadline
expires; with Background() a slow disk could hang the TLS handshake
indefinitely. Switched both the server and client callbacks to use
hello.Context() / cri.Context() so a stuck read is bounded by the
handshake timeout.
2. Admin server loaded TLS inside the serve goroutine. If the cert was
bad, the goroutine returned but startAdminServer kept blocking on
<-ctx.Done() with no listener, making the process look healthy with
nothing bound. Moved TLS setup to run before the goroutine starts
and propagate errors via fmt.Errorf; also captures the provider and
defers Close().
3. HTTP client discarded the certprovider.Provider from
NewClientGetCertificate. That leaked the refresh goroutine, and
NewHttpClientWithTLS had a worse case where a CA-file failure after
provider creation orphaned the provider entirely. Added a
certProvider field and a Close() method on HTTPClient, and made
the constructors close the provider on subsequent error paths.
4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain
the provider. filer and webdav run ServeTLS synchronously, so a
plain defer works. master/volume/s3 dispatch goroutines and return
while the server keeps running, so they hook Close() into
grace.OnInterrupt.
5. Test: certreload_test now tolerates transient read/parse errors
during file rotation (writeSelfSigned rewrites cert before key) and
reports the last error only if the deadline expires.
No user-visible behavior change for the happy path.
* test(tls): add end-to-end HTTPS cert rotation integration test
Boots a real `weed master` with HTTPS enabled, captures the leaf cert
served at TLS handshake time, atomically rewrites the cert/key files
on disk (the same rename-in-place pattern kubelet does when it swaps
a cert-manager Secret), and asserts that a subsequent TLS handshake
observes the rotated leaf — with no process restart, no SIGHUP, no
reloader sidecar. Verifies the full path: on-disk change → pemfile
refresh tick → provider.KeyMaterial → tls.Config.GetCertificate →
server TLS handshake.
Runtime is ~1s by exposing the reloader's refresh window as an env
var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the
test. The same env var is user-facing — documented in the wiki — so
operators running short-lived certs (Vault, cert-manager with
duration: 24h, etc.) can tighten the rotation-pickup window without a
rebuild. Defaults to 5h to preserve prior behavior.
security.CredRefreshingInterval is kept for API compatibility but now
aliases certreload.DefaultRefreshInterval so the same env controls
both gRPC mTLS and HTTPS reload.
* ci(tls): wire the TLS rotation integration test into GitHub Actions
Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner,
Go 1.25, build weed, run `go test` in test/tls_rotation, upload master
logs on failure. 10-minute job timeout; the test itself finishes in
about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms
inside the test.
Runs on every push to master and on every PR to master.
* fix(tls): address follow-up PR review comments
Three new comments on the integration test + volume shutdown path:
1. Test: peekServerCert was swallowing every dial/handshake error,
which meant waitForCert's "last err: <nil>" fatal message lost all
diagnostic value. Thread errors back through: peekServerCert now
returns (*x509.Certificate, error), and waitForCert records the
latest error so a CI flake points at the actual cause (master
didn't come up, handshake rejected, CA pool mismatch, etc.).
2. Test: set HOME=<tempdir> on the master subprocess. Viper today
registers the literal path "$HOME/.seaweedfs" without env
expansion, so a developer's ~/.seaweedfs/security.toml is
accidentally invisible — the test was relying on that. Pinning
HOME is belt-and-braces against a future viper upgrade that does
expand env vars.
3. volume.go: startClusterHttpService's provider close was registered
via grace.OnInterrupt, which fires on SIGTERM but NOT on the
v.shutdownCtx.Done() path used by mini / integration tests. The
pemfile refresh goroutine leaked in that shutdown path. Now the
helper returns a close func and the caller invokes it on BOTH
shutdown paths for parity.
Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the
ast-grep static-analysis nit — zero-risk since the pool only trusts
our in-memory CA.
Test runs clean 3/3.
* fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8965)
When filer.sync runs with -a.security and -b.security flags, only gRPC
connections received per-cluster TLS configuration. HTTP clients for
volume server reads and uploads used a global singleton with the default
security.toml, causing TLS verification failures when clusters use
different self-signed certificates.
Load per-cluster HTTPS client config from the security files and pass
dedicated HTTP clients to FilerSource (for downloads) and FilerSink
(for uploads) so each direction uses the correct cluster's certificates.
* fix(sync): address review feedback for per-cluster HTTP TLS
- Add insecure_skip_verify support to NewHttpClientWithTLS and read it
from per-cluster security config via https.client.insecure_skip_verify
- Error on partial mTLS config (cert without key or vice versa)
- Add nil-check for client parameter in DownloadFileWithClient
- Document SetUploader as init-only (same pattern as SetChunkConcurrency)
* Add -insecureSkipVerify flag and config option for filer.sync HTTPS connections
When using filer.sync between clusters with different CAs (e.g., separate
OpenShift clusters), TLS certificate verification fails with "x509:
certificate signed by unknown authority". This adds two ways to skip TLS
certificate verification:
1. CLI flag: `weed filer.sync -insecureSkipVerify ...`
2. Config option: `insecure_skip_verify = true` under [https.client] in
security.toml
Closes#8778
* Add insecure_skip_verify option for HTTPS client in security.toml
When using filer.sync between clusters with different CAs (e.g., separate
OpenShift clusters), TLS certificate verification fails. Adding
insecure_skip_verify = true under [https.client] in security.toml allows
skipping TLS certificate verification.
The option is read during global HTTP client initialization so it applies
to all HTTPS connections including filer.sync proxy reads and writes.
Closes#8778
---------
Co-authored-by: Copilot <copilot@github.com>
* Added global http client
* Added Do func for global http client
* Changed the code to use the global http client
* Fix http client in volume uploader
* Fixed pkg name
* Fixed http util funcs
* Fixed http client for bench_filer_upload
* Fixed http client for stress_filer_upload
* Fixed http client for filer_server_handlers_proxy
* Fixed http client for command_fs_merge_volumes
* Fixed http client for command_fs_merge_volumes and command_volume_fsck
* Fixed http client for s3api_server
* Added init global client for main funcs
* Rename global_client to client
* Changed:
- fixed NewHttpClient;
- added CheckIsHttpsClientEnabled func
- updated security.toml in scaffold
* Reduce the visibility of some functions in the util/http/client pkg
* Added the loadSecurityConfig function
* Use util.LoadSecurityConfiguration() in NewHttpClient func