Files
seaweedfs/weed/util/http/client/http_client.go
Chris Lu 9ae905e456 feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin

S3 and filer already use a refreshing pemfile provider for their HTTPS
cert, so rotated certificates (e.g. from k8s cert-manager) are picked up
without a restart. Master, volume, webdav, and admin, however, passed
cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at
startup — rotating those certs required a pod restart.

Add a small helper NewReloadingServerCertificate in weed/security that
wraps pemfile.Provider and returns a tls.Config.GetCertificate closure,
then wire it into the four remaining HTTPS entry points. httpdown now
also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates
but CertFile/KeyFile are empty, so volume server can pre-populate
TLSConfig.

A unit test exercises the rotation path (write cert, rotate on disk,
assert the callback returns the new cert) with a short refresh window.

* refactor(security): route filer/s3 HTTPS through the shared cert reloader

Before: filer.go and s3.go each kept a *certprovider.Provider on the
options struct plus a duplicated GetCertificateWithUpdate method. Both
were loading pemfile themselves. Behaviorally they already reloaded, but
the logic was duplicated two ways and neither path was shared with the
newly-added master/volume/webdav/admin wiring.

After: both use security.NewReloadingServerCertificate like the other
servers. The per-struct certProvider field and GetCertificateWithUpdate
method are removed, along with the now-unused certprovider and pemfile
imports. Net: -32 lines, one code path for all HTTPS cert reloading.

No behavior change — the refresh window, cache, and handshake contract
are identical (the helper wraps the same pemfile.NewProvider).

* feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc

The HTTP client in weed/util/http/client loaded the mTLS client cert
once at startup via tls.LoadX509KeyPair. That left every long-lived
HTTPS client process (weed mount, backup, filer.copy, filer→volume,
s3→filer/volume) unable to pick up a rotated client cert without a
restart — even though the same cert-manager setup was already rotating
the server side fine.

Swap the client cert loader for a tls.Config.GetClientCertificate
callback backed by the same refreshing pemfile provider. New TLS
handshakes pick up the rotated cert; in-flight pooled connections keep
their old cert and drop as normal transport churn happens.

To keep this reusable from both server and client TLS code without an
import cycle (weed/security already imports weed/util/http/client for
LoadHTTPClientFromFile), extract the pemfile wrapper into a new
weed/security/certreload subpackage. weed/security keeps its thin
NewReloadingServerCertificate wrapper. The existing unit test moves
with the implementation.

gRPC mTLS was already handled by security.LoadServerTLS /
LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ
agent, Kafka gateway, and FUSE mount control plane are gRPC-only and
therefore already rotate.

CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted
as a known limitation in the wiki.

* fix(security): address PR review feedback on cert reloader

Bots (gemini-code-assist + coderabbit) flagged three real issues and a
couple of nits. Addressing them here:

1. KeyMaterial used context.Background(). The grpc pemfile provider's
   KeyMaterial blocks until material arrives or the context deadline
   expires; with Background() a slow disk could hang the TLS handshake
   indefinitely. Switched both the server and client callbacks to use
   hello.Context() / cri.Context() so a stuck read is bounded by the
   handshake timeout.

2. Admin server loaded TLS inside the serve goroutine. If the cert was
   bad, the goroutine returned but startAdminServer kept blocking on
   <-ctx.Done() with no listener, making the process look healthy with
   nothing bound. Moved TLS setup to run before the goroutine starts
   and propagate errors via fmt.Errorf; also captures the provider and
   defers Close().

3. HTTP client discarded the certprovider.Provider from
   NewClientGetCertificate. That leaked the refresh goroutine, and
   NewHttpClientWithTLS had a worse case where a CA-file failure after
   provider creation orphaned the provider entirely. Added a
   certProvider field and a Close() method on HTTPClient, and made
   the constructors close the provider on subsequent error paths.

4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain
   the provider. filer and webdav run ServeTLS synchronously, so a
   plain defer works. master/volume/s3 dispatch goroutines and return
   while the server keeps running, so they hook Close() into
   grace.OnInterrupt.

5. Test: certreload_test now tolerates transient read/parse errors
   during file rotation (writeSelfSigned rewrites cert before key) and
   reports the last error only if the deadline expires.

No user-visible behavior change for the happy path.

* test(tls): add end-to-end HTTPS cert rotation integration test

Boots a real `weed master` with HTTPS enabled, captures the leaf cert
served at TLS handshake time, atomically rewrites the cert/key files
on disk (the same rename-in-place pattern kubelet does when it swaps
a cert-manager Secret), and asserts that a subsequent TLS handshake
observes the rotated leaf — with no process restart, no SIGHUP, no
reloader sidecar. Verifies the full path: on-disk change → pemfile
refresh tick → provider.KeyMaterial → tls.Config.GetCertificate →
server TLS handshake.

Runtime is ~1s by exposing the reloader's refresh window as an env
var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the
test. The same env var is user-facing — documented in the wiki — so
operators running short-lived certs (Vault, cert-manager with
duration: 24h, etc.) can tighten the rotation-pickup window without a
rebuild. Defaults to 5h to preserve prior behavior.

security.CredRefreshingInterval is kept for API compatibility but now
aliases certreload.DefaultRefreshInterval so the same env controls
both gRPC mTLS and HTTPS reload.

* ci(tls): wire the TLS rotation integration test into GitHub Actions

Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner,
Go 1.25, build weed, run `go test` in test/tls_rotation, upload master
logs on failure. 10-minute job timeout; the test itself finishes in
about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms
inside the test.

Runs on every push to master and on every PR to master.

* fix(tls): address follow-up PR review comments

Three new comments on the integration test + volume shutdown path:

1. Test: peekServerCert was swallowing every dial/handshake error,
   which meant waitForCert's "last err: <nil>" fatal message lost all
   diagnostic value. Thread errors back through: peekServerCert now
   returns (*x509.Certificate, error), and waitForCert records the
   latest error so a CI flake points at the actual cause (master
   didn't come up, handshake rejected, CA pool mismatch, etc.).

2. Test: set HOME=<tempdir> on the master subprocess. Viper today
   registers the literal path "$HOME/.seaweedfs" without env
   expansion, so a developer's ~/.seaweedfs/security.toml is
   accidentally invisible — the test was relying on that. Pinning
   HOME is belt-and-braces against a future viper upgrade that does
   expand env vars.

3. volume.go: startClusterHttpService's provider close was registered
   via grace.OnInterrupt, which fires on SIGTERM but NOT on the
   v.shutdownCtx.Done() path used by mini / integration tests. The
   pemfile refresh goroutine leaked in that shutdown path. Now the
   helper returns a close func and the caller invokes it on BOTH
   shutdown paths for parity.

Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the
ast-grep static-analysis nit — zero-risk since the pool only trusts
our in-memory CA.

Test runs clean 3/3.
2026-04-21 20:20:11 -07:00

305 lines
8.8 KiB
Go

package client
import (
"crypto/tls"
"crypto/x509"
"fmt"
"io"
"net/http"
"net/url"
"os"
"strings"
"sync"
"google.golang.org/grpc/credentials/tls/certprovider"
"github.com/seaweedfs/seaweedfs/weed/security/certreload"
util "github.com/seaweedfs/seaweedfs/weed/util"
"github.com/spf13/viper"
)
var (
loadSecurityConfigOnce sync.Once
)
type HTTPClient struct {
Client *http.Client
Transport *http.Transport
expectHttpsScheme bool
// certProvider, when non-nil, owns a background refresh goroutine for
// the client mTLS cert/key pair. Close() must be called to stop it.
certProvider certprovider.Provider
}
// Close stops any background cert refresh goroutine. Safe to call on a
// client that was constructed without mTLS. Existing pooled connections
// are also closed via CloseIdleConnections.
func (httpClient *HTTPClient) Close() {
if httpClient == nil {
return
}
if httpClient.certProvider != nil {
httpClient.certProvider.Close()
httpClient.certProvider = nil
}
if httpClient.Client != nil {
httpClient.Client.CloseIdleConnections()
}
}
func (httpClient *HTTPClient) Do(req *http.Request) (*http.Response, error) {
req.URL.Scheme = httpClient.GetHttpScheme()
return httpClient.Client.Do(req)
}
func (httpClient *HTTPClient) Get(url string) (resp *http.Response, err error) {
url, err = httpClient.NormalizeHttpScheme(url)
if err != nil {
return nil, err
}
return httpClient.Client.Get(url)
}
func (httpClient *HTTPClient) Post(url, contentType string, body io.Reader) (resp *http.Response, err error) {
url, err = httpClient.NormalizeHttpScheme(url)
if err != nil {
return nil, err
}
return httpClient.Client.Post(url, contentType, body)
}
func (httpClient *HTTPClient) PostForm(url string, data url.Values) (resp *http.Response, err error) {
url, err = httpClient.NormalizeHttpScheme(url)
if err != nil {
return nil, err
}
return httpClient.Client.PostForm(url, data)
}
func (httpClient *HTTPClient) Head(url string) (resp *http.Response, err error) {
url, err = httpClient.NormalizeHttpScheme(url)
if err != nil {
return nil, err
}
return httpClient.Client.Head(url)
}
func (httpClient *HTTPClient) CloseIdleConnections() {
httpClient.Client.CloseIdleConnections()
}
func (httpClient *HTTPClient) GetClientTransport() *http.Transport {
return httpClient.Transport
}
func (httpClient *HTTPClient) GetHttpScheme() string {
if httpClient.expectHttpsScheme {
return "https"
}
return "http"
}
func (httpClient *HTTPClient) NormalizeHttpScheme(rawURL string) (string, error) {
expectedScheme := httpClient.GetHttpScheme()
if !(strings.HasPrefix(rawURL, "http://") || strings.HasPrefix(rawURL, "https://")) {
return expectedScheme + "://" + rawURL, nil
}
parsedURL, err := url.Parse(rawURL)
if err != nil {
return "", err
}
if expectedScheme != parsedURL.Scheme {
parsedURL.Scheme = expectedScheme
}
return parsedURL.String(), nil
}
func NewHttpClient(clientName ClientName, opts ...HttpClientOpt) (*HTTPClient, error) {
httpClient := HTTPClient{}
httpClient.expectHttpsScheme = checkIsHttpsClientEnabled(clientName)
var tlsConfig *tls.Config = nil
if httpClient.expectHttpsScheme {
certFileName, keyFileName, hasClientCert, err := clientCertPaths(clientName)
if err != nil {
return nil, err
}
clientCaCert, clientCaCertName, err := getClientCaCert(clientName)
if err != nil {
return nil, err
}
if hasClientCert || len(clientCaCert) != 0 {
caCertPool, err := createHTTPClientCertPool(clientCaCert, clientCaCertName)
if err != nil {
return nil, err
}
tlsConfig = &tls.Config{
RootCAs: caCertPool,
InsecureSkipVerify: false,
}
if hasClientCert {
getClientCert, provider, err := certreload.NewClientGetCertificate(certFileName, keyFileName)
if err != nil {
return nil, fmt.Errorf("error loading client certificate and key: %s", err)
}
tlsConfig.GetClientCertificate = getClientCert
httpClient.certProvider = provider
}
}
if getBoolOptionFromSecurityConfiguration(clientName, "insecure_skip_verify") {
if tlsConfig == nil {
tlsConfig = &tls.Config{}
}
tlsConfig.InsecureSkipVerify = true
}
}
httpClient.Transport = &http.Transport{
MaxIdleConns: 1024,
MaxIdleConnsPerHost: 1024,
TLSClientConfig: tlsConfig,
}
httpClient.Client = &http.Client{
Transport: httpClient.Transport,
}
for _, opt := range opts {
opt(&httpClient)
}
return &httpClient, nil
}
func getStringOptionFromSecurityConfiguration(clientName ClientName, stringOptionName string) string {
util.LoadSecurityConfiguration()
return viper.GetString(fmt.Sprintf("https.%s.%s", clientName.LowerCaseString(), stringOptionName))
}
func getBoolOptionFromSecurityConfiguration(clientName ClientName, boolOptionName string) bool {
util.LoadSecurityConfiguration()
return viper.GetBool(fmt.Sprintf("https.%s.%s", clientName.LowerCaseString(), boolOptionName))
}
func checkIsHttpsClientEnabled(clientName ClientName) bool {
return getBoolOptionFromSecurityConfiguration(clientName, "enabled")
}
func getFileContentFromSecurityConfiguration(clientName ClientName, fileType string) ([]byte, string, error) {
if fileName := getStringOptionFromSecurityConfiguration(clientName, fileType); fileName != "" {
fileContent, err := os.ReadFile(fileName)
if err != nil {
return nil, fileName, err
}
return fileContent, fileName, err
}
return nil, "", nil
}
// clientCertPaths reads the https.<clientName>.{cert,key} paths from the
// security config, validates they're either both set or both empty, and
// returns them along with a hasClientCert flag. Loading is deferred to
// certreload so the cert/key pair is picked up from disk on rotation.
func clientCertPaths(clientName ClientName) (certFile, keyFile string, hasClientCert bool, err error) {
certFile = getStringOptionFromSecurityConfiguration(clientName, "cert")
keyFile = getStringOptionFromSecurityConfiguration(clientName, "key")
if certFile == "" && keyFile == "" {
return "", "", false, nil
}
if certFile == "" || keyFile == "" {
return "", "", false, fmt.Errorf("https.%s: both cert and key must be set (got cert=%q key=%q)", clientName.LowerCaseString(), certFile, keyFile)
}
return certFile, keyFile, true, nil
}
func getClientCaCert(clientName ClientName) ([]byte, string, error) {
return getFileContentFromSecurityConfiguration(clientName, "ca")
}
// NewHttpClientWithTLS creates an HTTPClient with explicit TLS certificate
// parameters instead of reading from the global security configuration.
// This is used by filer.sync to create per-cluster HTTP clients when clusters
// use different certificates.
func NewHttpClientWithTLS(certFile, keyFile, caFile string, insecureSkipVerify bool, opts ...HttpClientOpt) (*HTTPClient, error) {
httpClient := HTTPClient{}
httpClient.expectHttpsScheme = true
var tlsConfig *tls.Config
if (certFile == "") != (keyFile == "") {
return nil, fmt.Errorf("both cert and key are required for mTLS, got cert=%q key=%q", certFile, keyFile)
}
var getClientCert func(*tls.CertificateRequestInfo) (*tls.Certificate, error)
if certFile != "" && keyFile != "" {
cb, provider, err := certreload.NewClientGetCertificate(certFile, keyFile)
if err != nil {
return nil, fmt.Errorf("error loading client certificate and key: %s", err)
}
getClientCert = cb
httpClient.certProvider = provider
}
// closeProviderOnError ensures the cert reloader's background refresh
// goroutine is shut down if any subsequent step fails before we hand
// the client back to the caller.
closeProviderOnError := func() {
if httpClient.certProvider != nil {
httpClient.certProvider.Close()
httpClient.certProvider = nil
}
}
var caCertPool *x509.CertPool
if caFile != "" {
caCert, err := os.ReadFile(caFile)
if err != nil {
closeProviderOnError()
return nil, fmt.Errorf("error reading CA cert %s: %s", caFile, err)
}
caCertPool, err = createHTTPClientCertPool(caCert, caFile)
if err != nil {
closeProviderOnError()
return nil, err
}
}
if getClientCert != nil || caCertPool != nil || insecureSkipVerify {
tlsConfig = &tls.Config{
GetClientCertificate: getClientCert,
RootCAs: caCertPool,
InsecureSkipVerify: insecureSkipVerify,
}
}
httpClient.Transport = &http.Transport{
MaxIdleConns: 1024,
MaxIdleConnsPerHost: 1024,
TLSClientConfig: tlsConfig,
}
httpClient.Client = &http.Client{
Transport: httpClient.Transport,
}
for _, opt := range opts {
opt(&httpClient)
}
return &httpClient, nil
}
func createHTTPClientCertPool(certContent []byte, fileName string) (*x509.CertPool, error) {
certPool := x509.NewCertPool()
if len(certContent) == 0 {
return certPool, nil
}
ok := certPool.AppendCertsFromPEM(certContent)
if !ok {
return nil, fmt.Errorf("error processing certificate in %s", fileName)
}
return certPool, nil
}