112 Commits

Author SHA1 Message Date
Chris Lu
25beb7ec48 admin: expose Prometheus metrics (#9652)
* admin: add -metricsPort flag to expose Prometheus metrics

The admin command had no metrics endpoint, so passing -metricsPort
(as the operator does for spec.admin.metricsPort) crashed the process
with "flag provided but not defined". Wire up -metricsPort/-metricsIp
and start the shared Prometheus metrics server, matching filer, master,
and volume.

* admin: emit maintenance task and worker fleet metrics

Add Prometheus metrics for the admin server's distinctive work: the
maintenance task queue and the worker fleet that executes it.

Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot
of the queue), maintenance_tasks_completed_total{type,outcome} counter
and maintenance_task_duration_seconds{type} histogram (recorded when a
task reaches a terminal state), and last/next scan timestamp gauges.

Worker fleet: workers_connected and worker_slots{used,max} gauges, plus
worker_events_total{event} counting register/unregister/stale removals.

Gauges are snapshotted by a background goroutine on the admin server;
counters and the histogram are recorded at their event sites.

* admin: read worker slot totals under lock, clear next-scan gauge when idle

GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent
outside the queue lock races with task assignment and completion. Add
GetWorkerSlotTotals to aggregate under the lock.

Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner
is not running, so it can't retain a stale value after a stop.
2026-05-24 14:09:02 -07:00
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
495632730c feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log

Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.

  - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
    wall-clock per shard. p95 climbing toward MaxRuntime means the
    shard is brushing its budget.
  - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
    events drainShardEvents processed. Pairs with the duration so a
    spike in events-per-shard correlates with a slow shard.
  - S3LifecycleDispatchCounter (existing, reused): processMatches now
    increments this with the outcome label, so streaming and
    daily_replay paths share one outcome view. Transport errors are
    counted under outcome="TRANSPORT_ERROR".

dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.

Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.

* fix(s3/lifecycle): align dispatch error label + clean test labels

Two PR-9462 review fixes from gemini:

1. processMatches' transport-failure label was "TRANSPORT_ERROR";
   streaming's dispatcher uses "RPC_ERROR" for the same condition
   (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
   the same Prometheus query covers both delete paths.

2. The dispatch-counter assertion test now deletes its label row
   on exit so the in-process Prometheus registry doesn't accumulate
   per-test state across the suite.
2026-05-12 12:15:20 -07:00
Chris Lu
884b0bcbfd feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)

Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.

Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.

Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
  non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
  the input ClusterContext and injects two metadata keys for
  s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
  decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
  config; other job types pass through unchanged.

Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
  on (admin-config field names, metadata keys). Plain strings on the
  admin side keep weed/admin/plugin free of a dependency on the
  s3_lifecycle worker package; the two sets of constants are pinned
  to identical values and a mismatch would silently disable rate
  limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
  builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
  Missing/empty/non-positive values → no limiter (legacy unlimited
  behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
  bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
  (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).

Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
  long each Limiter.Wait blocks before a LifecycleDelete RPC.
  Operators tune the cap by reading p95 — near-zero means the cap
  isn't binding, a long tail at 1/rate means it is.

Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
  pass-through for non-allocator job types, rps=0 / no-executors
  skip, even sharing, burst sharing, burst=0 omit (worker default
  kicks in), burst floor of 1, no mutation of input metadata, nil
  input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
  covering nil/empty/missing metadata, non-positive/invalid rate,
  positive rate builds correctly, burst missing defaults to 2× rate,
  tiny rate clamps burst to ≥1.

Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.

* fix(s3/lifecycle): divide cluster budget by active workers, not all capable

gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.

Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.

Tests swap the SharedEvenly case for two pinned scenarios:
  - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
  - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
  - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4

* feat(s3/lifecycle): drop Worker Count knob from admin config form

The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.

Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).

handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
2026-05-11 19:17:06 -07:00
Chris Lu
af2a359e45 feat(s3/lifecycle): metadata_only_total Prometheus counter (#9399)
Operator-visible signal for the metadata-only delete path landed in
PR 9390. Increment seaweedfs_s3_lifecycle_metadata_only_total{bucket,
rule_hash} after each successful unversioned or noncurrent / expired-
marker delete that took the skip-chunk path. Suspended-versioning
null delete is intentionally not counted: that path's nil err can
mean "deleted" or "NotFound", so a count there would over-report.
rule_hash is hex-encoded for label safety; nil bytes collapse to
"". DeleteBucketMetrics tears the new series down alongside the
existing lifecycle counters when a bucket is removed.
2026-05-09 20:02:26 -07:00
Chris Lu
e55db58ca9 feat(s3/lifecycle): expose Prometheus metrics (Phase 7) (#9375)
* feat(s3/lifecycle): expose Prometheus metrics (Phase 7)

Five new gauges/counters under the s3_lifecycle subsystem so operators
can see what the worker is doing without grepping logs:

- dispatch_total{bucket,kind,outcome} — every LifecycleDelete RPC
  bumps this. Outcome is the proto enum name (DONE, NOOP_RESOLVED,
  RETRY_LATER, BLOCKED, …) plus a synthetic "RPC_ERROR" for transport
  failures classified as RETRY_LATER.
- schedule_depth{shard} — pending matches in each shard's schedule,
  sampled on the dispatcher tick.
- cursor_min_ts_ns{shard} — per-shard min cursor timestamp; lag is
  derived as (now - min) by the scrape side.
- events_total{shard} — meta-log events the reader fed to the router.
- bootstrap_dispatch_total{bucket,kind} — bootstrap-walk dispatches.

Test asserts the dispatch counter increments for both DONE and
RPC_ERROR paths.

* fix(stats): purge lifecycle bucket label series in DeleteBucketMetrics

The two new bucket-labeled lifecycle counters
(S3LifecycleDispatchCounter, S3LifecycleBootstrapDispatchCounter)
weren't included in DeleteBucketMetrics, so explicit bucket teardown
left their label series behind — same cardinality leak the existing
counters above already avoid. Tack them onto the same DeletePartialMatch
chain.
2026-05-08 17:49:10 -07:00
Lisandro Pin
3f3aaa7cc8 Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics...

  - `scrub_last_time_seconds`
  - `scrub_volume_failures`
  - `scrub_shard_failures`

...capturing overall volume scrub results, and allowing to construct alerts
and dashboards to monitor scrubbing progress.

Note that these metrics are aggregated at the volume/EC shard level, and not
intended for fine-grained tracking of scrubbing operations.
2026-04-28 12:34:02 -07:00
Lisandro Pin
2c404f66bc Export file_read_invalid_needles metric for REST read requests on invalid file IDs. (#9241)
Provides a straightforward metric to count read requests with incorrect file/needle IDs,
which can indicate client issues.

Note that the metric does not cover gRPC calls, as the current proto service API
does not support seeking files by ID.
2026-04-27 12:22:42 -07:00
Lisandro Pin
fff243d463 Export gRPC file_{read,write}_failures metrics on volume servers. (#9177)
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-22 11:22:21 -07:00
Lisandro Pin
6bcacedda9 Export master_disconnections metrics on volume servers. (#9104)
This allows to track connection issues and master failovers in real time via Prometheus metrics.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-17 15:15:26 -07:00
Lisandro Pin
67a2810d2d Export start_time_seconds metrics on both master & volume servers. (#9046)
These are to be used to track uptimes.

See https://github.com/seaweedfs/seaweedfs/issues/8535 for details.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-13 09:34:08 -07:00
Chris Lu
69218c88fe fix(stats): fix build on openbsd, solaris, and windows (#8951)
fix(stats): replace undefined calculateDiskRemaining with inline calculation

disk_openbsd.go, disk_solaris.go, and disk_windows.go all call
calculateDiskRemaining() which is never defined, causing build failures
on those platforms. Replace with the same inline calculation used in
disk_supported.go.
2026-04-06 12:48:48 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
Chris Lu
5fa5507234 Add Prometheus metric to count upload errors (#8788)
Add Prometheus metric to count upload errors (#8775)

Add SeaweedFS_upload_error_total counter labeled by HTTP status code,
so operators can alert on write/replication failures. Code "0" indicates
a transport error (no HTTP response received).

Also add an "Upload Errors" panel to the Grafana dashboard.
2026-03-26 16:58:05 -07:00
Chris Lu
0a5c5ed4ce Persist S3 bucket counter metrics across idle periods (#8595)
* Stop deleting counter metrics during bucket TTL cleanup

Counter metrics (traffic bytes, request counts, object counts) are
monotonically increasing by design. Deleting them after 10 minutes of
bucket inactivity causes them to vanish from /metrics output and reset
to zero when traffic resumes, breaking Prometheus rate()/increase()
queries and making historical traffic reporting impossible.

Only delete gauges and histograms in the TTL cleanup loop, as these
represent current state and are safely re-populated on next activity.

Fixes https://github.com/seaweedfs/seaweedfs/issues/8521

* Clean up all bucket metrics on bucket deletion

Add DeleteBucketMetrics() to delete all metrics (including counters)
for a bucket when it is explicitly deleted. This prevents unbounded
label cardinality from accumulating for buckets that no longer exist.

Called from DeleteBucketHandler after successful bucket deletion.

* Reduce mutex scope in bucket metrics TTL sweep

Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 19:00:40 -07:00
Chris Lu
0a2dac1e56 Reduce mutex scope in bucket metrics TTL sweep
Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 18:43:35 -07:00
Chris Lu
e4b70c2521 go fix 2026-02-20 18:42:00 -08:00
promalert
9012069bd7 chore: execute goimports to format the code (#7983)
* chore: execute goimports to format the code

Signed-off-by: promalert <promalert@outlook.com>

* goimports -w .

---------

Signed-off-by: promalert <promalert@outlook.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-01-07 13:06:08 -08:00
Chris Lu
f5c666052e feat: add S3 bucket size and object count metrics (#7776)
* feat: add S3 bucket size and object count metrics

Adds periodic collection of bucket size metrics:
- SeaweedFS_s3_bucket_size_bytes: logical size (deduplicated across replicas)
- SeaweedFS_s3_bucket_physical_size_bytes: physical size (including replicas)
- SeaweedFS_s3_bucket_object_count: object count (deduplicated)

Collection runs every 1 minute via background goroutine that queries
filer Statistics RPC for each bucket's collection.

Also adds Grafana dashboard panels for:
- S3 Bucket Size (logical vs physical)
- S3 Bucket Object Count

* address PR comments: fix bucket size metrics collection

1. Fix collectCollectionInfoFromMaster to use master VolumeList API
   - Now properly queries master for topology info
   - Uses WithMasterClient to get volume list from master
   - Correctly calculates logical vs physical size based on replication

2. Return error when filerClient is nil to trigger fallback
   - Changed from 'return nil, nil' to 'return nil, error'
   - Ensures fallback to filer stats is properly triggered

3. Implement pagination in listBucketNames
   - Added listBucketPageSize constant (1000)
   - Uses StartFromFileName for pagination
   - Continues fetching until fewer entries than limit returned

4. Handle NewReplicaPlacementFromByte error and prevent division by zero
   - Check error return from NewReplicaPlacementFromByte
   - Default to 1 copy if error occurs
   - Add explicit check for copyCount == 0

* simplify bucket size metrics: remove filer fallback, align with quota enforcement

- Remove fallback to filer Statistics RPC
- Use only master topology for collection info (same as s3.bucket.quota.enforce)
- Updated comments to clarify this runs the same collection logic as quota enforcement
- Simplified code by removing collectBucketSizeFromFilerStats

* use s3a.option.Masters directly instead of querying filer

* address PR comments: fix dashboard overlaps and improve metrics collection

Grafana dashboard fixes:
- Fix overlapping panels 55 and 59 in grafana_seaweedfs.json (moved 59 to y=30)
- Fix grid collision in k8s dashboard (moved panel 72 to y=48)
- Aggregate bucket metrics with max() by (bucket) for multi-instance S3 gateways

Go code improvements:
- Add graceful shutdown support via context cancellation
- Use ticker instead of time.Sleep for better shutdown responsiveness
- Distinguish EOF from actual errors in stream handling

* improve bucket size metrics: multi-master failover and proper error handling

- Initial delay now respects context cancellation using select with time.After
- Use WithOneOfGrpcMasterClients for multi-master failover instead of hardcoding Masters[0]
- Properly propagate stream errors instead of just logging them (EOF vs real errors)

* improve bucket size metrics: distributed lock and volume ID deduplication

- Add distributed lock (LiveLock) so only one S3 instance collects metrics at a time
- Add IsLocked() method to LiveLock for checking lock status
- Fix deduplication: use volume ID tracking instead of dividing by copyCount
  - Previous approach gave wrong results if replicas were missing
  - Now tracks seen volume IDs and counts each volume only once
- Physical size still includes all replicas for accurate disk usage reporting

* rename lock to s3.leader

* simplify: remove StartBucketSizeMetricsCollection wrapper function

* fix data race: use atomic operations for LiveLock.isLocked field

- Change isLocked from bool to int32
- Use atomic.LoadInt32/StoreInt32 for all reads/writes
- Sync shared isLocked field in StartLongLivedLock goroutine

* add nil check for topology info to prevent panic

* fix bucket metrics: use Ticker for consistent intervals, fix pagination logic

- Use time.Ticker instead of time.After for consistent interval execution
- Fix pagination: count all entries (not just directories) for proper termination
- Update lastFileName for all entries to prevent pagination issues

* address PR comments: remove redundant atomic store, propagate context

- Remove redundant atomic.StoreInt32 in StartLongLivedLock (AttemptToLock already sets it)
- Propagate context through metrics collection for proper cancellation on shutdown
  - collectAndUpdateBucketSizeMetrics now accepts ctx
  - collectCollectionInfoFromMaster uses ctx for VolumeList RPC
  - listBucketNames uses ctx for ListEntries RPC
2025-12-15 19:23:25 -08:00
Chris Lu
848bec6d24 Metrics: Add Prometheus metrics for concurrent upload tracking (#7555)
* metrics: add Prometheus metrics for concurrent upload tracking

Add Prometheus metrics to monitor concurrent upload activity for both
filer and S3 servers. This provides visibility into the upload limiting
feature added in the previous PR.

New Metrics:
- SeaweedFS_filer_in_flight_upload_bytes: Current bytes being uploaded to filer
- SeaweedFS_filer_in_flight_upload_count: Current number of uploads to filer
- SeaweedFS_s3_in_flight_upload_bytes: Current bytes being uploaded to S3
- SeaweedFS_s3_in_flight_upload_count: Current number of uploads to S3

The metrics are updated atomically whenever uploads start or complete,
providing real-time visibility into upload concurrency levels.

This helps operators:
- Monitor upload concurrency in real-time
- Set appropriate limits based on actual usage patterns
- Detect potential bottlenecks or capacity issues
- Track the effectiveness of upload limiting configuration

* grafana: add dashboard panels for concurrent upload metrics

Add 4 new panels to the Grafana dashboard to visualize the concurrent
upload metrics added in this PR:

Filer Section:
- Filer Concurrent Uploads: Shows current number of concurrent uploads
- Filer Concurrent Upload Bytes: Shows current bytes being uploaded

S3 Gateway Section:
- S3 Concurrent Uploads: Shows current number of concurrent uploads
- S3 Concurrent Upload Bytes: Shows current bytes being uploaded

These panels help operators monitor upload concurrency in real-time and
tune the upload limiting configuration based on actual usage patterns.

* more efficient
2025-11-26 15:51:38 -08:00
Chris Lu
5f7a292334 add build info metrics (#7525)
* add build info metrics

* unused

* metrics on build

* size limit

* once
2025-11-21 16:55:28 -08:00
Konstantin Lebedev
93007c1842 [volume] refactor and add metrics for flight upload and download data limit condition (#6920)
* refactor concurrentDownloadLimit

* fix loop

* fix cmdServer

* fix: resolve conversation pr 6920

* Changes logging function (#6919)

* updated logging methods for stores

* updated logging methods for stores

* updated logging methods for filer

* updated logging methods for uploader and http_util

* updated logging methods for weed server

---------

Co-authored-by: akosov <a.kosov@kryptonite.ru>

* Improve lock ring (#6921)

* fix flaky lock ring test

* add more tests

* fix: build

* fix: rm import util/version

* fix: serverOptions

* refactoring

---------

Co-authored-by: Aleksey Kosov <rusyak777@list.ru>
Co-authored-by: akosov <a.kosov@kryptonite.ru>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: chrislu <chris.lu@gmail.com>
2025-07-02 18:03:49 -07:00
Aleksey Kosov
ef4eda0761 added re-generating and writing the Volume UUID if it is empty (#6568) 2025-02-24 07:58:43 -08:00
chrislu
b78a897f14 freebsd has fs.Bavail as int64
fix freebsd building https://github.com/seaweedfs/seaweedfs/actions/runs/13380531394/job/37368195169
2025-02-17 20:19:52 -08:00
zouyixiong
7392161894 fix: The free disk size and percentage are quite different from the output of df -h. (#6550)
* fix: record and delete bucket metrics after inactive

* feat: match available disk size with system cmd `dh -h`

* feat: move temp test to unmaintained/

---------

Co-authored-by: XYZ <XYZ>
2025-02-17 15:49:11 -08:00
zouyixiong
8eab76c5db fix: record and delete bucket metrics after inactive (#6523)
Co-authored-by: XYZ <XYZ>
2025-02-07 10:26:39 -08:00
Hadi Zamani
a2330f624b Add metrics for uploaded and deleted s3 objects (#6475) 2025-01-25 21:55:06 -08:00
Hadi Zamani
c7ae969c06 Add bucket's traffic metrics (#6444)
* Add bucket's traffic metrics

* Add bucket traffic to dashboards

* Fix bucket metrics help messages

* Fix variable names
2025-01-16 08:23:35 -08:00
zouyixiong
d6f3e1970d fix: filer may crash by bucketLastActiveTsNs concurrency access. (#6350) 2024-12-13 05:30:21 -08:00
zouyixiong
9987a65e8a fix: record and delete bucket metrics after inactive (#6349) 2024-12-12 20:34:02 -08:00
Konstantin Lebedev
167b50be88 fix missing register master metric MasterPickForWriteErrorCounter (#6277) 2024-11-25 08:59:11 -08:00
wyang
a7973ed7d1 fix deadlock hang when broadcast to clients (#6184)
fix deadlock when broadcast to clients

when master thransfer leader, the old master will disconnect with all
filers and volumeServers, if the cluster is a big , the broadcast
messages may be more big than the max of the channel len 100, then if the
KeepConnect was not listen on the channel in disconnect, it will
deadlock. and the whole cluster will not serve!
2024-11-03 23:20:48 -08:00
steve.wei
cfbe45c765 feat: add in-flight metric for s3/file/volume-server (#6120) 2024-10-14 12:10:05 -07:00
zemul
028ebb1d0d Feat:merge small chunk (#6049)
* fix:mount deadlock

* feat: merge small chunk

* adjust MergeChunkMinCount

* fix

---------

Co-authored-by: zemul <zhouzemiao@ihuman.com>
2024-09-23 08:51:03 -07:00
Konstantin Lebedev
67a252ee8a [master] refactor func ShouldGrowVolumes (#5884) 2024-09-04 08:16:44 -07:00
Konstantin Lebedev
b2ffcdaab2 [master] do sync grow request only if absolutely necessary (#5821)
* do sync grow request only if absolutely necessary
https://github.com/seaweedfs/seaweedfs/pull/5819

* remove check VolumeGrowStrategy Threshold on PickForWrite

* fix fmt.Errorf
2024-07-30 13:21:35 -07:00
Konstantin Lebedev
33964fa292 metrics stats of volume layout depends on the data center (#5775)
stats volume layout depends on the data center
2024-07-12 12:32:25 -07:00
Dominic Pearson
d8bde9b96e Solaris disk support (#5638) 2024-06-02 22:10:28 -07:00
chrislu
ba98f02d02 go fmt 2024-05-23 08:25:16 -07:00
Dave St.Germain
731b3aadbe Add support for OpenBSD (#5578)
Co-authored-by: Dave St.Germain <dcs@adullmoment.com>
2024-05-10 14:35:41 -07:00
steve.wei
0bdf121e51 rename VolumeServerVolumeGauge (#5504) 2024-04-17 04:49:50 -07:00
Konstantin Lebedev
33537ae29f [s3] fix s3 test_multipart_get_part (#5476)
* try fix s3  test_multipart_get_part

* add passed s3 tests

* fix SeaweedFSUploadId

* rm spaces

* convert part request to range

* add passed s3 tests of multipart
2024-04-14 10:41:32 -07:00
Konstantin Lebedev
3e25ed1b11 [s3] add s3 pass test_multipart_upload_size_too_small (#5475)
* add s3 pass test_multipart_upload_size_too_small

* refactor metric names

* return ErrNoSuchUpload if empty parts

* fix test
2024-04-07 11:52:35 -07:00
Konstantin Lebedev
d42a04cceb [s3] fix s3 test_multipart_resend_first_finishes_last (#5471)
* try fix s3 test
https://github.com/seaweedfs/seaweedfs/pull/5466

* add error handler metrics

* refactor

* refactor multipartExt

* delete bad entry parts
2024-04-06 10:56:39 -07:00
Konstantin Lebedev
9c1e0f5811 [master] grow volumes if no writable volumes in current dataCenter (#5434)
* grow volumes if no writable volumes in current dataCenter
https://github.com/seaweedfs/seaweedfs/issues/3886

* fix tests with volume grow

* automatic volume grow one volume

* add ErrorChunkAssign metrics
2024-03-29 00:38:27 -07:00
Konstantin Lebedev
dc9568fc0d [master] add test for PickForWrite add metrics for volume layout (#5413) 2024-03-22 07:39:11 -07:00
Konstantin Lebedev
a7fc723ae0 chore: add status code for request_total metrics (#5188) 2024-01-10 10:05:27 -08:00
chrislu
81f11883e3 go fmt 2023-11-26 11:47:20 -08:00
SmsS4
f95848ba7d Add ErrorGetNotFound and ErrorGetInternal to volume server metrics (#4960) 2023-10-30 07:38:03 -07:00
chrislu
6ebe26a765 Revert "Revert "Revert "Add disk type to prometheus metrics" (#4777)""
This reverts commit 567d788928.
2023-10-03 08:28:52 -07:00