seaweedfs

14,008 Commits 160 Branches 353 Tags

3 Commits

Include renames

Author	SHA1	Message	Date
Jaehoon KimandGitHub	d00acded8a	fix(vacuum): batch all replicas in a single plugin worker task (#9702 ) * fix(vacuum): batch all replicas in a single plugin worker task The plugin worker vacuum path emitted one TaskDetectionResult per (volume, server) replica, but the dispatcher gates duplicate tasks per volume via ActiveTopology.HasAnyTask. The first replica's task was created and the remaining N-1 replicas were silently dropped, so only one replica per volume was ever vacuumed — leaving the others with all their garbage intact. Mirror the master built-in flow (topology.vacuumOneVolumeId → batchVacuumVolumeCheck/Compact/Commit/Cleanup) by: - aggregating detection metrics by VolumeID so a single task carries every replica in TaskParams.Sources - having VacuumTask accept []string servers (instead of a single string), re-check each replica's garbage ratio at execute time to derive a vacuumTargets subset, and run Compact/Commit/Cleanup against only that subset - updating the dispatcher (plugin_handler.Execute, register.CreateTask) to forward every Sources node to NewVacuumTask * fix(vacuum): run all-replica vacuum in two phases to keep failure atomic The prior implementation iterated Compact → Commit → Cleanup against each replica in sequence. A Compact failure on the second replica left the first one already committed (its active files swapped with the .cp* files), producing replica divergence with no automatic recovery. Split performVacuum into two phases, matching topology.vacuumOneVolumeId: Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup on every target to drop the .cpd/.cpx/.cpldb temp files, then abort. No replica has swapped yet, so every replica returns to its original state. Phase 2 — Commit all targets. Best-effort, matching batchVacuumVolumeCommit: per-replica errors are collected and surfaced together. Once any replica has swapped there is no clean rollback, so a partial Phase 2 failure requires operator reconciliation. Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and removes the old performVacuumOne. * fix(vacuum): abort when any replica's garbage check fails The prior check tolerated per-replica RPC errors and only failed the task if every replica errored — partial failures were silently treated as "ineligible" so the responding replicas would still be vacuumed. That produces divergence the moment the unreachable replica comes back: it still carries the original garbage while the others have been compacted. Match topology.batchVacuumVolumeCheck's contract instead — its return value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the whole vacuum on every replica's check succeeding. If any replica is unreachable or its VacuumVolumeCheck RPC errors, abort the task; the volume will be retried on the next detection cycle once the replica is healthy. * fix(vacuum): guard against nil metrics and TaskSource entries Detection's bucket-building loop dereferenced m.VolumeID without checking m for nil. VacuumTask.Validate built sourceSet from params.Sources without checking each entry for nil. Both paths would panic on a malformed protobuf payload that managed to deliver a nil slot. Skip nil entries in both loops — neutral with the existing nil/empty filtering already done in register.CreateTask and plugin_handler.Execute. * test(vacuum): success path no longer calls VacuumVolumeCleanup The plugin worker vacuum is now two-phase (Compact-all → Commit-all, with Cleanup only invoked on Compact failure to roll back .cp* temp files). This matches topology.vacuumOneVolumeId, where batchVacuumVolumeCleanup runs only on the Compact-failure branch. On a successful Commit the temp files do not linger: - CommitCompactVolume renames .cpd → .dat and .cpx → .idx - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go) so calling VacuumVolumeCleanup afterwards is a redundant no-op. The prior worker code called it unconditionally and the integration test asserted that — switch the expectation to cleanupCalls == 0 to document the new (and master-aligned) contract.	2026-05-27 11:15:25 -07:00
Chris LuandGitHub	1f6f473995	refactor(worker): co-locate plugin handlers with their task packages (#9301 ) * refactor(worker): co-locate plugin handlers with their task packages Move every per-task plugin handler from weed/plugin/worker/ into the matching weed/worker/tasks/<name>/ package, so each task owns its detection, scheduling, execution, and plugin handler in one place. Step 0 (within pluginworker, no behavior change): extract shared helpers that previously lived inside individual handler files into dedicated files and export the ones now consumed across packages. - activity.go: BuildExecutorActivity, BuildDetectorActivity - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority - interval.go: ShouldSkipDetectionByInterval - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location - collection_filter.go: CollectionFilterMode + consts - volume_metrics.go: export CollectVolumeMetricsFromMasters, MasterAddressCandidates, FetchVolumeList - testing_senders_test.go: shared test stubs Phase 1: move the per-task plugin handlers (and the iceberg subpackage) into their task packages. weed/plugin/worker/vacuum_handler.go -> weed/worker/tasks/vacuum/plugin_handler.go weed/plugin/worker/ec_balance_handler.go -> weed/worker/tasks/ec_balance/plugin_handler.go weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go weed/plugin/worker/iceberg/ -> weed/worker/tasks/iceberg/ weed/plugin/worker/handlers/handlers.go now blank-imports all five task subpackages so their init() registrations fire. weed/command/mini.go and the worker tests construct the handler with vacuum.DefaultMaxExecutionConcurrency (the constant moved with the vacuum handler). admin_script remains in weed/plugin/worker/ because there is no underlying weed/worker/tasks/admin_script/ package to merge with. * refactor(worker): update test/plugin_workers imports for moved handlers Three handler constructors moved out of pluginworker into their task packages — update the integration test files in test/plugin_workers/ to import from the new locations: pluginworker.NewVacuumHandler -> vacuum.NewVacuumHandler pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler The pluginworker import is kept where the file still uses pluginworker.WorkerOptions / pluginworker.JobHandler. * refactor(worker): update test/s3tables iceberg import path The iceberg subpackage moved from weed/plugin/worker/iceberg/ to weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go still imported the old path, breaking S3 Tables / RisingWave / Trino / Spark / Iceberg-catalog / STS integration test builds. Mirrors the OSS-side fix needed by every job in the run that transitively imports test/s3tables/maintenance. * chore: gofmt PR-touched files The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables and test/s3tables, then fails if anything is unformatted. Files this PR moved or modified had import-grouping and trailing-spacing issues introduced by perl-based renames; reformat them with gofmt -w. Touched files: test/plugin_workers/erasure_coding/{detection,execution}_test.go test/s3tables/maintenance/maintenance_integration_test.go weed/plugin/worker/handlers/handlers.go weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler.go refactor(worker): bounds-checked int conversions for plugin config values CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved plugin handler files: results of pluginworker.ReadInt64Config (which ultimately calls strconv.ParseInt with bit size 64) were being narrowed to int32/uint32/int without an upper-bound check, so a malicious or malformed admin/worker config value could overflow the target type. Add three helpers in weed/plugin/worker/config.go that wrap ReadInt64Config and clamp out-of-range values back to the caller's fallback: ReadInt32Config (math.MinInt32 .. math.MaxInt32) ReadUint32Config (0 .. math.MaxUint32) ReadIntConfig (math.MinInt32 .. math.MaxInt32, platform-portable) Update each flagged call site in the four moved task packages to use the bounds-checked helper. For protobuf uint32 fields (volume IDs) the variable type also becomes uint32, removing the trailing uint32(volumeID) casts and changing the "missing volume_id" check from `<= 0` to `== 0`. Touched files: weed/plugin/worker/config.go weed/worker/tasks/balance/plugin_handler.go weed/worker/tasks/erasure_coding/plugin_handler.go weed/worker/tasks/vacuum/plugin_handler.go * refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers CodeQL still flagged three call sites where ReadInt64Config was being narrowed to int after a value-range clamp (max_concurrent_moves <= 50, batch_size <= 100, min_server_count >= 2). The clamp is correct but CodeQL's flow analysis didn't recognize the bound, so it flagged them as unbounded narrowing. Switch to ReadIntConfig (already int32-bounded by the helper) for those three sites, drop the now-redundant int64 intermediate variables. Also drops the now-unused `> math.MaxInt32` clamp in ec_balance.deriveECBalanceWorkerConfig (the helper covers it).	2026-05-02 18:03:13 -07:00
Chris LuandGitHub	453310b057	Add plugin worker integration tests for erasure coding (#8450 ) * test: add plugin worker integration harness * test: add erasure coding detection integration tests * test: add erasure coding execution integration tests * ci: add plugin worker integration workflow * test: extend fake volume server for vacuum and balance * test: expand erasure coding detection topologies * test: add large erasure coding detection topology * test: add vacuum plugin worker integration tests * test: add volume balance plugin worker integration tests * ci: run plugin worker tests per worker * fixes * erasure coding: stop after placement failures * erasure coding: record hasMore when early stopping * erasure coding: relax large topology expectations	2026-02-25 22:11:41 -08:00

Author

SHA1

Message

Date

Jaehoon KimandGitHub

d00acded8a

fix(vacuum): batch all replicas in a single plugin worker task (#9702 )

* fix(vacuum): batch all replicas in a single plugin worker task

The plugin worker vacuum path emitted one TaskDetectionResult per
(volume, server) replica, but the dispatcher gates duplicate tasks per
volume via ActiveTopology.HasAnyTask. The first replica's task was
created and the remaining N-1 replicas were silently dropped, so only
one replica per volume was ever vacuumed — leaving the others with all
their garbage intact.

Mirror the master built-in flow (topology.vacuumOneVolumeId →
batchVacuumVolumeCheck/Compact/Commit/Cleanup) by:

- aggregating detection metrics by VolumeID so a single task carries
  every replica in TaskParams.Sources
- having VacuumTask accept []string servers (instead of a single
  string), re-check each replica's garbage ratio at execute time to
  derive a vacuumTargets subset, and run Compact/Commit/Cleanup against
  only that subset
- updating the dispatcher (plugin_handler.Execute, register.CreateTask)
  to forward every Sources node to NewVacuumTask

* fix(vacuum): run all-replica vacuum in two phases to keep failure atomic

The prior implementation iterated Compact → Commit → Cleanup against
each replica in sequence. A Compact failure on the second replica left
the first one already committed (its active files swapped with the
.cp* files), producing replica divergence with no automatic recovery.

Split performVacuum into two phases, matching topology.vacuumOneVolumeId:

  Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup
  on every target to drop the .cpd/.cpx/.cpldb temp files, then abort.
  No replica has swapped yet, so every replica returns to its original
  state.

  Phase 2 — Commit all targets. Best-effort, matching
  batchVacuumVolumeCommit: per-replica errors are collected and
  surfaced together. Once any replica has swapped there is no clean
  rollback, so a partial Phase 2 failure requires operator
  reconciliation.

Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and
removes the old performVacuumOne.

* fix(vacuum): abort when any replica's garbage check fails

The prior check tolerated per-replica RPC errors and only failed the
task if every replica errored — partial failures were silently treated
as "ineligible" so the responding replicas would still be vacuumed.
That produces divergence the moment the unreachable replica comes
back: it still carries the original garbage while the others have
been compacted.

Match topology.batchVacuumVolumeCheck's contract instead — its return
value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the
whole vacuum on every replica's check succeeding. If any replica is
unreachable or its VacuumVolumeCheck RPC errors, abort the task; the
volume will be retried on the next detection cycle once the replica
is healthy.

* fix(vacuum): guard against nil metrics and TaskSource entries

Detection's bucket-building loop dereferenced m.VolumeID without
checking m for nil. VacuumTask.Validate built sourceSet from
params.Sources without checking each entry for nil. Both paths would
panic on a malformed protobuf payload that managed to deliver a nil
slot. Skip nil entries in both loops — neutral with the existing
nil/empty filtering already done in register.CreateTask and
plugin_handler.Execute.

* test(vacuum): success path no longer calls VacuumVolumeCleanup

The plugin worker vacuum is now two-phase (Compact-all → Commit-all,
with Cleanup only invoked on Compact failure to roll back .cp* temp
files). This matches topology.vacuumOneVolumeId, where
batchVacuumVolumeCleanup runs only on the Compact-failure branch.

On a successful Commit the temp files do not linger:
  - CommitCompactVolume renames .cpd → .dat and .cpx → .idx
  - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go)

so calling VacuumVolumeCleanup afterwards is a redundant no-op. The
prior worker code called it unconditionally and the integration test
asserted that — switch the expectation to cleanupCalls == 0 to
document the new (and master-aligned) contract.

2026-05-27 11:15:25 -07:00

Chris LuandGitHub

1f6f473995

refactor(worker): co-locate plugin handlers with their task packages (#9301 )

* refactor(worker): co-locate plugin handlers with their task packages

Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.

Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.

  - activity.go: BuildExecutorActivity, BuildDetectorActivity
  - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
  - interval.go: ShouldSkipDetectionByInterval
  - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
  - collection_filter.go: CollectionFilterMode + consts
  - volume_metrics.go: export CollectVolumeMetricsFromMasters,
    MasterAddressCandidates, FetchVolumeList
  - testing_senders_test.go: shared test stubs

Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.

  weed/plugin/worker/vacuum_handler.go         -> weed/worker/tasks/vacuum/plugin_handler.go
  weed/plugin/worker/ec_balance_handler.go     -> weed/worker/tasks/ec_balance/plugin_handler.go
  weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
  weed/plugin/worker/iceberg/                   -> weed/worker/tasks/iceberg/

  weed/plugin/worker/handlers/handlers.go now blank-imports all five
  task subpackages so their init() registrations fire.

  weed/command/mini.go and the worker tests construct the handler with
  vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
  vacuum handler).

admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.

* refactor(worker): update test/plugin_workers imports for moved handlers

Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:

  pluginworker.NewVacuumHandler        -> vacuum.NewVacuumHandler
  pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
  pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler

The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.

* refactor(worker): update test/s3tables iceberg import path

The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.

Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.

* chore: gofmt PR-touched files

The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.

Touched files:
  test/plugin_workers/erasure_coding/{detection,execution}_test.go
  test/s3tables/maintenance/maintenance_integration_test.go
  weed/plugin/worker/handlers/handlers.go
  weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go

* refactor(worker): bounds-checked int conversions for plugin config values

CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.

Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:

  ReadInt32Config (math.MinInt32 .. math.MaxInt32)
  ReadUint32Config (0 .. math.MaxUint32)
  ReadIntConfig    (math.MinInt32 .. math.MaxInt32, platform-portable)

Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.

Touched files:
  weed/plugin/worker/config.go
  weed/worker/tasks/balance/plugin_handler.go
  weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/worker/tasks/vacuum/plugin_handler.go

* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers

CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.

Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.

Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).

2026-05-02 18:03:13 -07:00

Chris LuandGitHub

453310b057

Add plugin worker integration tests for erasure coding (#8450 )

* test: add plugin worker integration harness

* test: add erasure coding detection integration tests

* test: add erasure coding execution integration tests

* ci: add plugin worker integration workflow

* test: extend fake volume server for vacuum and balance

* test: expand erasure coding detection topologies

* test: add large erasure coding detection topology

* test: add vacuum plugin worker integration tests

* test: add volume balance plugin worker integration tests

* ci: run plugin worker tests per worker

* fixes

* erasure coding: stop after placement failures

* erasure coding: record hasMore when early stopping

* erasure coding: relax large topology expectations

2026-02-25 22:11:41 -08:00