Commit Graph

3 Commits

Author SHA1 Message Date
Jaehoon Kim
d00acded8a fix(vacuum): batch all replicas in a single plugin worker task (#9702)
* fix(vacuum): batch all replicas in a single plugin worker task

The plugin worker vacuum path emitted one TaskDetectionResult per
(volume, server) replica, but the dispatcher gates duplicate tasks per
volume via ActiveTopology.HasAnyTask. The first replica's task was
created and the remaining N-1 replicas were silently dropped, so only
one replica per volume was ever vacuumed — leaving the others with all
their garbage intact.

Mirror the master built-in flow (topology.vacuumOneVolumeId →
batchVacuumVolumeCheck/Compact/Commit/Cleanup) by:

- aggregating detection metrics by VolumeID so a single task carries
  every replica in TaskParams.Sources
- having VacuumTask accept []string servers (instead of a single
  string), re-check each replica's garbage ratio at execute time to
  derive a vacuumTargets subset, and run Compact/Commit/Cleanup against
  only that subset
- updating the dispatcher (plugin_handler.Execute, register.CreateTask)
  to forward every Sources node to NewVacuumTask

* fix(vacuum): run all-replica vacuum in two phases to keep failure atomic

The prior implementation iterated Compact → Commit → Cleanup against
each replica in sequence. A Compact failure on the second replica left
the first one already committed (its active files swapped with the
.cp* files), producing replica divergence with no automatic recovery.

Split performVacuum into two phases, matching topology.vacuumOneVolumeId:

  Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup
  on every target to drop the .cpd/.cpx/.cpldb temp files, then abort.
  No replica has swapped yet, so every replica returns to its original
  state.

  Phase 2 — Commit all targets. Best-effort, matching
  batchVacuumVolumeCommit: per-replica errors are collected and
  surfaced together. Once any replica has swapped there is no clean
  rollback, so a partial Phase 2 failure requires operator
  reconciliation.

Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and
removes the old performVacuumOne.

* fix(vacuum): abort when any replica's garbage check fails

The prior check tolerated per-replica RPC errors and only failed the
task if every replica errored — partial failures were silently treated
as "ineligible" so the responding replicas would still be vacuumed.
That produces divergence the moment the unreachable replica comes
back: it still carries the original garbage while the others have
been compacted.

Match topology.batchVacuumVolumeCheck's contract instead — its return
value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the
whole vacuum on every replica's check succeeding. If any replica is
unreachable or its VacuumVolumeCheck RPC errors, abort the task; the
volume will be retried on the next detection cycle once the replica
is healthy.

* fix(vacuum): guard against nil metrics and TaskSource entries

Detection's bucket-building loop dereferenced m.VolumeID without
checking m for nil. VacuumTask.Validate built sourceSet from
params.Sources without checking each entry for nil. Both paths would
panic on a malformed protobuf payload that managed to deliver a nil
slot. Skip nil entries in both loops — neutral with the existing
nil/empty filtering already done in register.CreateTask and
plugin_handler.Execute.

* test(vacuum): success path no longer calls VacuumVolumeCleanup

The plugin worker vacuum is now two-phase (Compact-all → Commit-all,
with Cleanup only invoked on Compact failure to roll back .cp* temp
files). This matches topology.vacuumOneVolumeId, where
batchVacuumVolumeCleanup runs only on the Compact-failure branch.

On a successful Commit the temp files do not linger:
  - CommitCompactVolume renames .cpd → .dat and .cpx → .idx
  - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go)

so calling VacuumVolumeCleanup afterwards is a redundant no-op. The
prior worker code called it unconditionally and the integration test
asserted that — switch the expectation to cleanupCalls == 0 to
document the new (and master-aligned) contract.
2026-05-27 11:15:25 -07:00
Chris Lu
1f6f473995 refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages

Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.

Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.

  - activity.go: BuildExecutorActivity, BuildDetectorActivity
  - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
  - interval.go: ShouldSkipDetectionByInterval
  - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
  - collection_filter.go: CollectionFilterMode + consts
  - volume_metrics.go: export CollectVolumeMetricsFromMasters,
    MasterAddressCandidates, FetchVolumeList
  - testing_senders_test.go: shared test stubs

Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.

  weed/plugin/worker/vacuum_handler.go         -> weed/worker/tasks/vacuum/plugin_handler.go
  weed/plugin/worker/ec_balance_handler.go     -> weed/worker/tasks/ec_balance/plugin_handler.go
  weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
  weed/plugin/worker/iceberg/                   -> weed/worker/tasks/iceberg/

  weed/plugin/worker/handlers/handlers.go now blank-imports all five
  task subpackages so their init() registrations fire.

  weed/command/mini.go and the worker tests construct the handler with
  vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
  vacuum handler).

admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.

* refactor(worker): update test/plugin_workers imports for moved handlers

Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:

  pluginworker.NewVacuumHandler        -> vacuum.NewVacuumHandler
  pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
  pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler

The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.

* refactor(worker): update test/s3tables iceberg import path

The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.

Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.

* chore: gofmt PR-touched files

The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.

Touched files:
  test/plugin_workers/erasure_coding/{detection,execution}_test.go
  test/s3tables/maintenance/maintenance_integration_test.go
  weed/plugin/worker/handlers/handlers.go
  weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go

* refactor(worker): bounds-checked int conversions for plugin config values

CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.

Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:

  ReadInt32Config (math.MinInt32 .. math.MaxInt32)
  ReadUint32Config (0 .. math.MaxUint32)
  ReadIntConfig    (math.MinInt32 .. math.MaxInt32, platform-portable)

Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.

Touched files:
  weed/plugin/worker/config.go
  weed/worker/tasks/balance/plugin_handler.go
  weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/worker/tasks/vacuum/plugin_handler.go

* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers

CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.

Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.

Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
2026-05-02 18:03:13 -07:00
Chris Lu
453310b057 Add plugin worker integration tests for erasure coding (#8450)
* test: add plugin worker integration harness

* test: add erasure coding detection integration tests

* test: add erasure coding execution integration tests

* ci: add plugin worker integration workflow

* test: extend fake volume server for vacuum and balance

* test: expand erasure coding detection topologies

* test: add large erasure coding detection topology

* test: add vacuum plugin worker integration tests

* test: add volume balance plugin worker integration tests

* ci: run plugin worker tests per worker

* fixes

* erasure coding: stop after placement failures

* erasure coding: record hasMore when early stopping

* erasure coding: relax large topology expectations
2026-02-25 22:11:41 -08:00