pve-exporter/docs/superpowers/specs/2026-03-20-remaining-collectors-design.md
Davíð Steinn Geirsson e10156323b docs: address spec review feedback
- Split pve_ha_service_info into _config and _status to avoid stale series
- Handle wearout "N/A" and health "UNKNOWN" edge cases for physical disks
- Clarify node label convention and rootfs available vs free naming
- Note QEMU-only scope for VM pressure (LXC lacks PSI in PVE API)
- Add full node_status/lrm_status examples showing all cluster nodes
- Document mutex-guarded nodes pattern and test fixture requirements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:10:22 +00:00

10 KiB
Raw Blame History

Remaining Collectors Design Spec

Add 4 new collectors to pve-exporter covering the TODO items from the README: node status, VM pressure, HA status, and physical disks. SDN is excluded (config-only, not operationally useful). Kernel version and CPU model are excluded from node status (static, low value).

Collectors

1. Node Status Collector

File: collector/node_status.go API: /nodes/{node}/status (per-node fan-out) Type: NodeAwareCollector

Metric Type Labels Description
pve_node_load1 Gauge node 1-minute load average
pve_node_load5 Gauge node 5-minute load average
pve_node_load15 Gauge node 15-minute load average
pve_node_swap_total_bytes Gauge node Total swap in bytes
pve_node_swap_used_bytes Gauge node Used swap in bytes
pve_node_swap_free_bytes Gauge node Free swap in bytes
pve_node_rootfs_total_bytes Gauge node Root filesystem total in bytes
pve_node_rootfs_used_bytes Gauge node Root filesystem used in bytes
pve_node_rootfs_available_bytes Gauge node Root filesystem available in bytes
pve_node_ksm_shared_bytes Gauge node KSM shared memory in bytes
pve_node_boot_mode_info Gauge node, mode, secureboot Boot mode info (always 1)

The pve_node_ prefix disambiguates these from cluster_resources metrics which use id labels like node/node01. The node label here is the bare node name (e.g., node01), consistent with how other NodeAwareCollectors label per-node data.

Rootfs uses available (not free) because the API field avail reflects usable space after reserved blocks, which is the operationally relevant value. Swap has no reserved blocks so free is correct there.

API response structure:

{
  "data": {
    "loadavg": ["3.12", "2.88", "2.79"],
    "swap": {"total": 8589930496, "used": 0, "free": 8589930496},
    "rootfs": {"used": 28747304960, "total": 100861726720, "avail": 66943684608},
    "ksm": {"shared": 0},
    "boot-info": {"mode": "efi", "secureboot": 0}
  }
}

Load averages are strings in the API, parse with strconv.ParseFloat. secureboot label: "1" or "0" (string, not bool).

2. VM Pressure Collector

File: collector/vm_pressure.go API: /nodes/{node}/qemu (per-node fan-out) Type: NodeAwareCollector

Metric Type Labels Description
pve_vm_pressure_cpu_some_ratio Gauge id, node CPU pressure (some)
pve_vm_pressure_cpu_full_ratio Gauge id, node CPU pressure (full)
pve_vm_pressure_memory_some_ratio Gauge id, node Memory pressure (some)
pve_vm_pressure_memory_full_ratio Gauge id, node Memory pressure (full)
pve_vm_pressure_io_some_ratio Gauge id, node I/O pressure (some)
pve_vm_pressure_io_full_ratio Gauge id, node I/O pressure (full)

API response structure (per VM entry):

{
  "vmid": 112,
  "status": "running",
  "pressurecpusome": 0,
  "pressurecpufull": 0,
  "pressurememorysome": 0,
  "pressurememoryfull": 0,
  "pressureiosome": 0,
  "pressureiofull": 0
}
  • Only emit metrics for running VMs (stopped VMs lack pressure fields).
  • QEMU only — LXC containers run in the host kernel namespace and do not expose per-container PSI metrics through the PVE API.
  • id label: constructed as fmt.Sprintf("qemu/%d", vmid) to match the existing convention used by cluster_resources and node_config collectors.
  • API returns values 0100. Divide by 100 to produce 0.01.0 ratios.

3. HA Status Collector

File: collector/ha_status.go API: /cluster/ha/status/manager_status + /cluster/ha/resources (cluster-level, no per-node fan-out) Type: Plain Collector (not NodeAwareCollector). Both API calls happen inside a single Update() method.

Metric Type Labels Description
pve_ha_crm_master Gauge node 1 if node is CRM master, 0 otherwise (all nodes)
pve_ha_node_status Gauge node, status Per-node HA status (always 1)
pve_ha_lrm_timestamp_seconds Gauge node Last LRM heartbeat as unix timestamp
pve_ha_lrm_mode Gauge node, mode LRM mode per node (always 1)
pve_ha_service_config Gauge sid, type, max_restart, max_relocate, failback Service config (always 1)
pve_ha_service_status Gauge sid, node, state Service runtime state (always 1)

Service metrics are split into _config (from /cluster/ha/resources, static labels) and _status (from manager_status.service_status, runtime labels that change). This avoids stale series when a service migrates between nodes or changes state.

API response structure (/cluster/ha/status/manager_status):

{
  "data": {
    "manager_status": {
      "master_node": "node03",
      "node_status": {
        "node01": "online",
        "node02": "online",
        "node03": "online",
        "node04": "online",
        "node05": "online"
      },
      "service_status": {
        "vm:106": {"node": "node04", "running": 1, "state": "started"}
      }
    },
    "lrm_status": {
      "node01": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node02": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node03": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node04": {"mode": "active", "state": "active", "timestamp": 1774016350},
      "node05": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351}
    }
  }
}

API response structure (/cluster/ha/resources):

{
  "data": [
    {"sid": "vm:106", "type": "vm", "state": "started", "max_restart": 2, "max_relocate": 2, "failback": 1}
  ]
}
  • pve_ha_crm_master: iterate node_status keys from manager_status, emit 1 for master_node, 0 for all others. The node_status map contains all cluster nodes.
  • pve_ha_service_config: from /cluster/ha/resources. Numeric config values (max_restart, max_relocate, failback) become string labels.
  • pve_ha_service_status: from manager_status.service_status. The state label reflects runtime state (e.g., started, stopped, migrate).

4. Physical Disks Collector

File: collector/physical_disk.go API: /nodes/{node}/disks/list (per-node fan-out) Type: NodeAwareCollector

Metric Type Labels Description
pve_physical_disk_health Gauge node, devpath, model, serial, type 1 if SMART PASSED, 0 otherwise
pve_physical_disk_wearout_remaining_ratio Gauge node, devpath Wearout remaining 0.01.0
pve_physical_disk_size_bytes Gauge node, devpath Disk size in bytes
pve_physical_disk_info Gauge node, devpath, model, serial, type, used Disk info (always 1)
pve_physical_disk_osd Gauge node, devpath, osd Disk-to-OSD mapping (always 1, one per OSD)

API response structure (per disk entry):

{
  "devpath": "/dev/nvme0n1",
  "health": "PASSED",
  "wearout": 100,
  "size": 7681501126656,
  "model": "VV007680KYFFL",
  "serial": "ADD3NA317I0104K2N",
  "type": "nvme",
  "used": "LVM",
  "osdid": "8",
  "osdid-list": ["8"]
}
  • wearout: API returns 0100 integer representing percentage remaining (100 = new, 0 = fully worn). Divide by 100 to get ratio directly (no inversion needed). The API may return "N/A" (string) for disks that don't support wear leveling — use json.Number or similar to handle this. Skip emitting pve_physical_disk_wearout_remaining_ratio when the value is not a valid number.
  • health: compare string to "PASSED", emit 1 or 0. If health is empty or "UNKNOWN", emit 0.
  • osd label: format as "osd.N" (e.g., osd.8) matching Ceph daemon naming convention.
  • Multi-OSD disks: emit one pve_physical_disk_osd entry per item in osdid-list.
  • Non-OSD disks (osdid is -1 / osdid-list is null): no pve_physical_disk_osd entry emitted.

Implementation Pattern

All 4 collectors follow the established patterns:

  • Self-register via init() + registerCollector()
  • NodeAwareCollector interface for per-node endpoints (node_status, vm_pressure, physical_disk)
  • Plain Collector for cluster-level endpoints (ha_status)
  • NodeAwareCollector implementations guard the nodes slice with sync.Mutex, copy it at the start of Update(), matching the pattern in node_config.go and replication.go
  • Per-node fan-out uses sync.WaitGroup + semaphore from client.MaxConcurrent()
  • JSON response types as unexported structs in each collector file
  • Unit tests with JSON fixtures in collector/fixtures/
  • testCollectorAdapter pattern for testutil.GatherAndCompare

Test Fixtures

Each collector needs fixture files in collector/fixtures/:

  • node_status.json — full /nodes/{node}/status response
  • node_qemu_pressure.json/nodes/{node}/qemu response with running + stopped VMs (reuse or extend existing node_qemu.json if pressure fields can be added)
  • ha_manager_status.json/cluster/ha/status/manager_status response
  • ha_resources.json/cluster/ha/resources response
  • node_disks.json/nodes/{node}/disks/list response

The HA collector test needs two routes mapped in the test server (one per endpoint), unlike NodeAwareCollector tests which share a single route pattern.

Scope Exclusions

  • SDN/Network: Excluded — API exposes config only, not operational state.
  • Kernel version info: Excluded — static, low operational value.
  • CPU model info: Excluded — static, low operational value.

README Update

Remove the implemented items from the TODO section. Remove the SDN, kernel version, and CPU model entries entirely. If no TODO items remain, remove the section. Add metrics tables for all 4 new collectors following the existing table format.