Davíð Steinn Geirsson e10156323b docs: address spec review feedback

- Split pve_ha_service_info into _config and _status to avoid stale series
- Handle wearout "N/A" and health "UNKNOWN" edge cases for physical disks
- Clarify node label convention and rootfs available vs free naming
- Note QEMU-only scope for VM pressure (LXC lacks PSI in PVE API)
- Add full node_status/lrm_status examples showing all cluster nodes
- Document mutex-guarded nodes pattern and test fixture requirements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 15:10:22 +00:00

10 KiB

Raw Blame History

Remaining Collectors Design Spec

Add 4 new collectors to pve-exporter covering the TODO items from the README: node status, VM pressure, HA status, and physical disks. SDN is excluded (config-only, not operationally useful). Kernel version and CPU model are excluded from node status (static, low value).

Collectors

1. Node Status Collector

File: collector/node_status.go API: /nodes/{node}/status (per-node fan-out) Type: NodeAwareCollector

Metric	Type	Labels	Description
`pve_node_load1`	Gauge	`node`	1-minute load average
`pve_node_load5`	Gauge	`node`	5-minute load average
`pve_node_load15`	Gauge	`node`	15-minute load average
`pve_node_swap_total_bytes`	Gauge	`node`	Total swap in bytes
`pve_node_swap_used_bytes`	Gauge	`node`	Used swap in bytes
`pve_node_swap_free_bytes`	Gauge	`node`	Free swap in bytes
`pve_node_rootfs_total_bytes`	Gauge	`node`	Root filesystem total in bytes
`pve_node_rootfs_used_bytes`	Gauge	`node`	Root filesystem used in bytes
`pve_node_rootfs_available_bytes`	Gauge	`node`	Root filesystem available in bytes
`pve_node_ksm_shared_bytes`	Gauge	`node`	KSM shared memory in bytes
`pve_node_boot_mode_info`	Gauge	`node`, `mode`, `secureboot`	Boot mode info (always 1)

The pve_node_ prefix disambiguates these from cluster_resources metrics which use id labels like node/node01. The node label here is the bare node name (e.g., node01), consistent with how other NodeAwareCollectors label per-node data.

Rootfs uses available (not free) because the API field avail reflects usable space after reserved blocks, which is the operationally relevant value. Swap has no reserved blocks so free is correct there.

API response structure:

{
  "data": {
    "loadavg": ["3.12", "2.88", "2.79"],
    "swap": {"total": 8589930496, "used": 0, "free": 8589930496},
    "rootfs": {"used": 28747304960, "total": 100861726720, "avail": 66943684608},
    "ksm": {"shared": 0},
    "boot-info": {"mode": "efi", "secureboot": 0}
  }
}

Load averages are strings in the API, parse with strconv.ParseFloat. secureboot label: "1" or "0" (string, not bool).

2. VM Pressure Collector

File: collector/vm_pressure.go API: /nodes/{node}/qemu (per-node fan-out) Type: NodeAwareCollector

Metric	Type	Labels	Description
`pve_vm_pressure_cpu_some_ratio`	Gauge	`id`, `node`	CPU pressure (some)
`pve_vm_pressure_cpu_full_ratio`	Gauge	`id`, `node`	CPU pressure (full)
`pve_vm_pressure_memory_some_ratio`	Gauge	`id`, `node`	Memory pressure (some)
`pve_vm_pressure_memory_full_ratio`	Gauge	`id`, `node`	Memory pressure (full)
`pve_vm_pressure_io_some_ratio`	Gauge	`id`, `node`	I/O pressure (some)
`pve_vm_pressure_io_full_ratio`	Gauge	`id`, `node`	I/O pressure (full)

API response structure (per VM entry):

{
  "vmid": 112,
  "status": "running",
  "pressurecpusome": 0,
  "pressurecpufull": 0,
  "pressurememorysome": 0,
  "pressurememoryfull": 0,
  "pressureiosome": 0,
  "pressureiofull": 0
}

Only emit metrics for running VMs (stopped VMs lack pressure fields).
QEMU only — LXC containers run in the host kernel namespace and do not expose per-container PSI metrics through the PVE API.
id label: constructed as fmt.Sprintf("qemu/%d", vmid) to match the existing convention used by cluster_resources and node_config collectors.
API returns values 0–100. Divide by 100 to produce 0.0–1.0 ratios.

3. HA Status Collector

File: collector/ha_status.go API: /cluster/ha/status/manager_status + /cluster/ha/resources (cluster-level, no per-node fan-out) Type: Plain Collector (not NodeAwareCollector). Both API calls happen inside a single Update() method.

Metric	Type	Labels	Description
`pve_ha_crm_master`	Gauge	`node`	1 if node is CRM master, 0 otherwise (all nodes)
`pve_ha_node_status`	Gauge	`node`, `status`	Per-node HA status (always 1)
`pve_ha_lrm_timestamp_seconds`	Gauge	`node`	Last LRM heartbeat as unix timestamp
`pve_ha_lrm_mode`	Gauge	`node`, `mode`	LRM mode per node (always 1)
`pve_ha_service_config`	Gauge	`sid`, `type`, `max_restart`, `max_relocate`, `failback`	Service config (always 1)
`pve_ha_service_status`	Gauge	`sid`, `node`, `state`	Service runtime state (always 1)

Service metrics are split into _config (from /cluster/ha/resources, static labels) and _status (from manager_status.service_status, runtime labels that change). This avoids stale series when a service migrates between nodes or changes state.

API response structure (/cluster/ha/status/manager_status):

{
  "data": {
    "manager_status": {
      "master_node": "node03",
      "node_status": {
        "node01": "online",
        "node02": "online",
        "node03": "online",
        "node04": "online",
        "node05": "online"
      },
      "service_status": {
        "vm:106": {"node": "node04", "running": 1, "state": "started"}
      }
    },
    "lrm_status": {
      "node01": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node02": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node03": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
      "node04": {"mode": "active", "state": "active", "timestamp": 1774016350},
      "node05": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351}
    }
  }
}

API response structure (/cluster/ha/resources):

{
  "data": [
    {"sid": "vm:106", "type": "vm", "state": "started", "max_restart": 2, "max_relocate": 2, "failback": 1}
  ]
}

pve_ha_crm_master: iterate node_status keys from manager_status, emit 1 for master_node, 0 for all others. The node_status map contains all cluster nodes.
pve_ha_service_config: from /cluster/ha/resources. Numeric config values (max_restart, max_relocate, failback) become string labels.
pve_ha_service_status: from manager_status.service_status. The state label reflects runtime state (e.g., started, stopped, migrate).

4. Physical Disks Collector

File: collector/physical_disk.go API: /nodes/{node}/disks/list (per-node fan-out) Type: NodeAwareCollector

Metric	Type	Labels	Description
`pve_physical_disk_health`	Gauge	`node`, `devpath`, `model`, `serial`, `type`	1 if SMART PASSED, 0 otherwise
`pve_physical_disk_wearout_remaining_ratio`	Gauge	`node`, `devpath`	Wearout remaining 0.0–1.0
`pve_physical_disk_size_bytes`	Gauge	`node`, `devpath`	Disk size in bytes
`pve_physical_disk_info`	Gauge	`node`, `devpath`, `model`, `serial`, `type`, `used`	Disk info (always 1)
`pve_physical_disk_osd`	Gauge	`node`, `devpath`, `osd`	Disk-to-OSD mapping (always 1, one per OSD)

API response structure (per disk entry):

{
  "devpath": "/dev/nvme0n1",
  "health": "PASSED",
  "wearout": 100,
  "size": 7681501126656,
  "model": "VV007680KYFFL",
  "serial": "ADD3NA317I0104K2N",
  "type": "nvme",
  "used": "LVM",
  "osdid": "8",
  "osdid-list": ["8"]
}

wearout: API returns 0–100 integer representing percentage remaining (100 = new, 0 = fully worn). Divide by 100 to get ratio directly (no inversion needed). The API may return "N/A" (string) for disks that don't support wear leveling — use json.Number or similar to handle this. Skip emitting pve_physical_disk_wearout_remaining_ratio when the value is not a valid number.
health: compare string to "PASSED", emit 1 or 0. If health is empty or "UNKNOWN", emit 0.
osd label: format as "osd.N" (e.g., osd.8) matching Ceph daemon naming convention.
Multi-OSD disks: emit one pve_physical_disk_osd entry per item in osdid-list.
Non-OSD disks (osdid is -1 / osdid-list is null): no pve_physical_disk_osd entry emitted.

Implementation Pattern

All 4 collectors follow the established patterns:

Self-register via init() + registerCollector()
NodeAwareCollector interface for per-node endpoints (node_status, vm_pressure, physical_disk)
Plain Collector for cluster-level endpoints (ha_status)
NodeAwareCollector implementations guard the nodes slice with sync.Mutex, copy it at the start of Update(), matching the pattern in node_config.go and replication.go
Per-node fan-out uses sync.WaitGroup + semaphore from client.MaxConcurrent()
JSON response types as unexported structs in each collector file
Unit tests with JSON fixtures in collector/fixtures/
testCollectorAdapter pattern for testutil.GatherAndCompare

Test Fixtures

Each collector needs fixture files in collector/fixtures/:

node_status.json — full /nodes/{node}/status response
node_qemu_pressure.json — /nodes/{node}/qemu response with running + stopped VMs (reuse or extend existing node_qemu.json if pressure fields can be added)
ha_manager_status.json — /cluster/ha/status/manager_status response
ha_resources.json — /cluster/ha/resources response
node_disks.json — /nodes/{node}/disks/list response

The HA collector test needs two routes mapped in the test server (one per endpoint), unlike NodeAwareCollector tests which share a single route pattern.

Scope Exclusions

SDN/Network: Excluded — API exposes config only, not operational state.
Kernel version info: Excluded — static, low operational value.
CPU model info: Excluded — static, low operational value.

README Update

Remove the implemented items from the TODO section. Remove the SDN, kernel version, and CPU model entries entirely. If no TODO items remain, remove the section. Add metrics tables for all 4 new collectors following the existing table format.

10 KiB Raw Blame History Unescape Escape