docs: address spec review feedback

- Split pve_ha_service_info into _config and _status to avoid stale series - Handle wearout "N/A" and health "UNKNOWN" edge cases for physical disks - Clarify node label convention and rootfs available vs free naming - Note QEMU-only scope for VM pressure (LXC lacks PSI in PVE API) - Add full node_status/lrm_status examples showing all cluster nodes - Document mutex-guarded nodes pattern and test fixture requirements Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:10:22 +00:00 · 2026-03-20 15:10:22 +00:00 · e10156323b
commit e10156323b
parent b4ed302009
1 changed files with 63 additions and 10 deletions
--- a/docs/superpowers/specs/2026-03-20-remaining-collectors-design.md
+++ b/docs/superpowers/specs/2026-03-20-remaining-collectors-design.md
@ -27,6 +27,14 @@ not operationally useful). Kernel version and CPU model are excluded from node s
 | `pve_node_ksm_shared_bytes` | Gauge | `node` | KSM shared memory in bytes |
 | `pve_node_boot_mode_info` | Gauge | `node`, `mode`, `secureboot` | Boot mode info (always 1) |

+The `pve_node_` prefix disambiguates these from `cluster_resources` metrics which use
+`id` labels like `node/node01`. The `node` label here is the bare node name (e.g.,
+`node01`), consistent with how other NodeAwareCollectors label per-node data.
+
+Rootfs uses `available` (not `free`) because the API field `avail` reflects usable
+space after reserved blocks, which is the operationally relevant value. Swap has no
+reserved blocks so `free` is correct there.
+
 **API response structure:**
 ```json
 {
@ -73,14 +81,18 @@ Load averages are strings in the API, parse with `strconv.ParseFloat`.
 ```

 - Only emit metrics for running VMs (stopped VMs lack pressure fields).
- `id` label follows existing convention: `"qemu/{vmid}"` (e.g., `qemu/112`).
+- QEMU only — LXC containers run in the host kernel namespace and do not expose
+  per-container PSI metrics through the PVE API.
+- `id` label: constructed as `fmt.Sprintf("qemu/%d", vmid)` to match the existing
+  convention used by `cluster_resources` and `node_config` collectors.
 - API returns values 0–100. Divide by 100 to produce 0.0–1.0 ratios.

 ### 3. HA Status Collector

 **File:** `collector/ha_status.go`
 **API:** `/cluster/ha/status/manager_status` + `/cluster/ha/resources` (cluster-level, no per-node fan-out)
-**Type:** Plain Collector (not NodeAwareCollector)
+**Type:** Plain Collector (not NodeAwareCollector). Both API calls happen inside a
+single `Update()` method.

 | Metric | Type | Labels | Description |
 |--------|------|--------|-------------|
@ -88,7 +100,12 @@ Load averages are strings in the API, parse with `strconv.ParseFloat`.
 | `pve_ha_node_status` | Gauge | `node`, `status` | Per-node HA status (always 1) |
 | `pve_ha_lrm_timestamp_seconds` | Gauge | `node` | Last LRM heartbeat as unix timestamp |
 | `pve_ha_lrm_mode` | Gauge | `node`, `mode` | LRM mode per node (always 1) |
-| `pve_ha_service_info` | Gauge | `sid`, `type`, `node`, `state`, `max_restart`, `max_relocate`, `failback` | Service config + runtime (always 1) |
+| `pve_ha_service_config` | Gauge | `sid`, `type`, `max_restart`, `max_relocate`, `failback` | Service config (always 1) |
+| `pve_ha_service_status` | Gauge | `sid`, `node`, `state` | Service runtime state (always 1) |
+
+Service metrics are split into `_config` (from `/cluster/ha/resources`, static labels)
+and `_status` (from `manager_status.service_status`, runtime labels that change). This
+avoids stale series when a service migrates between nodes or changes state.

 **API response structure (`/cluster/ha/status/manager_status`):**
 ```json
@ -96,13 +113,23 @@ Load averages are strings in the API, parse with `strconv.ParseFloat`.
  "data": {
    "manager_status": {
      "master_node": "node03",
-      "node_status": {"node01": "online", "node02": "online"},
+      "node_status": {
+        "node01": "online",
+        "node02": "online",
+        "node03": "online",
+        "node04": "online",
+        "node05": "online"
+      },
      "service_status": {
        "vm:106": {"node": "node04", "running": 1, "state": "started"}
      }
    },
    "lrm_status": {
-      "node01": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351}
+      "node01": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
+      "node02": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
+      "node03": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351},
+      "node04": {"mode": "active", "state": "active", "timestamp": 1774016350},
+      "node05": {"mode": "active", "state": "wait_for_agent_lock", "timestamp": 1774016351}
    }
  }
 }
@ -117,8 +144,12 @@ Load averages are strings in the API, parse with `strconv.ParseFloat`.
 }
 ```

- `pve_ha_crm_master`: iterate `node_status` keys, emit 1 for `master_node`, 0 for all others.
- `pve_ha_service_info`: merge runtime state from `service_status` (node, state) with config from `/cluster/ha/resources` (max_restart, max_relocate, failback). Numeric config values become string labels.
+- `pve_ha_crm_master`: iterate `node_status` keys from `manager_status`, emit 1 for
+  `master_node`, 0 for all others. The `node_status` map contains all cluster nodes.
+- `pve_ha_service_config`: from `/cluster/ha/resources`. Numeric config values
+  (`max_restart`, `max_relocate`, `failback`) become string labels.
+- `pve_ha_service_status`: from `manager_status.service_status`. The `state` label
+  reflects runtime state (e.g., `started`, `stopped`, `migrate`).

 ### 4. Physical Disks Collector

@ -150,8 +181,13 @@ Load averages are strings in the API, parse with `strconv.ParseFloat`.
 }
 ```

- `wearout`: API returns 0–100 integer (100 = new). Divide by 100 for ratio.
- `health`: compare string to `"PASSED"`, emit 1 or 0.
+- `wearout`: API returns 0–100 integer representing percentage remaining (100 = new,
+  0 = fully worn). Divide by 100 to get ratio directly (no inversion needed). The API
+  may return `"N/A"` (string) for disks that don't support wear leveling — use
+  `json.Number` or similar to handle this. Skip emitting `pve_physical_disk_wearout_remaining_ratio`
+  when the value is not a valid number.
+- `health`: compare string to `"PASSED"`, emit 1 or 0. If `health` is empty or
+  `"UNKNOWN"`, emit 0.
 - `osd` label: format as `"osd.N"` (e.g., `osd.8`) matching Ceph daemon naming convention.
 - Multi-OSD disks: emit one `pve_physical_disk_osd` entry per item in `osdid-list`.
 - Non-OSD disks (`osdid` is -1 / `osdid-list` is null): no `pve_physical_disk_osd` entry emitted.
@ -163,11 +199,27 @@ All 4 collectors follow the established patterns:
 - Self-register via `init()` + `registerCollector()`
 - NodeAwareCollector interface for per-node endpoints (node_status, vm_pressure, physical_disk)
 - Plain Collector for cluster-level endpoints (ha_status)
+- NodeAwareCollector implementations guard the `nodes` slice with `sync.Mutex`, copy
+  it at the start of `Update()`, matching the pattern in `node_config.go` and `replication.go`
 - Per-node fan-out uses `sync.WaitGroup` + semaphore from `client.MaxConcurrent()`
 - JSON response types as unexported structs in each collector file
 - Unit tests with JSON fixtures in `collector/fixtures/`
 - `testCollectorAdapter` pattern for `testutil.GatherAndCompare`

+### Test Fixtures
+
+Each collector needs fixture files in `collector/fixtures/`:
+
+- `node_status.json` — full `/nodes/{node}/status` response
+- `node_qemu_pressure.json` — `/nodes/{node}/qemu` response with running + stopped VMs
+  (reuse or extend existing `node_qemu.json` if pressure fields can be added)
+- `ha_manager_status.json` — `/cluster/ha/status/manager_status` response
+- `ha_resources.json` — `/cluster/ha/resources` response
+- `node_disks.json` — `/nodes/{node}/disks/list` response
+
+The HA collector test needs two routes mapped in the test server (one per endpoint),
+unlike NodeAwareCollector tests which share a single route pattern.
+
 ## Scope Exclusions

 - **SDN/Network**: Excluded — API exposes config only, not operational state.
@ -177,4 +229,5 @@ All 4 collectors follow the established patterns:
 ## README Update

 Remove the implemented items from the TODO section. Remove the SDN, kernel version,
-and CPU model entries entirely. If no TODO items remain, remove the section.
+and CPU model entries entirely. If no TODO items remain, remove the section. Add
+metrics tables for all 4 new collectors following the existing table format.