docs: update README with new collector metrics, remove TODO section

Add metrics tables for node_status, vm_pressure, ha_status, and
physical_disk collectors. Remove the TODO section as all planned
metrics are now implemented (SDN excluded by design).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Davíð Steinn Geirsson 2026-03-20 15:35:00 +00:00
parent a88c696bfd
commit 771c3dc126

View file

@ -130,42 +130,57 @@ Create a PVE API token with at least `PVEAuditor` role. Provide it via:
| `pve_subscription_status` | Gauge | `id`, `status` | Subscription status |
| `pve_subscription_next_due_timestamp_seconds` | Gauge | `id` | Next due date as Unix timestamp |
### Node Status
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `pve_node_load1` | Gauge | `node` | 1-minute load average |
| `pve_node_load5` | Gauge | `node` | 5-minute load average |
| `pve_node_load15` | Gauge | `node` | 15-minute load average |
| `pve_node_swap_total_bytes` | Gauge | `node` | Total swap in bytes |
| `pve_node_swap_used_bytes` | Gauge | `node` | Used swap in bytes |
| `pve_node_swap_free_bytes` | Gauge | `node` | Free swap in bytes |
| `pve_node_rootfs_total_bytes` | Gauge | `node` | Root filesystem total in bytes |
| `pve_node_rootfs_used_bytes` | Gauge | `node` | Root filesystem used in bytes |
| `pve_node_rootfs_available_bytes` | Gauge | `node` | Root filesystem available in bytes |
| `pve_node_ksm_shared_bytes` | Gauge | `node` | KSM shared memory in bytes |
| `pve_node_boot_mode_info` | Gauge | `node`, `mode`, `secureboot` | Boot mode info (always 1) |
### VM Pressure
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `pve_vm_pressure_cpu_some_ratio` | Gauge | `id`, `node` | CPU pressure (some) |
| `pve_vm_pressure_cpu_full_ratio` | Gauge | `id`, `node` | CPU pressure (full) |
| `pve_vm_pressure_memory_some_ratio` | Gauge | `id`, `node` | Memory pressure (some) |
| `pve_vm_pressure_memory_full_ratio` | Gauge | `id`, `node` | Memory pressure (full) |
| `pve_vm_pressure_io_some_ratio` | Gauge | `id`, `node` | I/O pressure (some) |
| `pve_vm_pressure_io_full_ratio` | Gauge | `id`, `node` | I/O pressure (full) |
### HA Status
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `pve_ha_crm_master` | Gauge | `node` | 1 if node is CRM master, 0 otherwise |
| `pve_ha_node_status` | Gauge | `node`, `status` | Per-node HA status (always 1) |
| `pve_ha_lrm_timestamp_seconds` | Gauge | `node` | Last LRM heartbeat as Unix timestamp |
| `pve_ha_lrm_mode` | Gauge | `node`, `mode` | LRM mode per node (always 1) |
| `pve_ha_service_config` | Gauge | `sid`, `type`, `max_restart`, `max_relocate`, `failback` | Service config (always 1) |
| `pve_ha_service_status` | Gauge | `sid`, `node`, `state` | Service runtime state (always 1) |
### Physical Disks
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `pve_physical_disk_health` | Gauge | `node`, `devpath`, `model`, `serial`, `type` | 1 if SMART PASSED, 0 otherwise |
| `pve_physical_disk_wearout_remaining_ratio` | Gauge | `node`, `devpath` | Wearout remaining (1.0 = new) |
| `pve_physical_disk_size_bytes` | Gauge | `node`, `devpath` | Disk size in bytes |
| `pve_physical_disk_info` | Gauge | `node`, `devpath`, `model`, `serial`, `type`, `used` | Disk info (always 1) |
| `pve_physical_disk_osd` | Gauge | `node`, `devpath`, `osd` | Disk-to-OSD mapping (always 1) |
### Scrape Meta
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `pve_scrape_collector_duration_seconds` | Gauge | `collector` | Scrape duration per collector |
| `pve_scrape_collector_success` | Gauge | `collector` | 1 if collector succeeded |
## TODO: Future Metrics
The following metrics are available from the PVE API but not yet implemented:
### Per-node detailed status (`/nodes/{node}/status`)
- Load averages (1m, 5m, 15m)
- Swap usage (total, used, free)
- Root filesystem usage (total, used, available)
- KSM shared memory
- Kernel version info
- Boot mode and secure boot status
- CPU model info (model, sockets, cores, MHz)
### Per-VM pressure metrics (`/nodes/{node}/qemu`)
- `pressurecpusome`, `pressurecpufull`
- `pressurememorysome`, `pressurememoryfull`
- `pressureiosome`, `pressureiofull`
### HA detailed status (`/cluster/ha/status/current`)
- CRM master node and status
- Per-node LRM status (idle/active) and timestamps
- Per-service HA config (failback, max_restart, max_relocate)
### Physical disks (`/nodes/{node}/disks/list`)
- Disk health (SMART status)
- Wearout level
- Size and model info
- OSD mapping
### SDN/Network (`/cluster/resources` type=sdn)
- Zone status per node
- Zone type info