Add pve-exporter design spec

Full design for a Go Prometheus exporter for Proxmox VE, replacing the Python prometheus-pve-exporter with corosync metrics added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 10:59:14 +00:00 · 2026-03-20 10:59:14 +00:00 · 154a46f3cf
commit 154a46f3cf
parent 4aa8a6d579
1 changed files with 287 additions and 0 deletions
--- a/docs/superpowers/specs/2026-03-20-pve-exporter-design.md
+++ b/docs/superpowers/specs/2026-03-20-pve-exporter-design.md
@ -0,0 +1,287 @@
+# pve-exporter Design Spec
+
+A Prometheus exporter for Proxmox VE written in Go. Replaces the Python
+prometheus-pve-exporter with a single static binary, matching all existing
+metric names for dashboard compatibility, and adding corosync cluster metrics.
+
+## Goals
+
+- Drop-in metric compatibility with prometheus-pve-exporter (same metric names
+  and labels where possible) so existing Grafana dashboards work unchanged
+- Add corosync/quorum metrics not available in the Python exporter
+- Single statically-linked binary for easy deployment via Ansible
+- Cluster-wide scrape from a single instance (no per-node exporter deployment)
+
+## Non-Goals
+
+- Ceph metrics (collected separately via ceph-mgr)
+- General-purpose PVE API client library
+- Full parity with PVE's web UI
+
+## Architecture
+
+### Project Structure
+
+```
+pve-exporter/
+├── main.go                  # Entry point, flag parsing, HTTP server
+├── collector/
+│   ├── collector.go         # Collector interface, registry, PVECollector
+│   ├── client.go            # PVE API client (HTTP, auth, JSON parsing)
+│   ├── cluster_status.go    # pve_up, pve_node_info, pve_cluster_info
+│   ├── cluster_resources.go # CPU, memory, disk, network, storage, guest/HA info
+│   ├── corosync.go          # pve_cluster_quorate, nodes_total, expected_votes, node_online
+│   ├── version.go           # pve_version_info
+│   ├── backup.go            # pve_not_backed_up_*
+│   ├── node_config.go       # pve_onboot_status
+│   ├── replication.go       # pve_replication_*
+│   └── subscription.go      # pve_subscription_*
+├── go.mod
+├── go.sum
+├── Makefile
+└── README.md
+```
+
+### Collector Framework
+
+Follows the node_exporter pattern:
+
+```go
+type Collector interface {
+    Update(client *Client, ch chan<- prometheus.Metric) error
+}
+```
+
+Collectors self-register via `init()` + `registerCollector()`. The framework
+runs all collectors in parallel (goroutines + WaitGroup) and emits per-collector
+scrape duration and success metrics automatically.
+
+Collectors that need the node list or shared `/cluster/resources` data implement
+additional interfaces:
+
+```go
+type NodeAwareCollector interface {
+    Collector
+    SetNodes(nodes []string)
+}
+
+type ResourceAwareCollector interface {
+    Collector
+    SetResources(data []byte)
+}
+```
+
+### Scrape Flow
+
+1. Prometheus hits `/metrics`
+2. `PVECollector.Collect()` fetches `/cluster/resources` first (needed by
+   multiple collectors and provides the node list)
+3. Node list and resources data are passed to collectors that need them
+4. All collectors run in parallel
+5. Per-node API calls within collectors (subscription, replication, node_config)
+   are parallelized across nodes with bounded concurrency (5 concurrent requests)
+6. Framework measures duration, catches errors, emits scrape meta-metrics
+
+### API Client
+
+```go
+type Client struct {
+    httpClient *http.Client
+    hosts      []string   // tried in order on failure
+    token      string     // PVEAPIToken=user@realm!tokenid=uuid
+}
+```
+
+- Tries hosts in order; on connection/HTTP error, falls through to next host.
+  Remembers the last working host and tries it first on subsequent scrapes.
+- 1-second TCP connect timeout for fast failover to next host.
+- TLS certificate verification enabled by default. `--pve.tls-insecure` to
+  disable.
+- Single `Get(path string) ([]byte, error)` method. No caching; each scrape
+  makes fresh API calls.
+- Context-aware with scrape timeout propagated from Prometheus.
+
+### Authentication
+
+- `--pve.api-token` flag or `PVE_API_TOKEN` env var for token string
+- `--pve.token-file` for reading token from file at startup (avoids token in
+  process list, Ansible-friendly)
+- Sent as `Authorization: PVEAPIToken=...` header
+
+## CLI & HTTP
+
+```
+pve-exporter \
+  --pve.host=https://node02:8006 \
+  --pve.host=https://node01:8006 \
+  --pve.token-file=/etc/pve-exporter/apikey \
+  --web.listen-address=:9221 \
+  --web.telemetry-path=/metrics
+```
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--pve.host` | (required, repeatable) | PVE API base URLs, tried in order |
+| `--pve.api-token` | — | API token string (mutually exclusive with token-file) |
+| `--pve.token-file` | — | Path to file containing API token |
+| `--pve.tls-insecure` | `false` | Disable TLS certificate verification |
+| `--web.listen-address` | `:9221` | Address to listen on |
+| `--web.telemetry-path` | `/metrics` | Path for metrics endpoint |
+| `--log.level` | `info` | Log level (debug, info, warn, error) |
+| `--log.format` | `logfmt` | Log format (logfmt, json) |
+
+HTTP endpoints:
+- `/metrics` — Prometheus metrics
+- `/` — Landing page with link to metrics
+
+Port 9221 matches the Python exporter for drop-in compatibility.
+
+## Metrics
+
+All metrics use namespace `pve`.
+
+### cluster_status collector
+
+API: `/cluster/status`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_up` | Gauge | `id` |
+| `pve_node_info` | Gauge | `id`, `level`, `name`, `nodeid` |
+| `pve_cluster_info` | Gauge | `id`, `nodes`, `quorate`, `version` |
+
+### corosync collector
+
+API: `/cluster/status`, `/cluster/config/nodes`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_cluster_quorate` | Gauge | — |
+| `pve_cluster_nodes_total` | Gauge | — |
+| `pve_cluster_expected_votes` | Gauge | — |
+| `pve_node_online` | Gauge | `name`, `nodeid` |
+
+### cluster_resources collector
+
+API: `/cluster/resources`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_cpu_usage_ratio` | Gauge | `id` |
+| `pve_cpu_usage_limit` | Gauge | `id` |
+| `pve_memory_usage_bytes` | Gauge | `id` |
+| `pve_memory_size_bytes` | Gauge | `id` |
+| `pve_disk_usage_bytes` | Gauge | `id` |
+| `pve_disk_size_bytes` | Gauge | `id` |
+| `pve_network_transmit_bytes_total` | Counter | `id` |
+| `pve_network_receive_bytes_total` | Counter | `id` |
+| `pve_disk_written_bytes_total` | Counter | `id` |
+| `pve_disk_read_bytes_total` | Counter | `id` |
+| `pve_uptime_seconds` | Gauge | `id` |
+| `pve_storage_shared` | Gauge | `id` |
+| `pve_guest_info` | Gauge | `id`, `node`, `name`, `type`, `template`, `tags` |
+| `pve_storage_info` | Gauge | `id`, `node`, `storage`, `plugintype`, `content` |
+| `pve_ha_state` | Gauge | `id`, `state` |
+| `pve_lock_state` | Gauge | `id`, `state` |
+
+### version collector
+
+API: `/version`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_version_info` | Gauge | `release`, `repoid`, `version` |
+
+### backup collector
+
+API: `/cluster/backup-info/not-backed-up`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_not_backed_up_total` | Gauge | `id` |
+| `pve_not_backed_up_info` | Gauge | `id` |
+
+### node_config collector
+
+API: `/nodes/{node}/qemu/{vmid}/config`, `/nodes/{node}/lxc/{vmid}/config`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_onboot_status` | Gauge | `id`, `node`, `type` |
+
+### replication collector
+
+API: `/nodes/{node}/replication`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_replication_info` | Gauge | `id`, `type`, `source`, `target`, `guest` |
+| `pve_replication_duration_seconds` | Gauge | `id` |
+| `pve_replication_last_sync_timestamp_seconds` | Gauge | `id` |
+| `pve_replication_last_try_timestamp_seconds` | Gauge | `id` |
+| `pve_replication_next_sync_timestamp_seconds` | Gauge | `id` |
+| `pve_replication_failed_syncs` | Gauge | `id` |
+
+### subscription collector
+
+API: `/nodes/{node}/subscription`
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_subscription_info` | Gauge | `id`, `level` |
+| `pve_subscription_status` | Gauge | `id`, `status` |
+| `pve_subscription_next_due_timestamp_seconds` | Gauge | `id` |
+
+### Scrape meta-metrics
+
+| Metric | Type | Labels |
+|--------|------|--------|
+| `pve_scrape_collector_duration_seconds` | Gauge | `collector` |
+| `pve_scrape_collector_success` | Gauge | `collector` |
+
+## Dependencies
+
+- `github.com/alecthomas/kingpin/v2` — CLI flags
+- `github.com/prometheus/client_golang` — Prometheus client
+- `github.com/prometheus/common` — logging (promslog)
+- `github.com/prometheus/exporter-toolkit` — TLS, web config, landing page
+
+## Testing Strategy
+
+- Unit tests per collector with mock API responses (JSON fixtures)
+- Integration test: start exporter, scrape `/metrics`, verify expected metric
+  names and labels are present
+- Manual validation against live PVE cluster
+
+## Future Metrics (TODO)
+
+The following metrics are available from the PVE API but deferred to future work:
+
+### Per-node detailed status (`/nodes/{node}/status`)
+- Load averages (1m, 5m, 15m)
+- Swap usage (total, used, free)
+- Root filesystem usage (total, used, available)
+- KSM shared memory
+- Kernel version info
+- Boot mode and secure boot status
+- CPU model info (model, sockets, cores, MHz)
+
+### Per-VM pressure metrics (`/nodes/{node}/qemu`)
+- `pressurecpusome`, `pressurecpufull`
+- `pressurememorysome`, `pressurememoryfull`
+- `pressureiosome`, `pressureiofull`
+
+### HA detailed status (`/cluster/ha/status/current`)
+- CRM master node and status
+- Per-node LRM status (idle/active) and timestamps
+- Per-service HA config (failback, max_restart, max_relocate)
+
+### Physical disks (`/nodes/{node}/disks/list`)
+- Disk health (SMART status)
+- Wearout level
+- Size and model info
+- OSD mapping
+
+### SDN/Network (`/cluster/resources` type=sdn)
+- Zone status per node
+- Zone type info