Proxmox VE prometheus exporter

Find a file

Davíð Steinn Geirsson a88c696bfd feat: add physical_disk collector (health, wearout, size, OSD mapping) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-03-20 15:33:46 +00:00
collector	feat: add physical_disk collector (health, wearout, size, OSD mapping)	2026-03-20 15:33:46 +00:00
docs/superpowers	docs: add implementation plan for remaining collectors	2026-03-20 15:20:17 +00:00
.gitignore	Add flake.nix for Nix builds and dev shell	2026-03-20 12:44:50 +00:00
flake.lock	Add flake.nix for Nix builds and dev shell	2026-03-20 12:44:50 +00:00
flake.nix	Add flake.nix for Nix builds and dev shell	2026-03-20 12:44:50 +00:00
go.mod	feat: add version collector (pve_version_info)	2026-03-20 11:31:06 +00:00
go.sum	feat: add main entry point with CLI flags and HTTP server	2026-03-20 11:27:53 +00:00
main.go	feat: add main entry point with CLI flags and HTTP server	2026-03-20 11:27:53 +00:00
Makefile	feat: add main entry point with CLI flags and HTTP server	2026-03-20 11:27:53 +00:00
README.md	docs: add README with usage, metrics reference, and future metrics TODO	2026-03-20 11:38:17 +00:00

README.md

pve-exporter

A Prometheus exporter for Proxmox VE written in Go. Produces a single static binary for easy deployment.

Designed as a drop-in replacement for prometheus-pve-exporter with matching metric names for dashboard compatibility, plus additional corosync cluster metrics.

Installation

CGO_ENABLED=0 go build -o pve-exporter .

Usage

pve-exporter \
  --pve.host=https://node01:8006 \
  --pve.host=https://node02:8006 \
  --pve.token-file=/etc/pve-exporter/apikey \
  --web.listen-address=:9221

The exporter scrapes all cluster data from a single PVE API endpoint. Multiple --pve.host values provide failover — hosts are tried in order, with a 1-second connect timeout for fast failover.

Flags

Flag	Default	Description
`--pve.host`	(required)	PVE API base URL (repeatable)
`--pve.api-token`		API token string (`user@realm!tokenid=uuid`)
`--pve.token-file`		Path to file containing API token
`--pve.tls-insecure`	`false`	Disable TLS certificate verification
`--pve.max-concurrent`	`5`	Max concurrent API requests for per-node fan-out
`--web.listen-address`	`:9221`	Address to listen on
`--web.telemetry-path`	`/metrics`	Path for metrics endpoint
`--log.level`	`info`	Log level (debug, info, warn, error)
`--log.format`	`logfmt`	Log format (logfmt, json)

Authentication

Create a PVE API token with at least PVEAuditor role. Provide it via:

--pve.api-token=user@realm!tokenid=uuid (visible in process list)
--pve.token-file=/path/to/file (recommended)
PVE_API_TOKEN environment variable

--pve.api-token and --pve.token-file are mutually exclusive.

Metrics

Cluster Status

Metric	Type	Labels	Description
`pve_node_info`	Gauge	`id`, `level`, `name`, `nodeid`	Node info (always 1)
`pve_cluster_info`	Gauge	`id`, `nodes`, `quorate`, `version`	Cluster info (always 1)

Corosync

Metric	Type	Labels	Description
`pve_cluster_quorate`	Gauge		1 if cluster has quorum
`pve_cluster_nodes_total`	Gauge		Total node count
`pve_cluster_expected_votes`	Gauge		Sum of quorum votes from config
`pve_node_online`	Gauge	`name`, `nodeid`	1 if node is online

Cluster Resources

Metric	Type	Labels	Description
`pve_up`	Gauge	`id`	1 if node/VM/CT is online/running
`pve_cpu_usage_ratio`	Gauge	`id`	CPU utilization ratio
`pve_cpu_usage_limit`	Gauge	`id`	Number of available CPUs
`pve_memory_usage_bytes`	Gauge	`id`	Used memory in bytes
`pve_memory_size_bytes`	Gauge	`id`	Total memory in bytes
`pve_disk_usage_bytes`	Gauge	`id`	Used disk space in bytes
`pve_disk_size_bytes`	Gauge	`id`	Total disk space in bytes
`pve_uptime_seconds`	Gauge	`id`	Uptime in seconds
`pve_network_transmit_bytes_total`	Counter	`id`	Network bytes sent
`pve_network_receive_bytes_total`	Counter	`id`	Network bytes received
`pve_disk_written_bytes_total`	Counter	`id`	Disk bytes written
`pve_disk_read_bytes_total`	Counter	`id`	Disk bytes read
`pve_guest_info`	Gauge	`id`, `node`, `name`, `type`, `template`, `tags`	VM/CT info (always 1)
`pve_storage_info`	Gauge	`id`, `node`, `storage`, `plugintype`, `content`	Storage info (always 1)
`pve_storage_shared`	Gauge	`id`	1 if storage is shared
`pve_ha_state`	Gauge	`id`, `state`	HA service status
`pve_lock_state`	Gauge	`id`, `state`	Guest config lock state

Version

Metric	Type	Labels	Description
`pve_version_info`	Gauge	`release`, `repoid`, `version`	PVE version info (always 1)

Backup

Metric	Type	Labels	Description
`pve_not_backed_up_total`	Gauge	`id`	1 if guest has no backup job
`pve_not_backed_up_info`	Gauge	`id`	1 if guest has no backup job

Node Config

Metric	Type	Labels	Description
`pve_onboot_status`	Gauge	`id`, `node`, `type`	VM/CT onboot config value

Replication

Metric	Type	Labels	Description
`pve_replication_info`	Gauge	`id`, `type`, `source`, `target`, `guest`	Replication job info (always 1)
`pve_replication_duration_seconds`	Gauge	`id`	Last replication duration
`pve_replication_last_sync_timestamp_seconds`	Gauge	`id`	Last successful sync time
`pve_replication_last_try_timestamp_seconds`	Gauge	`id`	Last sync attempt time
`pve_replication_next_sync_timestamp_seconds`	Gauge	`id`	Next scheduled sync time
`pve_replication_failed_syncs`	Gauge	`id`	Failed sync count

Subscription

Metric	Type	Labels	Description
`pve_subscription_info`	Gauge	`id`, `level`	Subscription info (always 1)
`pve_subscription_status`	Gauge	`id`, `status`	Subscription status
`pve_subscription_next_due_timestamp_seconds`	Gauge	`id`	Next due date as Unix timestamp

Scrape Meta

Metric	Type	Labels	Description
`pve_scrape_collector_duration_seconds`	Gauge	`collector`	Scrape duration per collector
`pve_scrape_collector_success`	Gauge	`collector`	1 if collector succeeded

TODO: Future Metrics

The following metrics are available from the PVE API but not yet implemented:

Per-node detailed status (`/nodes/{node}/status`)

Load averages (1m, 5m, 15m)
Swap usage (total, used, free)
Root filesystem usage (total, used, available)
KSM shared memory
Kernel version info
Boot mode and secure boot status
CPU model info (model, sockets, cores, MHz)

Per-VM pressure metrics (`/nodes/{node}/qemu`)

pressurecpusome, pressurecpufull
pressurememorysome, pressurememoryfull
pressureiosome, pressureiofull

HA detailed status (`/cluster/ha/status/current`)

CRM master node and status
Per-node LRM status (idle/active) and timestamps
Per-service HA config (failback, max_restart, max_relocate)

Physical disks (`/nodes/{node}/disks/list`)

Disk health (SMART status)
Wearout level
Size and model info
OSD mapping

SDN/Network (`/cluster/resources` type=sdn)

Zone status per node
Zone type info