diff --git a/README.md b/README.md new file mode 100644 index 0000000..d02176b --- /dev/null +++ b/README.md @@ -0,0 +1,171 @@ +# pve-exporter + +A Prometheus exporter for Proxmox VE written in Go. Produces a single static +binary for easy deployment. + +Designed as a drop-in replacement for +[prometheus-pve-exporter](https://github.com/prometheus-community/prometheus-pve-exporter) +with matching metric names for dashboard compatibility, plus additional +corosync cluster metrics. + +## Installation + +```bash +CGO_ENABLED=0 go build -o pve-exporter . +``` + +## Usage + +```bash +pve-exporter \ + --pve.host=https://node01:8006 \ + --pve.host=https://node02:8006 \ + --pve.token-file=/etc/pve-exporter/apikey \ + --web.listen-address=:9221 +``` + +The exporter scrapes all cluster data from a single PVE API endpoint. Multiple +`--pve.host` values provide failover — hosts are tried in order, with a +1-second connect timeout for fast failover. + +### Flags + +| Flag | Default | Description | +|------|---------|-------------| +| `--pve.host` | (required) | PVE API base URL (repeatable) | +| `--pve.api-token` | | API token string (`user@realm!tokenid=uuid`) | +| `--pve.token-file` | | Path to file containing API token | +| `--pve.tls-insecure` | `false` | Disable TLS certificate verification | +| `--pve.max-concurrent` | `5` | Max concurrent API requests for per-node fan-out | +| `--web.listen-address` | `:9221` | Address to listen on | +| `--web.telemetry-path` | `/metrics` | Path for metrics endpoint | +| `--log.level` | `info` | Log level (debug, info, warn, error) | +| `--log.format` | `logfmt` | Log format (logfmt, json) | + +### Authentication + +Create a PVE API token with at least `PVEAuditor` role. Provide it via: + +- `--pve.api-token=user@realm!tokenid=uuid` (visible in process list) +- `--pve.token-file=/path/to/file` (recommended) +- `PVE_API_TOKEN` environment variable + +`--pve.api-token` and `--pve.token-file` are mutually exclusive. + +## Metrics + +### Cluster Status + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_node_info` | Gauge | `id`, `level`, `name`, `nodeid` | Node info (always 1) | +| `pve_cluster_info` | Gauge | `id`, `nodes`, `quorate`, `version` | Cluster info (always 1) | + +### Corosync + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_cluster_quorate` | Gauge | | 1 if cluster has quorum | +| `pve_cluster_nodes_total` | Gauge | | Total node count | +| `pve_cluster_expected_votes` | Gauge | | Sum of quorum votes from config | +| `pve_node_online` | Gauge | `name`, `nodeid` | 1 if node is online | + +### Cluster Resources + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_up` | Gauge | `id` | 1 if node/VM/CT is online/running | +| `pve_cpu_usage_ratio` | Gauge | `id` | CPU utilization ratio | +| `pve_cpu_usage_limit` | Gauge | `id` | Number of available CPUs | +| `pve_memory_usage_bytes` | Gauge | `id` | Used memory in bytes | +| `pve_memory_size_bytes` | Gauge | `id` | Total memory in bytes | +| `pve_disk_usage_bytes` | Gauge | `id` | Used disk space in bytes | +| `pve_disk_size_bytes` | Gauge | `id` | Total disk space in bytes | +| `pve_uptime_seconds` | Gauge | `id` | Uptime in seconds | +| `pve_network_transmit_bytes_total` | Counter | `id` | Network bytes sent | +| `pve_network_receive_bytes_total` | Counter | `id` | Network bytes received | +| `pve_disk_written_bytes_total` | Counter | `id` | Disk bytes written | +| `pve_disk_read_bytes_total` | Counter | `id` | Disk bytes read | +| `pve_guest_info` | Gauge | `id`, `node`, `name`, `type`, `template`, `tags` | VM/CT info (always 1) | +| `pve_storage_info` | Gauge | `id`, `node`, `storage`, `plugintype`, `content` | Storage info (always 1) | +| `pve_storage_shared` | Gauge | `id` | 1 if storage is shared | +| `pve_ha_state` | Gauge | `id`, `state` | HA service status | +| `pve_lock_state` | Gauge | `id`, `state` | Guest config lock state | + +### Version + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_version_info` | Gauge | `release`, `repoid`, `version` | PVE version info (always 1) | + +### Backup + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_not_backed_up_total` | Gauge | `id` | 1 if guest has no backup job | +| `pve_not_backed_up_info` | Gauge | `id` | 1 if guest has no backup job | + +### Node Config + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_onboot_status` | Gauge | `id`, `node`, `type` | VM/CT onboot config value | + +### Replication + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_replication_info` | Gauge | `id`, `type`, `source`, `target`, `guest` | Replication job info (always 1) | +| `pve_replication_duration_seconds` | Gauge | `id` | Last replication duration | +| `pve_replication_last_sync_timestamp_seconds` | Gauge | `id` | Last successful sync time | +| `pve_replication_last_try_timestamp_seconds` | Gauge | `id` | Last sync attempt time | +| `pve_replication_next_sync_timestamp_seconds` | Gauge | `id` | Next scheduled sync time | +| `pve_replication_failed_syncs` | Gauge | `id` | Failed sync count | + +### Subscription + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_subscription_info` | Gauge | `id`, `level` | Subscription info (always 1) | +| `pve_subscription_status` | Gauge | `id`, `status` | Subscription status | +| `pve_subscription_next_due_timestamp_seconds` | Gauge | `id` | Next due date as Unix timestamp | + +### Scrape Meta + +| Metric | Type | Labels | Description | +|--------|------|--------|-------------| +| `pve_scrape_collector_duration_seconds` | Gauge | `collector` | Scrape duration per collector | +| `pve_scrape_collector_success` | Gauge | `collector` | 1 if collector succeeded | + +## TODO: Future Metrics + +The following metrics are available from the PVE API but not yet implemented: + +### Per-node detailed status (`/nodes/{node}/status`) +- Load averages (1m, 5m, 15m) +- Swap usage (total, used, free) +- Root filesystem usage (total, used, available) +- KSM shared memory +- Kernel version info +- Boot mode and secure boot status +- CPU model info (model, sockets, cores, MHz) + +### Per-VM pressure metrics (`/nodes/{node}/qemu`) +- `pressurecpusome`, `pressurecpufull` +- `pressurememorysome`, `pressurememoryfull` +- `pressureiosome`, `pressureiofull` + +### HA detailed status (`/cluster/ha/status/current`) +- CRM master node and status +- Per-node LRM status (idle/active) and timestamps +- Per-service HA config (failback, max_restart, max_relocate) + +### Physical disks (`/nodes/{node}/disks/list`) +- Disk health (SMART status) +- Wearout level +- Size and model info +- OSD mapping + +### SDN/Network (`/cluster/resources` type=sdn) +- Zone status per node +- Zone type info