14 tasks covering: Go module setup, API client, collector framework, main entry point, and all 8 collectors (version, cluster_status, corosync, cluster_resources, backup, subscription, node_config, replication), plus README and integration testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
74 KiB
pve-exporter Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build a Go Prometheus exporter for Proxmox VE that matches prometheus-pve-exporter's metrics and adds corosync cluster metrics.
Architecture: node_exporter-style collector framework. A shared PVE API client with multi-host failover feeds self-registering collectors that run in parallel. Each collector owns one API domain (cluster status, resources, corosync, etc.) and emits metrics to a Prometheus channel.
Tech Stack: Go 1.22+, github.com/prometheus/client_golang, github.com/alecthomas/kingpin/v2, github.com/prometheus/common (promslog), github.com/prometheus/exporter-toolkit
Spec: docs/superpowers/specs/2026-03-20-pve-exporter-design.md
File Structure
| File | Responsibility |
|---|---|
go.mod |
Module definition and dependencies |
main.go |
CLI flags, HTTP server, wiring |
collector/collector.go |
Collector interface, registry, PVECollector (prometheus.Collector), scrape orchestration |
collector/client.go |
PVE API HTTP client with multi-host failover, auth, TLS config |
collector/client_test.go |
Client tests with httptest server |
collector/cluster_status.go |
pve_up, pve_node_info, pve_cluster_info |
collector/cluster_status_test.go |
Tests with JSON fixtures |
collector/corosync.go |
pve_cluster_quorate, pve_cluster_nodes_total, pve_cluster_expected_votes, pve_node_online |
collector/corosync_test.go |
Tests with JSON fixtures |
collector/cluster_resources.go |
16 metrics: CPU, memory, disk, network, storage, guest info, HA/lock state |
collector/cluster_resources_test.go |
Tests with JSON fixtures |
collector/version.go |
pve_version_info |
collector/version_test.go |
Tests with JSON fixtures |
collector/backup.go |
pve_not_backed_up_total, pve_not_backed_up_info |
collector/backup_test.go |
Tests with JSON fixtures |
collector/node_config.go |
pve_onboot_status (per-node fan-out) |
collector/node_config_test.go |
Tests with JSON fixtures |
collector/replication.go |
6 replication metrics (per-node fan-out) |
collector/replication_test.go |
Tests with JSON fixtures |
collector/subscription.go |
3 subscription metrics (per-node fan-out) |
collector/subscription_test.go |
Tests with JSON fixtures |
collector/testutil_test.go |
Shared test helpers: mock client, fixture loader |
collector/fixtures/ |
JSON fixture files for API responses |
Makefile |
Build, test, lint targets |
README.md |
Usage docs, metric list, TODO for future metrics |
Task 1: Go Module and Dependencies
Files:
-
Create:
go.mod -
Step 1: Initialize Go module
cd /home/user/git/pve-exporter
go mod init github.com/dsgeis/pve-exporter
- Step 2: Add dependencies
cd /home/user/git/pve-exporter
go get github.com/alecthomas/kingpin/v2@v2.4.0
go get github.com/prometheus/client_golang@latest
go get github.com/prometheus/common@latest
go get github.com/prometheus/exporter-toolkit@latest
- Step 3: Commit
git add go.mod go.sum
git commit -m "feat: initialize Go module with dependencies"
Task 2: PVE API Client
Files:
-
Create:
collector/client.go -
Create:
collector/client_test.go -
Step 1: Write the failing test
Create collector/client_test.go:
package collector
import (
"net/http"
"net/http/httptest"
"testing"
)
func TestClientGet(t *testing.T) {
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Header.Get("Authorization") != "PVEAPIToken=test@pve!token=secret" {
t.Errorf("unexpected auth header: %s", r.Header.Get("Authorization"))
}
if r.URL.Path != "/api2/json/version" {
t.Errorf("unexpected path: %s", r.URL.Path)
}
w.Write([]byte(`{"data":{"version":"8.0"}}`))
}))
defer server.Close()
client, err := NewClient([]string{server.URL}, "test@pve!token=secret", true, 5)
if err != nil {
t.Fatal(err)
}
// Override HTTP client to trust test server's TLS cert
client.httpClient = server.Client()
data, err := client.Get("/version")
if err != nil {
t.Fatal(err)
}
if string(data) != `{"data":{"version":"8.0"}}` {
t.Errorf("unexpected response: %s", string(data))
}
}
func TestClientFailover(t *testing.T) {
// First server always fails
bad := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusInternalServerError)
}))
defer bad.Close()
// Second server works
good := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte(`{"data":"ok"}`))
}))
defer good.Close()
client, err := NewClient([]string{bad.URL, good.URL}, "token", true, 5)
if err != nil {
t.Fatal(err)
}
// Use a client that trusts both test servers
client.httpClient = bad.Client()
data, err := client.Get("/test")
if err != nil {
t.Fatal(err)
}
if string(data) != `{"data":"ok"}` {
t.Errorf("unexpected response: %s", string(data))
}
// After success, good host should be tried first (remembered)
// Make bad server unreachable by closing it
bad.Close()
data, err = client.Get("/test")
if err != nil {
t.Fatal(err)
}
if string(data) != `{"data":"ok"}` {
t.Errorf("second request failed or returned wrong data: %s", string(data))
}
}
func TestClientAllHostsFail(t *testing.T) {
bad := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusInternalServerError)
}))
defer bad.Close()
client, err := NewClient([]string{bad.URL}, "token", true, 5)
if err != nil {
t.Fatal(err)
}
client.httpClient = bad.Client()
_, err = client.Get("/test")
if err == nil {
t.Error("expected error when all hosts fail")
}
}
- Step 2: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClient -v
Expected: Compilation error — NewClient not defined.
- Step 3: Write the implementation
Create collector/client.go:
package collector
import (
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"sync"
"time"
)
// Client is an HTTP client for the Proxmox VE API.
// It supports multiple hosts with automatic failover.
type Client struct {
httpClient *http.Client
hosts []string
token string
maxConcurrent int
mu sync.Mutex
lastGoodHost int // index into hosts
}
// NewClient creates a new PVE API client.
// hosts is a list of PVE API base URLs tried in order on failure.
// token is the PVE API token string (user@realm!tokenid=uuid).
// tlsInsecure disables TLS certificate verification when true.
// maxConcurrent limits parallel per-node API requests.
func NewClient(hosts []string, token string, tlsInsecure bool, maxConcurrent int) (*Client, error) {
if len(hosts) == 0 {
return nil, fmt.Errorf("at least one PVE host is required")
}
if maxConcurrent < 1 {
maxConcurrent = 5
}
transport := &http.Transport{
DialContext: (&net.Dialer{
Timeout: 1 * time.Second,
}).DialContext,
TLSClientConfig: &tls.Config{
InsecureSkipVerify: tlsInsecure,
},
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
}
return &Client{
httpClient: &http.Client{
Transport: transport,
Timeout: 30 * time.Second,
},
hosts: hosts,
token: token,
maxConcurrent: maxConcurrent,
}, nil
}
// Get makes a GET request to the PVE API at the given path.
// It tries hosts in order, starting with the last successful host.
// The path should not include /api2/json prefix — it is added automatically.
func (c *Client) Get(path string) ([]byte, error) {
c.mu.Lock()
startIdx := c.lastGoodHost
c.mu.Unlock()
var lastErr error
for i := 0; i < len(c.hosts); i++ {
idx := (startIdx + i) % len(c.hosts)
host := c.hosts[idx]
url := host + "/api2/json" + path
req, err := http.NewRequest("GET", url, nil)
if err != nil {
lastErr = fmt.Errorf("creating request for %s: %w", host, err)
continue
}
req.Header.Set("Authorization", "PVEAPIToken="+c.token)
resp, err := c.httpClient.Do(req)
if err != nil {
lastErr = fmt.Errorf("requesting %s: %w", url, err)
continue
}
body, err := io.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
lastErr = fmt.Errorf("reading response from %s: %w", url, err)
continue
}
if resp.StatusCode != http.StatusOK {
lastErr = fmt.Errorf("%s returned status %d: %s", url, resp.StatusCode, string(body))
continue
}
c.mu.Lock()
c.lastGoodHost = idx
c.mu.Unlock()
return body, nil
}
return nil, fmt.Errorf("all PVE hosts failed: %w", lastErr)
}
// MaxConcurrent returns the configured max concurrent API requests.
func (c *Client) MaxConcurrent() int {
return c.maxConcurrent
}
- Step 4: Run tests to verify they pass
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClient -v
Expected: All 3 tests PASS.
- Step 5: Commit
git add collector/client.go collector/client_test.go
git commit -m "feat: add PVE API client with multi-host failover"
Task 3: Collector Framework
Files:
-
Create:
collector/collector.go -
Create:
collector/testutil_test.go -
Step 1: Write collector.go
Create collector/collector.go with the collector interface, registry, and PVECollector:
package collector
import (
"fmt"
"log/slog"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
)
const namespace = "pve"
var (
scrapeDurationDesc = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "scrape", "collector_duration_seconds"),
"pve_exporter: Duration of a collector scrape.",
[]string{"collector"}, nil,
)
scrapeSuccessDesc = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "scrape", "collector_success"),
"pve_exporter: Whether a collector succeeded.",
[]string{"collector"}, nil,
)
)
// Collector is the interface each metric collector implements.
type Collector interface {
Update(client *Client, ch chan<- prometheus.Metric) error
}
// NodeAwareCollector is implemented by collectors that need the cluster node list.
type NodeAwareCollector interface {
Collector
SetNodes(nodes []string)
}
// ResourceAwareCollector is implemented by collectors that consume /cluster/resources data.
type ResourceAwareCollector interface {
Collector
SetResources(data []byte)
}
var factories = make(map[string]func(logger *slog.Logger) Collector)
func registerCollector(name string, factory func(logger *slog.Logger) Collector) {
factories[name] = factory
}
// PVECollector implements prometheus.Collector and orchestrates all registered collectors.
type PVECollector struct {
client *Client
collectors map[string]Collector
logger *slog.Logger
}
// NewPVECollector creates a PVECollector with all registered collectors.
func NewPVECollector(client *Client, logger *slog.Logger) *PVECollector {
collectors := make(map[string]Collector)
for name, factory := range factories {
collectors[name] = factory(logger.With("collector", name))
}
return &PVECollector{
client: client,
collectors: collectors,
logger: logger,
}
}
// Describe implements prometheus.Collector.
func (p *PVECollector) Describe(ch chan<- *prometheus.Desc) {
ch <- scrapeDurationDesc
ch <- scrapeSuccessDesc
}
// Collect implements prometheus.Collector.
// It fetches /cluster/resources first to get the node list and resource data,
// then runs all collectors in parallel.
func (p *PVECollector) Collect(ch chan<- prometheus.Metric) {
// Pre-fetch cluster resources for shared data
resourcesData, err := p.client.Get("/cluster/resources")
var nodes []string
if err != nil {
p.logger.Error("failed to fetch cluster resources", "err", err)
} else {
nodes = extractNodeNames(resourcesData)
}
// Distribute shared data to collectors that need it
for _, c := range p.collectors {
if rac, ok := c.(ResourceAwareCollector); ok && resourcesData != nil {
rac.SetResources(resourcesData)
}
if nac, ok := c.(NodeAwareCollector); ok {
nac.SetNodes(nodes)
}
}
// Run all collectors in parallel
wg := sync.WaitGroup{}
wg.Add(len(p.collectors))
for name, c := range p.collectors {
go func(name string, c Collector) {
defer wg.Done()
begin := time.Now()
err := c.Update(p.client, ch)
duration := time.Since(begin)
var success float64
if err != nil {
p.logger.Error("collector failed", "name", name, "duration_seconds", duration.Seconds(), "err", err)
success = 0
} else {
p.logger.Debug("collector succeeded", "name", name, "duration_seconds", duration.Seconds())
success = 1
}
ch <- prometheus.MustNewConstMetric(scrapeDurationDesc, prometheus.GaugeValue, duration.Seconds(), name)
ch <- prometheus.MustNewConstMetric(scrapeSuccessDesc, prometheus.GaugeValue, success, name)
}(name, c)
}
wg.Wait()
}
// extractNodeNames parses /cluster/resources response and returns node names.
func extractNodeNames(data []byte) []string {
// Lightweight JSON parsing — we only need node names from type=node entries
type resource struct {
Type string `json:"type"`
Node string `json:"node"`
}
type response struct {
Data []resource `json:"data"`
}
var resp response
if err := json.Unmarshal(data, &resp); err != nil {
return nil
}
seen := make(map[string]bool)
var nodes []string
for _, r := range resp.Data {
if r.Type == "node" && !seen[r.Node] {
seen[r.Node] = true
nodes = append(nodes, r.Node)
}
}
return nodes
}
Note: add "encoding/json" to the import block.
- Step 2: Create shared test utilities
Create collector/testutil_test.go:
package collector
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
)
// newTestClient creates a Client backed by an httptest.Server that serves
// fixture files. Routes maps API paths (e.g. "/cluster/status") to fixture
// file basenames (e.g. "cluster_status.json") under collector/fixtures/.
func newTestClient(t *testing.T, routes map[string]string) *Client {
t.Helper()
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Strip /api2/json prefix to match route keys
path := r.URL.Path
const prefix = "/api2/json"
if len(path) > len(prefix) {
path = path[len(prefix):]
}
fixture, ok := routes[path]
if !ok {
w.WriteHeader(http.StatusNotFound)
return
}
data, err := os.ReadFile(filepath.Join("fixtures", fixture))
if err != nil {
t.Fatalf("reading fixture %s: %v", fixture, err)
}
w.Header().Set("Content-Type", "application/json")
w.Write(data)
}))
t.Cleanup(server.Close)
client, err := NewClient([]string{server.URL}, "test-token", true, 5)
if err != nil {
t.Fatal(err)
}
client.httpClient = server.Client()
return client
}
// testCollectorAdapter wraps a Collector into a prometheus.Collector for testutil.
type testCollectorAdapter struct {
client *Client
collector Collector
}
func (a *testCollectorAdapter) Describe(ch chan<- *prometheus.Desc) {
ch <- prometheus.NewDesc("dummy", "dummy", nil, nil)
}
func (a *testCollectorAdapter) Collect(ch chan<- prometheus.Metric) {
a.collector.Update(a.client, ch)
}
Add these imports to testutil_test.go: "github.com/prometheus/client_golang/prometheus".
- Step 3: Verify it compiles
cd /home/user/git/pve-exporter && go build ./collector/
Expected: Compiles successfully (no tests to run yet for collector.go itself since it needs at least one registered collector).
- Step 4: Commit
git add collector/collector.go collector/testutil_test.go
git commit -m "feat: add collector framework with registry and parallel scrape orchestration"
Task 4: Main Entry Point (Minimal)
Files:
-
Create:
main.go -
Create:
Makefile -
Step 1: Write main.go
Create main.go:
package main
import (
"fmt"
"log/slog"
"net/http"
"os"
"strings"
"github.com/alecthomas/kingpin/v2"
"github.com/prometheus/client_golang/prometheus"
versioncollector "github.com/prometheus/client_golang/prometheus/collectors/version"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/prometheus/common/promslog"
"github.com/prometheus/common/promslog/flag"
"github.com/prometheus/common/version"
"github.com/prometheus/exporter-toolkit/web"
"github.com/prometheus/exporter-toolkit/web/kingpinflag"
"github.com/dsgeis/pve-exporter/collector"
)
func main() {
var (
pveHosts = kingpin.Flag(
"pve.host",
"PVE API base URL (repeatable, tried in order on failure).",
).Required().Strings()
pveAPIToken = kingpin.Flag(
"pve.api-token",
"PVE API token string (user@realm!tokenid=uuid). Mutually exclusive with --pve.token-file.",
).String()
pveTokenFile = kingpin.Flag(
"pve.token-file",
"Path to file containing PVE API token. Mutually exclusive with --pve.api-token.",
).String()
pveTLSInsecure = kingpin.Flag(
"pve.tls-insecure",
"Disable TLS certificate verification for PVE API.",
).Default("false").Bool()
pveMaxConcurrent = kingpin.Flag(
"pve.max-concurrent",
"Max concurrent API requests for per-node fan-out.",
).Default("5").Int()
metricsPath = kingpin.Flag(
"web.telemetry-path",
"Path under which to expose metrics.",
).Default("/metrics").String()
toolkitFlags = kingpinflag.AddFlags(kingpin.CommandLine, ":9221")
)
promslogConfig := &promslog.Config{}
flag.AddFlags(kingpin.CommandLine, promslogConfig)
kingpin.Version(version.Print("pve_exporter"))
kingpin.CommandLine.UsageWriter(os.Stdout)
kingpin.HelpFlag.Short('h')
kingpin.Parse()
logger := promslog.New(promslogConfig)
// Resolve API token
token, err := resolveToken(*pveAPIToken, *pveTokenFile)
if err != nil {
logger.Error("failed to resolve API token", "err", err)
os.Exit(1)
}
client, err := collector.NewClient(*pveHosts, token, *pveTLSInsecure, *pveMaxConcurrent)
if err != nil {
logger.Error("failed to create PVE client", "err", err)
os.Exit(1)
}
pveCollector := collector.NewPVECollector(client, logger)
r := prometheus.NewRegistry()
r.MustRegister(versioncollector.NewCollector("pve_exporter"))
r.MustRegister(pveCollector)
handler := promhttp.HandlerFor(r, promhttp.HandlerOpts{
ErrorLog: slog.NewLogLogger(logger.Handler(), slog.LevelError),
ErrorHandling: promhttp.ContinueOnError,
})
http.Handle(*metricsPath, handler)
if *metricsPath != "/" {
landingConfig := web.LandingConfig{
Name: "PVE Exporter",
Description: "Prometheus Exporter for Proxmox VE",
Version: version.Info(),
Links: []web.LandingLinks{
{Address: *metricsPath, Text: "Metrics"},
},
}
landingPage, err := web.NewLandingPage(landingConfig)
if err != nil {
logger.Error(err.Error())
os.Exit(1)
}
http.Handle("/", landingPage)
}
logger.Info("Starting pve_exporter", "version", version.Info())
server := &http.Server{}
if err := web.ListenAndServe(server, toolkitFlags, logger); err != nil {
logger.Error(err.Error())
os.Exit(1)
}
}
func resolveToken(apiToken, tokenFile string) (string, error) {
if apiToken != "" && tokenFile != "" {
return "", fmt.Errorf("--pve.api-token and --pve.token-file are mutually exclusive")
}
if apiToken == "" && tokenFile == "" {
// Try environment variable
if env := os.Getenv("PVE_API_TOKEN"); env != "" {
return env, nil
}
return "", fmt.Errorf("one of --pve.api-token, --pve.token-file, or PVE_API_TOKEN is required")
}
if tokenFile != "" {
data, err := os.ReadFile(tokenFile)
if err != nil {
return "", fmt.Errorf("reading token file: %w", err)
}
return strings.TrimSpace(string(data)), nil
}
return apiToken, nil
}
- Step 2: Write Makefile
Create Makefile:
.PHONY: build test clean
BINARY := pve-exporter
build:
CGO_ENABLED=0 go build -o $(BINARY) .
test:
go test -v ./...
clean:
rm -f $(BINARY)
- Step 3: Verify it compiles and runs
cd /home/user/git/pve-exporter && go mod tidy && make build
Expected: Compiles. Binary pve-exporter created.
./pve-exporter --help
Expected: Shows usage with all flags listed.
- Step 4: Commit
git add main.go Makefile go.mod go.sum
git commit -m "feat: add main entry point with CLI flags and HTTP server"
Task 5: Version Collector
The simplest collector — good for validating the framework end-to-end.
Files:
-
Create:
collector/version.go -
Create:
collector/version_test.go -
Create:
collector/fixtures/version.json -
Step 1: Create fixture
Create collector/fixtures/version.json:
{"data":{"version":"9.1.4","release":"9.1","repoid":"5ac30304265fbd8e"}}
- Step 2: Write the failing test
Create collector/version_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestVersionCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/version": "version.json",
})
logger := promslog.NewNopLogger()
c := newVersionCollector(logger)
expected := `
# HELP pve_version_info Proxmox VE version info.
# TYPE pve_version_info gauge
pve_version_info{release="9.1",repoid="5ac30304265fbd8e",version="9.1.4"} 1
`
reg := prometheus.NewRegistry()
collector := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(collector)
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "pve_version_info"); err != nil {
t.Error(err)
}
}
Note: testCollectorAdapter is defined in testutil_test.go (Task 3) and shared by all collector tests.
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestVersionCollector -v
Expected: Compilation error — newVersionCollector not defined.
- Step 4: Write the implementation
Create collector/version.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"github.com/prometheus/client_golang/prometheus"
)
type versionCollector struct {
infoDesc *prometheus.Desc
logger *slog.Logger
}
func init() {
registerCollector("version", func(logger *slog.Logger) Collector {
return newVersionCollector(logger)
})
}
func newVersionCollector(logger *slog.Logger) *versionCollector {
return &versionCollector{
infoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "version_info"),
"Proxmox VE version info.",
[]string{"release", "repoid", "version"}, nil,
),
logger: logger,
}
}
func (c *versionCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
body, err := client.Get("/version")
if err != nil {
return fmt.Errorf("fetching version: %w", err)
}
var resp struct {
Data struct {
Version string `json:"version"`
Release string `json:"release"`
RepoID string `json:"repoid"`
} `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
return fmt.Errorf("parsing version response: %w", err)
}
ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1,
resp.Data.Release, resp.Data.RepoID, resp.Data.Version)
return nil
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestVersionCollector -v
Expected: PASS.
- Step 6: End-to-end smoke test
Build and run against the live PVE cluster to verify the full stack works:
cd /home/user/git/pve-exporter && make build
./pve-exporter --pve.host=https://node02.freyja.cloud.sip.is:8006 --pve.tls-insecure --pve.token-file=.apikey &
sleep 2
curl -s http://localhost:9221/metrics | grep pve_version_info
kill %1
Expected: Output contains pve_version_info{release="9.1",repoid="5ac30304265fbd8e",version="9.1.4"} 1 plus scrape meta-metrics.
- Step 7: Commit
git add collector/version.go collector/version_test.go collector/fixtures/version.json
git commit -m "feat: add version collector (pve_version_info)"
Task 6: Cluster Status Collector
Files:
-
Create:
collector/cluster_status.go -
Create:
collector/cluster_status_test.go -
Create:
collector/fixtures/cluster_status.json -
Step 1: Create fixture
Create collector/fixtures/cluster_status.json with realistic data from the live API. The response contains both cluster-type and node-type entries:
{"data":[{"type":"cluster","id":"cluster","name":"freyja","version":9,"nodes":5,"quorate":1},{"type":"node","id":"node/node01","name":"node01","nodeid":1,"online":1,"local":0,"ip":"10.99.0.1","level":"b"},{"type":"node","id":"node/node02","name":"node02","nodeid":2,"online":1,"local":1,"ip":"10.99.0.2","level":"b"},{"type":"node","id":"node/node03","name":"node03","nodeid":3,"online":1,"local":0,"ip":"10.99.0.3","level":"b"},{"type":"node","id":"node/node04","name":"node04","nodeid":4,"online":1,"local":0,"ip":"10.99.0.4","level":"b"},{"type":"node","id":"node/node05","name":"node05","nodeid":5,"online":1,"local":0,"ip":"10.99.0.5","level":"b"}]}
- Step 2: Write the failing test
Create collector/cluster_status_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestClusterStatusCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/cluster/status": "cluster_status.json",
})
logger := promslog.NewNopLogger()
c := newClusterStatusCollector(logger)
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
// Check pve_up for nodes
expected := `
# HELP pve_up Node/VM/CT-Status is online/running.
# TYPE pve_up gauge
pve_up{id="node/node01"} 1
pve_up{id="node/node02"} 1
pve_up{id="node/node03"} 1
pve_up{id="node/node04"} 1
pve_up{id="node/node05"} 1
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "pve_up"); err != nil {
t.Error(err)
}
// Check pve_node_info
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_node_info Node info.
# TYPE pve_node_info gauge
pve_node_info{id="node/node01",level="b",name="node01",nodeid="1"} 1
pve_node_info{id="node/node02",level="b",name="node02",nodeid="2"} 1
pve_node_info{id="node/node03",level="b",name="node03",nodeid="3"} 1
pve_node_info{id="node/node04",level="b",name="node04",nodeid="4"} 1
pve_node_info{id="node/node05",level="b",name="node05",nodeid="5"} 1
`), "pve_node_info"); err != nil {
t.Error(err)
}
// Check pve_cluster_info
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_cluster_info Cluster info.
# TYPE pve_cluster_info gauge
pve_cluster_info{id="cluster",nodes="5",quorate="1",version="9"} 1
`), "pve_cluster_info"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterStatusCollector -v
Expected: Compilation error — newClusterStatusCollector not defined.
- Step 4: Write the implementation
Create collector/cluster_status.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"strconv"
"github.com/prometheus/client_golang/prometheus"
)
type clusterStatusCollector struct {
upDesc *prometheus.Desc
nodeInfoDesc *prometheus.Desc
clusterInfoDesc *prometheus.Desc
logger *slog.Logger
}
func init() {
registerCollector("cluster_status", func(logger *slog.Logger) Collector {
return newClusterStatusCollector(logger)
})
}
func newClusterStatusCollector(logger *slog.Logger) *clusterStatusCollector {
return &clusterStatusCollector{
upDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "up"),
"Node/VM/CT-Status is online/running.",
[]string{"id"}, nil,
),
nodeInfoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "node_info"),
"Node info.",
[]string{"id", "level", "name", "nodeid"}, nil,
),
clusterInfoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "cluster_info"),
"Cluster info.",
[]string{"id", "nodes", "quorate", "version"}, nil,
),
logger: logger,
}
}
func (c *clusterStatusCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
body, err := client.Get("/cluster/status")
if err != nil {
return fmt.Errorf("fetching cluster status: %w", err)
}
var resp struct {
Data []json.RawMessage `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
return fmt.Errorf("parsing cluster status: %w", err)
}
for _, raw := range resp.Data {
var entry struct {
Type string `json:"type"`
ID string `json:"id"`
Name string `json:"name"`
Online int `json:"online"`
NodeID int `json:"nodeid"`
Level string `json:"level"`
Nodes int `json:"nodes"`
Quorate int `json:"quorate"`
Version int `json:"version"`
}
if err := json.Unmarshal(raw, &entry); err != nil {
c.logger.Warn("skipping unparseable cluster status entry", "err", err)
continue
}
switch entry.Type {
case "node":
ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue,
float64(entry.Online), entry.ID)
ch <- prometheus.MustNewConstMetric(c.nodeInfoDesc, prometheus.GaugeValue, 1,
entry.ID, entry.Level, entry.Name, strconv.Itoa(entry.NodeID))
case "cluster":
ch <- prometheus.MustNewConstMetric(c.clusterInfoDesc, prometheus.GaugeValue, 1,
entry.ID,
strconv.Itoa(entry.Nodes),
strconv.Itoa(entry.Quorate),
strconv.Itoa(entry.Version))
}
}
return nil
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterStatusCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/cluster_status.go collector/cluster_status_test.go collector/fixtures/cluster_status.json
git commit -m "feat: add cluster_status collector (pve_up, pve_node_info, pve_cluster_info)"
Task 7: Corosync Collector
Files:
-
Create:
collector/corosync.go -
Create:
collector/corosync_test.go -
Create:
collector/fixtures/cluster_config_nodes.json -
Step 1: Create fixture
Create collector/fixtures/cluster_config_nodes.json:
{"data":[{"node":"node01","nodeid":"1","quorum_votes":"1","ring0_addr":"10.99.0.1"},{"node":"node02","nodeid":"2","quorum_votes":"1","ring0_addr":"10.99.0.2"},{"node":"node03","nodeid":"3","quorum_votes":"1","ring0_addr":"10.99.0.3"},{"node":"node04","nodeid":"4","quorum_votes":"1","ring0_addr":"10.99.0.4"},{"node":"node05","nodeid":"5","quorum_votes":"1","ring0_addr":"10.99.0.5"}]}
This collector reuses cluster_status.json from Task 6 for quorate and node online state.
- Step 2: Write the failing test
Create collector/corosync_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestCorosyncCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/cluster/status": "cluster_status.json",
"/cluster/config/nodes": "cluster_config_nodes.json",
})
logger := promslog.NewNopLogger()
c := newCorosyncCollector(logger)
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
expected := `
# HELP pve_cluster_quorate Whether the cluster is quorate.
# TYPE pve_cluster_quorate gauge
pve_cluster_quorate 1
# HELP pve_cluster_nodes_total Total number of nodes in the cluster.
# TYPE pve_cluster_nodes_total gauge
pve_cluster_nodes_total 5
# HELP pve_cluster_expected_votes Total expected votes in the cluster.
# TYPE pve_cluster_expected_votes gauge
pve_cluster_expected_votes 5
# HELP pve_node_online Whether a cluster node is online.
# TYPE pve_node_online gauge
pve_node_online{name="node01",nodeid="1"} 1
pve_node_online{name="node02",nodeid="2"} 1
pve_node_online{name="node03",nodeid="3"} 1
pve_node_online{name="node04",nodeid="4"} 1
pve_node_online{name="node05",nodeid="5"} 1
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected),
"pve_cluster_quorate", "pve_cluster_nodes_total", "pve_cluster_expected_votes", "pve_node_online"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestCorosyncCollector -v
Expected: Compilation error — newCorosyncCollector not defined.
- Step 4: Write the implementation
Create collector/corosync.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"strconv"
"github.com/prometheus/client_golang/prometheus"
)
type corosyncCollector struct {
quorateDesc *prometheus.Desc
nodesTotalDesc *prometheus.Desc
expectedVotesDesc *prometheus.Desc
nodeOnlineDesc *prometheus.Desc
logger *slog.Logger
}
func init() {
registerCollector("corosync", func(logger *slog.Logger) Collector {
return newCorosyncCollector(logger)
})
}
func newCorosyncCollector(logger *slog.Logger) *corosyncCollector {
return &corosyncCollector{
quorateDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "cluster", "quorate"),
"Whether the cluster is quorate.",
nil, nil,
),
nodesTotalDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "cluster", "nodes_total"),
"Total number of nodes in the cluster.",
nil, nil,
),
expectedVotesDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "cluster", "expected_votes"),
"Total expected votes in the cluster.",
nil, nil,
),
nodeOnlineDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "node_online"),
"Whether a cluster node is online.",
[]string{"name", "nodeid"}, nil,
),
logger: logger,
}
}
func (c *corosyncCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
// Fetch cluster status for quorate and node online state
statusBody, err := client.Get("/cluster/status")
if err != nil {
return fmt.Errorf("fetching cluster status: %w", err)
}
var statusResp struct {
Data []struct {
Type string `json:"type"`
Name string `json:"name"`
NodeID int `json:"nodeid"`
Online int `json:"online"`
Quorate int `json:"quorate"`
Nodes int `json:"nodes"`
} `json:"data"`
}
if err := json.Unmarshal(statusBody, &statusResp); err != nil {
return fmt.Errorf("parsing cluster status: %w", err)
}
for _, entry := range statusResp.Data {
switch entry.Type {
case "cluster":
ch <- prometheus.MustNewConstMetric(c.quorateDesc, prometheus.GaugeValue, float64(entry.Quorate))
ch <- prometheus.MustNewConstMetric(c.nodesTotalDesc, prometheus.GaugeValue, float64(entry.Nodes))
case "node":
ch <- prometheus.MustNewConstMetric(c.nodeOnlineDesc, prometheus.GaugeValue,
float64(entry.Online), entry.Name, strconv.Itoa(entry.NodeID))
}
}
// Fetch config nodes for expected votes
configBody, err := client.Get("/cluster/config/nodes")
if err != nil {
return fmt.Errorf("fetching cluster config nodes: %w", err)
}
var configResp struct {
Data []struct {
QuorumVotes string `json:"quorum_votes"`
} `json:"data"`
}
if err := json.Unmarshal(configBody, &configResp); err != nil {
return fmt.Errorf("parsing cluster config nodes: %w", err)
}
var totalVotes float64
for _, node := range configResp.Data {
votes, _ := strconv.ParseFloat(node.QuorumVotes, 64)
totalVotes += votes
}
ch <- prometheus.MustNewConstMetric(c.expectedVotesDesc, prometheus.GaugeValue, totalVotes)
return nil
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestCorosyncCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/corosync.go collector/corosync_test.go collector/fixtures/cluster_config_nodes.json
git commit -m "feat: add corosync collector (quorate, nodes_total, expected_votes, node_online)"
Task 8: Cluster Resources Collector
The largest collector — 16 metrics across nodes, VMs, containers, and storage.
Files:
-
Create:
collector/cluster_resources.go -
Create:
collector/cluster_resources_test.go -
Create:
collector/fixtures/cluster_resources.json -
Step 1: Create fixture
Create collector/fixtures/cluster_resources.json with a representative subset: 1 node, 2 VMs (1 running, 1 stopped), 1 storage. Include fields for all 16 metrics. This fixture should be a realistic but minimal JSON response. Use data modeled on the live API:
{"data":[
{"type":"node","id":"node/node01","node":"node01","status":"online","cpu":0.05,"maxcpu":256,"mem":50000000000,"maxmem":1555325325312,"disk":5000000000,"maxdisk":100000000000,"uptime":2081781},
{"type":"qemu","id":"qemu/100","node":"node01","name":"testvm1","status":"running","vmid":100,"cpu":0.02,"maxcpu":4,"mem":2147483648,"maxmem":4294967296,"disk":0,"maxdisk":34359738368,"netin":1000000,"netout":500000,"diskread":200000000,"diskwrite":100000000,"uptime":86400,"template":0,"tags":"web;prod","hastate":"started","lock":""},
{"type":"qemu","id":"qemu/101","node":"node01","name":"testvm2","status":"stopped","vmid":101,"cpu":0,"maxcpu":2,"mem":0,"maxmem":2147483648,"disk":0,"maxdisk":17179869184,"netin":0,"netout":0,"diskread":0,"diskwrite":0,"uptime":0,"template":1,"tags":"","hastate":"","lock":"backup"},
{"type":"storage","id":"storage/node01/local","node":"node01","storage":"local","plugintype":"dir","content":"iso,vztmpl,backup","disk":5000000000,"maxdisk":100000000000,"shared":0,"status":"available"}
]}
- Step 2: Write the failing test
Create collector/cluster_resources_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestClusterResourcesCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/cluster/resources": "cluster_resources.json",
})
logger := promslog.NewNopLogger()
c := newClusterResourcesCollector(logger)
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
// Test CPU metrics
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_cpu_usage_ratio CPU utilization.
# TYPE pve_cpu_usage_ratio gauge
pve_cpu_usage_ratio{id="node/node01"} 0.05
pve_cpu_usage_ratio{id="qemu/100"} 0.02
pve_cpu_usage_ratio{id="qemu/101"} 0
# HELP pve_cpu_usage_limit Number of available CPUs.
# TYPE pve_cpu_usage_limit gauge
pve_cpu_usage_limit{id="node/node01"} 256
pve_cpu_usage_limit{id="qemu/100"} 4
pve_cpu_usage_limit{id="qemu/101"} 2
`), "pve_cpu_usage_ratio", "pve_cpu_usage_limit"); err != nil {
t.Error(err)
}
// Test guest info
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_guest_info VM/CT info.
# TYPE pve_guest_info gauge
pve_guest_info{id="qemu/100",name="testvm1",node="node01",tags="web;prod",template="0",type="qemu"} 1
pve_guest_info{id="qemu/101",name="testvm2",node="node01",tags="",template="1",type="qemu"} 1
`), "pve_guest_info"); err != nil {
t.Error(err)
}
// Test storage info
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_storage_info Storage info.
# TYPE pve_storage_info gauge
pve_storage_info{content="iso,vztmpl,backup",id="storage/node01/local",node="node01",plugintype="dir",storage="local"} 1
`), "pve_storage_info"); err != nil {
t.Error(err)
}
// Test HA state (only for VMs with hastate set)
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_ha_state HA service status.
# TYPE pve_ha_state gauge
pve_ha_state{id="qemu/100",state="started"} 1
`), "pve_ha_state"); err != nil {
t.Error(err)
}
// Test lock state (only for VMs with lock set)
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_lock_state Guest config lock state.
# TYPE pve_lock_state gauge
pve_lock_state{id="qemu/101",state="backup"} 1
`), "pve_lock_state"); err != nil {
t.Error(err)
}
// Test pve_up includes VM status
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_up Node/VM/CT-Status is online/running.
# TYPE pve_up gauge
pve_up{id="node/node01"} 1
pve_up{id="qemu/100"} 1
pve_up{id="qemu/101"} 0
`), "pve_up"); err != nil {
t.Error(err)
}
// Test storage shared
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_storage_shared Whether or not the storage is shared among cluster nodes.
# TYPE pve_storage_shared gauge
pve_storage_shared{id="storage/node01/local"} 0
`), "pve_storage_shared"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterResourcesCollector -v
Expected: Compilation error — newClusterResourcesCollector not defined.
- Step 4: Write the implementation
Create collector/cluster_resources.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"strconv"
"github.com/prometheus/client_golang/prometheus"
)
type clusterResourcesCollector struct {
upDesc *prometheus.Desc
cpuUsageDesc *prometheus.Desc
cpuLimitDesc *prometheus.Desc
memUsageDesc *prometheus.Desc
memSizeDesc *prometheus.Desc
diskUsageDesc *prometheus.Desc
diskSizeDesc *prometheus.Desc
netTransmitDesc *prometheus.Desc
netReceiveDesc *prometheus.Desc
diskWrittenDesc *prometheus.Desc
diskReadDesc *prometheus.Desc
uptimeDesc *prometheus.Desc
storageSharedDesc *prometheus.Desc
guestInfoDesc *prometheus.Desc
storageInfoDesc *prometheus.Desc
haStateDesc *prometheus.Desc
lockStateDesc *prometheus.Desc
logger *slog.Logger
}
func init() {
registerCollector("cluster_resources", func(logger *slog.Logger) Collector {
return newClusterResourcesCollector(logger)
})
}
func newClusterResourcesCollector(logger *slog.Logger) *clusterResourcesCollector {
return &clusterResourcesCollector{
upDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "up"),
"Node/VM/CT-Status is online/running.",
[]string{"id"}, nil,
),
cpuUsageDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "cpu_usage_ratio"),
"CPU utilization.",
[]string{"id"}, nil,
),
cpuLimitDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "cpu_usage_limit"),
"Number of available CPUs.",
[]string{"id"}, nil,
),
memUsageDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "memory_usage_bytes"),
"Used memory in bytes.",
[]string{"id"}, nil,
),
memSizeDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "memory_size_bytes"),
"Number of available memory in bytes.",
[]string{"id"}, nil,
),
diskUsageDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "disk_usage_bytes"),
"Used disk space in bytes.",
[]string{"id"}, nil,
),
diskSizeDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "disk_size_bytes"),
"Storage size in bytes.",
[]string{"id"}, nil,
),
netTransmitDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "network_transmit_bytes_total"),
"Network bytes transmitted since guest start.",
[]string{"id"}, nil,
),
netReceiveDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "network_receive_bytes_total"),
"Network bytes received since guest start.",
[]string{"id"}, nil,
),
diskWrittenDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "disk_written_bytes_total"),
"Disk bytes written since guest start.",
[]string{"id"}, nil,
),
diskReadDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "disk_read_bytes_total"),
"Disk bytes read since guest start.",
[]string{"id"}, nil,
),
uptimeDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "uptime_seconds"),
"Uptime in seconds.",
[]string{"id"}, nil,
),
storageSharedDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "storage_shared"),
"Whether or not the storage is shared among cluster nodes.",
[]string{"id"}, nil,
),
guestInfoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "guest_info"),
"VM/CT info.",
[]string{"id", "node", "name", "type", "template", "tags"}, nil,
),
storageInfoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "storage_info"),
"Storage info.",
[]string{"id", "node", "storage", "plugintype", "content"}, nil,
),
haStateDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "ha_state"),
"HA service status.",
[]string{"id", "state"}, nil,
),
lockStateDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "lock_state"),
"Guest config lock state.",
[]string{"id", "state"}, nil,
),
logger: logger,
}
}
func (c *clusterResourcesCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
body, err := client.Get("/cluster/resources")
if err != nil {
return fmt.Errorf("fetching cluster resources: %w", err)
}
var resp struct {
Data []struct {
Type string `json:"type"`
ID string `json:"id"`
Node string `json:"node"`
Name string `json:"name"`
Status string `json:"status"`
CPU float64 `json:"cpu"`
MaxCPU float64 `json:"maxcpu"`
Mem float64 `json:"mem"`
MaxMem float64 `json:"maxmem"`
Disk float64 `json:"disk"`
MaxDisk float64 `json:"maxdisk"`
NetIn float64 `json:"netin"`
NetOut float64 `json:"netout"`
DiskRead float64 `json:"diskread"`
DiskWrite float64 `json:"diskwrite"`
Uptime float64 `json:"uptime"`
Template int `json:"template"`
Tags string `json:"tags"`
HAState string `json:"hastate"`
Lock string `json:"lock"`
Storage string `json:"storage"`
PluginType string `json:"plugintype"`
Content string `json:"content"`
Shared int `json:"shared"`
} `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
return fmt.Errorf("parsing cluster resources: %w", err)
}
for _, r := range resp.Data {
switch r.Type {
case "node":
online := 0.0
if r.Status == "online" {
online = 1.0
}
ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue, online, r.ID)
ch <- prometheus.MustNewConstMetric(c.cpuUsageDesc, prometheus.GaugeValue, r.CPU, r.ID)
ch <- prometheus.MustNewConstMetric(c.cpuLimitDesc, prometheus.GaugeValue, r.MaxCPU, r.ID)
ch <- prometheus.MustNewConstMetric(c.memUsageDesc, prometheus.GaugeValue, r.Mem, r.ID)
ch <- prometheus.MustNewConstMetric(c.memSizeDesc, prometheus.GaugeValue, r.MaxMem, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
ch <- prometheus.MustNewConstMetric(c.uptimeDesc, prometheus.GaugeValue, r.Uptime, r.ID)
case "qemu", "lxc":
online := 0.0
if r.Status == "running" {
online = 1.0
}
ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue, online, r.ID)
ch <- prometheus.MustNewConstMetric(c.cpuUsageDesc, prometheus.GaugeValue, r.CPU, r.ID)
ch <- prometheus.MustNewConstMetric(c.cpuLimitDesc, prometheus.GaugeValue, r.MaxCPU, r.ID)
ch <- prometheus.MustNewConstMetric(c.memUsageDesc, prometheus.GaugeValue, r.Mem, r.ID)
ch <- prometheus.MustNewConstMetric(c.memSizeDesc, prometheus.GaugeValue, r.MaxMem, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
ch <- prometheus.MustNewConstMetric(c.netTransmitDesc, prometheus.CounterValue, r.NetOut, r.ID)
ch <- prometheus.MustNewConstMetric(c.netReceiveDesc, prometheus.CounterValue, r.NetIn, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskWrittenDesc, prometheus.CounterValue, r.DiskWrite, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskReadDesc, prometheus.CounterValue, r.DiskRead, r.ID)
ch <- prometheus.MustNewConstMetric(c.uptimeDesc, prometheus.GaugeValue, r.Uptime, r.ID)
ch <- prometheus.MustNewConstMetric(c.guestInfoDesc, prometheus.GaugeValue, 1,
r.ID, r.Node, r.Name, r.Type, strconv.Itoa(r.Template), r.Tags)
if r.HAState != "" {
ch <- prometheus.MustNewConstMetric(c.haStateDesc, prometheus.GaugeValue, 1, r.ID, r.HAState)
}
if r.Lock != "" {
ch <- prometheus.MustNewConstMetric(c.lockStateDesc, prometheus.GaugeValue, 1, r.ID, r.Lock)
}
case "storage":
ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
ch <- prometheus.MustNewConstMetric(c.storageSharedDesc, prometheus.GaugeValue, float64(r.Shared), r.ID)
ch <- prometheus.MustNewConstMetric(c.storageInfoDesc, prometheus.GaugeValue, 1,
r.ID, r.Node, r.Storage, r.PluginType, r.Content)
}
}
return nil
}
Important note: Both cluster_status and cluster_resources emit pve_up. This will cause a duplicate descriptor error. Fix: remove pve_up from cluster_status collector — cluster_resources is the canonical source since it covers nodes, VMs, and CTs. Update cluster_status.go to remove the upDesc field and its usage, and update the cluster_status_test accordingly.
- Step 5: Fix pve_up duplication — remove from cluster_status
Edit collector/cluster_status.go: remove the upDesc field and all pve_up emissions. Edit collector/cluster_status_test.go: remove the pve_up assertions.
- Step 6: Run all tests
cd /home/user/git/pve-exporter && go test ./collector/ -v
Expected: All tests PASS.
- Step 7: Commit
git add collector/cluster_resources.go collector/cluster_resources_test.go collector/fixtures/cluster_resources.json collector/cluster_status.go collector/cluster_status_test.go
git commit -m "feat: add cluster_resources collector (16 metrics: CPU, memory, disk, network, storage, guest info, HA/lock)"
Task 9: Backup Collector
Files:
-
Create:
collector/backup.go -
Create:
collector/backup_test.go -
Create:
collector/fixtures/backup_not_backed_up.json -
Step 1: Create fixture
Create collector/fixtures/backup_not_backed_up.json:
{"data":[{"vmid":100,"name":"pve-backup.freyja.sip.is","type":"qemu"}]}
- Step 2: Write the failing test
Create collector/backup_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestBackupCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/cluster/backup-info/not-backed-up": "backup_not_backed_up.json",
})
logger := promslog.NewNopLogger()
c := newBackupCollector(logger)
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
expected := `
# HELP pve_not_backed_up_info Present if guest is not covered by any backup job.
# TYPE pve_not_backed_up_info gauge
pve_not_backed_up_info{id="qemu/100"} 1
# HELP pve_not_backed_up_total Total number of guests not covered by any backup job.
# TYPE pve_not_backed_up_total gauge
pve_not_backed_up_total{id="qemu/100"} 1
`
if err := testutil.GatherAndCompare(reg, strings.NewReader(expected),
"pve_not_backed_up_info", "pve_not_backed_up_total"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestBackupCollector -v
Expected: Compilation error.
- Step 4: Write the implementation
Create collector/backup.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"github.com/prometheus/client_golang/prometheus"
)
type backupCollector struct {
notBackedUpTotalDesc *prometheus.Desc
notBackedUpInfoDesc *prometheus.Desc
logger *slog.Logger
}
func init() {
registerCollector("backup", func(logger *slog.Logger) Collector {
return newBackupCollector(logger)
})
}
func newBackupCollector(logger *slog.Logger) *backupCollector {
return &backupCollector{
notBackedUpTotalDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "not_backed_up_total"),
"Total number of guests not covered by any backup job.",
[]string{"id"}, nil,
),
notBackedUpInfoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "not_backed_up_info"),
"Present if guest is not covered by any backup job.",
[]string{"id"}, nil,
),
logger: logger,
}
}
func (c *backupCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
body, err := client.Get("/cluster/backup-info/not-backed-up")
if err != nil {
return fmt.Errorf("fetching backup info: %w", err)
}
var resp struct {
Data []struct {
VMID int `json:"vmid"`
Type string `json:"type"`
} `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
return fmt.Errorf("parsing backup info: %w", err)
}
for _, vm := range resp.Data {
id := fmt.Sprintf("%s/%d", vm.Type, vm.VMID)
ch <- prometheus.MustNewConstMetric(c.notBackedUpTotalDesc, prometheus.GaugeValue, 1, id)
ch <- prometheus.MustNewConstMetric(c.notBackedUpInfoDesc, prometheus.GaugeValue, 1, id)
}
return nil
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestBackupCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/backup.go collector/backup_test.go collector/fixtures/backup_not_backed_up.json
git commit -m "feat: add backup collector (pve_not_backed_up_total, pve_not_backed_up_info)"
Task 10: Subscription Collector
Files:
-
Create:
collector/subscription.go -
Create:
collector/subscription_test.go -
Create:
collector/fixtures/node_subscription.json -
Step 1: Create fixture
Create collector/fixtures/node_subscription.json:
{"data":{"status":"active","level":"b","productname":"Proxmox VE Basic Subscription","nextduedate":"2027-02-03","regdate":"2025-02-03","key":"pve2b-test","sockets":2,"checktime":1773896474}}
- Step 2: Write the failing test
Create collector/subscription_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestSubscriptionCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/nodes/node01/subscription": "node_subscription.json",
})
logger := promslog.NewNopLogger()
c := newSubscriptionCollector(logger)
// Manually set nodes since this is a NodeAwareCollector
c.SetNodes([]string{"node01"})
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_subscription_info Proxmox VE subscription info.
# TYPE pve_subscription_info gauge
pve_subscription_info{id="node/node01",level="b"} 1
# HELP pve_subscription_status Proxmox VE subscription status.
# TYPE pve_subscription_status gauge
pve_subscription_status{id="node/node01",status="active"} 1
# HELP pve_subscription_next_due_timestamp_seconds Subscription next due date as Unix timestamp.
# TYPE pve_subscription_next_due_timestamp_seconds gauge
pve_subscription_next_due_timestamp_seconds{id="node/node01"} 1.801699200e+09
`), "pve_subscription_info", "pve_subscription_status", "pve_subscription_next_due_timestamp_seconds"); err != nil {
t.Error(err)
}
}
Note: 2027-02-03 as Unix timestamp is 1801699200.
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestSubscriptionCollector -v
Expected: Compilation error.
- Step 4: Write the implementation
Create collector/subscription.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
)
type subscriptionCollector struct {
infoDesc *prometheus.Desc
statusDesc *prometheus.Desc
nextDueDesc *prometheus.Desc
logger *slog.Logger
mu sync.Mutex
nodes []string
}
func init() {
registerCollector("subscription", func(logger *slog.Logger) Collector {
return newSubscriptionCollector(logger)
})
}
func newSubscriptionCollector(logger *slog.Logger) *subscriptionCollector {
return &subscriptionCollector{
infoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "subscription_info"),
"Proxmox VE subscription info.",
[]string{"id", "level"}, nil,
),
statusDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "subscription_status"),
"Proxmox VE subscription status.",
[]string{"id", "status"}, nil,
),
nextDueDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "subscription_next_due_timestamp_seconds"),
"Subscription next due date as Unix timestamp.",
[]string{"id"}, nil,
),
logger: logger,
}
}
func (c *subscriptionCollector) SetNodes(nodes []string) {
c.mu.Lock()
defer c.mu.Unlock()
c.nodes = nodes
}
func (c *subscriptionCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
c.mu.Lock()
nodes := c.nodes
c.mu.Unlock()
if len(nodes) == 0 {
return nil
}
sem := make(chan struct{}, client.MaxConcurrent())
var wg sync.WaitGroup
var mu sync.Mutex
var firstErr error
for _, node := range nodes {
wg.Add(1)
sem <- struct{}{}
go func(node string) {
defer wg.Done()
defer func() { <-sem }()
body, err := client.Get(fmt.Sprintf("/nodes/%s/subscription", node))
if err != nil {
mu.Lock()
if firstErr == nil {
firstErr = fmt.Errorf("fetching subscription for %s: %w", node, err)
}
mu.Unlock()
return
}
var resp struct {
Data struct {
Status string `json:"status"`
Level string `json:"level"`
NextDueDate string `json:"nextduedate"`
} `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
c.logger.Warn("parsing subscription response", "node", node, "err", err)
return
}
id := "node/" + node
ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1, id, resp.Data.Level)
ch <- prometheus.MustNewConstMetric(c.statusDesc, prometheus.GaugeValue, 1, id, resp.Data.Status)
if resp.Data.NextDueDate != "" {
if t, err := time.Parse("2006-01-02", resp.Data.NextDueDate); err == nil {
ch <- prometheus.MustNewConstMetric(c.nextDueDesc, prometheus.GaugeValue,
float64(t.Unix()), id)
}
}
}(node)
}
wg.Wait()
return firstErr
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestSubscriptionCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/subscription.go collector/subscription_test.go collector/fixtures/node_subscription.json
git commit -m "feat: add subscription collector (info, status, next_due_timestamp)"
Task 11: Node Config Collector
Files:
-
Create:
collector/node_config.go -
Create:
collector/node_config_test.go -
Create:
collector/fixtures/node_qemu.json -
Create:
collector/fixtures/node_qemu_config_100.json -
Create:
collector/fixtures/node_lxc.json -
Step 1: Create fixtures
Create collector/fixtures/node_qemu.json:
{"data":[{"vmid":100,"name":"testvm1","status":"running"},{"vmid":101,"name":"testvm2","status":"stopped"}]}
Create collector/fixtures/node_qemu_config_100.json:
{"data":{"onboot":1,"name":"testvm1","memory":4096}}
Create collector/fixtures/node_qemu_config_101.json:
{"data":{"onboot":0,"name":"testvm2","memory":2048}}
Create collector/fixtures/node_lxc.json:
{"data":[]}
- Step 2: Write the failing test
Create collector/node_config_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestNodeConfigCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/nodes/node01/qemu": "node_qemu.json",
"/nodes/node01/qemu/100/config": "node_qemu_config_100.json",
"/nodes/node01/qemu/101/config": "node_qemu_config_101.json",
"/nodes/node01/lxc": "node_lxc.json",
})
logger := promslog.NewNopLogger()
c := newNodeConfigCollector(logger)
c.SetNodes([]string{"node01"})
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_onboot_status Proxmox VM/CT onboot config value.
# TYPE pve_onboot_status gauge
pve_onboot_status{id="qemu/100",node="node01",type="qemu"} 1
pve_onboot_status{id="qemu/101",node="node01",type="qemu"} 0
`), "pve_onboot_status"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestNodeConfigCollector -v
Expected: Compilation error.
- Step 4: Write the implementation
Create collector/node_config.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"sync"
"github.com/prometheus/client_golang/prometheus"
)
type nodeConfigCollector struct {
onbootDesc *prometheus.Desc
logger *slog.Logger
mu sync.Mutex
nodes []string
}
func init() {
registerCollector("node_config", func(logger *slog.Logger) Collector {
return newNodeConfigCollector(logger)
})
}
func newNodeConfigCollector(logger *slog.Logger) *nodeConfigCollector {
return &nodeConfigCollector{
onbootDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "onboot_status"),
"Proxmox VM/CT onboot config value.",
[]string{"id", "node", "type"}, nil,
),
logger: logger,
}
}
func (c *nodeConfigCollector) SetNodes(nodes []string) {
c.mu.Lock()
defer c.mu.Unlock()
c.nodes = nodes
}
func (c *nodeConfigCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
c.mu.Lock()
nodes := c.nodes
c.mu.Unlock()
if len(nodes) == 0 {
return nil
}
sem := make(chan struct{}, client.MaxConcurrent())
var wg sync.WaitGroup
var mu sync.Mutex
var firstErr error
for _, node := range nodes {
wg.Add(1)
sem <- struct{}{}
go func(node string) {
defer wg.Done()
defer func() { <-sem }()
if err := c.collectGuestConfigs(client, ch, node, "qemu"); err != nil {
mu.Lock()
if firstErr == nil {
firstErr = err
}
mu.Unlock()
}
if err := c.collectGuestConfigs(client, ch, node, "lxc"); err != nil {
mu.Lock()
if firstErr == nil {
firstErr = err
}
mu.Unlock()
}
}(node)
}
wg.Wait()
return firstErr
}
func (c *nodeConfigCollector) collectGuestConfigs(client *Client, ch chan<- prometheus.Metric, node, guestType string) error {
// List guests
body, err := client.Get(fmt.Sprintf("/nodes/%s/%s", node, guestType))
if err != nil {
return fmt.Errorf("listing %s on %s: %w", guestType, node, err)
}
var listResp struct {
Data []struct {
VMID int `json:"vmid"`
} `json:"data"`
}
if err := json.Unmarshal(body, &listResp); err != nil {
return fmt.Errorf("parsing %s list for %s: %w", guestType, node, err)
}
// Fetch config for each guest
sem := make(chan struct{}, client.MaxConcurrent())
var wg sync.WaitGroup
for _, guest := range listResp.Data {
wg.Add(1)
sem <- struct{}{}
go func(vmid int) {
defer wg.Done()
defer func() { <-sem }()
configBody, err := client.Get(fmt.Sprintf("/nodes/%s/%s/%d/config", node, guestType, vmid))
if err != nil {
c.logger.Warn("fetching config", "node", node, "type", guestType, "vmid", vmid, "err", err)
return
}
var configResp struct {
Data struct {
Onboot *int `json:"onboot"`
} `json:"data"`
}
if err := json.Unmarshal(configBody, &configResp); err != nil {
c.logger.Warn("parsing config", "node", node, "type", guestType, "vmid", vmid, "err", err)
return
}
onboot := 0.0
if configResp.Data.Onboot != nil {
onboot = float64(*configResp.Data.Onboot)
}
id := fmt.Sprintf("%s/%d", guestType, vmid)
ch <- prometheus.MustNewConstMetric(c.onbootDesc, prometheus.GaugeValue, onboot, id, node, guestType)
}(guest.VMID)
}
wg.Wait()
return nil
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestNodeConfigCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/node_config.go collector/node_config_test.go collector/fixtures/node_qemu.json collector/fixtures/node_qemu_config_100.json collector/fixtures/node_qemu_config_101.json collector/fixtures/node_lxc.json
git commit -m "feat: add node_config collector (pve_onboot_status)"
Task 12: Replication Collector
Files:
-
Create:
collector/replication.go -
Create:
collector/replication_test.go -
Create:
collector/fixtures/node_replication.json -
Step 1: Create fixture
Create collector/fixtures/node_replication.json (empty — no replication on this cluster, but test the parsing path):
{"data":[{"id":"100-0","type":"local","source":"node01","target":"node02","guest":100,"duration":5.2,"last_sync":1710000000,"last_try":1710000060,"next_sync":1710003600,"fail_count":0}]}
- Step 2: Write the failing test
Create collector/replication_test.go:
package collector
import (
"strings"
"testing"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/common/promslog"
)
func TestReplicationCollector(t *testing.T) {
client := newTestClient(t, map[string]string{
"/nodes/node01/replication": "node_replication.json",
})
logger := promslog.NewNopLogger()
c := newReplicationCollector(logger)
c.SetNodes([]string{"node01"})
reg := prometheus.NewRegistry()
adapter := &testCollectorAdapter{client: client, collector: c}
reg.MustRegister(adapter)
if err := testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP pve_replication_info Proxmox VM replication info.
# TYPE pve_replication_info gauge
pve_replication_info{guest="100",id="100-0",source="node01",target="node02",type="local"} 1
# HELP pve_replication_duration_seconds Proxmox VM replication duration.
# TYPE pve_replication_duration_seconds gauge
pve_replication_duration_seconds{id="100-0"} 5.2
# HELP pve_replication_last_sync_timestamp_seconds Proxmox VM replication last_sync.
# TYPE pve_replication_last_sync_timestamp_seconds gauge
pve_replication_last_sync_timestamp_seconds{id="100-0"} 1.71e+09
# HELP pve_replication_last_try_timestamp_seconds Proxmox VM replication last_try.
# TYPE pve_replication_last_try_timestamp_seconds gauge
pve_replication_last_try_timestamp_seconds{id="100-0"} 1.71000006e+09
# HELP pve_replication_next_sync_timestamp_seconds Proxmox VM replication next_sync.
# TYPE pve_replication_next_sync_timestamp_seconds gauge
pve_replication_next_sync_timestamp_seconds{id="100-0"} 1.7100036e+09
# HELP pve_replication_failed_syncs Proxmox VM replication fail_count.
# TYPE pve_replication_failed_syncs gauge
pve_replication_failed_syncs{id="100-0"} 0
`), "pve_replication_info", "pve_replication_duration_seconds",
"pve_replication_last_sync_timestamp_seconds", "pve_replication_last_try_timestamp_seconds",
"pve_replication_next_sync_timestamp_seconds", "pve_replication_failed_syncs"); err != nil {
t.Error(err)
}
}
- Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestReplicationCollector -v
Expected: Compilation error.
- Step 4: Write the implementation
Create collector/replication.go:
package collector
import (
"encoding/json"
"fmt"
"log/slog"
"strconv"
"sync"
"github.com/prometheus/client_golang/prometheus"
)
type replicationCollector struct {
infoDesc *prometheus.Desc
durationDesc *prometheus.Desc
lastSyncDesc *prometheus.Desc
lastTryDesc *prometheus.Desc
nextSyncDesc *prometheus.Desc
failCountDesc *prometheus.Desc
logger *slog.Logger
mu sync.Mutex
nodes []string
}
func init() {
registerCollector("replication", func(logger *slog.Logger) Collector {
return newReplicationCollector(logger)
})
}
func newReplicationCollector(logger *slog.Logger) *replicationCollector {
return &replicationCollector{
infoDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_info"),
"Proxmox VM replication info.",
[]string{"id", "type", "source", "target", "guest"}, nil,
),
durationDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_duration_seconds"),
"Proxmox VM replication duration.",
[]string{"id"}, nil,
),
lastSyncDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_last_sync_timestamp_seconds"),
"Proxmox VM replication last_sync.",
[]string{"id"}, nil,
),
lastTryDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_last_try_timestamp_seconds"),
"Proxmox VM replication last_try.",
[]string{"id"}, nil,
),
nextSyncDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_next_sync_timestamp_seconds"),
"Proxmox VM replication next_sync.",
[]string{"id"}, nil,
),
failCountDesc: prometheus.NewDesc(
prometheus.BuildFQName(namespace, "", "replication_failed_syncs"),
"Proxmox VM replication fail_count.",
[]string{"id"}, nil,
),
logger: logger,
}
}
func (c *replicationCollector) SetNodes(nodes []string) {
c.mu.Lock()
defer c.mu.Unlock()
c.nodes = nodes
}
func (c *replicationCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
c.mu.Lock()
nodes := c.nodes
c.mu.Unlock()
if len(nodes) == 0 {
return nil
}
sem := make(chan struct{}, client.MaxConcurrent())
var wg sync.WaitGroup
var mu sync.Mutex
var firstErr error
for _, node := range nodes {
wg.Add(1)
sem <- struct{}{}
go func(node string) {
defer wg.Done()
defer func() { <-sem }()
body, err := client.Get(fmt.Sprintf("/nodes/%s/replication", node))
if err != nil {
mu.Lock()
if firstErr == nil {
firstErr = fmt.Errorf("fetching replication for %s: %w", node, err)
}
mu.Unlock()
return
}
var resp struct {
Data []struct {
ID string `json:"id"`
Type string `json:"type"`
Source string `json:"source"`
Target string `json:"target"`
Guest int `json:"guest"`
Duration float64 `json:"duration"`
LastSync float64 `json:"last_sync"`
LastTry float64 `json:"last_try"`
NextSync float64 `json:"next_sync"`
FailCount float64 `json:"fail_count"`
} `json:"data"`
}
if err := json.Unmarshal(body, &resp); err != nil {
c.logger.Warn("parsing replication response", "node", node, "err", err)
return
}
for _, r := range resp.Data {
ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1,
r.ID, r.Type, r.Source, r.Target, strconv.Itoa(r.Guest))
ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, r.Duration, r.ID)
ch <- prometheus.MustNewConstMetric(c.lastSyncDesc, prometheus.GaugeValue, r.LastSync, r.ID)
ch <- prometheus.MustNewConstMetric(c.lastTryDesc, prometheus.GaugeValue, r.LastTry, r.ID)
ch <- prometheus.MustNewConstMetric(c.nextSyncDesc, prometheus.GaugeValue, r.NextSync, r.ID)
ch <- prometheus.MustNewConstMetric(c.failCountDesc, prometheus.GaugeValue, r.FailCount, r.ID)
}
}(node)
}
wg.Wait()
return firstErr
}
- Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestReplicationCollector -v
Expected: PASS.
- Step 6: Commit
git add collector/replication.go collector/replication_test.go collector/fixtures/node_replication.json
git commit -m "feat: add replication collector (6 replication metrics)"
Task 13: README with TODO Metrics
Files:
-
Create:
README.md -
Step 1: Write README.md
Create README.md with usage documentation, full metric list, and the TODO section for future metrics. Include:
-
Project description
-
Installation (build from source)
-
Usage (CLI flags, example command)
-
Complete metric table (all implemented metrics with type and labels)
-
TODO section listing all deferred metrics from the spec's "Future Metrics" section:
- Per-node detailed status (load avg, swap, rootfs, KSM, kernel, boot, CPU model)
- Per-VM pressure metrics
- HA detailed status (CRM, LRM, per-service config)
- Physical disks (SMART, wearout, OSD mapping)
- SDN/Network (zone status)
-
Step 2: Commit
git add README.md
git commit -m "docs: add README with usage, metrics reference, and future metrics TODO"
Task 14: Integration Test and Final Verification
Files:
-
No new files — uses existing code
-
Step 1: Run all unit tests
cd /home/user/git/pve-exporter && go test ./... -v
Expected: All tests PASS.
- Step 2: Build static binary
cd /home/user/git/pve-exporter && CGO_ENABLED=0 go build -o pve-exporter .
file pve-exporter
Expected: pve-exporter: ELF 64-bit LSB executable, x86-64, ... statically linked
- Step 3: End-to-end smoke test against live PVE
cd /home/user/git/pve-exporter
./pve-exporter --pve.host=https://node02.freyja.cloud.sip.is:8006 --pve.tls-insecure --pve.token-file=.apikey &
sleep 2
curl -s http://localhost:9221/metrics > /tmp/pve-metrics.txt
kill %1
Verify key metrics are present:
grep "pve_version_info" /tmp/pve-metrics.txt
grep "pve_cluster_quorate" /tmp/pve-metrics.txt
grep "pve_node_online" /tmp/pve-metrics.txt
grep "pve_cluster_info" /tmp/pve-metrics.txt
grep "pve_cpu_usage_ratio" /tmp/pve-metrics.txt
grep "pve_guest_info" /tmp/pve-metrics.txt
grep "pve_storage_info" /tmp/pve-metrics.txt
grep "pve_subscription_info" /tmp/pve-metrics.txt
grep "pve_not_backed_up" /tmp/pve-metrics.txt
grep "pve_scrape_collector_success" /tmp/pve-metrics.txt
Expected: All metrics present with correct labels and values.
- Step 4: Verify scrape performance
time curl -s http://localhost:9221/metrics > /dev/null
Expected: Scrape completes in under 5 seconds.
- Step 5: Commit any fixes needed
If the integration test reveals issues, fix and commit.