pve-exporter/docs/superpowers/plans/2026-03-20-pve-exporter-plan.md
Davíð Steinn Geirsson b590245a53 Add pve-exporter implementation plan
14 tasks covering: Go module setup, API client, collector framework,
main entry point, and all 8 collectors (version, cluster_status,
corosync, cluster_resources, backup, subscription, node_config,
replication), plus README and integration testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:20:11 +00:00

74 KiB

pve-exporter Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a Go Prometheus exporter for Proxmox VE that matches prometheus-pve-exporter's metrics and adds corosync cluster metrics.

Architecture: node_exporter-style collector framework. A shared PVE API client with multi-host failover feeds self-registering collectors that run in parallel. Each collector owns one API domain (cluster status, resources, corosync, etc.) and emits metrics to a Prometheus channel.

Tech Stack: Go 1.22+, github.com/prometheus/client_golang, github.com/alecthomas/kingpin/v2, github.com/prometheus/common (promslog), github.com/prometheus/exporter-toolkit

Spec: docs/superpowers/specs/2026-03-20-pve-exporter-design.md


File Structure

File Responsibility
go.mod Module definition and dependencies
main.go CLI flags, HTTP server, wiring
collector/collector.go Collector interface, registry, PVECollector (prometheus.Collector), scrape orchestration
collector/client.go PVE API HTTP client with multi-host failover, auth, TLS config
collector/client_test.go Client tests with httptest server
collector/cluster_status.go pve_up, pve_node_info, pve_cluster_info
collector/cluster_status_test.go Tests with JSON fixtures
collector/corosync.go pve_cluster_quorate, pve_cluster_nodes_total, pve_cluster_expected_votes, pve_node_online
collector/corosync_test.go Tests with JSON fixtures
collector/cluster_resources.go 16 metrics: CPU, memory, disk, network, storage, guest info, HA/lock state
collector/cluster_resources_test.go Tests with JSON fixtures
collector/version.go pve_version_info
collector/version_test.go Tests with JSON fixtures
collector/backup.go pve_not_backed_up_total, pve_not_backed_up_info
collector/backup_test.go Tests with JSON fixtures
collector/node_config.go pve_onboot_status (per-node fan-out)
collector/node_config_test.go Tests with JSON fixtures
collector/replication.go 6 replication metrics (per-node fan-out)
collector/replication_test.go Tests with JSON fixtures
collector/subscription.go 3 subscription metrics (per-node fan-out)
collector/subscription_test.go Tests with JSON fixtures
collector/testutil_test.go Shared test helpers: mock client, fixture loader
collector/fixtures/ JSON fixture files for API responses
Makefile Build, test, lint targets
README.md Usage docs, metric list, TODO for future metrics

Task 1: Go Module and Dependencies

Files:

  • Create: go.mod

  • Step 1: Initialize Go module

cd /home/user/git/pve-exporter
go mod init github.com/dsgeis/pve-exporter
  • Step 2: Add dependencies
cd /home/user/git/pve-exporter
go get github.com/alecthomas/kingpin/v2@v2.4.0
go get github.com/prometheus/client_golang@latest
go get github.com/prometheus/common@latest
go get github.com/prometheus/exporter-toolkit@latest
  • Step 3: Commit
git add go.mod go.sum
git commit -m "feat: initialize Go module with dependencies"

Task 2: PVE API Client

Files:

  • Create: collector/client.go

  • Create: collector/client_test.go

  • Step 1: Write the failing test

Create collector/client_test.go:

package collector

import (
	"net/http"
	"net/http/httptest"
	"testing"
)

func TestClientGet(t *testing.T) {
	server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		if r.Header.Get("Authorization") != "PVEAPIToken=test@pve!token=secret" {
			t.Errorf("unexpected auth header: %s", r.Header.Get("Authorization"))
		}
		if r.URL.Path != "/api2/json/version" {
			t.Errorf("unexpected path: %s", r.URL.Path)
		}
		w.Write([]byte(`{"data":{"version":"8.0"}}`))
	}))
	defer server.Close()

	client, err := NewClient([]string{server.URL}, "test@pve!token=secret", true, 5)
	if err != nil {
		t.Fatal(err)
	}
	// Override HTTP client to trust test server's TLS cert
	client.httpClient = server.Client()

	data, err := client.Get("/version")
	if err != nil {
		t.Fatal(err)
	}
	if string(data) != `{"data":{"version":"8.0"}}` {
		t.Errorf("unexpected response: %s", string(data))
	}
}

func TestClientFailover(t *testing.T) {
	// First server always fails
	bad := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusInternalServerError)
	}))
	defer bad.Close()

	// Second server works
	good := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`{"data":"ok"}`))
	}))
	defer good.Close()

	client, err := NewClient([]string{bad.URL, good.URL}, "token", true, 5)
	if err != nil {
		t.Fatal(err)
	}
	// Use a client that trusts both test servers
	client.httpClient = bad.Client()

	data, err := client.Get("/test")
	if err != nil {
		t.Fatal(err)
	}
	if string(data) != `{"data":"ok"}` {
		t.Errorf("unexpected response: %s", string(data))
	}

	// After success, good host should be tried first (remembered)
	// Make bad server unreachable by closing it
	bad.Close()

	data, err = client.Get("/test")
	if err != nil {
		t.Fatal(err)
	}
	if string(data) != `{"data":"ok"}` {
		t.Errorf("second request failed or returned wrong data: %s", string(data))
	}
}

func TestClientAllHostsFail(t *testing.T) {
	bad := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusInternalServerError)
	}))
	defer bad.Close()

	client, err := NewClient([]string{bad.URL}, "token", true, 5)
	if err != nil {
		t.Fatal(err)
	}
	client.httpClient = bad.Client()

	_, err = client.Get("/test")
	if err == nil {
		t.Error("expected error when all hosts fail")
	}
}
  • Step 2: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClient -v

Expected: Compilation error — NewClient not defined.

  • Step 3: Write the implementation

Create collector/client.go:

package collector

import (
	"crypto/tls"
	"fmt"
	"io"
	"net"
	"net/http"
	"sync"
	"time"
)

// Client is an HTTP client for the Proxmox VE API.
// It supports multiple hosts with automatic failover.
type Client struct {
	httpClient     *http.Client
	hosts          []string
	token          string
	maxConcurrent  int

	mu            sync.Mutex
	lastGoodHost  int // index into hosts
}

// NewClient creates a new PVE API client.
// hosts is a list of PVE API base URLs tried in order on failure.
// token is the PVE API token string (user@realm!tokenid=uuid).
// tlsInsecure disables TLS certificate verification when true.
// maxConcurrent limits parallel per-node API requests.
func NewClient(hosts []string, token string, tlsInsecure bool, maxConcurrent int) (*Client, error) {
	if len(hosts) == 0 {
		return nil, fmt.Errorf("at least one PVE host is required")
	}
	if maxConcurrent < 1 {
		maxConcurrent = 5
	}

	transport := &http.Transport{
		DialContext: (&net.Dialer{
			Timeout: 1 * time.Second,
		}).DialContext,
		TLSClientConfig: &tls.Config{
			InsecureSkipVerify: tlsInsecure,
		},
		MaxIdleConnsPerHost: 10,
		IdleConnTimeout:     90 * time.Second,
	}

	return &Client{
		httpClient: &http.Client{
			Transport: transport,
			Timeout:   30 * time.Second,
		},
		hosts:         hosts,
		token:         token,
		maxConcurrent: maxConcurrent,
	}, nil
}

// Get makes a GET request to the PVE API at the given path.
// It tries hosts in order, starting with the last successful host.
// The path should not include /api2/json prefix — it is added automatically.
func (c *Client) Get(path string) ([]byte, error) {
	c.mu.Lock()
	startIdx := c.lastGoodHost
	c.mu.Unlock()

	var lastErr error
	for i := 0; i < len(c.hosts); i++ {
		idx := (startIdx + i) % len(c.hosts)
		host := c.hosts[idx]

		url := host + "/api2/json" + path

		req, err := http.NewRequest("GET", url, nil)
		if err != nil {
			lastErr = fmt.Errorf("creating request for %s: %w", host, err)
			continue
		}
		req.Header.Set("Authorization", "PVEAPIToken="+c.token)

		resp, err := c.httpClient.Do(req)
		if err != nil {
			lastErr = fmt.Errorf("requesting %s: %w", url, err)
			continue
		}

		body, err := io.ReadAll(resp.Body)
		resp.Body.Close()

		if err != nil {
			lastErr = fmt.Errorf("reading response from %s: %w", url, err)
			continue
		}

		if resp.StatusCode != http.StatusOK {
			lastErr = fmt.Errorf("%s returned status %d: %s", url, resp.StatusCode, string(body))
			continue
		}

		c.mu.Lock()
		c.lastGoodHost = idx
		c.mu.Unlock()

		return body, nil
	}

	return nil, fmt.Errorf("all PVE hosts failed: %w", lastErr)
}

// MaxConcurrent returns the configured max concurrent API requests.
func (c *Client) MaxConcurrent() int {
	return c.maxConcurrent
}
  • Step 4: Run tests to verify they pass
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClient -v

Expected: All 3 tests PASS.

  • Step 5: Commit
git add collector/client.go collector/client_test.go
git commit -m "feat: add PVE API client with multi-host failover"

Task 3: Collector Framework

Files:

  • Create: collector/collector.go

  • Create: collector/testutil_test.go

  • Step 1: Write collector.go

Create collector/collector.go with the collector interface, registry, and PVECollector:

package collector

import (
	"fmt"
	"log/slog"
	"sync"
	"time"

	"github.com/prometheus/client_golang/prometheus"
)

const namespace = "pve"

var (
	scrapeDurationDesc = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "scrape", "collector_duration_seconds"),
		"pve_exporter: Duration of a collector scrape.",
		[]string{"collector"}, nil,
	)
	scrapeSuccessDesc = prometheus.NewDesc(
		prometheus.BuildFQName(namespace, "scrape", "collector_success"),
		"pve_exporter: Whether a collector succeeded.",
		[]string{"collector"}, nil,
	)
)

// Collector is the interface each metric collector implements.
type Collector interface {
	Update(client *Client, ch chan<- prometheus.Metric) error
}

// NodeAwareCollector is implemented by collectors that need the cluster node list.
type NodeAwareCollector interface {
	Collector
	SetNodes(nodes []string)
}

// ResourceAwareCollector is implemented by collectors that consume /cluster/resources data.
type ResourceAwareCollector interface {
	Collector
	SetResources(data []byte)
}

var factories = make(map[string]func(logger *slog.Logger) Collector)

func registerCollector(name string, factory func(logger *slog.Logger) Collector) {
	factories[name] = factory
}

// PVECollector implements prometheus.Collector and orchestrates all registered collectors.
type PVECollector struct {
	client     *Client
	collectors map[string]Collector
	logger     *slog.Logger
}

// NewPVECollector creates a PVECollector with all registered collectors.
func NewPVECollector(client *Client, logger *slog.Logger) *PVECollector {
	collectors := make(map[string]Collector)
	for name, factory := range factories {
		collectors[name] = factory(logger.With("collector", name))
	}
	return &PVECollector{
		client:     client,
		collectors: collectors,
		logger:     logger,
	}
}

// Describe implements prometheus.Collector.
func (p *PVECollector) Describe(ch chan<- *prometheus.Desc) {
	ch <- scrapeDurationDesc
	ch <- scrapeSuccessDesc
}

// Collect implements prometheus.Collector.
// It fetches /cluster/resources first to get the node list and resource data,
// then runs all collectors in parallel.
func (p *PVECollector) Collect(ch chan<- prometheus.Metric) {
	// Pre-fetch cluster resources for shared data
	resourcesData, err := p.client.Get("/cluster/resources")
	var nodes []string
	if err != nil {
		p.logger.Error("failed to fetch cluster resources", "err", err)
	} else {
		nodes = extractNodeNames(resourcesData)
	}

	// Distribute shared data to collectors that need it
	for _, c := range p.collectors {
		if rac, ok := c.(ResourceAwareCollector); ok && resourcesData != nil {
			rac.SetResources(resourcesData)
		}
		if nac, ok := c.(NodeAwareCollector); ok {
			nac.SetNodes(nodes)
		}
	}

	// Run all collectors in parallel
	wg := sync.WaitGroup{}
	wg.Add(len(p.collectors))
	for name, c := range p.collectors {
		go func(name string, c Collector) {
			defer wg.Done()
			begin := time.Now()
			err := c.Update(p.client, ch)
			duration := time.Since(begin)

			var success float64
			if err != nil {
				p.logger.Error("collector failed", "name", name, "duration_seconds", duration.Seconds(), "err", err)
				success = 0
			} else {
				p.logger.Debug("collector succeeded", "name", name, "duration_seconds", duration.Seconds())
				success = 1
			}
			ch <- prometheus.MustNewConstMetric(scrapeDurationDesc, prometheus.GaugeValue, duration.Seconds(), name)
			ch <- prometheus.MustNewConstMetric(scrapeSuccessDesc, prometheus.GaugeValue, success, name)
		}(name, c)
	}
	wg.Wait()
}

// extractNodeNames parses /cluster/resources response and returns node names.
func extractNodeNames(data []byte) []string {
	// Lightweight JSON parsing — we only need node names from type=node entries
	type resource struct {
		Type string `json:"type"`
		Node string `json:"node"`
	}
	type response struct {
		Data []resource `json:"data"`
	}

	var resp response
	if err := json.Unmarshal(data, &resp); err != nil {
		return nil
	}

	seen := make(map[string]bool)
	var nodes []string
	for _, r := range resp.Data {
		if r.Type == "node" && !seen[r.Node] {
			seen[r.Node] = true
			nodes = append(nodes, r.Node)
		}
	}
	return nodes
}

Note: add "encoding/json" to the import block.

  • Step 2: Create shared test utilities

Create collector/testutil_test.go:

package collector

import (
	"net/http"
	"net/http/httptest"
	"os"
	"path/filepath"
	"testing"
)

// newTestClient creates a Client backed by an httptest.Server that serves
// fixture files. Routes maps API paths (e.g. "/cluster/status") to fixture
// file basenames (e.g. "cluster_status.json") under collector/fixtures/.
func newTestClient(t *testing.T, routes map[string]string) *Client {
	t.Helper()
	server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Strip /api2/json prefix to match route keys
		path := r.URL.Path
		const prefix = "/api2/json"
		if len(path) > len(prefix) {
			path = path[len(prefix):]
		}

		fixture, ok := routes[path]
		if !ok {
			w.WriteHeader(http.StatusNotFound)
			return
		}
		data, err := os.ReadFile(filepath.Join("fixtures", fixture))
		if err != nil {
			t.Fatalf("reading fixture %s: %v", fixture, err)
		}
		w.Header().Set("Content-Type", "application/json")
		w.Write(data)
	}))
	t.Cleanup(server.Close)

	client, err := NewClient([]string{server.URL}, "test-token", true, 5)
	if err != nil {
		t.Fatal(err)
	}
	client.httpClient = server.Client()
	return client
}

// testCollectorAdapter wraps a Collector into a prometheus.Collector for testutil.
type testCollectorAdapter struct {
	client    *Client
	collector Collector
}

func (a *testCollectorAdapter) Describe(ch chan<- *prometheus.Desc) {
	ch <- prometheus.NewDesc("dummy", "dummy", nil, nil)
}

func (a *testCollectorAdapter) Collect(ch chan<- prometheus.Metric) {
	a.collector.Update(a.client, ch)
}

Add these imports to testutil_test.go: "github.com/prometheus/client_golang/prometheus".

  • Step 3: Verify it compiles
cd /home/user/git/pve-exporter && go build ./collector/

Expected: Compiles successfully (no tests to run yet for collector.go itself since it needs at least one registered collector).

  • Step 4: Commit
git add collector/collector.go collector/testutil_test.go
git commit -m "feat: add collector framework with registry and parallel scrape orchestration"

Task 4: Main Entry Point (Minimal)

Files:

  • Create: main.go

  • Create: Makefile

  • Step 1: Write main.go

Create main.go:

package main

import (
	"fmt"
	"log/slog"
	"net/http"
	"os"
	"strings"

	"github.com/alecthomas/kingpin/v2"
	"github.com/prometheus/client_golang/prometheus"
	versioncollector "github.com/prometheus/client_golang/prometheus/collectors/version"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/prometheus/common/promslog"
	"github.com/prometheus/common/promslog/flag"
	"github.com/prometheus/common/version"
	"github.com/prometheus/exporter-toolkit/web"
	"github.com/prometheus/exporter-toolkit/web/kingpinflag"

	"github.com/dsgeis/pve-exporter/collector"
)

func main() {
	var (
		pveHosts = kingpin.Flag(
			"pve.host",
			"PVE API base URL (repeatable, tried in order on failure).",
		).Required().Strings()
		pveAPIToken = kingpin.Flag(
			"pve.api-token",
			"PVE API token string (user@realm!tokenid=uuid). Mutually exclusive with --pve.token-file.",
		).String()
		pveTokenFile = kingpin.Flag(
			"pve.token-file",
			"Path to file containing PVE API token. Mutually exclusive with --pve.api-token.",
		).String()
		pveTLSInsecure = kingpin.Flag(
			"pve.tls-insecure",
			"Disable TLS certificate verification for PVE API.",
		).Default("false").Bool()
		pveMaxConcurrent = kingpin.Flag(
			"pve.max-concurrent",
			"Max concurrent API requests for per-node fan-out.",
		).Default("5").Int()
		metricsPath = kingpin.Flag(
			"web.telemetry-path",
			"Path under which to expose metrics.",
		).Default("/metrics").String()
		toolkitFlags = kingpinflag.AddFlags(kingpin.CommandLine, ":9221")
	)

	promslogConfig := &promslog.Config{}
	flag.AddFlags(kingpin.CommandLine, promslogConfig)
	kingpin.Version(version.Print("pve_exporter"))
	kingpin.CommandLine.UsageWriter(os.Stdout)
	kingpin.HelpFlag.Short('h')
	kingpin.Parse()
	logger := promslog.New(promslogConfig)

	// Resolve API token
	token, err := resolveToken(*pveAPIToken, *pveTokenFile)
	if err != nil {
		logger.Error("failed to resolve API token", "err", err)
		os.Exit(1)
	}

	client, err := collector.NewClient(*pveHosts, token, *pveTLSInsecure, *pveMaxConcurrent)
	if err != nil {
		logger.Error("failed to create PVE client", "err", err)
		os.Exit(1)
	}

	pveCollector := collector.NewPVECollector(client, logger)

	r := prometheus.NewRegistry()
	r.MustRegister(versioncollector.NewCollector("pve_exporter"))
	r.MustRegister(pveCollector)

	handler := promhttp.HandlerFor(r, promhttp.HandlerOpts{
		ErrorLog:      slog.NewLogLogger(logger.Handler(), slog.LevelError),
		ErrorHandling: promhttp.ContinueOnError,
	})

	http.Handle(*metricsPath, handler)
	if *metricsPath != "/" {
		landingConfig := web.LandingConfig{
			Name:        "PVE Exporter",
			Description: "Prometheus Exporter for Proxmox VE",
			Version:     version.Info(),
			Links: []web.LandingLinks{
				{Address: *metricsPath, Text: "Metrics"},
			},
		}
		landingPage, err := web.NewLandingPage(landingConfig)
		if err != nil {
			logger.Error(err.Error())
			os.Exit(1)
		}
		http.Handle("/", landingPage)
	}

	logger.Info("Starting pve_exporter", "version", version.Info())

	server := &http.Server{}
	if err := web.ListenAndServe(server, toolkitFlags, logger); err != nil {
		logger.Error(err.Error())
		os.Exit(1)
	}
}

func resolveToken(apiToken, tokenFile string) (string, error) {
	if apiToken != "" && tokenFile != "" {
		return "", fmt.Errorf("--pve.api-token and --pve.token-file are mutually exclusive")
	}
	if apiToken == "" && tokenFile == "" {
		// Try environment variable
		if env := os.Getenv("PVE_API_TOKEN"); env != "" {
			return env, nil
		}
		return "", fmt.Errorf("one of --pve.api-token, --pve.token-file, or PVE_API_TOKEN is required")
	}
	if tokenFile != "" {
		data, err := os.ReadFile(tokenFile)
		if err != nil {
			return "", fmt.Errorf("reading token file: %w", err)
		}
		return strings.TrimSpace(string(data)), nil
	}
	return apiToken, nil
}
  • Step 2: Write Makefile

Create Makefile:

.PHONY: build test clean

BINARY := pve-exporter

build:
	CGO_ENABLED=0 go build -o $(BINARY) .

test:
	go test -v ./...

clean:
	rm -f $(BINARY)
  • Step 3: Verify it compiles and runs
cd /home/user/git/pve-exporter && go mod tidy && make build

Expected: Compiles. Binary pve-exporter created.

./pve-exporter --help

Expected: Shows usage with all flags listed.

  • Step 4: Commit
git add main.go Makefile go.mod go.sum
git commit -m "feat: add main entry point with CLI flags and HTTP server"

Task 5: Version Collector

The simplest collector — good for validating the framework end-to-end.

Files:

  • Create: collector/version.go

  • Create: collector/version_test.go

  • Create: collector/fixtures/version.json

  • Step 1: Create fixture

Create collector/fixtures/version.json:

{"data":{"version":"9.1.4","release":"9.1","repoid":"5ac30304265fbd8e"}}
  • Step 2: Write the failing test

Create collector/version_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestVersionCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/version": "version.json",
	})
	logger := promslog.NewNopLogger()
	c := newVersionCollector(logger)

	expected := `
		# HELP pve_version_info Proxmox VE version info.
		# TYPE pve_version_info gauge
		pve_version_info{release="9.1",repoid="5ac30304265fbd8e",version="9.1.4"} 1
	`

	reg := prometheus.NewRegistry()
	collector := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(collector)

	if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "pve_version_info"); err != nil {
		t.Error(err)
	}
}

Note: testCollectorAdapter is defined in testutil_test.go (Task 3) and shared by all collector tests.

  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestVersionCollector -v

Expected: Compilation error — newVersionCollector not defined.

  • Step 4: Write the implementation

Create collector/version.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"

	"github.com/prometheus/client_golang/prometheus"
)

type versionCollector struct {
	infoDesc *prometheus.Desc
	logger   *slog.Logger
}

func init() {
	registerCollector("version", func(logger *slog.Logger) Collector {
		return newVersionCollector(logger)
	})
}

func newVersionCollector(logger *slog.Logger) *versionCollector {
	return &versionCollector{
		infoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "version_info"),
			"Proxmox VE version info.",
			[]string{"release", "repoid", "version"}, nil,
		),
		logger: logger,
	}
}

func (c *versionCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	body, err := client.Get("/version")
	if err != nil {
		return fmt.Errorf("fetching version: %w", err)
	}

	var resp struct {
		Data struct {
			Version string `json:"version"`
			Release string `json:"release"`
			RepoID  string `json:"repoid"`
		} `json:"data"`
	}
	if err := json.Unmarshal(body, &resp); err != nil {
		return fmt.Errorf("parsing version response: %w", err)
	}

	ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1,
		resp.Data.Release, resp.Data.RepoID, resp.Data.Version)

	return nil
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestVersionCollector -v

Expected: PASS.

  • Step 6: End-to-end smoke test

Build and run against the live PVE cluster to verify the full stack works:

cd /home/user/git/pve-exporter && make build
./pve-exporter --pve.host=https://node02.freyja.cloud.sip.is:8006 --pve.tls-insecure --pve.token-file=.apikey &
sleep 2
curl -s http://localhost:9221/metrics | grep pve_version_info
kill %1

Expected: Output contains pve_version_info{release="9.1",repoid="5ac30304265fbd8e",version="9.1.4"} 1 plus scrape meta-metrics.

  • Step 7: Commit
git add collector/version.go collector/version_test.go collector/fixtures/version.json
git commit -m "feat: add version collector (pve_version_info)"

Task 6: Cluster Status Collector

Files:

  • Create: collector/cluster_status.go

  • Create: collector/cluster_status_test.go

  • Create: collector/fixtures/cluster_status.json

  • Step 1: Create fixture

Create collector/fixtures/cluster_status.json with realistic data from the live API. The response contains both cluster-type and node-type entries:

{"data":[{"type":"cluster","id":"cluster","name":"freyja","version":9,"nodes":5,"quorate":1},{"type":"node","id":"node/node01","name":"node01","nodeid":1,"online":1,"local":0,"ip":"10.99.0.1","level":"b"},{"type":"node","id":"node/node02","name":"node02","nodeid":2,"online":1,"local":1,"ip":"10.99.0.2","level":"b"},{"type":"node","id":"node/node03","name":"node03","nodeid":3,"online":1,"local":0,"ip":"10.99.0.3","level":"b"},{"type":"node","id":"node/node04","name":"node04","nodeid":4,"online":1,"local":0,"ip":"10.99.0.4","level":"b"},{"type":"node","id":"node/node05","name":"node05","nodeid":5,"online":1,"local":0,"ip":"10.99.0.5","level":"b"}]}
  • Step 2: Write the failing test

Create collector/cluster_status_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestClusterStatusCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/cluster/status": "cluster_status.json",
	})
	logger := promslog.NewNopLogger()
	c := newClusterStatusCollector(logger)

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	// Check pve_up for nodes
	expected := `
		# HELP pve_up Node/VM/CT-Status is online/running.
		# TYPE pve_up gauge
		pve_up{id="node/node01"} 1
		pve_up{id="node/node02"} 1
		pve_up{id="node/node03"} 1
		pve_up{id="node/node04"} 1
		pve_up{id="node/node05"} 1
	`
	if err := testutil.GatherAndCompare(reg, strings.NewReader(expected), "pve_up"); err != nil {
		t.Error(err)
	}

	// Check pve_node_info
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_node_info Node info.
		# TYPE pve_node_info gauge
		pve_node_info{id="node/node01",level="b",name="node01",nodeid="1"} 1
		pve_node_info{id="node/node02",level="b",name="node02",nodeid="2"} 1
		pve_node_info{id="node/node03",level="b",name="node03",nodeid="3"} 1
		pve_node_info{id="node/node04",level="b",name="node04",nodeid="4"} 1
		pve_node_info{id="node/node05",level="b",name="node05",nodeid="5"} 1
	`), "pve_node_info"); err != nil {
		t.Error(err)
	}

	// Check pve_cluster_info
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_cluster_info Cluster info.
		# TYPE pve_cluster_info gauge
		pve_cluster_info{id="cluster",nodes="5",quorate="1",version="9"} 1
	`), "pve_cluster_info"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterStatusCollector -v

Expected: Compilation error — newClusterStatusCollector not defined.

  • Step 4: Write the implementation

Create collector/cluster_status.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"strconv"

	"github.com/prometheus/client_golang/prometheus"
)

type clusterStatusCollector struct {
	upDesc          *prometheus.Desc
	nodeInfoDesc    *prometheus.Desc
	clusterInfoDesc *prometheus.Desc
	logger          *slog.Logger
}

func init() {
	registerCollector("cluster_status", func(logger *slog.Logger) Collector {
		return newClusterStatusCollector(logger)
	})
}

func newClusterStatusCollector(logger *slog.Logger) *clusterStatusCollector {
	return &clusterStatusCollector{
		upDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "up"),
			"Node/VM/CT-Status is online/running.",
			[]string{"id"}, nil,
		),
		nodeInfoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "node_info"),
			"Node info.",
			[]string{"id", "level", "name", "nodeid"}, nil,
		),
		clusterInfoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "cluster_info"),
			"Cluster info.",
			[]string{"id", "nodes", "quorate", "version"}, nil,
		),
		logger: logger,
	}
}

func (c *clusterStatusCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	body, err := client.Get("/cluster/status")
	if err != nil {
		return fmt.Errorf("fetching cluster status: %w", err)
	}

	var resp struct {
		Data []json.RawMessage `json:"data"`
	}
	if err := json.Unmarshal(body, &resp); err != nil {
		return fmt.Errorf("parsing cluster status: %w", err)
	}

	for _, raw := range resp.Data {
		var entry struct {
			Type    string `json:"type"`
			ID      string `json:"id"`
			Name    string `json:"name"`
			Online  int    `json:"online"`
			NodeID  int    `json:"nodeid"`
			Level   string `json:"level"`
			Nodes   int    `json:"nodes"`
			Quorate int    `json:"quorate"`
			Version int    `json:"version"`
		}
		if err := json.Unmarshal(raw, &entry); err != nil {
			c.logger.Warn("skipping unparseable cluster status entry", "err", err)
			continue
		}

		switch entry.Type {
		case "node":
			ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue,
				float64(entry.Online), entry.ID)
			ch <- prometheus.MustNewConstMetric(c.nodeInfoDesc, prometheus.GaugeValue, 1,
				entry.ID, entry.Level, entry.Name, strconv.Itoa(entry.NodeID))
		case "cluster":
			ch <- prometheus.MustNewConstMetric(c.clusterInfoDesc, prometheus.GaugeValue, 1,
				entry.ID,
				strconv.Itoa(entry.Nodes),
				strconv.Itoa(entry.Quorate),
				strconv.Itoa(entry.Version))
		}
	}

	return nil
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterStatusCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/cluster_status.go collector/cluster_status_test.go collector/fixtures/cluster_status.json
git commit -m "feat: add cluster_status collector (pve_up, pve_node_info, pve_cluster_info)"

Task 7: Corosync Collector

Files:

  • Create: collector/corosync.go

  • Create: collector/corosync_test.go

  • Create: collector/fixtures/cluster_config_nodes.json

  • Step 1: Create fixture

Create collector/fixtures/cluster_config_nodes.json:

{"data":[{"node":"node01","nodeid":"1","quorum_votes":"1","ring0_addr":"10.99.0.1"},{"node":"node02","nodeid":"2","quorum_votes":"1","ring0_addr":"10.99.0.2"},{"node":"node03","nodeid":"3","quorum_votes":"1","ring0_addr":"10.99.0.3"},{"node":"node04","nodeid":"4","quorum_votes":"1","ring0_addr":"10.99.0.4"},{"node":"node05","nodeid":"5","quorum_votes":"1","ring0_addr":"10.99.0.5"}]}

This collector reuses cluster_status.json from Task 6 for quorate and node online state.

  • Step 2: Write the failing test

Create collector/corosync_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestCorosyncCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/cluster/status":       "cluster_status.json",
		"/cluster/config/nodes": "cluster_config_nodes.json",
	})
	logger := promslog.NewNopLogger()
	c := newCorosyncCollector(logger)

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	expected := `
		# HELP pve_cluster_quorate Whether the cluster is quorate.
		# TYPE pve_cluster_quorate gauge
		pve_cluster_quorate 1
		# HELP pve_cluster_nodes_total Total number of nodes in the cluster.
		# TYPE pve_cluster_nodes_total gauge
		pve_cluster_nodes_total 5
		# HELP pve_cluster_expected_votes Total expected votes in the cluster.
		# TYPE pve_cluster_expected_votes gauge
		pve_cluster_expected_votes 5
		# HELP pve_node_online Whether a cluster node is online.
		# TYPE pve_node_online gauge
		pve_node_online{name="node01",nodeid="1"} 1
		pve_node_online{name="node02",nodeid="2"} 1
		pve_node_online{name="node03",nodeid="3"} 1
		pve_node_online{name="node04",nodeid="4"} 1
		pve_node_online{name="node05",nodeid="5"} 1
	`
	if err := testutil.GatherAndCompare(reg, strings.NewReader(expected),
		"pve_cluster_quorate", "pve_cluster_nodes_total", "pve_cluster_expected_votes", "pve_node_online"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestCorosyncCollector -v

Expected: Compilation error — newCorosyncCollector not defined.

  • Step 4: Write the implementation

Create collector/corosync.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"strconv"

	"github.com/prometheus/client_golang/prometheus"
)

type corosyncCollector struct {
	quorateDesc       *prometheus.Desc
	nodesTotalDesc    *prometheus.Desc
	expectedVotesDesc *prometheus.Desc
	nodeOnlineDesc    *prometheus.Desc
	logger            *slog.Logger
}

func init() {
	registerCollector("corosync", func(logger *slog.Logger) Collector {
		return newCorosyncCollector(logger)
	})
}

func newCorosyncCollector(logger *slog.Logger) *corosyncCollector {
	return &corosyncCollector{
		quorateDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "cluster", "quorate"),
			"Whether the cluster is quorate.",
			nil, nil,
		),
		nodesTotalDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "cluster", "nodes_total"),
			"Total number of nodes in the cluster.",
			nil, nil,
		),
		expectedVotesDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "cluster", "expected_votes"),
			"Total expected votes in the cluster.",
			nil, nil,
		),
		nodeOnlineDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "node_online"),
			"Whether a cluster node is online.",
			[]string{"name", "nodeid"}, nil,
		),
		logger: logger,
	}
}

func (c *corosyncCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	// Fetch cluster status for quorate and node online state
	statusBody, err := client.Get("/cluster/status")
	if err != nil {
		return fmt.Errorf("fetching cluster status: %w", err)
	}

	var statusResp struct {
		Data []struct {
			Type    string `json:"type"`
			Name    string `json:"name"`
			NodeID  int    `json:"nodeid"`
			Online  int    `json:"online"`
			Quorate int    `json:"quorate"`
			Nodes   int    `json:"nodes"`
		} `json:"data"`
	}
	if err := json.Unmarshal(statusBody, &statusResp); err != nil {
		return fmt.Errorf("parsing cluster status: %w", err)
	}

	for _, entry := range statusResp.Data {
		switch entry.Type {
		case "cluster":
			ch <- prometheus.MustNewConstMetric(c.quorateDesc, prometheus.GaugeValue, float64(entry.Quorate))
			ch <- prometheus.MustNewConstMetric(c.nodesTotalDesc, prometheus.GaugeValue, float64(entry.Nodes))
		case "node":
			ch <- prometheus.MustNewConstMetric(c.nodeOnlineDesc, prometheus.GaugeValue,
				float64(entry.Online), entry.Name, strconv.Itoa(entry.NodeID))
		}
	}

	// Fetch config nodes for expected votes
	configBody, err := client.Get("/cluster/config/nodes")
	if err != nil {
		return fmt.Errorf("fetching cluster config nodes: %w", err)
	}

	var configResp struct {
		Data []struct {
			QuorumVotes string `json:"quorum_votes"`
		} `json:"data"`
	}
	if err := json.Unmarshal(configBody, &configResp); err != nil {
		return fmt.Errorf("parsing cluster config nodes: %w", err)
	}

	var totalVotes float64
	for _, node := range configResp.Data {
		votes, _ := strconv.ParseFloat(node.QuorumVotes, 64)
		totalVotes += votes
	}
	ch <- prometheus.MustNewConstMetric(c.expectedVotesDesc, prometheus.GaugeValue, totalVotes)

	return nil
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestCorosyncCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/corosync.go collector/corosync_test.go collector/fixtures/cluster_config_nodes.json
git commit -m "feat: add corosync collector (quorate, nodes_total, expected_votes, node_online)"

Task 8: Cluster Resources Collector

The largest collector — 16 metrics across nodes, VMs, containers, and storage.

Files:

  • Create: collector/cluster_resources.go

  • Create: collector/cluster_resources_test.go

  • Create: collector/fixtures/cluster_resources.json

  • Step 1: Create fixture

Create collector/fixtures/cluster_resources.json with a representative subset: 1 node, 2 VMs (1 running, 1 stopped), 1 storage. Include fields for all 16 metrics. This fixture should be a realistic but minimal JSON response. Use data modeled on the live API:

{"data":[
  {"type":"node","id":"node/node01","node":"node01","status":"online","cpu":0.05,"maxcpu":256,"mem":50000000000,"maxmem":1555325325312,"disk":5000000000,"maxdisk":100000000000,"uptime":2081781},
  {"type":"qemu","id":"qemu/100","node":"node01","name":"testvm1","status":"running","vmid":100,"cpu":0.02,"maxcpu":4,"mem":2147483648,"maxmem":4294967296,"disk":0,"maxdisk":34359738368,"netin":1000000,"netout":500000,"diskread":200000000,"diskwrite":100000000,"uptime":86400,"template":0,"tags":"web;prod","hastate":"started","lock":""},
  {"type":"qemu","id":"qemu/101","node":"node01","name":"testvm2","status":"stopped","vmid":101,"cpu":0,"maxcpu":2,"mem":0,"maxmem":2147483648,"disk":0,"maxdisk":17179869184,"netin":0,"netout":0,"diskread":0,"diskwrite":0,"uptime":0,"template":1,"tags":"","hastate":"","lock":"backup"},
  {"type":"storage","id":"storage/node01/local","node":"node01","storage":"local","plugintype":"dir","content":"iso,vztmpl,backup","disk":5000000000,"maxdisk":100000000000,"shared":0,"status":"available"}
]}
  • Step 2: Write the failing test

Create collector/cluster_resources_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestClusterResourcesCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/cluster/resources": "cluster_resources.json",
	})
	logger := promslog.NewNopLogger()
	c := newClusterResourcesCollector(logger)

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	// Test CPU metrics
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_cpu_usage_ratio CPU utilization.
		# TYPE pve_cpu_usage_ratio gauge
		pve_cpu_usage_ratio{id="node/node01"} 0.05
		pve_cpu_usage_ratio{id="qemu/100"} 0.02
		pve_cpu_usage_ratio{id="qemu/101"} 0
		# HELP pve_cpu_usage_limit Number of available CPUs.
		# TYPE pve_cpu_usage_limit gauge
		pve_cpu_usage_limit{id="node/node01"} 256
		pve_cpu_usage_limit{id="qemu/100"} 4
		pve_cpu_usage_limit{id="qemu/101"} 2
	`), "pve_cpu_usage_ratio", "pve_cpu_usage_limit"); err != nil {
		t.Error(err)
	}

	// Test guest info
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_guest_info VM/CT info.
		# TYPE pve_guest_info gauge
		pve_guest_info{id="qemu/100",name="testvm1",node="node01",tags="web;prod",template="0",type="qemu"} 1
		pve_guest_info{id="qemu/101",name="testvm2",node="node01",tags="",template="1",type="qemu"} 1
	`), "pve_guest_info"); err != nil {
		t.Error(err)
	}

	// Test storage info
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_storage_info Storage info.
		# TYPE pve_storage_info gauge
		pve_storage_info{content="iso,vztmpl,backup",id="storage/node01/local",node="node01",plugintype="dir",storage="local"} 1
	`), "pve_storage_info"); err != nil {
		t.Error(err)
	}

	// Test HA state (only for VMs with hastate set)
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_ha_state HA service status.
		# TYPE pve_ha_state gauge
		pve_ha_state{id="qemu/100",state="started"} 1
	`), "pve_ha_state"); err != nil {
		t.Error(err)
	}

	// Test lock state (only for VMs with lock set)
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_lock_state Guest config lock state.
		# TYPE pve_lock_state gauge
		pve_lock_state{id="qemu/101",state="backup"} 1
	`), "pve_lock_state"); err != nil {
		t.Error(err)
	}

	// Test pve_up includes VM status
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_up Node/VM/CT-Status is online/running.
		# TYPE pve_up gauge
		pve_up{id="node/node01"} 1
		pve_up{id="qemu/100"} 1
		pve_up{id="qemu/101"} 0
	`), "pve_up"); err != nil {
		t.Error(err)
	}

	// Test storage shared
	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_storage_shared Whether or not the storage is shared among cluster nodes.
		# TYPE pve_storage_shared gauge
		pve_storage_shared{id="storage/node01/local"} 0
	`), "pve_storage_shared"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestClusterResourcesCollector -v

Expected: Compilation error — newClusterResourcesCollector not defined.

  • Step 4: Write the implementation

Create collector/cluster_resources.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"strconv"

	"github.com/prometheus/client_golang/prometheus"
)

type clusterResourcesCollector struct {
	upDesc               *prometheus.Desc
	cpuUsageDesc         *prometheus.Desc
	cpuLimitDesc         *prometheus.Desc
	memUsageDesc         *prometheus.Desc
	memSizeDesc          *prometheus.Desc
	diskUsageDesc        *prometheus.Desc
	diskSizeDesc         *prometheus.Desc
	netTransmitDesc      *prometheus.Desc
	netReceiveDesc       *prometheus.Desc
	diskWrittenDesc      *prometheus.Desc
	diskReadDesc         *prometheus.Desc
	uptimeDesc           *prometheus.Desc
	storageSharedDesc    *prometheus.Desc
	guestInfoDesc        *prometheus.Desc
	storageInfoDesc      *prometheus.Desc
	haStateDesc          *prometheus.Desc
	lockStateDesc        *prometheus.Desc
	logger               *slog.Logger
}

func init() {
	registerCollector("cluster_resources", func(logger *slog.Logger) Collector {
		return newClusterResourcesCollector(logger)
	})
}

func newClusterResourcesCollector(logger *slog.Logger) *clusterResourcesCollector {
	return &clusterResourcesCollector{
		upDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "up"),
			"Node/VM/CT-Status is online/running.",
			[]string{"id"}, nil,
		),
		cpuUsageDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "cpu_usage_ratio"),
			"CPU utilization.",
			[]string{"id"}, nil,
		),
		cpuLimitDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "cpu_usage_limit"),
			"Number of available CPUs.",
			[]string{"id"}, nil,
		),
		memUsageDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "memory_usage_bytes"),
			"Used memory in bytes.",
			[]string{"id"}, nil,
		),
		memSizeDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "memory_size_bytes"),
			"Number of available memory in bytes.",
			[]string{"id"}, nil,
		),
		diskUsageDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "disk_usage_bytes"),
			"Used disk space in bytes.",
			[]string{"id"}, nil,
		),
		diskSizeDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "disk_size_bytes"),
			"Storage size in bytes.",
			[]string{"id"}, nil,
		),
		netTransmitDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "network_transmit_bytes_total"),
			"Network bytes transmitted since guest start.",
			[]string{"id"}, nil,
		),
		netReceiveDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "network_receive_bytes_total"),
			"Network bytes received since guest start.",
			[]string{"id"}, nil,
		),
		diskWrittenDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "disk_written_bytes_total"),
			"Disk bytes written since guest start.",
			[]string{"id"}, nil,
		),
		diskReadDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "disk_read_bytes_total"),
			"Disk bytes read since guest start.",
			[]string{"id"}, nil,
		),
		uptimeDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "uptime_seconds"),
			"Uptime in seconds.",
			[]string{"id"}, nil,
		),
		storageSharedDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "storage_shared"),
			"Whether or not the storage is shared among cluster nodes.",
			[]string{"id"}, nil,
		),
		guestInfoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "guest_info"),
			"VM/CT info.",
			[]string{"id", "node", "name", "type", "template", "tags"}, nil,
		),
		storageInfoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "storage_info"),
			"Storage info.",
			[]string{"id", "node", "storage", "plugintype", "content"}, nil,
		),
		haStateDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "ha_state"),
			"HA service status.",
			[]string{"id", "state"}, nil,
		),
		lockStateDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "lock_state"),
			"Guest config lock state.",
			[]string{"id", "state"}, nil,
		),
		logger: logger,
	}
}

func (c *clusterResourcesCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	body, err := client.Get("/cluster/resources")
	if err != nil {
		return fmt.Errorf("fetching cluster resources: %w", err)
	}

	var resp struct {
		Data []struct {
			Type       string  `json:"type"`
			ID         string  `json:"id"`
			Node       string  `json:"node"`
			Name       string  `json:"name"`
			Status     string  `json:"status"`
			CPU        float64 `json:"cpu"`
			MaxCPU     float64 `json:"maxcpu"`
			Mem        float64 `json:"mem"`
			MaxMem     float64 `json:"maxmem"`
			Disk       float64 `json:"disk"`
			MaxDisk    float64 `json:"maxdisk"`
			NetIn      float64 `json:"netin"`
			NetOut     float64 `json:"netout"`
			DiskRead   float64 `json:"diskread"`
			DiskWrite  float64 `json:"diskwrite"`
			Uptime     float64 `json:"uptime"`
			Template   int     `json:"template"`
			Tags       string  `json:"tags"`
			HAState    string  `json:"hastate"`
			Lock       string  `json:"lock"`
			Storage    string  `json:"storage"`
			PluginType string  `json:"plugintype"`
			Content    string  `json:"content"`
			Shared     int     `json:"shared"`
		} `json:"data"`
	}
	if err := json.Unmarshal(body, &resp); err != nil {
		return fmt.Errorf("parsing cluster resources: %w", err)
	}

	for _, r := range resp.Data {
		switch r.Type {
		case "node":
			online := 0.0
			if r.Status == "online" {
				online = 1.0
			}
			ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue, online, r.ID)
			ch <- prometheus.MustNewConstMetric(c.cpuUsageDesc, prometheus.GaugeValue, r.CPU, r.ID)
			ch <- prometheus.MustNewConstMetric(c.cpuLimitDesc, prometheus.GaugeValue, r.MaxCPU, r.ID)
			ch <- prometheus.MustNewConstMetric(c.memUsageDesc, prometheus.GaugeValue, r.Mem, r.ID)
			ch <- prometheus.MustNewConstMetric(c.memSizeDesc, prometheus.GaugeValue, r.MaxMem, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.uptimeDesc, prometheus.GaugeValue, r.Uptime, r.ID)

		case "qemu", "lxc":
			online := 0.0
			if r.Status == "running" {
				online = 1.0
			}
			ch <- prometheus.MustNewConstMetric(c.upDesc, prometheus.GaugeValue, online, r.ID)
			ch <- prometheus.MustNewConstMetric(c.cpuUsageDesc, prometheus.GaugeValue, r.CPU, r.ID)
			ch <- prometheus.MustNewConstMetric(c.cpuLimitDesc, prometheus.GaugeValue, r.MaxCPU, r.ID)
			ch <- prometheus.MustNewConstMetric(c.memUsageDesc, prometheus.GaugeValue, r.Mem, r.ID)
			ch <- prometheus.MustNewConstMetric(c.memSizeDesc, prometheus.GaugeValue, r.MaxMem, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.netTransmitDesc, prometheus.CounterValue, r.NetOut, r.ID)
			ch <- prometheus.MustNewConstMetric(c.netReceiveDesc, prometheus.CounterValue, r.NetIn, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskWrittenDesc, prometheus.CounterValue, r.DiskWrite, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskReadDesc, prometheus.CounterValue, r.DiskRead, r.ID)
			ch <- prometheus.MustNewConstMetric(c.uptimeDesc, prometheus.GaugeValue, r.Uptime, r.ID)
			ch <- prometheus.MustNewConstMetric(c.guestInfoDesc, prometheus.GaugeValue, 1,
				r.ID, r.Node, r.Name, r.Type, strconv.Itoa(r.Template), r.Tags)

			if r.HAState != "" {
				ch <- prometheus.MustNewConstMetric(c.haStateDesc, prometheus.GaugeValue, 1, r.ID, r.HAState)
			}
			if r.Lock != "" {
				ch <- prometheus.MustNewConstMetric(c.lockStateDesc, prometheus.GaugeValue, 1, r.ID, r.Lock)
			}

		case "storage":
			ch <- prometheus.MustNewConstMetric(c.diskUsageDesc, prometheus.GaugeValue, r.Disk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.diskSizeDesc, prometheus.GaugeValue, r.MaxDisk, r.ID)
			ch <- prometheus.MustNewConstMetric(c.storageSharedDesc, prometheus.GaugeValue, float64(r.Shared), r.ID)
			ch <- prometheus.MustNewConstMetric(c.storageInfoDesc, prometheus.GaugeValue, 1,
				r.ID, r.Node, r.Storage, r.PluginType, r.Content)
		}
	}

	return nil
}

Important note: Both cluster_status and cluster_resources emit pve_up. This will cause a duplicate descriptor error. Fix: remove pve_up from cluster_status collector — cluster_resources is the canonical source since it covers nodes, VMs, and CTs. Update cluster_status.go to remove the upDesc field and its usage, and update the cluster_status_test accordingly.

  • Step 5: Fix pve_up duplication — remove from cluster_status

Edit collector/cluster_status.go: remove the upDesc field and all pve_up emissions. Edit collector/cluster_status_test.go: remove the pve_up assertions.

  • Step 6: Run all tests
cd /home/user/git/pve-exporter && go test ./collector/ -v

Expected: All tests PASS.

  • Step 7: Commit
git add collector/cluster_resources.go collector/cluster_resources_test.go collector/fixtures/cluster_resources.json collector/cluster_status.go collector/cluster_status_test.go
git commit -m "feat: add cluster_resources collector (16 metrics: CPU, memory, disk, network, storage, guest info, HA/lock)"

Task 9: Backup Collector

Files:

  • Create: collector/backup.go

  • Create: collector/backup_test.go

  • Create: collector/fixtures/backup_not_backed_up.json

  • Step 1: Create fixture

Create collector/fixtures/backup_not_backed_up.json:

{"data":[{"vmid":100,"name":"pve-backup.freyja.sip.is","type":"qemu"}]}
  • Step 2: Write the failing test

Create collector/backup_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestBackupCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/cluster/backup-info/not-backed-up": "backup_not_backed_up.json",
	})
	logger := promslog.NewNopLogger()
	c := newBackupCollector(logger)

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	expected := `
		# HELP pve_not_backed_up_info Present if guest is not covered by any backup job.
		# TYPE pve_not_backed_up_info gauge
		pve_not_backed_up_info{id="qemu/100"} 1
		# HELP pve_not_backed_up_total Total number of guests not covered by any backup job.
		# TYPE pve_not_backed_up_total gauge
		pve_not_backed_up_total{id="qemu/100"} 1
	`
	if err := testutil.GatherAndCompare(reg, strings.NewReader(expected),
		"pve_not_backed_up_info", "pve_not_backed_up_total"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestBackupCollector -v

Expected: Compilation error.

  • Step 4: Write the implementation

Create collector/backup.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"

	"github.com/prometheus/client_golang/prometheus"
)

type backupCollector struct {
	notBackedUpTotalDesc *prometheus.Desc
	notBackedUpInfoDesc  *prometheus.Desc
	logger               *slog.Logger
}

func init() {
	registerCollector("backup", func(logger *slog.Logger) Collector {
		return newBackupCollector(logger)
	})
}

func newBackupCollector(logger *slog.Logger) *backupCollector {
	return &backupCollector{
		notBackedUpTotalDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "not_backed_up_total"),
			"Total number of guests not covered by any backup job.",
			[]string{"id"}, nil,
		),
		notBackedUpInfoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "not_backed_up_info"),
			"Present if guest is not covered by any backup job.",
			[]string{"id"}, nil,
		),
		logger: logger,
	}
}

func (c *backupCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	body, err := client.Get("/cluster/backup-info/not-backed-up")
	if err != nil {
		return fmt.Errorf("fetching backup info: %w", err)
	}

	var resp struct {
		Data []struct {
			VMID int    `json:"vmid"`
			Type string `json:"type"`
		} `json:"data"`
	}
	if err := json.Unmarshal(body, &resp); err != nil {
		return fmt.Errorf("parsing backup info: %w", err)
	}

	for _, vm := range resp.Data {
		id := fmt.Sprintf("%s/%d", vm.Type, vm.VMID)
		ch <- prometheus.MustNewConstMetric(c.notBackedUpTotalDesc, prometheus.GaugeValue, 1, id)
		ch <- prometheus.MustNewConstMetric(c.notBackedUpInfoDesc, prometheus.GaugeValue, 1, id)
	}

	return nil
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestBackupCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/backup.go collector/backup_test.go collector/fixtures/backup_not_backed_up.json
git commit -m "feat: add backup collector (pve_not_backed_up_total, pve_not_backed_up_info)"

Task 10: Subscription Collector

Files:

  • Create: collector/subscription.go

  • Create: collector/subscription_test.go

  • Create: collector/fixtures/node_subscription.json

  • Step 1: Create fixture

Create collector/fixtures/node_subscription.json:

{"data":{"status":"active","level":"b","productname":"Proxmox VE Basic Subscription","nextduedate":"2027-02-03","regdate":"2025-02-03","key":"pve2b-test","sockets":2,"checktime":1773896474}}
  • Step 2: Write the failing test

Create collector/subscription_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestSubscriptionCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/nodes/node01/subscription": "node_subscription.json",
	})
	logger := promslog.NewNopLogger()
	c := newSubscriptionCollector(logger)

	// Manually set nodes since this is a NodeAwareCollector
	c.SetNodes([]string{"node01"})

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_subscription_info Proxmox VE subscription info.
		# TYPE pve_subscription_info gauge
		pve_subscription_info{id="node/node01",level="b"} 1
		# HELP pve_subscription_status Proxmox VE subscription status.
		# TYPE pve_subscription_status gauge
		pve_subscription_status{id="node/node01",status="active"} 1
		# HELP pve_subscription_next_due_timestamp_seconds Subscription next due date as Unix timestamp.
		# TYPE pve_subscription_next_due_timestamp_seconds gauge
		pve_subscription_next_due_timestamp_seconds{id="node/node01"} 1.801699200e+09
	`), "pve_subscription_info", "pve_subscription_status", "pve_subscription_next_due_timestamp_seconds"); err != nil {
		t.Error(err)
	}
}

Note: 2027-02-03 as Unix timestamp is 1801699200.

  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestSubscriptionCollector -v

Expected: Compilation error.

  • Step 4: Write the implementation

Create collector/subscription.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"sync"
	"time"

	"github.com/prometheus/client_golang/prometheus"
)

type subscriptionCollector struct {
	infoDesc    *prometheus.Desc
	statusDesc  *prometheus.Desc
	nextDueDesc *prometheus.Desc
	logger      *slog.Logger

	mu    sync.Mutex
	nodes []string
}

func init() {
	registerCollector("subscription", func(logger *slog.Logger) Collector {
		return newSubscriptionCollector(logger)
	})
}

func newSubscriptionCollector(logger *slog.Logger) *subscriptionCollector {
	return &subscriptionCollector{
		infoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "subscription_info"),
			"Proxmox VE subscription info.",
			[]string{"id", "level"}, nil,
		),
		statusDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "subscription_status"),
			"Proxmox VE subscription status.",
			[]string{"id", "status"}, nil,
		),
		nextDueDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "subscription_next_due_timestamp_seconds"),
			"Subscription next due date as Unix timestamp.",
			[]string{"id"}, nil,
		),
		logger: logger,
	}
}

func (c *subscriptionCollector) SetNodes(nodes []string) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.nodes = nodes
}

func (c *subscriptionCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	c.mu.Lock()
	nodes := c.nodes
	c.mu.Unlock()

	if len(nodes) == 0 {
		return nil
	}

	sem := make(chan struct{}, client.MaxConcurrent())
	var wg sync.WaitGroup
	var mu sync.Mutex
	var firstErr error

	for _, node := range nodes {
		wg.Add(1)
		sem <- struct{}{}
		go func(node string) {
			defer wg.Done()
			defer func() { <-sem }()

			body, err := client.Get(fmt.Sprintf("/nodes/%s/subscription", node))
			if err != nil {
				mu.Lock()
				if firstErr == nil {
					firstErr = fmt.Errorf("fetching subscription for %s: %w", node, err)
				}
				mu.Unlock()
				return
			}

			var resp struct {
				Data struct {
					Status      string `json:"status"`
					Level       string `json:"level"`
					NextDueDate string `json:"nextduedate"`
				} `json:"data"`
			}
			if err := json.Unmarshal(body, &resp); err != nil {
				c.logger.Warn("parsing subscription response", "node", node, "err", err)
				return
			}

			id := "node/" + node
			ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1, id, resp.Data.Level)
			ch <- prometheus.MustNewConstMetric(c.statusDesc, prometheus.GaugeValue, 1, id, resp.Data.Status)

			if resp.Data.NextDueDate != "" {
				if t, err := time.Parse("2006-01-02", resp.Data.NextDueDate); err == nil {
					ch <- prometheus.MustNewConstMetric(c.nextDueDesc, prometheus.GaugeValue,
						float64(t.Unix()), id)
				}
			}
		}(node)
	}
	wg.Wait()

	return firstErr
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestSubscriptionCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/subscription.go collector/subscription_test.go collector/fixtures/node_subscription.json
git commit -m "feat: add subscription collector (info, status, next_due_timestamp)"

Task 11: Node Config Collector

Files:

  • Create: collector/node_config.go

  • Create: collector/node_config_test.go

  • Create: collector/fixtures/node_qemu.json

  • Create: collector/fixtures/node_qemu_config_100.json

  • Create: collector/fixtures/node_lxc.json

  • Step 1: Create fixtures

Create collector/fixtures/node_qemu.json:

{"data":[{"vmid":100,"name":"testvm1","status":"running"},{"vmid":101,"name":"testvm2","status":"stopped"}]}

Create collector/fixtures/node_qemu_config_100.json:

{"data":{"onboot":1,"name":"testvm1","memory":4096}}

Create collector/fixtures/node_qemu_config_101.json:

{"data":{"onboot":0,"name":"testvm2","memory":2048}}

Create collector/fixtures/node_lxc.json:

{"data":[]}
  • Step 2: Write the failing test

Create collector/node_config_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestNodeConfigCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/nodes/node01/qemu":            "node_qemu.json",
		"/nodes/node01/qemu/100/config": "node_qemu_config_100.json",
		"/nodes/node01/qemu/101/config": "node_qemu_config_101.json",
		"/nodes/node01/lxc":             "node_lxc.json",
	})
	logger := promslog.NewNopLogger()
	c := newNodeConfigCollector(logger)
	c.SetNodes([]string{"node01"})

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_onboot_status Proxmox VM/CT onboot config value.
		# TYPE pve_onboot_status gauge
		pve_onboot_status{id="qemu/100",node="node01",type="qemu"} 1
		pve_onboot_status{id="qemu/101",node="node01",type="qemu"} 0
	`), "pve_onboot_status"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestNodeConfigCollector -v

Expected: Compilation error.

  • Step 4: Write the implementation

Create collector/node_config.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"sync"

	"github.com/prometheus/client_golang/prometheus"
)

type nodeConfigCollector struct {
	onbootDesc *prometheus.Desc
	logger     *slog.Logger

	mu    sync.Mutex
	nodes []string
}

func init() {
	registerCollector("node_config", func(logger *slog.Logger) Collector {
		return newNodeConfigCollector(logger)
	})
}

func newNodeConfigCollector(logger *slog.Logger) *nodeConfigCollector {
	return &nodeConfigCollector{
		onbootDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "onboot_status"),
			"Proxmox VM/CT onboot config value.",
			[]string{"id", "node", "type"}, nil,
		),
		logger: logger,
	}
}

func (c *nodeConfigCollector) SetNodes(nodes []string) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.nodes = nodes
}

func (c *nodeConfigCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	c.mu.Lock()
	nodes := c.nodes
	c.mu.Unlock()

	if len(nodes) == 0 {
		return nil
	}

	sem := make(chan struct{}, client.MaxConcurrent())
	var wg sync.WaitGroup
	var mu sync.Mutex
	var firstErr error

	for _, node := range nodes {
		wg.Add(1)
		sem <- struct{}{}
		go func(node string) {
			defer wg.Done()
			defer func() { <-sem }()

			if err := c.collectGuestConfigs(client, ch, node, "qemu"); err != nil {
				mu.Lock()
				if firstErr == nil {
					firstErr = err
				}
				mu.Unlock()
			}
			if err := c.collectGuestConfigs(client, ch, node, "lxc"); err != nil {
				mu.Lock()
				if firstErr == nil {
					firstErr = err
				}
				mu.Unlock()
			}
		}(node)
	}
	wg.Wait()

	return firstErr
}

func (c *nodeConfigCollector) collectGuestConfigs(client *Client, ch chan<- prometheus.Metric, node, guestType string) error {
	// List guests
	body, err := client.Get(fmt.Sprintf("/nodes/%s/%s", node, guestType))
	if err != nil {
		return fmt.Errorf("listing %s on %s: %w", guestType, node, err)
	}

	var listResp struct {
		Data []struct {
			VMID int `json:"vmid"`
		} `json:"data"`
	}
	if err := json.Unmarshal(body, &listResp); err != nil {
		return fmt.Errorf("parsing %s list for %s: %w", guestType, node, err)
	}

	// Fetch config for each guest
	sem := make(chan struct{}, client.MaxConcurrent())
	var wg sync.WaitGroup

	for _, guest := range listResp.Data {
		wg.Add(1)
		sem <- struct{}{}
		go func(vmid int) {
			defer wg.Done()
			defer func() { <-sem }()

			configBody, err := client.Get(fmt.Sprintf("/nodes/%s/%s/%d/config", node, guestType, vmid))
			if err != nil {
				c.logger.Warn("fetching config", "node", node, "type", guestType, "vmid", vmid, "err", err)
				return
			}

			var configResp struct {
				Data struct {
					Onboot *int `json:"onboot"`
				} `json:"data"`
			}
			if err := json.Unmarshal(configBody, &configResp); err != nil {
				c.logger.Warn("parsing config", "node", node, "type", guestType, "vmid", vmid, "err", err)
				return
			}

			onboot := 0.0
			if configResp.Data.Onboot != nil {
				onboot = float64(*configResp.Data.Onboot)
			}

			id := fmt.Sprintf("%s/%d", guestType, vmid)
			ch <- prometheus.MustNewConstMetric(c.onbootDesc, prometheus.GaugeValue, onboot, id, node, guestType)
		}(guest.VMID)
	}
	wg.Wait()

	return nil
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestNodeConfigCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/node_config.go collector/node_config_test.go collector/fixtures/node_qemu.json collector/fixtures/node_qemu_config_100.json collector/fixtures/node_qemu_config_101.json collector/fixtures/node_lxc.json
git commit -m "feat: add node_config collector (pve_onboot_status)"

Task 12: Replication Collector

Files:

  • Create: collector/replication.go

  • Create: collector/replication_test.go

  • Create: collector/fixtures/node_replication.json

  • Step 1: Create fixture

Create collector/fixtures/node_replication.json (empty — no replication on this cluster, but test the parsing path):

{"data":[{"id":"100-0","type":"local","source":"node01","target":"node02","guest":100,"duration":5.2,"last_sync":1710000000,"last_try":1710000060,"next_sync":1710003600,"fail_count":0}]}
  • Step 2: Write the failing test

Create collector/replication_test.go:

package collector

import (
	"strings"
	"testing"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/testutil"
	"github.com/prometheus/common/promslog"
)

func TestReplicationCollector(t *testing.T) {
	client := newTestClient(t, map[string]string{
		"/nodes/node01/replication": "node_replication.json",
	})
	logger := promslog.NewNopLogger()
	c := newReplicationCollector(logger)
	c.SetNodes([]string{"node01"})

	reg := prometheus.NewRegistry()
	adapter := &testCollectorAdapter{client: client, collector: c}
	reg.MustRegister(adapter)

	if err := testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP pve_replication_info Proxmox VM replication info.
		# TYPE pve_replication_info gauge
		pve_replication_info{guest="100",id="100-0",source="node01",target="node02",type="local"} 1
		# HELP pve_replication_duration_seconds Proxmox VM replication duration.
		# TYPE pve_replication_duration_seconds gauge
		pve_replication_duration_seconds{id="100-0"} 5.2
		# HELP pve_replication_last_sync_timestamp_seconds Proxmox VM replication last_sync.
		# TYPE pve_replication_last_sync_timestamp_seconds gauge
		pve_replication_last_sync_timestamp_seconds{id="100-0"} 1.71e+09
		# HELP pve_replication_last_try_timestamp_seconds Proxmox VM replication last_try.
		# TYPE pve_replication_last_try_timestamp_seconds gauge
		pve_replication_last_try_timestamp_seconds{id="100-0"} 1.71000006e+09
		# HELP pve_replication_next_sync_timestamp_seconds Proxmox VM replication next_sync.
		# TYPE pve_replication_next_sync_timestamp_seconds gauge
		pve_replication_next_sync_timestamp_seconds{id="100-0"} 1.7100036e+09
		# HELP pve_replication_failed_syncs Proxmox VM replication fail_count.
		# TYPE pve_replication_failed_syncs gauge
		pve_replication_failed_syncs{id="100-0"} 0
	`), "pve_replication_info", "pve_replication_duration_seconds",
		"pve_replication_last_sync_timestamp_seconds", "pve_replication_last_try_timestamp_seconds",
		"pve_replication_next_sync_timestamp_seconds", "pve_replication_failed_syncs"); err != nil {
		t.Error(err)
	}
}
  • Step 3: Run test to verify it fails
cd /home/user/git/pve-exporter && go test ./collector/ -run TestReplicationCollector -v

Expected: Compilation error.

  • Step 4: Write the implementation

Create collector/replication.go:

package collector

import (
	"encoding/json"
	"fmt"
	"log/slog"
	"strconv"
	"sync"

	"github.com/prometheus/client_golang/prometheus"
)

type replicationCollector struct {
	infoDesc     *prometheus.Desc
	durationDesc *prometheus.Desc
	lastSyncDesc *prometheus.Desc
	lastTryDesc  *prometheus.Desc
	nextSyncDesc *prometheus.Desc
	failCountDesc *prometheus.Desc
	logger       *slog.Logger

	mu    sync.Mutex
	nodes []string
}

func init() {
	registerCollector("replication", func(logger *slog.Logger) Collector {
		return newReplicationCollector(logger)
	})
}

func newReplicationCollector(logger *slog.Logger) *replicationCollector {
	return &replicationCollector{
		infoDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_info"),
			"Proxmox VM replication info.",
			[]string{"id", "type", "source", "target", "guest"}, nil,
		),
		durationDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_duration_seconds"),
			"Proxmox VM replication duration.",
			[]string{"id"}, nil,
		),
		lastSyncDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_last_sync_timestamp_seconds"),
			"Proxmox VM replication last_sync.",
			[]string{"id"}, nil,
		),
		lastTryDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_last_try_timestamp_seconds"),
			"Proxmox VM replication last_try.",
			[]string{"id"}, nil,
		),
		nextSyncDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_next_sync_timestamp_seconds"),
			"Proxmox VM replication next_sync.",
			[]string{"id"}, nil,
		),
		failCountDesc: prometheus.NewDesc(
			prometheus.BuildFQName(namespace, "", "replication_failed_syncs"),
			"Proxmox VM replication fail_count.",
			[]string{"id"}, nil,
		),
		logger: logger,
	}
}

func (c *replicationCollector) SetNodes(nodes []string) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.nodes = nodes
}

func (c *replicationCollector) Update(client *Client, ch chan<- prometheus.Metric) error {
	c.mu.Lock()
	nodes := c.nodes
	c.mu.Unlock()

	if len(nodes) == 0 {
		return nil
	}

	sem := make(chan struct{}, client.MaxConcurrent())
	var wg sync.WaitGroup
	var mu sync.Mutex
	var firstErr error

	for _, node := range nodes {
		wg.Add(1)
		sem <- struct{}{}
		go func(node string) {
			defer wg.Done()
			defer func() { <-sem }()

			body, err := client.Get(fmt.Sprintf("/nodes/%s/replication", node))
			if err != nil {
				mu.Lock()
				if firstErr == nil {
					firstErr = fmt.Errorf("fetching replication for %s: %w", node, err)
				}
				mu.Unlock()
				return
			}

			var resp struct {
				Data []struct {
					ID        string  `json:"id"`
					Type      string  `json:"type"`
					Source    string  `json:"source"`
					Target    string  `json:"target"`
					Guest     int     `json:"guest"`
					Duration  float64 `json:"duration"`
					LastSync  float64 `json:"last_sync"`
					LastTry   float64 `json:"last_try"`
					NextSync  float64 `json:"next_sync"`
					FailCount float64 `json:"fail_count"`
				} `json:"data"`
			}
			if err := json.Unmarshal(body, &resp); err != nil {
				c.logger.Warn("parsing replication response", "node", node, "err", err)
				return
			}

			for _, r := range resp.Data {
				ch <- prometheus.MustNewConstMetric(c.infoDesc, prometheus.GaugeValue, 1,
					r.ID, r.Type, r.Source, r.Target, strconv.Itoa(r.Guest))
				ch <- prometheus.MustNewConstMetric(c.durationDesc, prometheus.GaugeValue, r.Duration, r.ID)
				ch <- prometheus.MustNewConstMetric(c.lastSyncDesc, prometheus.GaugeValue, r.LastSync, r.ID)
				ch <- prometheus.MustNewConstMetric(c.lastTryDesc, prometheus.GaugeValue, r.LastTry, r.ID)
				ch <- prometheus.MustNewConstMetric(c.nextSyncDesc, prometheus.GaugeValue, r.NextSync, r.ID)
				ch <- prometheus.MustNewConstMetric(c.failCountDesc, prometheus.GaugeValue, r.FailCount, r.ID)
			}
		}(node)
	}
	wg.Wait()

	return firstErr
}
  • Step 5: Run tests
cd /home/user/git/pve-exporter && go test ./collector/ -run TestReplicationCollector -v

Expected: PASS.

  • Step 6: Commit
git add collector/replication.go collector/replication_test.go collector/fixtures/node_replication.json
git commit -m "feat: add replication collector (6 replication metrics)"

Task 13: README with TODO Metrics

Files:

  • Create: README.md

  • Step 1: Write README.md

Create README.md with usage documentation, full metric list, and the TODO section for future metrics. Include:

  • Project description

  • Installation (build from source)

  • Usage (CLI flags, example command)

  • Complete metric table (all implemented metrics with type and labels)

  • TODO section listing all deferred metrics from the spec's "Future Metrics" section:

    • Per-node detailed status (load avg, swap, rootfs, KSM, kernel, boot, CPU model)
    • Per-VM pressure metrics
    • HA detailed status (CRM, LRM, per-service config)
    • Physical disks (SMART, wearout, OSD mapping)
    • SDN/Network (zone status)
  • Step 2: Commit

git add README.md
git commit -m "docs: add README with usage, metrics reference, and future metrics TODO"

Task 14: Integration Test and Final Verification

Files:

  • No new files — uses existing code

  • Step 1: Run all unit tests

cd /home/user/git/pve-exporter && go test ./... -v

Expected: All tests PASS.

  • Step 2: Build static binary
cd /home/user/git/pve-exporter && CGO_ENABLED=0 go build -o pve-exporter .
file pve-exporter

Expected: pve-exporter: ELF 64-bit LSB executable, x86-64, ... statically linked

  • Step 3: End-to-end smoke test against live PVE
cd /home/user/git/pve-exporter
./pve-exporter --pve.host=https://node02.freyja.cloud.sip.is:8006 --pve.tls-insecure --pve.token-file=.apikey &
sleep 2
curl -s http://localhost:9221/metrics > /tmp/pve-metrics.txt
kill %1

Verify key metrics are present:

grep "pve_version_info" /tmp/pve-metrics.txt
grep "pve_cluster_quorate" /tmp/pve-metrics.txt
grep "pve_node_online" /tmp/pve-metrics.txt
grep "pve_cluster_info" /tmp/pve-metrics.txt
grep "pve_cpu_usage_ratio" /tmp/pve-metrics.txt
grep "pve_guest_info" /tmp/pve-metrics.txt
grep "pve_storage_info" /tmp/pve-metrics.txt
grep "pve_subscription_info" /tmp/pve-metrics.txt
grep "pve_not_backed_up" /tmp/pve-metrics.txt
grep "pve_scrape_collector_success" /tmp/pve-metrics.txt

Expected: All metrics present with correct labels and values.

  • Step 4: Verify scrape performance
time curl -s http://localhost:9221/metrics > /dev/null

Expected: Scrape completes in under 5 seconds.

  • Step 5: Commit any fixes needed

If the integration test reveals issues, fix and commit.