Commit graph

34 commits

Author SHA1 Message Date
3bad7963af fix: resolve deadlock in node_config collector causing request exhaustion
The outer goroutine per-node acquired a semaphore slot and held it while
collectNode spawned inner goroutines needing slots from the same semaphore.
With maxConc=5 and 5+ nodes, all slots were consumed by outer goroutines,
inner goroutines blocked forever, and Collect() never returned — permanently
consuming an HTTP MaxRequestsInFlight slot until the server stopped responding.

Remove the redundant outer semaphore acquire (inner goroutines already manage
their own slots) and add a 120s HTTP timeout as defense-in-depth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 11:30:54 +00:00
5e066a5c4b fix: normalize HA service IDs to match cluster_resources format
Convert HA API service IDs (vm:106, ct:200) to the resource ID format
used by /cluster/resources and the Python exporter (qemu/106, lxc/200).
Rename label from "sid" to "id" so HA metrics can be joined with
pve_ha_state, pve_guest_info, and other id-labeled metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 15:12:01 +00:00
01dbc7cee4 Strip trailing slash from PVE host URLs
A trailing slash in --pve.host (e.g. https://host:8006/) caused API
requests to fail with status 500 due to double slashes in the path.
2026-03-23 11:34:26 +00:00
771c3dc126 docs: update README with new collector metrics, remove TODO section
Add metrics tables for node_status, vm_pressure, ha_status, and
physical_disk collectors. Remove the TODO section as all planned
metrics are now implemented (SDN excluded by design).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:35:00 +00:00
a88c696bfd feat: add physical_disk collector (health, wearout, size, OSD mapping)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:33:46 +00:00
0afa5b0e19 test: add physical_disk collector test and fixture
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:33:43 +00:00
6244100886 feat: add ha_status collector (CRM master, node/LRM status, service config)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:30:42 +00:00
16cfba4587 test: add ha_status collector test and fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:30:39 +00:00
d458894b0e feat: add vm_pressure collector (PSI cpu/memory/io for QEMU VMs)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:28:22 +00:00
1e4e3af1d5 test: add vm_pressure collector test and fixture
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:28:17 +00:00
496a46460c feat: add node_status collector (load, swap, rootfs, ksm, boot mode)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:23:09 +00:00
2097451d15 test: add node_status collector test and fixture
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 15:23:04 +00:00
18bb43394e docs: add implementation plan for remaining collectors
9 tasks covering node_status, vm_pressure, ha_status, and physical_disk
collectors with TDD approach, fixtures, and README update.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:20:17 +00:00
e10156323b docs: address spec review feedback
- Split pve_ha_service_info into _config and _status to avoid stale series
- Handle wearout "N/A" and health "UNKNOWN" edge cases for physical disks
- Clarify node label convention and rootfs available vs free naming
- Note QEMU-only scope for VM pressure (LXC lacks PSI in PVE API)
- Add full node_status/lrm_status examples showing all cluster nodes
- Document mutex-guarded nodes pattern and test fixture requirements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:10:22 +00:00
b4ed302009 docs: add design spec for remaining collectors
Covers node status, VM pressure, HA status, and physical disks
collectors with metric definitions, API structures, and scope
exclusions (SDN, kernel version, CPU model).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 15:07:12 +00:00
07a03c0578 Add flake.nix for Nix builds and dev shell
- Package builds with buildGoModule and CGO_ENABLED=0
- Dev shell provides go_latest, gopls, gotools

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 12:44:50 +00:00
2bdb508672 fix: normalize API token format in client
Accept tokens both with and without PVEAPIToken= prefix,
since token files may contain the full Authorization header value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:40:07 +00:00
56fe551700 docs: add README with usage, metrics reference, and future metrics TODO
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:38:17 +00:00
3bafb67aa0 feat: add replication collector (6 replication metrics)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:36:54 +00:00
b59abd59d3 feat: add node_config collector (pve_onboot_status)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:36:17 +00:00
7708a64408 feat: add subscription collector (info, status, next_due_timestamp)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:35:36 +00:00
5e61f224c4 feat: add backup collector (pve_not_backed_up_total, pve_not_backed_up_info)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:33:33 +00:00
a62264edf8 feat: add cluster_resources collector (16 metrics)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:33:03 +00:00
2a51e00fe1 feat: add corosync collector (quorate, nodes_total, expected_votes, node_online)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:32:00 +00:00
63494d0fcb feat: add cluster_status collector (pve_node_info, pve_cluster_info)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:31:31 +00:00
c8ae97d777 feat: add version collector (pve_version_info)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:31:06 +00:00
1a13f19b1f feat: add main entry point with CLI flags and HTTP server
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:27:53 +00:00
af71e7d729 feat: add collector framework with registry and parallel scrape orchestration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:26:55 +00:00
210e22e030 feat: add PVE API client with multi-host failover
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:25:48 +00:00
b8d69f2589 feat: initialize Go module with Prometheus and kingpin dependencies
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:24:50 +00:00
b590245a53 Add pve-exporter implementation plan
14 tasks covering: Go module setup, API client, collector framework,
main entry point, and all 8 collectors (version, cluster_status,
corosync, cluster_resources, backup, subscription, node_config,
replication), plus README and integration testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:20:11 +00:00
5196a441ef Add --pve.max-concurrent flag to spec
Configurable bounded concurrency for per-node API fan-out,
defaulting to 5.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:02:21 +00:00
154a46f3cf Add pve-exporter design spec
Full design for a Go Prometheus exporter for Proxmox VE, replacing
the Python prometheus-pve-exporter with corosync metrics added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 10:59:14 +00:00
4aa8a6d579 Add .gitignore 2026-03-19 16:29:23 +00:00