Open source tools for GPU clusters.
Two tools. One goal: stop guessing about your cluster's real performance. Both free, both open source.
DeepLM Insights
Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency across your entire fleet — setup takes minutes.
5 Grafana dashboards included
How it works
- SLURM prologue/epilogue hooks collect per-job metrics at start and end
- Flask metrics API exposes Prometheus-format data on :5000/metrics
- Cassandra stores job history for long-term trend analysis
- Optional NVIDIA BCM integration for real GPU power data (falls back to TDP estimate)
- Full stack deployable in one command via Docker Compose
Stack
Get started in 5 minutes.
Requires Docker Compose and a SLURM cluster. Cassandra is included in the compose stack.
git clone https://github.com/DeepLM/Insights.git
cd Insightscp .env.example .env
# Edit .env with your Cassandra host, compute nodes, BCM credentialsdocker compose up -dsudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.shGrafana → http://localhost:3000 (admin / changeme)
Prometheus → http://localhost:9090
Metrics API → http://localhost:5000/metrics
Cassandra UI → http://localhost:5002pip install . then run the metrics API directly.DeepLM Baseline
Multi-stage performance benchmarking for GPU clusters. Run three targeted test suites — compute, interconnect, and network — to establish ground-truth baselines before and after any infrastructure change. Structured JSON output for every stage.
GPU Compute & Memory
How fast are your GPUs — really?
Establish per-GPU compute throughput baselines, detect thermal throttling and vendor-imposed power limits, and validate memory bandwidth. Compare actual performance against vendor specs to find GPUs that aren't pulling their weight.
What you learn
- Actual TFLOPS vs. vendor-published peak (FP16/BF16)
- Whether power limits are set below spec
- HBM bandwidth — are you hitting ±5% of rated speed?
- Thermal throttle thresholds under sustained load
- Which GPUs in your fleet are underperforming
Tests included
Pass / Fail Thresholds
GPU Interconnect & Intra-Node
Are your GPUs actually talking to each other?
Validate GPU-to-GPU bandwidth within a node. Catch misconfigured NVLink bridges, disabled NVSwitch lanes, PCIe gen mismatches, and NUMA affinity issues that silently kill distributed training performance.
What you learn
- NVLink peer-to-peer bandwidth between all GPU pairs
- Whether NVSwitch mesh is fully connected or degraded
- PCIe bandwidth matching expected gen (4 vs 5)
- NUMA affinity — is each GPU on the right node?
- Which link pairs are bottlenecking your all-reduce
Tests included
Pass / Fail Thresholds
Network & Multi-Node Collectives
Can your cluster actually scale?
Baseline the full network stack — Ethernet throughput, InfiniBand RDMA bandwidth, and multi-node collective communication. This is where most clusters silently lose 30–50% of their theoretical distributed training performance.
What you learn
- Per-NIC Ethernet throughput and asymmetry issues
- InfiniBand port speeds and RDMA bandwidth at line rate
- Whether NCCL is using IB transport or falling back to TCP
- AllReduce bus bandwidth across your actual node count
- Which node pairs have bad ports, cables, or switches
Tests included
Pass / Fail Thresholds
Coming next
Storage baselining, checkpoint latency, and the full pre-flight harness
Stages 4 and 5 add storage I/O benchmarking (single-node and distributed), checkpoint latency profiling (sync and async), and a unified regression harness that runs as a SLURM prolog or K8s init container.
Start with Insights today.
Real-time dashboards for your SLURM cluster. Free forever.
Get Insights on GitHub