Open source tools for GPU clusters.

Two tools. One goal: stop guessing about your cluster's real performance. Both free, both open source.

Free & Open Source

DeepLM Insights

Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency across your entire fleet — setup takes minutes.

5 Grafana dashboards included

Job Insights
Per-job CPU/GPU utilization, memory, priority, and power consumption
System Overview
Cluster-wide power, GPU temperature, fan speed, CPU/memory per node
Live Jobs
Real-time active job monitoring with 5-second refresh
Historical Jobs
Job duration analysis, completion rates, CPU hours by user
Checkpoint Analysis
Sync vs async checkpoint strategy comparison — stall time and overhead

How it works

  • SLURM prologue/epilogue hooks collect per-job metrics at start and end
  • Flask metrics API exposes Prometheus-format data on :5000/metrics
  • Cassandra stores job history for long-term trend analysis
  • Optional NVIDIA BCM integration for real GPU power data (falls back to TDP estimate)
  • Full stack deployable in one command via Docker Compose

Stack

SLURM hooks Flask metrics API Cassandra Prometheus Grafana

Get started in 5 minutes.

Requires Docker Compose and a SLURM cluster. Cassandra is included in the compose stack.

terminal
1. Clone
git clone https://github.com/DeepLM/Insights.git
cd Insights
2. Configure
cp .env.example .env
# Edit .env with your Cassandra host, compute nodes, BCM credentials
3. Start the stack
docker compose up -d
4. Install SLURM hooks
sudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.sh
Services
Grafana      → http://localhost:3000  (admin / changeme)
Prometheus   → http://localhost:9090
Metrics API  → http://localhost:5000/metrics
Cassandra UI → http://localhost:5002
Bare metal install also supported: pip install . then run the metrics API directly.
Coming Soon

DeepLM Baseline

Multi-stage performance benchmarking for GPU clusters. Run three targeted test suites — compute, interconnect, and network — to establish ground-truth baselines before and after any infrastructure change. Structured JSON output for every stage.

01

GPU Compute & Memory

How fast are your GPUs — really?

Establish per-GPU compute throughput baselines, detect thermal throttling and vendor-imposed power limits, and validate memory bandwidth. Compare actual performance against vendor specs to find GPUs that aren't pulling their weight.

What you learn

  • Actual TFLOPS vs. vendor-published peak (FP16/BF16)
  • Whether power limits are set below spec
  • HBM bandwidth — are you hitting ±5% of rated speed?
  • Thermal throttle thresholds under sustained load
  • Which GPUs in your fleet are underperforming

Tests included

Peak FP16/BF16 TFLOPS
GEMM microbenchmark (cuBLAS / rocBLAS)
Vector math saturation
Custom CUDA kernel + power monitoring
HBM bandwidth
Stream benchmark, peak read/write GB/s
Thermal ramp test
Sustained GEMM, log temp + SM clock/sec

Pass / Fail Thresholds

TFLOPS within ±5% of vendor spec
HBM bandwidth within ±5% of spec
Temperature stabilizes below throttle threshold
No throttle reason codes beyond [Active]
02

GPU Interconnect & Intra-Node

Are your GPUs actually talking to each other?

Validate GPU-to-GPU bandwidth within a node. Catch misconfigured NVLink bridges, disabled NVSwitch lanes, PCIe gen mismatches, and NUMA affinity issues that silently kill distributed training performance.

What you learn

  • NVLink peer-to-peer bandwidth between all GPU pairs
  • Whether NVSwitch mesh is fully connected or degraded
  • PCIe bandwidth matching expected gen (4 vs 5)
  • NUMA affinity — is each GPU on the right node?
  • Which link pairs are bottlenecking your all-reduce

Tests included

NVLink p2p bandwidth
p2pBandwidthLatencyTest, all GPU pairs
NVLink topology check
nvidia-smi nvlink --status
PCIe bandwidth
Host-device transfer, gen validation
NUMA affinity check
nvidia-smi topo -m vs expected

Pass / Fail Thresholds

NVLink p2p ≥ 450 GB/s per link pair (H100 SXM)
Zero NVLink errors
PCIe matches gen spec (~64 GB/s bidirectional for PCIe 5)
All GPUs on expected NUMA nodes
03

Network & Multi-Node Collectives

Can your cluster actually scale?

Baseline the full network stack — Ethernet throughput, InfiniBand RDMA bandwidth, and multi-node collective communication. This is where most clusters silently lose 30–50% of their theoretical distributed training performance.

What you learn

  • Per-NIC Ethernet throughput and asymmetry issues
  • InfiniBand port speeds and RDMA bandwidth at line rate
  • Whether NCCL is using IB transport or falling back to TCP
  • AllReduce bus bandwidth across your actual node count
  • Which node pairs have bad ports, cables, or switches

Tests included

Ethernet throughput
iPerf3 single/multi-stream, bidirectional
IB/RDMA bandwidth
ib_read_bw, ib_write_bw (perftest suite)
NCCL transport validation
all_reduce_perf with NCCL_DEBUG=INFO
AllReduce sweep
nccl-tests, 1KB → 8GB message sizes

Pass / Fail Thresholds

IB link speed matches fabric spec (e.g., NDR 400Gb/s)
RDMA write bandwidth ≥ 90% of line rate
NCCL transport confirmed as IB, not NET/Socket
AllReduce regression < 10% from baseline

Coming next

Storage baselining, checkpoint latency, and the full pre-flight harness

Stages 4 and 5 add storage I/O benchmarking (single-node and distributed), checkpoint latency profiling (sync and async), and a unified regression harness that runs as a SLURM prolog or K8s init container.

Coming soon

Start with Insights today.

Real-time dashboards for your SLURM cluster. Free forever.

Get Insights on GitHub