Try DeepLM.
Fork a project, spin up on your cluster, and see the difference. All projects are open source.
Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency. Docker Compose deploy, optional NVIDIA BCM integration, Cassandra-backed historical analysis.
Baseline your GPU cluster's real performance in one run. Tests compute throughput (TFLOPS, HBM bandwidth, thermal throttling), interconnect health (NVLink, NVSwitch, PCIe, NUMA), and network scaling (IB/RDMA, NCCL, AllReduce). Pass/fail thresholds against vendor specs.
Command-line interface for DeepLM. Manage clusters, view dashboards, and trigger optimizations from your terminal.
All repositories are hosted on GitHub under the DeepLM organization.