Try DeepLM.

Fork a project, spin up on your cluster, and see the difference. All projects are open source.

Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency. Docker Compose deploy, optional NVIDIA BCM integration, Cassandra-backed historical analysis.

Python—

Baseline

Baseline your GPU cluster's real performance in one run. Tests compute throughput (TFLOPS, HBM bandwidth, thermal throttling), interconnect health (NVLink, NVSwitch, PCIe, NUMA), and network scaling (IB/RDMA, NCCL, AllReduce). Pass/fail thresholds against vendor specs.

Python—

deeplm-cli

Command-line interface for DeepLM. Manage clusters, view dashboards, and trigger optimizations from your terminal.

Rust—

All repositories are hosted on GitHub under the DeepLM organization.