Name: The GPU Power-Performance Curve Most Clusters Ignore | Researcher Conversations at GTC
Uploaded: 2026-06-25T20:00:38+00:00
Duration: 603 s

Speakers

Keval Shah — AI Research Lead at Pebble

Summary

Keval Shah, AI Research Lead at Pebble, explains how GPU clusters exhibit a non-linear power-to-performance curve, leading to wasted power without proportional token gains. Pebble dynamically caps power and clock frequency per GPU by reading telemetry from vLLM, SLURM, and NVIDIA exporters, learning workloads before enforcing caps to maintain SLAs. The company also targets grid-responsive data centers that can tap into 100 GW of flexible US power by curtailing consumption during peaks. The discussion covers inference memory-bound characteristics, implementation via Kubernetes Helm charts, and continuous optimization across diurnal load patterns.

Pebble optimizes AI data centers for more tokens per watt by exploiting non-linear power-performance.
Inference workloads are memory-bound, leaving SMs idle while decode uses power, creating room for savings.
Pebble dynamically caps power and clock frequency per GPU based on workload characteristics and load patterns.
The system installs as a Helm chart, learns for days before applying caps, and avoids SLA violations.
Grid-responsive data centers could access 100 GW of flexible US power by curtailing during peak periods.
Pebble aims to make AI clusters grid-responsive with minimal performance impact.
The interview took place at NVIDIA GTC 2026 with SemiAnalysis.

The GPU Power-Performance Curve Most Clusters Ignore | Researcher Conversations at GTC

Summary

Up Next