Buzzberg Cup Live

Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX)

Watch on YouTube ↗  |  July 01, 2026 at 22:30  |  35:21  |  SemiAnalysis Weekly
Speakers
Bryan Shan — Substack author, SemiAnalysis
Kimbo — Analyst

Summary

The SemiAnalysis podcast breaks down DeepSeek V4's architectural innovations (sparse attention, Mega MoE, 1M context length) and the multi-week engineering grind to achieve day-zero runtime performance on NVIDIA, AMD, and Huawei hardware. The discussion covers kernel-level optimizations, the vLLM versus SGLang competition, and emerging evidence that Huawei's Ascend NPU is becoming a viable AI inference platform.

  • DeepSeek V4 introduces compressed sparse attention and Mega MoE to reduce KV cache needs by ~100x and enable 1M context length.
  • Day-zero support for a new model architecture requires extensive work across inference runtimes, and NVIDIA initially tripped on a new MHC dimension.
  • AMD's inference throughput improved notably through FP4 support, kernel rewrites, and step-by-step optimization compounding over weeks.
  • Huawei's Ascend NPU demonstrated competent DeepSeek V4 inference with open-source CANN code, good documentation, and rapid community-driven optimization.
  • Competition between vLLM and SGLang accelerates feature velocity and innovation, though it also fragments development effort.
  • The team plans to move beyond fixed-length benchmarks to agentic workloads using real-world Claude code traces and advanced disaggregation techniques.
Up Next