Name: Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX)
Uploaded: 2026-07-01T22:30:06+00:00
Duration: 2121 s

Speakers

Bryan Shan — Substack author, SemiAnalysis

Kimbo — Analyst

Summary

The SemiAnalysis podcast breaks down DeepSeek V4's architectural innovations (sparse attention, Mega MoE, 1M context length) and the multi-week engineering grind to achieve day-zero runtime performance on NVIDIA, AMD, and Huawei hardware. The discussion covers kernel-level optimizations, the vLLM versus SGLang competition, and emerging evidence that Huawei's Ascend NPU is becoming a viable AI inference platform.

DeepSeek V4 introduces compressed sparse attention and Mega MoE to reduce KV cache needs by ~100x and enable 1M context length.
Day-zero support for a new model architecture requires extensive work across inference runtimes, and NVIDIA initially tripped on a new MHC dimension.
AMD's inference throughput improved notably through FP4 support, kernel rewrites, and step-by-step optimization compounding over weeks.
Huawei's Ascend NPU demonstrated competent DeepSeek V4 inference with open-source CANN code, good documentation, and rapid community-driven optimization.
Competition between vLLM and SGLang accelerates feature velocity and innovation, though it also fragments development effort.
The team plans to move beyond fixed-length benchmarks to agentic workloads using real-world Claude code traces and advanced disaggregation techniques.

Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX)

Summary

Up Next