Summary
The SemiAnalysis podcast breaks down DeepSeek V4's architectural innovations (sparse attention, Mega MoE, 1M context length) and the multi-week engineering grind to achieve day-zero runtime performance on NVIDIA, AMD, and Huawei hardware. The discussion covers kernel-level optimizations, the vLLM versus SGLang competition, and emerging evidence that Huawei's Ascend NPU is becoming a viable AI inference platform.
- DeepSeek V4 introduces compressed sparse attention and Mega MoE to reduce KV cache needs by ~100x and enable 1M context length.
- Day-zero support for a new model architecture requires extensive work across inference runtimes, and NVIDIA initially tripped on a new MHC dimension.
- AMD's inference throughput improved notably through FP4 support, kernel rewrites, and step-by-step optimization compounding over weeks.
- Huawei's Ascend NPU demonstrated competent DeepSeek V4 inference with open-source CANN code, good documentation, and rapid community-driven optimization.
- Competition between vLLM and SGLang accelerates feature velocity and innovation, though it also fragments development effort.
- The team plans to move beyond fixed-length benchmarks to agentic workloads using real-world Claude code traces and advanced disaggregation techniques.