Dissecting Nvidia Blackwell - Tensor Cores, PTX Instructions, SASS, Floorsweep, Yield

Kimbo Chen · SemiAnalysis · March 31, 2026 at 22:00 · ⏱ 23 min read  | Read on Substack ↗
Summary
The article presents microbenchmarking results for Nvidia's Blackwell GPU architecture, showing near-peak throughput for tensor core MMAs but highlighting SMEM bandwidth bottlenecks for small shapes and the need for careful configuration like using 2SM MMA and TMA multicast. This validates Nvidia's performance claims and reinforces its dominance in AI hardware, while also identifying specific trade-offs that kernel developers must navigate.
  • Blackwell introduces 2SM MMA (cta_group::2) that allows a CTA pair across two SMs to collaboratively execute MMA, achieving near-perfect weak scaling with up to 2x speedup over 1SM MMA.
  • For MMA with both operands in shared memory (SS), instructions with N<128 are SMEM bandwidth bound at ~128 B/cycle; TMEM mode (TS) avoids this bottleneck and achieves peak throughput.
  • TMA multicast loads can reduce L2 traffic significantly by coalescing requests through the L2 Request Coalescer (LRC); implicit multicast performs on par with explicit for throughput but loses L2 efficiency above 64 KiB in-flight.
  • Distributed shared memory (DSMEM) access via `ld.shared::cluster` achieves lower throughput than local SMEM; peak DSMEM throughput requires `cp.async.bulk` (UBLKCP) for high-volume transfers.
  • Async copy (LDGSTS) saturates at ~6.6 TB/s with 32 KiB in-flight, while TMA can scale to 128 KiB in-flight, making TMA preferable for larger, regular memory loads and async copy for smaller or irregular patterns.
  • The number of SMs per GPC varies due to floorsweeping; kernels using clusters may leave SMs idle if cluster size doesn't evenly divide the GPC's SMs, but Blackwell introduces preferred/fallback cluster sizes to mitigate this.
  • Die-to-die latency in the B200 package is approximately 300 extra cycles, identified via SM-to-L2 latency matrix measurements.
  • The article benchmarks configurations used in FlashInfer attention kernels, showing that combining TMA and async copy is common, with TMA used for page loading in MHA kernels and async copy for dynamic page loading in MLA kernels.
Read time 23 min
Length 23,085 chars
Category finance
Trade Ideas
Kimbo Chen Substack author, SemiAnalysis
The article explicitly thanks Nebius for providing B200 nodes with 'correct hardware counters enabled that makes NCU profiling possible,' indicating Nebius offers a high-quality cloud infrastructure f
The article explicitly thanks Nebius for providing B200 nodes with 'correct hardware counters enabled that makes NCU profiling possible,' indicating Nebius offers a high-quality cloud infrastructure for deep learning research and inference. Risk: Nebius's customer concentration and competitive pricing pressure from hyperscalers could affect margins.
Kimbo Chen Substack author, SemiAnalysis
The microbenchmarking confirms that Blackwell's UMMA (Tensor Core MMA) achieves near-peak throughput for all data formats and CTA groups, validating Nvidia's claimed performance and reinforcing its co
The microbenchmarking confirms that Blackwell's UMMA (Tensor Core MMA) achieves near-peak throughput for all data formats and CTA groups, validating Nvidia's claimed performance and reinforcing its competitive advantage in AI accelerators. Risk: Performance gains require careful kernel tuning to avoid SMEM bandwidth bottlenecks and to leverage new features like 2SM MMA and TMA multicast; naive usage may underperform.
More from SemiAnalysis

This newsletter, published March 31, 2026, features Kimbo Chen discussing NBIS, NVDA. 2 trade ideas extracted by AI with direction and confidence scoring.

Speakers: Kimbo Chen  · Tickers: NBIS, NVDA