How Much Do GPU Clusters Really Cost?

Jordan Nanos · SemiAnalysis · April 20, 2026 at 14:21 · ⏱ 51 min read  | Read on Substack ↗
Summary
The article argues that GPU cluster total cost of ownership (TCO) extends far beyond per-GPU-hour pricing, with reliability, support, storage, networking, and fault tolerance creating 5-15% TCO differences even at equal GPU prices. Gold-tier neoclouds (Nebius, Fluidstack, Crusoe) outperform hyperscalers and silver-tier providers on TCO, meaning buyers who focus solely on headline GPU pricing are missing hidden costs that can add 10-60% to their bills.
  • SemiAnalysis defines eight TCO components beyond GPU price: storage, networking, control plane, support, goodput, setup expense, debugging expense, and non-GPU compute.
  • Gold-tier providers (Nebius, Fluidstack, Crusoe) have 5-15% lower TCO than hyperscalers and silver-tier neoclouds when GPU pricing is held equal, due to better support, hot spare pools, and lower setup/debugging costs.
  • In a Large LLM Pretrain scenario (5,184 GB300 GPUs, 3-year term), gold-tier TCO is 1x, hyperscaler 1.10x, and silver-tier 1.15x, with major drivers being support costs and EFA tuning time for hyperscalers, and goodput loss for silver-tier.
  • In Multimodal RL Research (2,048 B200 GPUs), hyperscaler GPU pricing is 29% higher (50th vs 25th percentile), leading to a 61% TCO premium over gold-tier; silver-tier is only 15% higher due to storage and debugging costs.
  • For Inference Endpoints (512 H200 GPUs), virtually all TCO difference comes from GPU pricing — hyperscaler is 59% more expensive, while gold and silver are nearly identical at equal GPU price due to fault-tolerant serving frameworks.
  • Fault-tolerant training frameworks (TorchFT, AWS HyperPod Checkpointless, Clockwork TorchPass) each have trade-offs in performance, memory, or idle node costs, and are critical for scaling beyond 1k GPUs.
  • Cluster quality varies even among providers following NVIDIA's reference architecture — performance differences appear in collective-bound workloads due to interconnect tuning and reliability.
  • SemiAnalysis adds Core42, BitDeer, FPT Smart Cloud, and Ori to its ClusterMAX ratings, with Core42 showing strong support but AMD/Broadcom NIC compatibility issues, and FPT and Ori flagged for security misconfigurations (PKey/SAKey).
Read time 51 min
Length 51,173 chars
Category finance
Trade Ideas
Jordan Nanos Member of Technical Staff at SemiAnalysis
Nebius is explicitly named as a gold-tier provider alongside Fluidstack and Crusoe. The article states gold-tier providers command a pricing premium because their TCO is 5-15% lower than silver-tier o
Nebius is explicitly named as a gold-tier provider alongside Fluidstack and Crusoe. The article states gold-tier providers command a pricing premium because their TCO is 5-15% lower than silver-tier or hyperscalers at equal GPU pricing, validating NBIS's superior offering and competitive advantage in the GPU cloud market. Risk: Nebius is a smaller player; its growth depends on continued capital availability and scaling Blackwell clusters. Pricing data is from Aug 2025 and market dynamics are changing.
Jordan Nanos Member of Technical Staff at SemiAnalysis
Every cluster scenario and provider in the article uses NVIDIA GPUs (Blackwell, H200, B200, GB300). The article quantifies massive demand: 'unicorn startups have thousands of GPUs' and 'companies spen
Every cluster scenario and provider in the article uses NVIDIA GPUs (Blackwell, H200, B200, GB300). The article quantifies massive demand: 'unicorn startups have thousands of GPUs' and 'companies spending over 80% of initial funding on GPUs'. The detailed analysis of TCO assumes NVIDIA's hardware as the standard, reinforcing NVIDIA's dominance in the AI training and inference market despite competitive pressures from custom ASICs. Risk: The article also notes that providers like Core42 use AMD MI300X, and that networking differences (InfiniBand vs EFA) affect performance — potential long-term threats if AMD or custom chips gain traction.
Jordan Nanos Member of Technical Staff at SemiAnalysis
The article uses AWS as the primary hyperscaler example and highlights several disadvantages: higher GPU pricing (50th-75th percentile), poor default storage performance requiring extra cost, signific
The article uses AWS as the primary hyperscaler example and highlights several disadvantages: higher GPU pricing (50th-75th percentile), poor default storage performance requiring extra cost, significant setup time for EFA tuning (weeks to months), separate support charges (3-10% of bill), and orchestration premiums (e.g., SageMaker vs EC2). In the Large LLM Pretrain scenario, these add 10% to TCO versus gold-tier; in RL research, 61% premium due to higher GPU pricing. This suggests AWS AI cloud offerings face margin pressure and customer migration risk to neoclouds. Risk: Large enterprises may still prefer AWS for compliance, ecosystem, and long-term contracts; the article's assumptions (e.g., no fault tolerance code) may not apply to all customers.
More from SemiAnalysis

This newsletter, published April 20, 2026, features Jordan Nanos discussing NBIS, NVDA, AMZN. 3 trade ideas extracted by AI with direction and confidence scoring.

Speakers: Jordan Nanos  · Tickers: NBIS, NVDA, AMZN