Zubnet AI學習Wiki › GPU
基礎設施

GPU

又名: Graphics Processing Unit
GPU 一開始是為圖形渲染而設計的,結果它剛好適合 AI — 因為它能同時做成千上萬次數學運算。訓練和執行 AI 模型本質上就是大規模矩陣乘法 — 正是 GPU 的強項。NVIDIA 主宰了這個市場。

為什麼重要

GPU 是整個 AI 產業的物理瓶頸。模型為什麼是這個價、為什麼有些供應商更快、為什麼全球晶片短缺 — 這一切都回到 GPU 的供給和 VRAM 上。

Deep Dive

The reason GPUs dominate AI isn't raw speed on any single calculation — a CPU actually handles individual operations faster. The advantage is parallelism. A modern CPU has 8-64 cores; an NVIDIA H100 has 16,896 CUDA cores. Neural networks are built on matrix multiplications, where you're doing the same operation on thousands of independent data points simultaneously. That's exactly the workload GPUs were designed for back when their job was calculating the color of millions of pixels every frame. The AI community just happened to notice that the same hardware architecture was perfect for training neural networks, and the modern GPU compute era was born.

The CUDA Moat

NVIDIA's dominance in AI GPUs isn't just about hardware — it's about CUDA, the software ecosystem they've been building since 2006. CUDA is the programming framework that lets developers write code for NVIDIA GPUs, and virtually every major AI framework (PyTorch, TensorFlow, JAX) is built on top of it. AMD makes competitive hardware with their MI300X (192GB of HBM3 memory), and they've got ROCm as their CUDA alternative, but the ecosystem gap is enormous. Most AI researchers and engineers have spent years writing CUDA code and aren't eager to port it. Google's TPUs (Tensor Processing Units) are the other major player, but those are only available through Google Cloud — you can't buy one.

The Hardware Tiers

The GPU landscape has clear tiers. On the datacenter side, NVIDIA's H100 (80GB HBM3) has been the workhorse of AI training since 2023, with the H200 (141GB HBM3e) offering more memory for larger models. The B200 and GB200 represent the next generation. For inference specifically, the L40S (48GB GDDR6X) offers a cheaper alternative when you don't need the raw training throughput. On the consumer side, the RTX 4090 with 24GB of GDDR6X is the king of local AI — enough VRAM to run quantized 14B-parameter models comfortably, though training anything serious on it is impractical. The gap between consumer and datacenter isn't just VRAM — it's memory bandwidth. An H100 pushes over 3 TB/s of memory bandwidth versus the 4090's 1 TB/s, and for large language model inference, memory bandwidth is often the actual bottleneck.

Scaling Beyond One Card

One thing practitioners learn quickly is that "having a GPU" and "having enough GPU" are very different situations. Running inference on a single model is one thing, but training a modern LLM requires multiple GPUs working together, connected by high-speed interconnects like NVLink or InfiniBand. An 8-GPU H100 node (DGX H100) costs around $300,000 and can train a 70B model — but frontier models like GPT-4 or Claude likely required thousands of GPUs for months. This is why cloud GPU rental (from providers like Lambda, DataCrunch, CoreWeave, or the hyperscalers) has become the standard approach: you rent a cluster for your training run and give it back when you're done, rather than buying hardware that will be outdated in two years.

相關概念

← 所有術語
← Google DeepMind GQA →
ESC