Zubnet AI學習Wiki › Inference
基礎設施

Inference

讓訓練好的模型跑起來、產出輸出的過程。訓練是學習,推理是使用學到的東西。每次你把 prompt 送給 Claude、或用 Stable Diffusion 生成一張圖,那就是推理。推理讓供應商付出 GPU-小時,也是你按 token 付費的那一部分。

為什麼重要

推理的成本與速度決定了 AI 產品的經濟學。推理更快 = 延遲更低 = 體驗更好。推理更便宜 = 價格更低 = 採用更廣。整個量化與最佳化產業存在的目的,就是讓推理更有效率。

Deep Dive

For large language models, inference happens in two distinct phases, and understanding them explains most of the performance characteristics you'll observe. The first phase is called "prefill" or "prompt processing" — the model reads your entire input prompt and builds up its internal state (the KV cache). This phase is compute-bound and benefits from GPU parallelism because all input tokens can be processed simultaneously. The second phase is "decode" or "generation" — the model produces output tokens one at a time, each one depending on all previous tokens. This phase is memory-bandwidth-bound because the model needs to read its weights from VRAM for each token but does relatively little computation per read. This is why Time to First Token (TTFT) and tokens-per-second are measured separately: they reflect fundamentally different bottlenecks.

Throughput vs. Latency

The economics of inference are dominated by a concept called "throughput vs. latency." If you're serving a chatbot where one user is waiting for a response, you want low latency — get that first token out fast. But if you're running batch processing (summarizing 10,000 documents overnight), you want high throughput — process as many tokens per second as possible, even if each individual request is slower. Inference engines like vLLM and TensorRT-LLM use a technique called "continuous batching" to dynamically group multiple requests together, which dramatically improves throughput. A single H100 might generate 40 tokens/second for one request, but by batching cleverly, the same GPU can serve 20+ concurrent users at acceptable latency because the memory bandwidth is shared more efficiently.

The Serving Landscape

The inference serving landscape has splintered into distinct approaches. Cloud API providers (Anthropic, OpenAI, Google) run massive GPU clusters and sell inference as a service, priced per token. Inference-focused providers like Groq bet on custom hardware — Groq's LPU (Language Processing Unit) is specifically designed for the sequential decode phase and achieves remarkably fast token generation. On the open-source side, llama.cpp brought LLM inference to CPUs and consumer GPUs through aggressive quantization, and tools like Ollama wrapped it in a user-friendly package. For production self-hosting, vLLM with PagedAttention has become the default choice, offering throughput that rivals commercial offerings when tuned correctly.

The Cost Reality

A common misconception is that inference is "cheap" compared to training. For a single request, yes — generating a response costs a fraction of a cent. But inference is ongoing. A popular chatbot handles millions of requests per day, indefinitely. OpenAI reportedly spends more on inference than training at this point. This is why inference optimization is such a hot area: speculative decoding (using a small "draft" model to predict what the large model will say), KV cache compression, and prefix caching (reusing computation for shared system prompts) all aim to squeeze more responses out of the same hardware. Every percentage point of efficiency improvement translates directly into millions of dollars saved at scale.

相關概念

← 所有術語
← Induction Head Instruction Following →
ESC