The broad set of techniques used to make AI models faster, smaller, cheaper, or more accurate. This includes training optimizations (mixed precision, gradient checkpointing, data parallelism), inference optimizations (quantization, pruning, distillation, speculative decoding), and serving optimizations (batching, caching, load balancing). Optimization is the reason you can run a 14B parameter model on a laptop.
Why it matters
Raw capability means nothing if you can't afford to run it. Optimization is the difference between a research demo and a production product. It's why open-weights models can compete with API providers, why mobile AI exists, and why inference costs keep dropping.