A training technique where human evaluators rank model outputs by quality, and this feedback is used to train a reward model that guides the AI toward better responses. It's what turns a raw pre-trained model (which just predicts next words) into a helpful, harmless assistant.
Why it matters
RLHF is the secret ingredient that made ChatGPT feel different from GPT-3. The base model already "knew" everything, but RLHF taught it to present that knowledge in a way humans actually find useful. It's also how safety behaviors are reinforced.