Zubnet AILearnWiki › Corpus
Training

Corpus

Also known as: Dataset, Training Data
The body of text (or other data) used to train a model. A corpus can range from curated collections of books and papers to massive scrapes of the entire internet. The quality and composition of the corpus fundamentally shapes what the model knows and how it behaves.

Why it matters

Garbage in, garbage out. A model trained on Reddit talks differently than one trained on scientific papers. This is why we curated our own corpus for Sarah — generic web crawls produced confused, incoherent results.

Related Concepts

← All Terms
← Context Window Data Centers →
ESC