Data to Model - Basic Theory

The journey from raw data to a working AI model involves a complex pipeline of collection, cleaning, preprocessing, training, and evaluation. Data quality is often more important than model architecture — the AI community saying "garbage in, garbage out" has never been more relevant. Understanding this pipeline helps you appreciate why some models are better than others and how to create effective fine-tuned models.

The data pipeline is where the real competitive advantage lies. Companies like OpenAI, Anthropic, and Google invest heavily in data curation — not just finding more data, but finding better data. Filtering, deduplication, and synthetic data generation have become entire subdisciplines as the AI community realizes that the quality ceiling of a model is set by its training data.

Data Collection at Scale

Web crawling (Common Crawl — petabytes of web pages), digitized books, GitHub code repositories, scientific papers (arXiv, PubMed), Wikipedia. The scale is staggering: trillions of tokens from billions of documents.

Foundation Models

Data Cleaning

Removing duplicates, low-quality content, boilerplate HTML, personally identifiable information (PII), and machine-generated spam. Up to 90% of raw crawled data may be discarded during cleaning.

Deduplication

Exact and near-duplicate removal to prevent memorization and data leakage. MinHash, SimHash, and suffix array techniques identify similar content. Critical for preventing models from memorizing and regurgitating specific texts.

Content Filtering

Removing harmful, toxic, or copyrighted content from training data. Classifier-based filtering, keyword blocklists, and domain-level decisions. Balancing thorough filtering with preserving data diversity is a core challenge.

AI Safety

Tokenization and Preprocessing

Converting raw text into tokens the model can process. BPE (Byte Pair Encoding) and SentencePiece are dominant methods. Vocabulary size (32K-100K+ tokens) trades memory for encoding efficiency. Multilingual tokenizers must balance all languages.

Token

Dataset Formats

JSONL (human-readable, line-per-example), Parquet (columnar, compressed), Arrow (in-memory, zero-copy). Efficient storage is critical when datasets reach terabytes. Hugging Face datasets library standardizes access.

Tools & Libraries

Data Quality vs Quantity

Smaller high-quality datasets can outperform larger noisy ones. Microsoft's Phi models proved this: carefully curated "textbook quality" data trained models that punched far above their parameter count.

Model Optimization

Synthetic Data Generation

Using existing AI models to generate training data for specific capabilities. Self-instruct, Evol-Instruct, and distillation pipelines create millions of instruction-response pairs. Enables training on domains where real data is scarce or expensive.

LLM

Human Data Annotation

Human labelers creating supervised examples for fine-tuning and RLHF. Annotation quality varies widely — detailed guidelines, multiple annotators per example, and inter-annotator agreement checks are essential.

Training & Fine-tuning

Open Datasets

The Pile (EleutherAI), RedPajama (Together AI), FineWeb (HuggingFace), SlimPajama — open datasets that power open-source models. Understanding their composition explains model capabilities and biases.

Data Classification

Common CrawlMassive open web archive containing petabytes of web pages, used as primary data source for training most LLMs.

Synthetic DataTraining data generated by AI models rather than collected from real sources, enabling training on scarce domains.

Data DeduplicationRemoving duplicate or near-duplicate examples using hashing techniques to prevent memorization and improve quality.

BPE TokenizationByte Pair Encoding — the dominant method for splitting text into sub-word tokens that models can process.