How models are trained from scratch and adapted for specific tasks.
Model training happens in stages. Pre-training teaches the model general language understanding by predicting next tokens on internet-scale data β this costs millions of dollars and requires thousands of GPUs. Fine-tuning then adapts this general model for specific tasks using much smaller, curated datasets. Finally, alignment training (RLHF or DPO) teaches the model to be helpful, honest, and safe.
The rise of parameter-efficient methods like LoRA has democratized fine-tuning β you can now adapt a 70B parameter model on a single consumer GPU. This has created a vibrant ecosystem of community-created fine-tunes for specific domains, languages, and use cases. Understanding when to fine-tune vs when to just prompt better is a key skill for AI practitioners.
Pre-training
Massive compute (thousands of GPUs for months) on trillions of tokens. The model learns language structure, world knowledge, and reasoning by predicting the next token. Costs $10M-$100M+ for frontier models.
Supervised Fine-Tuning (SFT)
Training on curated instruction-response pairs to teach the model to follow instructions and produce useful outputs. Thousands to millions of examples, typically taking hours to days on a few GPUs.
RLHF
Reinforcement Learning from Human Feedback β humans compare model outputs, a reward model learns their preferences, then the LLM is optimized to maximize that reward. The technique that made ChatGPT work.
DPO
Direct Preference Optimization β a simpler alternative to RLHF that eliminates the separate reward model. Directly optimizes the LLM on preference pairs. Increasingly popular for its stability and simplicity.
LoRA
Low-Rank Adaptation β fine-tuning only small additional matrices (1-5% of parameters) while keeping the original model frozen. Produces tiny adapter files (10-100MB) that can be swapped in and out.
QLoRA
Quantized LoRA β combines model quantization (4-bit) with LoRA adapters, enabling fine-tuning of 70B models on a single 24GB GPU. A breakthrough for accessible AI customization.
Full Fine-tuning vs PEFT
Full fine-tuning updates all parameters β maximum quality but requires multi-GPU setups and risks catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA are cheaper and preserve base model knowledge.
When to Fine-tune vs Prompt
Fine-tune for: consistent style/format, domain-specific knowledge, specific output structures. Prompt for: flexible tasks, rapid iteration, no training data available. Fine-tuning is a commitment; prompting is an experiment.
Training Cost Spectrum
Pre-training ($10M+) vs fine-tuning ($100-10K) vs prompting (free/cheap). API-based fine-tuning (OpenAI, Anthropic) costs pennies per example. Self-hosted fine-tuning requires GPU rental ($1-5/hr for A100).
Evaluation and Iteration
Measuring fine-tuned model quality with held-out test sets, automated metrics (perplexity, BLEU, ROUGE), and human evaluation. Always compare against the base model to quantify improvement.
RLHFReinforcement Learning from Human Feedback β training models to align with human preferences using a learned reward model.
LoRALow-Rank Adaptation β efficient fine-tuning that trains only small additional matrices while keeping the base model frozen.
SFTSupervised Fine-Tuning β training on curated instruction-response pairs to teach the model to follow instructions.
DPODirect Preference Optimization β simpler alternative to RLHF that directly optimizes on human preference pairs without a reward model.