Training & Fine-tuning - Basic Theory

Model training happens in stages. Pre-training teaches the model general language understanding by predicting next tokens on internet-scale data — this costs millions of dollars and requires thousands of GPUs. Fine-tuning then adapts this general model for specific tasks using much smaller, curated datasets. Finally, alignment training (RLHF or DPO) teaches the model to be helpful, honest, and safe.

The rise of parameter-efficient methods like LoRA has democratized fine-tuning — you can now adapt a 70B parameter model on a single consumer GPU. This has created a vibrant ecosystem of community-created fine-tunes for specific domains, languages, and use cases. Understanding when to fine-tune vs when to just prompt better is a key skill for AI practitioners.

Pre-training

Massive compute (thousands of GPUs for months) on trillions of tokens. The model learns language structure, world knowledge, and reasoning by predicting the next token. Costs $10M-$100M+ for frontier models.

Data to Model Hardware

Supervised Fine-Tuning (SFT)

Training on curated instruction-response pairs to teach the model to follow instructions and produce useful outputs. Thousands to millions of examples, typically taking hours to days on a few GPUs.

Prompt

RLHF

Reinforcement Learning from Human Feedback — humans compare model outputs, a reward model learns their preferences, then the LLM is optimized to maximize that reward. The technique that made ChatGPT work.

Alignment

DPO

Direct Preference Optimization — a simpler alternative to RLHF that eliminates the separate reward model. Directly optimizes the LLM on preference pairs. Increasingly popular for its stability and simplicity.

Alignment

LoRA

Low-Rank Adaptation — fine-tuning only small additional matrices (1-5% of parameters) while keeping the original model frozen. Produces tiny adapter files (10-100MB) that can be swapped in and out.

Model Optimization

QLoRA

Quantized LoRA — combines model quantization (4-bit) with LoRA adapters, enabling fine-tuning of 70B models on a single 24GB GPU. A breakthrough for accessible AI customization.

Model Optimization

Full Fine-tuning vs PEFT

Full fine-tuning updates all parameters — maximum quality but requires multi-GPU setups and risks catastrophic forgetting. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA are cheaper and preserve base model knowledge.

Neural Networks

When to Fine-tune vs Prompt

Fine-tune for: consistent style/format, domain-specific knowledge, specific output structures. Prompt for: flexible tasks, rapid iteration, no training data available. Fine-tuning is a commitment; prompting is an experiment.

Prompting Techniques

Training Cost Spectrum

Pre-training ($10M+) vs fine-tuning ($100-10K) vs prompting (free/cheap). API-based fine-tuning (OpenAI, Anthropic) costs pennies per example. Self-hosted fine-tuning requires GPU rental ($1-5/hr for A100).

API Providers

Evaluation and Iteration

Measuring fine-tuned model quality with held-out test sets, automated metrics (perplexity, BLEU, ROUGE), and human evaluation. Always compare against the base model to quantify improvement.

SOTA

RLHFReinforcement Learning from Human Feedback — training models to align with human preferences using a learned reward model.

LoRALow-Rank Adaptation — efficient fine-tuning that trains only small additional matrices while keeping the base model frozen.

SFTSupervised Fine-Tuning — training on curated instruction-response pairs to teach the model to follow instructions.

DPODirect Preference Optimization — simpler alternative to RLHF that directly optimizes on human preference pairs without a reward model.