Model Optimization - Basic Theory

Running large AI models requires expensive hardware. Model optimization techniques reduce computational requirements while maintaining quality. Quantization (reducing numerical precision), pruning (removing unnecessary connections), and distillation (training smaller models from larger ones) make it possible to run models on consumer hardware that would otherwise require data center GPUs.

Optimization is what makes AI practical. Without it, only companies with massive GPU clusters could use frontier models. Thanks to quantization and efficient inference engines like llama.cpp and vLLM, a 70B parameter model can run on a gaming PC, and inference costs have dropped 100x in two years. This democratization drives the open-source AI movement.

Quantization Basics

Reducing numerical precision from FP32 (32 bits per weight) → FP16 → INT8 → INT4 shrinks model size 2-8x. Each step trades a small amount of quality for major size and speed improvements.

Model Formats

GPTQ and AWQ

GPU-optimized post-training quantization methods. GPTQ uses calibration data for higher quality. AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns. Both enable fast GPU inference.

Hardware

GGUF and llama.cpp

GGUF is the format for llama.cpp, enabling mixed CPU+GPU inference on consumer machines. Supports quantization from Q2 (smallest) to Q8 (highest quality). Q4_K_M is the popular sweet spot.

Model Formats

Pruning

Removing weights close to zero that contribute little to output quality. Structured pruning removes entire neurons or attention heads. Can reduce model size 50-90% with careful calibration.

Neural Networks

Knowledge Distillation

Training a small "student" model from a large "teacher" model. The student learns from the teacher's probability distributions, not just hard labels, capturing richer information. GPT-4 distilled into GPT-4o mini is a practical example.

Foundation Models

Flash Attention

Memory-efficient attention computation by Tri Dao that fuses operations and uses tiling to avoid materializing the full attention matrix. Reduces GPU memory usage 5-20x and speeds up training by 2-4x.

Model Types

Speculative Decoding

Using a fast small model to draft multiple tokens at once, then the large model verifies them in a single forward pass. Achieves 2-3x speedup without quality loss since rejected tokens are regenerated.

LLM

KV Cache Optimization

During generation, each token must attend to all previous tokens. KV cache stores these attention states but grows with sequence length. PagedAttention (vLLM) manages this memory like virtual memory in operating systems.

Context

Model Merging

Combining weights from multiple fine-tuned models without additional training. Methods like TIES, DARE, and SLERP interpolate between model checkpoints. Community merges on HuggingFace often outperform individual fine-tunes.

Training & Fine-tuning

Practical Impact

A 70B model quantized to Q4 fits in 48GB VRAM (2x consumer GPUs). vLLM serves models 24x faster than naive inference. These optimizations mean a $2000 PC can run models that cost $100K+ to train.

API Providers

QuantizationReducing numerical precision of model weights (FP32→INT4) to decrease size 2-8x and speed up inference.

DistillationTraining a smaller student model to mimic a larger teacher model's probability distributions and capabilities.

Flash AttentionMemory-efficient attention implementation that avoids materializing the full attention matrix, reducing GPU memory 5-20x.

Speculative DecodingUsing a fast draft model to generate candidate tokens, verified in batch by the larger model for 2-3x speedup.