Making models faster, smaller, and cheaper to run.
Running large AI models requires expensive hardware. Model optimization techniques reduce computational requirements while maintaining quality. Quantization (reducing numerical precision), pruning (removing unnecessary connections), and distillation (training smaller models from larger ones) make it possible to run models on consumer hardware that would otherwise require data center GPUs.
Optimization is what makes AI practical. Without it, only companies with massive GPU clusters could use frontier models. Thanks to quantization and efficient inference engines like llama.cpp and vLLM, a 70B parameter model can run on a gaming PC, and inference costs have dropped 100x in two years. This democratization drives the open-source AI movement.
Quantization Basics
Reducing numerical precision from FP32 (32 bits per weight) → FP16 → INT8 → INT4 shrinks model size 2-8x. Each step trades a small amount of quality for major size and speed improvements.
GPTQ and AWQ
GPU-optimized post-training quantization methods. GPTQ uses calibration data for higher quality. AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns. Both enable fast GPU inference.
GGUF and llama.cpp
GGUF is the format for llama.cpp, enabling mixed CPU+GPU inference on consumer machines. Supports quantization from Q2 (smallest) to Q8 (highest quality). Q4_K_M is the popular sweet spot.
Pruning
Removing weights close to zero that contribute little to output quality. Structured pruning removes entire neurons or attention heads. Can reduce model size 50-90% with careful calibration.
Knowledge Distillation
Training a small "student" model from a large "teacher" model. The student learns from the teacher's probability distributions, not just hard labels, capturing richer information. GPT-4 distilled into GPT-4o mini is a practical example.
Flash Attention
Memory-efficient attention computation by Tri Dao that fuses operations and uses tiling to avoid materializing the full attention matrix. Reduces GPU memory usage 5-20x and speeds up training by 2-4x.
Speculative Decoding
Using a fast small model to draft multiple tokens at once, then the large model verifies them in a single forward pass. Achieves 2-3x speedup without quality loss since rejected tokens are regenerated.
KV Cache Optimization
During generation, each token must attend to all previous tokens. KV cache stores these attention states but grows with sequence length. PagedAttention (vLLM) manages this memory like virtual memory in operating systems.
Model Merging
Combining weights from multiple fine-tuned models without additional training. Methods like TIES, DARE, and SLERP interpolate between model checkpoints. Community merges on HuggingFace often outperform individual fine-tunes.
Practical Impact
A 70B model quantized to Q4 fits in 48GB VRAM (2x consumer GPUs). vLLM serves models 24x faster than naive inference. These optimizations mean a $2000 PC can run models that cost $100K+ to train.
QuantizationReducing numerical precision of model weights (FP32→INT4) to decrease size 2-8x and speed up inference.
DistillationTraining a smaller student model to mimic a larger teacher model's probability distributions and capabilities.
Flash AttentionMemory-efficient attention implementation that avoids materializing the full attention matrix, reducing GPU memory 5-20x.
Speculative DecodingUsing a fast draft model to generate candidate tokens, verified in batch by the larger model for 2-3x speedup.