← Back to Course

Basic Theory

πŸ‡ΊπŸ‡¦ Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ°
πŸš€ Level 4 β€” Master

Model Formats

Understanding different model distribution and execution formats.

AI models need to be serialized into files for distribution and loading. Different formats optimize for different goals: GGUF prioritizes running on consumer hardware with CPU+GPU splitting, GPTQ and AWQ are GPU-optimized for maximum throughput, SafeTensors ensures safe loading without code execution risks, and ONNX provides cross-platform compatibility.

Understanding model formats is essential for local AI deployment. The format you choose determines which inference engine you can use (llama.cpp, vLLM, TensorRT), what hardware it runs on, how much memory it needs, and how fast it generates tokens. Most models on Hugging Face are available in multiple formats.

Key Topics Covered
GGUF (llama.cpp)
The most versatile format for local inference. Supports CPU, GPU, and mixed CPU+GPU execution. Single-file distribution with embedded metadata. The go-to format for running models on consumer hardware.
GPTQ
GPU-optimized post-training quantization format. Models are quantized to 4-bit or 8-bit with calibration data. Faster than GGUF on pure GPU but requires the full model to fit in VRAM.
AWQ (Activation-Aware Quantization)
Advanced GPU quantization that preserves important weights based on activation patterns. Generally better quality than GPTQ at the same bit width. Supported by vLLM and TensorRT-LLM.
SafeTensors
Hugging Face safe serialization format that prevents arbitrary code execution on load, supports memory-mapping for fast loading, and is now the default format on Hugging Face Hub.
ONNX (Open Neural Network Exchange)
Cross-platform format supported by Microsoft, Google, and others. Enables running models on different hardware (CPU, GPU, NPU) through ONNX Runtime. Used for edge deployment and mobile inference.
ExLlamaV2 and EXL2
Highly optimized GPU inference with variable quantization β€” different layers can use different bit widths. Achieves the best perplexity-per-bit among quantized formats. Popular for enthusiast setups.
TensorRT-LLM
NVIDIA high-performance inference engine. Compiles models into optimized execution plans for NVIDIA GPUs. Maximum throughput for production serving but requires NVIDIA hardware and compilation step.
Quantization Levels
Q2 (smallest, lowest quality) through Q8 (largest, highest quality). Q4_K_M is the sweet spot for GGUF β€” good quality with reasonable size. Q5+ recommended for reasoning-heavy tasks.
Model Distribution
Hugging Face is the primary hub. Models are uploaded in multiple formats by quantization specialists (TheBloke, bartowski). Ollama and LM Studio download GGUF models with one-click setup.
Choosing the Right Format
Consumer GPU: GGUF or EXL2. Production GPU server: AWQ or TensorRT-LLM. CPU only: GGUF. Cross-platform: ONNX. Mobile/edge: ONNX or CoreML. When in doubt, start with GGUF Q4_K_M.
Key Terms
GGUFllama.cpp model format supporting CPU+GPU inference β€” the most popular format for running models on consumer hardware.
SafeTensorsSecure model serialization format that prevents code execution attacks during model loading.
Quantization LevelBit precision of model weights (Q2-Q8) β€” lower bits mean smaller files but reduced quality.
VRAMVideo RAM on GPU β€” the primary constraint determining which model sizes and formats can run on your hardware.
Practical Tips
Related Community Discussions