← Back to Course

Basic Theory

πŸ‡ΊπŸ‡¦ Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ°
πŸš€ Level 4 β€” Master

Hardware Basics

Hardware requirements for running AI models locally.

Running AI models locally requires understanding the hardware constraints. The key bottleneck is memory β€” specifically GPU VRAM for fast inference. A 7B parameter model needs about 4-6GB VRAM (quantized), while a 70B model needs 35-48GB. Your hardware determines which models you can run and how fast they generate tokens.

The hardware landscape has democratized significantly. Apple Silicon Macs with unified memory can run surprisingly large models. Consumer NVIDIA GPUs (RTX 4090 with 24GB VRAM) handle 13B-34B models well. For larger models, cloud GPU providers offer pay-per-hour access. Understanding these options helps you choose the right balance of cost, speed, and capability.

Key Topics Covered
GPU vs CPU Inference
GPUs are 10-50x faster than CPUs for AI inference due to massive parallelism. CPUs work for small models or when using GGUF with CPU offloading. For any serious local AI work, a GPU is essential.
VRAM Requirements
7B model: ~4GB (Q4), ~8GB (FP16). 13B: ~8GB (Q4). 34B: ~20GB (Q4). 70B: ~40GB (Q4). Rule of thumb: model size in GB at Q4 is roughly half the parameter count in billions.
NVIDIA Consumer GPUs
RTX 4090 (24GB, $1600) β€” king of local LLMs. RTX 4080 (16GB). RTX 3090 (24GB, used ~$800) β€” best value. RTX 4060 Ti 16GB β€” budget option. VRAM matters more than compute speed for LLMs.
Apple Silicon
M1/M2/M3/M4 Macs with unified memory can run large models. M2 Ultra (192GB) can run 70B+ models. M3 Max (128GB) handles 34B well. Slower than NVIDIA but memory bandwidth is excellent.
Data Center GPUs
A100 (80GB), H100 (80GB), H200 (141GB) β€” the hardware powering AI labs. 10-20x more expensive than consumer GPUs. Available through cloud providers for hourly rental.
Cloud GPU Providers
RunPod, Vast.ai, Lambda Labs β€” rent GPUs by the hour ($0.50-$4/hr for A100). Good for occasional use or running models too large for local hardware. No upfront investment.
Multi-GPU Setups
Split large models across multiple GPUs. NVLink provides fast GPU-to-GPU communication on matching NVIDIA cards. Consumer GPUs can use PCIe with slower but functional model sharding.
Inference Engines
llama.cpp (CPU+GPU, versatile), vLLM (high-throughput GPU serving), Ollama (easy local setup), LM Studio (GUI), TGI (HuggingFace serving). Each optimizes for different use cases.
RAM and Storage
System RAM matters for CPU inference and model loading. 32GB minimum, 64GB+ recommended. NVMe SSD dramatically speeds up model loading times (30-70B models are 20-40GB files).
Budget Configurations
Entry ($500): used RTX 3060 12GB β€” runs 7B models. Mid ($1500): RTX 4090 24GB β€” runs up to 34B. High ($3000+): Mac Studio M2 Ultra or dual GPU. Budget: use cloud APIs instead of local hardware.
Key Terms
VRAMVideo RAM on the GPU β€” the primary constraint for which AI models can run locally.
Unified MemoryApple Silicon architecture where CPU and GPU share the same memory pool, enabling larger models on Mac.
Model ShardingSplitting a model across multiple GPUs when it is too large to fit in a single GPU's VRAM.
Inference EngineSoftware that loads and runs AI models β€” llama.cpp, vLLM, Ollama, TensorRT-LLM each optimize for different scenarios.
Practical Tips
Related Community Discussions