Hardware Basics - Basic Theory

Running AI models locally requires understanding the hardware constraints. The key bottleneck is memory — specifically GPU VRAM for fast inference. A 7B parameter model needs about 4-6GB VRAM (quantized), while a 70B model needs 35-48GB. Your hardware determines which models you can run and how fast they generate tokens.

The hardware landscape has democratized significantly. Apple Silicon Macs with unified memory can run surprisingly large models. Consumer NVIDIA GPUs (RTX 4090 with 24GB VRAM) handle 13B-34B models well. For larger models, cloud GPU providers offer pay-per-hour access. Understanding these options helps you choose the right balance of cost, speed, and capability.

GPU vs CPU Inference

GPUs are 10-50x faster than CPUs for AI inference due to massive parallelism. CPUs work for small models or when using GGUF with CPU offloading. For any serious local AI work, a GPU is essential.

VRAM Requirements

7B model: ~4GB (Q4), ~8GB (FP16). 13B: ~8GB (Q4). 34B: ~20GB (Q4). 70B: ~40GB (Q4). Rule of thumb: model size in GB at Q4 is roughly half the parameter count in billions.

Model Formats

NVIDIA Consumer GPUs

RTX 4090 (24GB, $1600) — king of local LLMs. RTX 4080 (16GB). RTX 3090 (24GB, used ~$800) — best value. RTX 4060 Ti 16GB — budget option. VRAM matters more than compute speed for LLMs.

Apple Silicon

M1/M2/M3/M4 Macs with unified memory can run large models. M2 Ultra (192GB) can run 70B+ models. M3 Max (128GB) handles 34B well. Slower than NVIDIA but memory bandwidth is excellent.

Data Center GPUs

A100 (80GB), H100 (80GB), H200 (141GB) — the hardware powering AI labs. 10-20x more expensive than consumer GPUs. Available through cloud providers for hourly rental.

Cloud GPU Providers

RunPod, Vast.ai, Lambda Labs — rent GPUs by the hour ($0.50-$4/hr for A100). Good for occasional use or running models too large for local hardware. No upfront investment.

API Providers

Multi-GPU Setups

Split large models across multiple GPUs. NVLink provides fast GPU-to-GPU communication on matching NVIDIA cards. Consumer GPUs can use PCIe with slower but functional model sharding.

Inference Engines

llama.cpp (CPU+GPU, versatile), vLLM (high-throughput GPU serving), Ollama (easy local setup), LM Studio (GUI), TGI (HuggingFace serving). Each optimizes for different use cases.

Model Formats

RAM and Storage

System RAM matters for CPU inference and model loading. 32GB minimum, 64GB+ recommended. NVMe SSD dramatically speeds up model loading times (30-70B models are 20-40GB files).

Budget Configurations

Entry ($500): used RTX 3060 12GB — runs 7B models. Mid ($1500): RTX 4090 24GB — runs up to 34B. High ($3000+): Mac Studio M2 Ultra or dual GPU. Budget: use cloud APIs instead of local hardware.

VRAMVideo RAM on the GPU — the primary constraint for which AI models can run locally.

Unified MemoryApple Silicon architecture where CPU and GPU share the same memory pool, enabling larger models on Mac.

Model ShardingSplitting a model across multiple GPUs when it is too large to fit in a single GPU's VRAM.

Inference EngineSoftware that loads and runs AI models — llama.cpp, vLLM, Ollama, TensorRT-LLM each optimize for different scenarios.