← Back to Course

Basic Theory

πŸ‡ΊπŸ‡¦ Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ°
⚑ Level 3 β€” Professional

Model Types & Structures

Different model architectures and their trade-offs.

Not all neural networks are structured the same way. Different architectures have different strengths. Decoder-only transformers (GPT, Llama) excel at text generation. Encoder-decoder models (T5) are great for translation and summarization. MoE (Mixture of Experts) architectures enable much larger models by only activating a subset of parameters per input, and newer state-space models (Mamba) offer alternatives to the quadratic cost of attention.

Architecture choice has massive practical implications. MoE models like Mixtral offer frontier-level quality at a fraction of the inference cost because only a small portion of parameters activate per token. State-space models promise efficient processing of extremely long sequences. Understanding these trade-offs helps you choose the right model for your specific performance, cost, and latency requirements.

Key Topics Covered
Transformer Variants
Encoder-only (BERT β€” great for classification, embeddings), decoder-only (GPT, Llama β€” text generation), encoder-decoder (T5, BART β€” translation, summarization). Each variant processes text differently based on its attention mask pattern.
Decoder-Only Dominance
GPT, Claude, Gemini, Llama all use decoder-only architecture. It won out because it scales efficiently, can be pre-trained with simple next-token prediction, and handles both understanding and generation in one model.
Mixture of Experts (MoE)
Each token is routed to only 2 of 8+ specialized expert sub-networks by a learned router. This means a model with 46B total parameters may only use 12B per forward pass, making it dramatically faster than a dense model of equal quality.
MoE in Practice
Mixtral 8x7B (46B total, 12B active), DeepSeek-V3, and likely GPT-4. MoE models need more memory (all experts loaded) but compute less per token. Ideal for deployment where memory is cheap but compute/latency matter.
Dense vs Sparse Models
Dense models (Llama, Claude) activate all parameters for every token β€” predictable, easier to optimize. Sparse models (MoE) activate only a fraction β€” more efficient but harder to train and balance across experts.
Multi-Head Attention
Multiple parallel attention heads capture different types of relationships β€” syntax, semantics, long-range dependencies. Modern models use 32-128 heads. Grouped Query Attention (GQA) reduces memory by sharing key/value heads.
State-Space Models
Mamba and S4 process sequences in O(n) time vs O(n^2) for attention. They maintain a compressed state that evolves as they read each token. Promising for extremely long sequences (100K+ tokens) where attention becomes prohibitively expensive.
Hybrid Architectures
Combining attention layers with state-space or linear attention layers. Jamba (AI21) mixes Mamba and Transformer layers. Hybrids aim to get the best of both: global attention for reasoning, efficient processing for long contexts.
Model Depth vs Width
More layers (deeper) vs wider layers (more parameters per layer). Deeper models tend to reason better but are slower. Wider models process faster but may reason less deeply. Scaling laws help labs find optimal proportions.
Vision and Multimodal Architectures
Vision Transformers (ViT) apply attention to image patches. Multimodal models combine text transformers with vision encoders (CLIP, SigLIP). Diffusion models use a completely different architecture for image generation.
Key Terms
MoEMixture of Experts β€” architecture that routes each token to specialized sub-networks, using only a fraction of total parameters per forward pass.
Decoder-OnlyTransformer variant that generates text autoregressively by predicting one token at a time β€” the dominant architecture for modern LLMs.
State-Space ModelAlternative to attention (Mamba, S4) that processes sequences in linear O(n) time rather than quadratic O(n^2).
Grouped Query AttentionMemory optimization where multiple attention heads share key/value projections, reducing KV cache memory 4-8x.
Practical Tips
Related Community Discussions