Model Types & Structures

Not all neural networks are structured the same way. Different architectures have different strengths. Decoder-only transformers (GPT, Llama) excel at text generation. Encoder-decoder models (T5) are great for translation and summarization. MoE (Mixture of Experts) architectures enable much larger models by only activating a subset of parameters per input, and newer state-space models (Mamba) offer alternatives to the quadratic cost of attention.

Architecture choice has massive practical implications. MoE models like Mixtral offer frontier-level quality at a fraction of the inference cost because only a small portion of parameters activate per token. State-space models promise efficient processing of extremely long sequences. Understanding these trade-offs helps you choose the right model for your specific performance, cost, and latency requirements.

Transformer Variants

Encoder-only (BERT — great for classification, embeddings), decoder-only (GPT, Llama — text generation), encoder-decoder (T5, BART — translation, summarization). Each variant processes text differently based on its attention mask pattern.

Neural Networks

Decoder-Only Dominance

GPT, Claude, Gemini, Llama all use decoder-only architecture. It won out because it scales efficiently, can be pre-trained with simple next-token prediction, and handles both understanding and generation in one model.

LLM

Mixture of Experts (MoE)

Each token is routed to only 2 of 8+ specialized expert sub-networks by a learned router. This means a model with 46B total parameters may only use 12B per forward pass, making it dramatically faster than a dense model of equal quality.

Foundation Models

MoE in Practice

Mixtral 8x7B (46B total, 12B active), DeepSeek-V3, and likely GPT-4. MoE models need more memory (all experts loaded) but compute less per token. Ideal for deployment where memory is cheap but compute/latency matter.

Model Optimization

Dense vs Sparse Models

Dense models (Llama, Claude) activate all parameters for every token — predictable, easier to optimize. Sparse models (MoE) activate only a fraction — more efficient but harder to train and balance across experts.

Multi-Head Attention

Multiple parallel attention heads capture different types of relationships — syntax, semantics, long-range dependencies. Modern models use 32-128 heads. Grouped Query Attention (GQA) reduces memory by sharing key/value heads.

Context

State-Space Models

Mamba and S4 process sequences in O(n) time vs O(n^2) for attention. They maintain a compressed state that evolves as they read each token. Promising for extremely long sequences (100K+ tokens) where attention becomes prohibitively expensive.

Hybrid Architectures

Combining attention layers with state-space or linear attention layers. Jamba (AI21) mixes Mamba and Transformer layers. Hybrids aim to get the best of both: global attention for reasoning, efficient processing for long contexts.

Model Depth vs Width

More layers (deeper) vs wider layers (more parameters per layer). Deeper models tend to reason better but are slower. Wider models process faster but may reason less deeply. Scaling laws help labs find optimal proportions.

Data to Model

Vision and Multimodal Architectures

Vision Transformers (ViT) apply attention to image patches. Multimodal models combine text transformers with vision encoders (CLIP, SigLIP). Diffusion models use a completely different architecture for image generation.

Multimodal AI

MoEMixture of Experts — architecture that routes each token to specialized sub-networks, using only a fraction of total parameters per forward pass.

Decoder-OnlyTransformer variant that generates text autoregressively by predicting one token at a time — the dominant architecture for modern LLMs.

State-Space ModelAlternative to attention (Mamba, S4) that processes sequences in linear O(n) time rather than quadratic O(n^2).

Grouped Query AttentionMemory optimization where multiple attention heads share key/value projections, reducing KV cache memory 4-8x.

Basic Theory

Model Types & Structures