Large Language Models and the GPT architecture that started the revolution.
Large Language Models (LLMs) are neural networks trained on massive text datasets that can understand and generate human language. The "large" refers to both the number of parameters (billions) and the scale of training data (trillions of tokens from the internet). LLMs are the backbone of modern AI assistants like ChatGPT, Claude, and Gemini.
GPT β Generative Pre-trained Transformer β is the specific architecture family from OpenAI that popularized LLMs. The key insight was that scaling up a simple next-token prediction objective on internet-scale data produces remarkably capable models. This pattern of "pre-train at scale, then fine-tune for tasks" has become the dominant paradigm across all AI labs.
What is an LLM
A neural network with billions of parameters trained on massive text data. It learns the statistical structure of language and can generate, analyze, translate, and reason about text.
The Transformer Architecture
The 2017 "Attention Is All You Need" paper introduced self-attention β allowing each token to attend to every other token in the sequence. This replaced slower recurrent architectures and enabled parallelization.
GPT: Generative Pre-trained Transformer
OpenAI's decoder-only architecture. "Generative" = generates text, "Pre-trained" = trained on broad data first, "Transformer" = the underlying architecture. This design became the template for all modern LLMs.
Next-Token Prediction
LLMs generate text one token at a time, always predicting the most probable next token given all previous ones. This simple objective, at scale, produces remarkably capable models.
Scaling Laws
Chinchilla and earlier research showed that model performance improves predictably with more compute, data, and parameters. This mathematical relationship drives the industry push toward larger models.
Emergent Abilities
At certain scales, models suddenly gain capabilities not present in smaller versions β in-context learning, chain-of-thought reasoning, code generation. These emerge from scale, not explicit programming.
Key Model Families
GPT-4/o1 (OpenAI), Claude 3.5/4 (Anthropic), Gemini (Google), Llama 3 (Meta), Qwen 2.5 (Alibaba), Mistral/Mixtral (Mistral AI). Each has different strengths and trade-offs.
Model Sizes
From 1B parameter "small" models (run on phones) to 1T+ parameter frontier models (require data centers). Common sizes: 7B, 13B, 34B, 70B, 405B. Larger usually = more capable but slower and more expensive.
Training Pipeline
Pre-training (trillions of tokens, months of GPU time) β Supervised Fine-Tuning with human-curated examples β RLHF/DPO alignment to make models helpful and safe.
Inference
How models actually run: forward pass through the network, KV cache for efficient generation, batching multiple requests, streaming tokens to the user as they are generated.
LLMLarge Language Model β a neural network with billions of parameters trained to predict and generate text.
TransformerNeural network architecture using self-attention, enabling parallel processing of sequences.
Next-Token PredictionThe core training objective: given previous tokens, predict the most likely next one.
Scaling LawsMathematical relationships showing model performance improves predictably with more compute, data, and parameters.
AutoregressiveGenerating output one token at a time, where each new token depends on all previous ones.