Context windows, how models use context, and managing context effectively.
The context window is the total amount of text (measured in tokens) that a model can process in a single request β including both your input and the model's output. Think of it as the model's "working memory." Anything outside the context window simply doesn't exist for the model.
Context windows have grown dramatically: from 4K tokens in early GPT-3.5 to 200K (Claude) and 1M+ (Gemini). But bigger isn't always better β models often struggle to effectively use information in the middle of very long contexts. Understanding these dynamics is key to building effective AI applications.
What Is a Context Window
The total tokens (input + output) a model processes in one request β its working memory. Everything outside the context window simply doesn't exist for the model.
Context Window Sizes
GPT-4 (128K), Claude (200K), Gemini (1M+), open models (8K-128K). Bigger context means more information available, but cost and latency increase with context size.
How Attention Works
Each token "attends" to every other token β computational cost grows quadratically (O(n^2)). This is why very long contexts are expensive and why efficient attention methods matter.
Lost-in-the-Middle Problem
Models attend better to the beginning and end of context than the middle. Important information placed in the middle of a long context may be overlooked or given less weight.
Context Management Strategies
Summarization (compress older messages), chunking (process documents in pieces), prioritization (put most relevant info first/last). Essential skills for building production AI apps.
RAG (Retrieval-Augmented Generation)
Pull relevant documents into context on demand rather than stuffing everything in. A search retrieves the most relevant chunks, which are then added to the prompt before generation.
Conversation Memory
Chatbots simulate long-term memory by managing context: summarizing old messages, maintaining key facts, and selectively including relevant history in each new request.
Context Engineering
Deliberate structuring of what goes into the context window β what to include, what to summarize, what to omit. Arguably more important than prompt engineering for complex applications.
Sliding Window Processing
For documents longer than the context window, process in overlapping chunks that "slide" through the content. Each chunk shares some overlap with the previous for continuity.
Multi-Turn Conversation Costs
Each message in a conversation consumes context. As conversations grow, old messages get truncated or summarized. Understanding this helps you design chatbots that remain coherent over time.
Context WindowMaximum tokens a model processes at once β its working memory for a single request.
Lost-in-the-MiddleModels attend better to start and end of context, often missing information in the middle.
RAGRetrieval-Augmented Generation β dynamically retrieving relevant documents to add to the context.
Context EngineeringThe practice of deliberately structuring and managing what information enters the model's context.