Context - Basic Theory

The context window is the total amount of text (measured in tokens) that a model can process in a single request — including both your input and the model's output. Think of it as the model's "working memory." Anything outside the context window simply doesn't exist for the model.

Context windows have grown dramatically: from 4K tokens in early GPT-3.5 to 200K (Claude) and 1M+ (Gemini). But bigger isn't always better — models often struggle to effectively use information in the middle of very long contexts. Understanding these dynamics is key to building effective AI applications.

What Is a Context Window

The total tokens (input + output) a model processes in one request — its working memory. Everything outside the context window simply doesn't exist for the model.

Token

Context Window Sizes

GPT-4 (128K), Claude (200K), Gemini (1M+), open models (8K-128K). Bigger context means more information available, but cost and latency increase with context size.

The Big Players

How Attention Works

Each token "attends" to every other token — computational cost grows quadratically (O(n^2)). This is why very long contexts are expensive and why efficient attention methods matter.

Neural Networks

Lost-in-the-Middle Problem

Models attend better to the beginning and end of context than the middle. Important information placed in the middle of a long context may be overlooked or given less weight.

Context Management Strategies

Summarization (compress older messages), chunking (process documents in pieces), prioritization (put most relevant info first/last). Essential skills for building production AI apps.

RAG (Retrieval-Augmented Generation)

Pull relevant documents into context on demand rather than stuffing everything in. A search retrieves the most relevant chunks, which are then added to the prompt before generation.

RAG

Conversation Memory

Chatbots simulate long-term memory by managing context: summarizing old messages, maintaining key facts, and selectively including relevant history in each new request.

Context Engineering

Deliberate structuring of what goes into the context window — what to include, what to summarize, what to omit. Arguably more important than prompt engineering for complex applications.

Prompt

Sliding Window Processing

For documents longer than the context window, process in overlapping chunks that "slide" through the content. Each chunk shares some overlap with the previous for continuity.

Multi-Turn Conversation Costs

Each message in a conversation consumes context. As conversations grow, old messages get truncated or summarized. Understanding this helps you design chatbots that remain coherent over time.

Token

Context WindowMaximum tokens a model processes at once — its working memory for a single request.

Lost-in-the-MiddleModels attend better to start and end of context, often missing information in the middle.

RAGRetrieval-Augmented Generation — dynamically retrieving relevant documents to add to the context.

Context EngineeringThe practice of deliberately structuring and managing what information enters the model's context.

Put the most important information at the beginning and end of your prompt — models pay less attention to the middle (the "lost in the middle" effect)
For long-context tasks, chunk your input and summarize irrelevant sections rather than feeding everything in — focused context produces better results than exhaustive context
Use RAG instead of stuffing everything into context — retrieve only what's relevant for the specific query, even if the model has a huge context window