Token - Basic Theory

How models process text through tokenization - the fundamental unit of LLM computation.

Tokens are the fundamental units that LLMs work with. They are not characters, not words, but subword pieces — typically 3-4 characters of English text. The word "tokenization" becomes roughly ["token", "ization"]. Understanding tokens is critical because they determine costs, context window limits, and model behavior.

Every interaction with an AI model involves counting tokens: your input is measured in tokens, the model's output is counted in tokens, and you pay per token. The context window — how much text the model can "see" at once — is measured in tokens. A typical page of English text is about 500 tokens.

What Is a Token

Subword units — not characters or words. "Hello world" is 2 tokens. "Tokenization" becomes ["token", "ization"]. Typically 3-4 characters of English text per token.

Tokenization Algorithms

BPE (Byte Pair Encoding) iteratively merges frequent character pairs. SentencePiece handles any language. tiktoken is OpenAI's fast tokenizer. Each model family uses its own tokenizer.

Context Window Sizes

4K tokens (early GPT-3.5) → 128K (GPT-4) → 200K (Claude) → 1M+ (Gemini). Context windows have grown 250x in just 2 years, dramatically expanding what models can process.

Context

Token Pricing

Typical costs: $1-30 per million tokens depending on model tier. Claude Haiku ~$0.25/M input, GPT-4o ~$2.50/M input, Claude Opus ~$15/M input. Understanding pricing enables cost optimization.

API Providers

Language Differences

Ukrainian, Chinese, Arabic, and other non-Latin scripts use 2-3x more tokens than English for equivalent content. This directly impacts costs and effective context window size.

Special Tokens

Control tokens like <|im_start|>, <|im_end|>, [PAD], [SEP] are used internally by models to mark message boundaries, roles, and sequence structure. You rarely see them but they consume context.

Token Counting Tools

tiktoken (OpenAI), Anthropic tokenizer, Hugging Face tokenizers — use these to predict costs and check if your prompt fits within the context window before sending.

Cost Optimization

Shorter prompts = cheaper, but too short = worse quality. The art is finding the minimum effective prompt length. Removing unnecessary context and boilerplate saves money at scale.

Prompt

Prompt Caching

Many APIs cache common prompt prefixes to reduce costs on repeated calls. Anthropic and OpenAI both offer caching that can reduce input costs by 90% for repeated system prompts.

API Providers

Input vs Output Token Pricing

Output tokens are typically 2-5x more expensive than input tokens. Generating text costs more than reading it. This incentivizes concise outputs and affects application design decisions.

TokenThe basic unit of text that LLMs process — a subword piece typically 3-4 characters long.

BPEByte Pair Encoding — a tokenization algorithm that iteratively merges the most frequent character pairs.

Context WindowThe maximum number of tokens a model can process in a single request (input + output combined).

Prompt CachingAPI feature that caches common prompt prefixes to reduce cost on repeated similar requests.

Estimate token counts before making API calls: 1 token is roughly 4 English characters or 0.75 words — a page of text is about 500 tokens
Non-Latin scripts (Ukrainian, Chinese, Arabic) use more tokens per word — budget 2-3x more tokens for multilingual applications
To reduce costs, compress system prompts and reuse cached prefixes — Anthropic and OpenAI both support prompt caching for significant savings