← Back to Course

Basic Theory

πŸ‡ΊπŸ‡¦ Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ°
πŸ’‘ Level 2 β€” User

Token

How models process text through tokenization - the fundamental unit of LLM computation.

Tokens are the fundamental units that LLMs work with. They are not characters, not words, but subword pieces β€” typically 3-4 characters of English text. The word "tokenization" becomes roughly ["token", "ization"]. Understanding tokens is critical because they determine costs, context window limits, and model behavior.

Every interaction with an AI model involves counting tokens: your input is measured in tokens, the model's output is counted in tokens, and you pay per token. The context window β€” how much text the model can "see" at once β€” is measured in tokens. A typical page of English text is about 500 tokens.

Key Topics Covered
What Is a Token
Subword units β€” not characters or words. "Hello world" is 2 tokens. "Tokenization" becomes ["token", "ization"]. Typically 3-4 characters of English text per token.
Tokenization Algorithms
BPE (Byte Pair Encoding) iteratively merges frequent character pairs. SentencePiece handles any language. tiktoken is OpenAI's fast tokenizer. Each model family uses its own tokenizer.
Context Window Sizes
4K tokens (early GPT-3.5) β†’ 128K (GPT-4) β†’ 200K (Claude) β†’ 1M+ (Gemini). Context windows have grown 250x in just 2 years, dramatically expanding what models can process.
Token Pricing
Typical costs: $1-30 per million tokens depending on model tier. Claude Haiku ~$0.25/M input, GPT-4o ~$2.50/M input, Claude Opus ~$15/M input. Understanding pricing enables cost optimization.
Language Differences
Ukrainian, Chinese, Arabic, and other non-Latin scripts use 2-3x more tokens than English for equivalent content. This directly impacts costs and effective context window size.
Special Tokens
Control tokens like <|im_start|>, <|im_end|>, [PAD], [SEP] are used internally by models to mark message boundaries, roles, and sequence structure. You rarely see them but they consume context.
Token Counting Tools
tiktoken (OpenAI), Anthropic tokenizer, Hugging Face tokenizers β€” use these to predict costs and check if your prompt fits within the context window before sending.
Cost Optimization
Shorter prompts = cheaper, but too short = worse quality. The art is finding the minimum effective prompt length. Removing unnecessary context and boilerplate saves money at scale.
Prompt Caching
Many APIs cache common prompt prefixes to reduce costs on repeated calls. Anthropic and OpenAI both offer caching that can reduce input costs by 90% for repeated system prompts.
Input vs Output Token Pricing
Output tokens are typically 2-5x more expensive than input tokens. Generating text costs more than reading it. This incentivizes concise outputs and affects application design decisions.
Key Terms
TokenThe basic unit of text that LLMs process β€” a subword piece typically 3-4 characters long.
BPEByte Pair Encoding β€” a tokenization algorithm that iteratively merges the most frequent character pairs.
Context WindowThe maximum number of tokens a model can process in a single request (input + output combined).
Prompt CachingAPI feature that caches common prompt prefixes to reduce cost on repeated similar requests.
Practical Tips
Related Community Discussions