Categorizing AI models by what data types they handle as input and output.
AI models can be categorized by the types of data they process as inputs and produce as outputs. Understanding this classification helps you choose the right model for each task. A text-to-text model (LLM) handles different tasks than an image-to-text model (captioning) or a text-to-image model (diffusion).
Modern models increasingly blur these boundaries — multimodal foundation models can handle multiple data types in a single conversation. But understanding the underlying classification helps you design effective AI pipelines and choose appropriate APIs.
Text → Text (LLMs)
Chat, writing, analysis, translation, summarization. Models: GPT-4, Claude, Gemini, Llama. The most mature and widely used category of generative AI.
Text → Image (Diffusion)
Generate images from text descriptions. Models: DALL-E 3, Midjourney, Stable Diffusion, Flux. Quality has improved from abstract art to photorealistic outputs in just 2 years.
Image → Text (Vision)
Captioning, OCR, visual question-answering. Models: GPT-4V, Claude Vision, Gemini Pro Vision. Enables AI to "see" and reason about images, screenshots, documents.
Text → Audio (TTS)
Voice synthesis from text. Models: ElevenLabs, OpenAI TTS, Bark. Modern TTS produces near-human quality speech with emotion, accents, and multiple languages.
Audio → Text (Speech Recognition)
Transcription and speech-to-text. Models: Whisper, AssemblyAI, Deepgram. Enables voice interfaces, meeting transcription, and accessibility features.
Text → Video
Generate video clips from text descriptions. Models: Sora, Runway Gen-3, Kling, Pika. The newest frontier — quality is improving rapidly but still limited to short clips.
Text → Code
Code generation and completion from natural language. Models: GPT-4, Claude, Codex, StarCoder. Powers tools like GitHub Copilot, Cursor, and Claude Code.
Code → Text
Code explanation, documentation generation, and review. All major LLMs excel at reading and explaining code, making it one of the highest-value AI applications.
Image → Image
Image editing, style transfer, super-resolution, inpainting. Models: ControlNet, Instruct-Pix2Pix. Transform existing images rather than generating from scratch.
Audio → Audio
Voice conversion, music remixing, noise removal, audio enhancement. Specialized models that transform audio inputs without going through text as an intermediate.
ModalityThe type of data a model works with: text, image, audio, video, or code.
PipelineA chain of models processing data, e.g., audio→text→text→audio for a voice chatbot.
EmbeddingA numerical representation of data (text, image) in a vector space, enabling semantic search.