Data Type Classification

AI models can be categorized by the types of data they process as inputs and produce as outputs. Understanding this classification helps you choose the right model for each task. A text-to-text model (LLM) handles different tasks than an image-to-text model (captioning) or a text-to-image model (diffusion).

Modern models increasingly blur these boundaries — multimodal foundation models can handle multiple data types in a single conversation. But understanding the underlying classification helps you design effective AI pipelines and choose appropriate APIs.

Text → Text (LLMs)

Chat, writing, analysis, translation, summarization. Models: GPT-4, Claude, Gemini, Llama. The most mature and widely used category of generative AI.

LLM and GPT

Text → Image (Diffusion)

Generate images from text descriptions. Models: DALL-E 3, Midjourney, Stable Diffusion, Flux. Quality has improved from abstract art to photorealistic outputs in just 2 years.

Diffusion Models

Image → Text (Vision)

Captioning, OCR, visual question-answering. Models: GPT-4V, Claude Vision, Gemini Pro Vision. Enables AI to "see" and reason about images, screenshots, documents.

Multimodality

Text → Audio (TTS)

Voice synthesis from text. Models: ElevenLabs, OpenAI TTS, Bark. Modern TTS produces near-human quality speech with emotion, accents, and multiple languages.

Audio → Text (Speech Recognition)

Transcription and speech-to-text. Models: Whisper, AssemblyAI, Deepgram. Enables voice interfaces, meeting transcription, and accessibility features.

Text → Video

Generate video clips from text descriptions. Models: Sora, Runway Gen-3, Kling, Pika. The newest frontier — quality is improving rapidly but still limited to short clips.

Text → Code

Code generation and completion from natural language. Models: GPT-4, Claude, Codex, StarCoder. Powers tools like GitHub Copilot, Cursor, and Claude Code.

Vibecoding

Code → Text

Code explanation, documentation generation, and review. All major LLMs excel at reading and explaining code, making it one of the highest-value AI applications.

Image → Image

Image editing, style transfer, super-resolution, inpainting. Models: ControlNet, Instruct-Pix2Pix. Transform existing images rather than generating from scratch.

Diffusion Models

Audio → Audio

Voice conversion, music remixing, noise removal, audio enhancement. Specialized models that transform audio inputs without going through text as an intermediate.

When choosing a model, first identify your primary data modality — a text-to-image model is different from a text+image understanding model, even if both involve images
Build pipelines of specialized models (speech-to-text → LLM → text-to-speech) rather than using one model for everything — specialized models are usually better and cheaper
Embedding models are underrated tools — they enable semantic search, recommendation systems, and clustering at a fraction of the cost of LLM calls

Basic Theory

Data Type Classification