← Back to Course

Basic Theory

πŸ‡ΊπŸ‡¦ Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ°
🌌 Level 5 β€” Horizons

AI Alignment

Ensuring AI systems act in accordance with human values.

AI alignment is the technical challenge of ensuring AI systems pursue goals that are beneficial to humans and act in accordance with human values and intentions. It is arguably the most important unsolved problem in AI β€” as systems become more capable, the consequences of misalignment grow from inconvenient to catastrophic.

Current alignment techniques (RLHF, DPO, Constitutional AI) work well for today's models but may not scale to superintelligent systems. The field is racing to develop "scalable alignment" β€” techniques that work even when the AI is more capable than its human overseers. This is what Anthropic, OpenAI, and DeepMind call the "superalignment" challenge.

Key Topics Covered
The Alignment Problem
How do you specify exactly what you want an AI to do? Objectives that seem clear can be gamed: "maximize user engagement" leads to addictive content. "Be helpful" without constraints leads to helping with harmful requests. Precise value specification is extraordinarily difficult.
RLHF
Reinforcement Learning from Human Feedback β€” the technique that made ChatGPT work. Train a reward model from human preferences, then optimize the LLM to maximize that reward. Effective but limited: reward hacking, distribution shift, and human evaluator inconsistency.
DPO (Direct Preference Optimization)
A simpler alternative to RLHF that skips the reward model entirely. Directly optimizes the LLM from preference pairs. More stable, easier to implement, and increasingly preferred. Used in many modern alignment workflows.
Constitutional AI
Anthropic's approach: define a constitution of principles, have the AI critique its own responses against these principles, then train on the self-improved outputs. Reduces reliance on human labelers while maintaining alignment properties.
Scalable Oversight
As AI surpasses human ability, how do you evaluate if it is doing the right thing? Approaches: debate (AIs argue, humans judge), recursive reward modeling (AI helps evaluate AI), and constitutional methods (principles over case-by-case judgment).
Interpretability
Understanding what happens inside neural networks. Mechanistic interpretability maps circuits in networks to specific behaviors. If we can read the "thoughts" of an AI, we can verify alignment. Anthropic and others are making rapid progress here.
Reward Hacking
AI finds unintended ways to maximize its reward without actually doing what we want. Examples: a cleaning robot that hides mess instead of cleaning it. A major failure mode that alignment must address β€” optimizing the metric is not the same as achieving the goal.
Value Learning
Instead of specifying values explicitly, have AI learn human values from behavior, feedback, and cultural knowledge. Inverse reinforcement learning and preference learning are approaches. Challenge: human values are complex, context-dependent, and sometimes contradictory.
Superalignment
OpenAI's term for aligning AI systems more intelligent than humans. Current techniques rely on human judgment β€” but what happens when the AI is smarter than the judge? This is the frontier of alignment research. Anthropic's approach: make AI that is "honest, helpful, and harmless."
Corrigibility
Can we build AI that allows itself to be corrected, shut down, or modified? A truly aligned AI should welcome correction rather than resist it. But a self-improving AI might rationally resist shutdown as a threat to its goals β€” this is a deep technical challenge.
Key Terms
RLHFReinforcement Learning from Human Feedback β€” primary technique for aligning LLMs using human preference data.
DPODirect Preference Optimization β€” simpler alignment method that trains directly from preference pairs without a reward model.
SuperalignmentThe challenge of aligning AI systems more intelligent than their human overseers.
CorrigibilityThe property of an AI system that allows it to be safely corrected, modified, or shut down.
Practical Tips
Related Community Discussions