Architecture of neural networks - layers, activation functions, and how learning happens.
Neural networks are the mathematical foundation underlying all modern AI. They are loosely inspired by biological neurons but in practice are systems of matrix multiplications and nonlinear functions organized into layers. Understanding how they work β forward propagation, loss computation, and backpropagation β is essential for anyone wanting to go beyond surface-level AI usage.
The field has evolved from simple perceptrons in the 1950s to today's trillion-parameter transformer networks. Each architectural breakthrough β from convolutional layers for vision to attention mechanisms for language β expanded what neural networks could do. Knowing these fundamentals helps you understand why certain models excel at certain tasks and what the actual limitations of "AI" are.
Neurons and Weighted Sums
A neuron computes a weighted sum of its inputs, adds a bias term, then passes the result through an activation function. This simple operation, repeated billions of times across layers, is how neural networks compute.
Network Layers
Input layer receives raw data, hidden layers extract increasingly abstract features, output layer produces the final prediction. "Deep" learning means many hidden layers β modern LLMs have 80-120+ layers.
Activation Functions
ReLU (most common, simple max(0,x)), GELU (used in transformers, smoother), Sigmoid (squashes to 0-1), Softmax (outputs probability distribution). These introduce nonlinearity β without them, the entire network would collapse to a single linear transformation.
Forward Propagation
Data flows through the network layer by layer β each layer transforms its input and passes the result to the next. The final output is a prediction that can be compared to the true answer to compute error.
Loss Functions
Measuring how wrong the prediction is. Cross-entropy loss for classification, MSE for regression, next-token prediction loss for language models. The entire training process is about minimizing this loss function.
Backpropagation
The algorithm that makes learning possible. It computes how much each weight contributed to the error by applying the chain rule of calculus backwards through the network β hence "back" propagation.
Gradient Descent Optimization
Adjusting weights in the direction that reduces loss. Adam optimizer (used by almost all modern models) adapts learning rates per-parameter. Learning rate scheduling, warmup, and weight decay are critical training hyperparameters.
Convolutional Networks (CNNs)
Specialized for spatial data like images. Convolutional filters slide across the input detecting edges, textures, and patterns. Still used in vision AI but increasingly replaced by Vision Transformers (ViT).
Recurrent Networks (RNNs, LSTMs)
Designed for sequential data β text, time series, audio. They maintain a hidden state that carries information across time steps. Largely replaced by Transformers which process sequences in parallel.
The Transformer Architecture
The 2017 breakthrough that powers all modern LLMs. Self-attention allows each token to attend to every other token in the sequence, capturing long-range dependencies that RNNs struggled with. Multi-head attention runs several attention computations in parallel.
BackpropagationAlgorithm for computing how each weight contributes to the error by applying the chain rule backwards through the network.
Gradient DescentOptimization algorithm that iteratively adjusts weights in the direction that reduces error, using computed gradients.
Self-AttentionMechanism where each element in a sequence computes relevance scores with every other element, enabling context-aware processing.
Activation FunctionNonlinear function applied after weighted sums β without it, neural networks could only model linear relationships.