CSI 4106 - Fall 2024
Version: Oct 23, 2024 15:18
Deep learning (DL) is a machine learning technique that can be applied to supervised learning (including regression and classification), unsupervised learning, and reinforcement learning.
Inspired from the structure and function of biological neural networks found in animals.
Comprises interconnected neurons (or units) arranged into layers.
The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of \(\mathbb{R}^n\), given appropriate weights and activation functions.
A two-layer perceptron computes:
\[ y = \phi_2(\phi_1(X)) \]
where
\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]
A 3-layer perceptron computes:
\[ y = \phi_3(\phi_2(\phi_1(X))) \]
where
\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]
A \(k\)-layer perceptron computes:
\[ y = \phi_k( \ldots \phi_2(\phi_1(X)) \ldots ) \]
where
\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]
Learning representations by back-propagating errors
David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
Limitations, such as the inability to solve the XOR classification task, essentially stalled research on neural networks.
The perceptron was limited to a single layer, and there was no known method for training a multi-layer perceptron.
Single-layer perceptrons are limited to solving classification tasks that are linearly separable.
The model employs mean squared error as its loss function.
Gradient descent is used to minimize loss.
A sigmoid activation function is used instead of a step function, as its derivative provides valuable information for gradient descent.
Shows how updating internal weights using a two-pass algorithm consisting of a forward pass and a backward pass.
Enables training multi-layer perceptrons.
Initialization
Forward Pass
Compute Loss
Backward Pass (Backpropagation)
Repeat 2 to 5.
Initialize the weights and biases of the neural network.
For each example in the training set (or in a mini-batch):
Input Layer: Pass input features to first layer.
Hidden Layers: For each hidden layer, compute the activations (output) by applying the weighted sum of inputs plus bias, followed by an activation function (e.g., sigmoid, ReLU).
Output Layer: Same process as hidden layers. Output layer activations represent the predicted values.
Calculate the loss (error) using a suitable loss function by comparing the predicted values to the actual target values.
Output Layer: Compute the gradient of the loss with respect to the output layer’s weights and biases using the chain rule of calculus.
Hidden Layers: Propagate the error backward through the network, layer by layer. For each layer, compute the gradient of the loss with respect to the weights and biases. Use the derivative of the activation function to help calculate these gradients.
Update Weights and Biases: Adjust the weights and biases using the calculated gradients and a learning rate, which determines the step size for each update.
Activation Functions: Functions like sigmoid, ReLU, and tanh introduce non-linearity, which allows the network to learn complex patterns.
Learning Rate: A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively moving towards the steepest descent as defined by the negative of the gradient.
Vanishing gradient problem: Gradients become too small, hindering weight updates.
Stalled neural network research (again) in early 2000s.
Sigmoid and its derivative (range: 0 to 0.25) were key factors.
Common initialization: Weights/biases from \(\mathcal{N}(0, 1)\) contributed to the issue.
Glorot and Bengio (2010) shed light on the problems.
Alternative activation functions: Rectified Linear Unit (ReLU) and its variants (e.g., Leaky ReLU, Parametric ReLU, and Exponential Linear Unit).
Weight Initialization: Xavier (Glorot) or He initialization.
Figure 6
Figure 7
Objective: Mitigate the unstable gradients problem in deep neural networks.
Signal Flow:
Variance Matching:
Forward Pass: Ensure the output variance of each layer matches its input variance.
Backward Pass: Maintain equal gradient variance before and after passing through each layer.
A similar but slightly different initialization method design to work with ReLU, as well as Leaky ReLU, ELU, GELU, Swish, and Mish.
Randomly initializing the weights1 is sufficient to break symmetry in a neural network, allowing the bias terms to be set to zero without impacting the network’s ability to learn effectively.
A series of videos, with animations, providing the intuition behind the backpropagation algorithm.
Neural networks (playlist)
One of the most thorough series of videos on the backpropagation algorithm.
Introduction to neural networks (playlist)
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa