Training Artificial Neural Networks (Part 1)

CSI 4106 - Fall 2024

Marcel Turcotte

Version: Oct 23, 2024 15:18

Preamble

Quote of the Day

Learning objectives

  • Explain the architecture and function of feedforward neural networks (FNNs).
  • Describe the backpropagation algorithm and its role in training neural networks.
  • Identify common activation functions and understand their impact on network performance.
  • Understand the vanishing gradient problem and strategies to mitigate it.

Summary

3Blue1Brown

Summary - DL

  • Deep learning (DL) is a machine learning technique that can be applied to supervised learning (including regression and classification), unsupervised learning, and reinforcement learning.

  • Inspired from the structure and function of biological neural networks found in animals.

  • Comprises interconnected neurons (or units) arranged into layers.

Summary - FNN

Summary - FNN

Summary - units

Common Activation Functions

Universal Approximation

The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of \(\mathbb{R}^n\), given appropriate weights and activation functions.

Notation

Notation

A two-layer perceptron computes:

\[ y = \phi_2(\phi_1(X)) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A 3-layer perceptron computes:

\[ y = \phi_3(\phi_2(\phi_1(X))) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A \(k\)-layer perceptron computes:

\[ y = \phi_k( \ldots \phi_2(\phi_1(X)) \ldots ) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Back-propagation

3Blue1Brown

Back-propagation

Learning representations by back-propagating errors

David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

Before the back-propagation

  • Limitations, such as the inability to solve the XOR classification task, essentially stalled research on neural networks.

  • The perceptron was limited to a single layer, and there was no known method for training a multi-layer perceptron.

  • Single-layer perceptrons are limited to solving classification tasks that are linearly separable.

Back-propagation: contributions

  • The model employs mean squared error as its loss function.

  • Gradient descent is used to minimize loss.

  • A sigmoid activation function is used instead of a step function, as its derivative provides valuable information for gradient descent.

  • Shows how updating internal weights using a two-pass algorithm consisting of a forward pass and a backward pass.

  • Enables training multi-layer perceptrons.

Backpropagation: top level

  1. Initialization

  2. Forward Pass

  3. Compute Loss

  4. Backward Pass (Backpropagation)

  5. Repeat 2 to 5.

Backpropagation: 1. Initialization

Initialize the weights and biases of the neural network.

  1. Zero Initialization
    • All weights are initialized to zero.
    • Symmetry problems, all neurons produce identical outputs, preventing effective learning.
  2. Random Initialization
    • Weights are initialized randomly, often using a uniform or normal distribution.
    • Breaks the symmetry between neurons, allowing them to learn.
    • If not scaled properly, leads to slow convergence or vanishing/exploding gradients.

Backpropagation: 2. Forward Pass

For each example in the training set (or in a mini-batch):

  • Input Layer: Pass input features to first layer.

  • Hidden Layers: For each hidden layer, compute the activations (output) by applying the weighted sum of inputs plus bias, followed by an activation function (e.g., sigmoid, ReLU).

  • Output Layer: Same process as hidden layers. Output layer activations represent the predicted values.

Backpropagation: 3. Compute Loss

Calculate the loss (error) using a suitable loss function by comparing the predicted values to the actual target values.

Backpropagation: 4. Backward Pass

  • Output Layer: Compute the gradient of the loss with respect to the output layer’s weights and biases using the chain rule of calculus.

  • Hidden Layers: Propagate the error backward through the network, layer by layer. For each layer, compute the gradient of the loss with respect to the weights and biases. Use the derivative of the activation function to help calculate these gradients.

  • Update Weights and Biases: Adjust the weights and biases using the calculated gradients and a learning rate, which determines the step size for each update.

Key Concepts

  • Activation Functions: Functions like sigmoid, ReLU, and tanh introduce non-linearity, which allows the network to learn complex patterns.

  • Learning Rate: A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.

  • Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively moving towards the steepest descent as defined by the negative of the gradient.

Summary

Training

Vanishing gradients

  • Vanishing gradient problem: Gradients become too small, hindering weight updates.

  • Stalled neural network research (again) in early 2000s.

  • Sigmoid and its derivative (range: 0 to 0.25) were key factors.

  • Common initialization: Weights/biases from \(\mathcal{N}(0, 1)\) contributed to the issue.

Glorot and Bengio (2010) shed light on the problems.

Vanishing gradients: solutions

  • Alternative activation functions: Rectified Linear Unit (ReLU) and its variants (e.g., Leaky ReLU, Parametric ReLU, and Exponential Linear Unit).

  • Weight Initialization: Xavier (Glorot) or He initialization.

Glorot and Bengio

Figure 6

Figure 7

Glorot and Bengio

Objective: Mitigate the unstable gradients problem in deep neural networks.

Signal Flow:

  • Forward Direction: Ensure stable signal propagation for accurate predictions.
  • Reverse Direction: Maintain consistent gradient flow during backpropagation.

Glorot and Bengio

Variance Matching:

  • Forward Pass: Ensure the output variance of each layer matches its input variance.

  • Backward Pass: Maintain equal gradient variance before and after passing through each layer.

He initialization

A similar but slightly different initialization method design to work with ReLU, as well as Leaky ReLU, ELU, GELU, Swish, and Mish.

Ensure that the initialization method matches the chosen activation function.

import tensorflow as tf
from tensorflow.python.keras.layers import Dense

dense = Dense(50, activation="relu", kernel_initializer="he_normal")

Note

Randomly initializing the weights1 is sufficient to break symmetry in a neural network, allowing the bias terms to be set to zero without impacting the network’s ability to learn effectively.

Activation Function: Leaky ReLU

Prologue

Summary

  • Artificial Neural Networks (ANNs):
    • Inspired by biological neural networks.
    • Consist of interconnected neurons arranged in layers.
    • Applicable to supervised, unsupervised, and reinforcement learning.
  • Feedforward Neural Networks (FNNs):
    • Information flows unidirectionally from input to output.
    • Comprised of input, hidden, and output layers.
    • Can vary in the number of layers and nodes per layer.
  • Activation Functions:
    • Introduce non-linearity to enable learning complex patterns.
    • Common functions: Sigmoid, Tanh, ReLU, Leaky ReLU.
    • Choice of activation function affects gradient flow and network performance.
  • Universal Approximation Theorem:
    • A neural network with a single hidden layer can approximate any continuous function.
  • Backpropagation Algorithm:
    • Training involves forward pass, loss computation, backward pass, and weight updates.
    • Utilizes gradient descent to minimize the loss function.
    • Enables training of multi-layer perceptrons by adjusting internal weights.
  • Vanishing Gradient Problem:
    • Gradients become too small during backpropagation, hindering training.
    • Mitigation strategies include using ReLU activation functions and proper weight initialization (Glorot or He initialization).
  • Weight Initialization:
    • Random initialization breaks symmetry and allows effective learning.
    • Glorot initialization suits sigmoid and tanh activations.
    • He initialization is optimal for ReLU and its variants.
  • Key Concepts:
    • Learning rate determines the step size during optimization.
    • Gradient descent is used to update weights in the direction of minimizing loss.
    • Proper selection of activation functions and initialization methods is crucial for effective training.

3Blue1Brown

A series of videos, with animations, providing the intuition behind the backpropagation algorithm.

StatQuest

Herman Kamper

One of the most thorough series of videos on the backpropagation algorithm.

Next lecture

  • We will talk about softmas, cross-entropy, and regularization.

References

Angermueller, Christof, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. 2016. “Deep Learning for Computational Biology.” Mol Syst Biol 12 (7): 878. https://doi.org/10.15252/msb.20156651.
Cybenko, George V. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems 2: 303–14. https://api.semanticscholar.org/CorpusID:3958369.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, edited by Yee Whye Teh and Mike Titterington, 9:249–56. Proceedings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR. https://proceedings.mlr.press/v9/glorot10a.html.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/CVPR.2016.90.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/https://doi.org/10.1016/0893-6080(89)90020-8.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa