Training Artificial Neural Networks (Part 1)

CSI 4106 - Fall 2025

Marcel Turcotte

Version: Oct 20, 2025 12:35

Preamble

Message of the Day

Learning objectives

  • Explain the architecture and function of feed-forward neural networks (FNNs).
  • Describe the backpropagation algorithm and its role in training neural networks.
  • Identify common activation functions and understand their impact on network performance.

Summary

3Blue1Brown (1/2)

3Blue1Brown (2/2)

Summary - DL

  • Deep learning (DL) is a machine learning technique that can be applied to supervised learning (including regression and classification), unsupervised learning, and reinforcement learning.

  • Inspired from the structure and function of biological neural networks found in animals.

  • Comprises interconnected neurons (or units) arranged into layers.

Summary - FNN

Summary - FNN

Summary - units

Common Activation Functions

Show code
# Attribution: https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb

import numpy as np
import matplotlib.pyplot as plt

from scipy.special import expit as sigmoid

def relu(z):
    return np.maximum(0, z)

def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)

max_z = 4.5
z = np.linspace(-max_z, max_z, 200)

plt.figure(figsize=(11, 3.1))

plt.subplot(121)
plt.plot(z, relu(z), "m-.", linewidth=2, label="ReLU")
plt.plot(z, sigmoid(z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", linewidth=1, label="Tanh")
plt.grid(True)
plt.title("Activation functions")
plt.axis([-max_z, max_z, -1.65, 2.4])
plt.gca().set_yticks([-1, 0, 1, 2])
plt.legend(loc="lower right", fontsize=13)

plt.subplot(122)
plt.plot(z, derivative(sigmoid, z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", linewidth=1, label="Tanh")
plt.plot([-max_z, 0], [0, 0], "m-.", linewidth=2)
plt.plot([0, max_z], [1, 1], "m-.", linewidth=2)
plt.plot([0, 0], [0, 1], "m-.", linewidth=1.2)
plt.plot(0, 1, "mo", markersize=5)
plt.plot(0, 1, "mx", markersize=10)
plt.grid(True)
plt.title("Derivatives")
plt.axis([-max_z, max_z, -0.2, 1.2])

plt.show()

Universal Approximation

The universal approximation theorem states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of \(\mathbb{R}^n\), given appropriate weights and activation functions.

Naïve MLP

Data

Show code
# Generate and plot the "circles" dataset
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

# Generate synthetic data
X, y = make_circles(n_samples=1200, factor=0.35, noise=0.06, random_state=42)

# Separate coordinates for plotting
x1, x2 = X[:, 0], X[:, 1]

# Plot the two classes
plt.figure(figsize=(5, 5))
plt.scatter(x1[y==0], x2[y==0], color="C0", label="class 0 (outer ring)")
plt.scatter(x1[y==1], x2[y==1], color="C1", label="class 1 (inner circle)")
plt.xlabel("x₁")
plt.ylabel("x₂")
plt.title("Dataset generated with make_circles")
plt.axis("equal") # ensures circles look round
plt.legend()
plt.show()

Architecture

Utilities

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def bce_loss(y_true, y_prob, eps=1e-9):

    """Binary cross-entropy loss (average over data)."""

    y_prob = np.clip(y_prob, eps, 1 - eps)

    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

NaïveMLP

The complete implementation is presented below and will be examined in the subsequent slides.

Show code
class NaiveMLP:

    """
    A minimal multilayer perceptron (MLP) utilizing a brute force training 
    algorithm that does not require derivative calculations.

    Please note that the suggested training algorithm is intended solely for 
    didactic purposes and should not be mistaken for a genuine training algorithm.
    """

    def __init__(self, layer_sizes, step=0.1, seed=None):
        
        self.sizes = list(layer_sizes)
        self.step = float(step)
        rng = np.random.default_rng(seed)

        # Initialize weights and biases

        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]

        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

    def forward(self, X):

        """
        Simple forward pass: compute output activations only.

        X: shape (N, input_dim)

        Returns: output probabilities, shape (N,)
        """

        a = X
        for W, b in zip(self.W, self.b):
            a = sigmoid(a @ W + b)

        return a.ravel()

    def predict(self, X, threshold=0.5):

        return (self.forward(X) >= threshold).astype(int)

    def loss(self, X, y):

        return bce_loss(y, self.forward(X))

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

    def _get_param(self, tag):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            return self.W[l][i, j]
        else:
            _, l, j = tag
            return self.b[l][j]

    def _set_param(self, tag, val):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            self.W[l][i, j] = val
        else:
            _, l, j = tag
            self.b[l][j] = val

    def train(self, X, y, epochs=10, verbose=True):

        """
        Simultaneous update:
        - For each scalar parameter θ, try θ + δ for δ in {−step, 0, +step},
          pick the δ that gives minimal loss.
        - Collect all chosen δ’s, then apply all updates together.
        """

        for ep in range(1, epochs + 1):

            base_loss = self.loss(X, y)
            updates = {}

            # Probe all parameters
            for tag in self._all_param_tags():

                theta = self._get_param(tag)
                best_delta = 0.0
                best_loss = base_loss

                for delta in (-self.step, 0.0, +self.step):
                    self._set_param(tag, theta + delta)
                    trial_loss = self.loss(X, y)
                    if trial_loss < best_loss:
                        best_loss = trial_loss
                        best_delta = delta

                # restore original
                self._set_param(tag, theta)
                updates[tag] = best_delta

            # Apply all deltas together
            for tag, d in updates.items():
                if d != 0.0:
                    self._set_param(tag, self._get_param(tag) + d)

            new_loss = self.loss(X, y)

            if verbose:
                print(f"Epoch {ep:3d}: loss {base_loss:.5f}{new_loss:.5f}")

            # optional early stop
            if abs(new_loss - base_loss) < 1e-12:
                break

Class Definition

class NaiveMLP:

    """
    A minimal multilayer perceptron (MLP) utilizing a brute force training 
    algorithm that does not require derivative calculations.

    Please note that the suggested training algorithm is intended solely for 
    didactic purposes and should not be mistaken for a genuine training algorithm.
    """

Constructor

    def __init__(self, layer_sizes, step=0.1, seed=None):
        
        self.sizes = list(layer_sizes)
        self.step = float(step)
        rng = np.random.default_rng(seed)

        # Initialize weights and biases

        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]

        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

Python

seed = 0

rng = np.random.default_rng(seed)

layer_sizes = [2, 4, 4, 1]

[(in_d, out_d) for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
[(2, 4), (4, 4), (4, 1)]
[rng.standard_normal(size=(in_d, out_d)) * 0.5 for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
[array([[ 0.06286511, -0.06605243,  0.32021133,  0.05245006],
        [-0.26783469,  0.18079753,  0.65200002,  0.47354048]]),
 array([[-0.35186762, -0.63271074, -0.31163723,  0.02066299],
        [-1.16251539, -0.10939583, -0.62295547, -0.36613368],
        [-0.27212949, -0.15815008,  0.20581527,  0.52125668],
        [-0.06426733,  0.68323174, -0.33259734,  0.17575504]]),
 array([[ 0.45173509],
        [ 0.04700615],
        [-0.37174962],
        [-0.46086269]])]

Python

[out_d for out_d in layer_sizes[1:]]
[4, 4, 1]
[np.zeros(out_d) for out_d in layer_sizes[1:]]
[array([0., 0., 0., 0.]), array([0., 0., 0., 0.]), array([0.])]

Forward Pass

    def forward(self, X):

        """
        Simple forward pass: compute output activations.
        X: shape (N, input_dim)
        Returns: output probabilities, shape (N,)
        """

        a = X
        for W, b in zip(self.W, self.b):
            a = sigmoid(a @ W + b)

        return a.ravel()

Making predictions

    def predict(self, X, threshold=0.5):

        return (self.forward(X) >= threshold).astype(int)

Computing the loss

    def loss(self, X, y):

        return bce_loss(y, self.forward(X))

Discussion

With the exception of the training algorithm, our neural network implementation is now complete.

For those who are not familiar with the back-propagation algorithm, how do you propose to learn the parameters of the model?

Change weights → compute loss → keep if better → repeat.

Pseudocode

for each epoch:
    for each parameter w in network:
        best_delta = 0
        best_loss = current_loss
        for delta in [-0.01, 0, +0.01]:
            w_temp = w + delta
            loss_temp = compute_loss(w_temp, data)
            if loss_temp < best_loss:
                best_loss = loss_temp
                best_delta = delta
        w += best_delta

Python

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

Python

Show code
class Demo:

    def __init__(self, layer_sizes):
        self.sizes = list(layer_sizes)
        rng = np.random.default_rng(0)
        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

    def show(self):

      for tag in self._all_param_tags():
        print(tag)
d = Demo([2,4,4,1])
d.show()

Python

('W', 0, 0, 0)
('W', 0, 0, 1)
('W', 0, 0, 2)
('W', 0, 0, 3)
('W', 0, 1, 0)
('W', 0, 1, 1)
('W', 0, 1, 2)
('W', 0, 1, 3)
('b', 0, 0)
('b', 0, 1)
('b', 0, 2)
('b', 0, 3)
('W', 1, 0, 0)
('W', 1, 0, 1)
('W', 1, 0, 2)
('W', 1, 0, 3)
('W', 1, 1, 0)
('W', 1, 1, 1)
('W', 1, 1, 2)
('W', 1, 1, 3)
('W', 1, 2, 0)
('W', 1, 2, 1)
('W', 1, 2, 2)
('W', 1, 2, 3)
('W', 1, 3, 0)
('W', 1, 3, 1)
('W', 1, 3, 2)
('W', 1, 3, 3)
('b', 1, 0)
('b', 1, 1)
('b', 1, 2)
('b', 1, 3)
('W', 2, 0, 0)
('W', 2, 1, 0)
('W', 2, 2, 0)
('W', 2, 3, 0)
('b', 2, 0)

Python

    def _get_param(self, tag):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            return self.W[l][i, j]
        else:
            _, l, j = tag
            return self.b[l][j]

Python

    def _set_param(self, tag, val):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            self.W[l][i, j] = val
        else:
            _, l, j = tag
            self.b[l][j] = val

Training (learning)

    def train(self, X, y, epochs=10, verbose=True):

        for ep in range(1, epochs + 1):

            base_loss = self.loss(X, y)
            updates = {}

            # Probe all parameters
            for tag in self._all_param_tags():

                theta = self._get_param(tag)
                best_delta = 0.0
                best_loss = base_loss

                for delta in (-self.step, 0.0, +self.step):
                    self._set_param(tag, theta + delta)
                    trial_loss = self.loss(X, y)
                    if trial_loss < best_loss:
                        best_loss = trial_loss
                        best_delta = delta

                # restore original
                self._set_param(tag, theta)
                updates[tag] = best_delta

            # Apply all deltas together
            for tag, d in updates.items():
                if d != 0.0:
                    self._set_param(tag, self._get_param(tag) + d)

            new_loss = self.loss(X, y)

            if verbose:
                print(f"Epoch {ep:3d}: loss {base_loss:.5f}{new_loss:.5f}")

            # optional early stop
            if abs(new_loss - base_loss) < 1e-12:
                break

Ouf!

Does it work?

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model = NaiveMLP([2, 4, 4, 1], step=0.06, seed=0)

print("Initial loss:", model.loss(X_train, y_train))

model.train(X_train, y_train, epochs=100)

print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test acc: ", accuracy_score(y_test, model.predict(X_test)))
Initial loss: 0.7001969055705487
Epoch   1: loss 0.70020 → 0.69315
Epoch   2: loss 0.69315 → 0.69547
Epoch   3: loss 0.69547 → 0.69387
Epoch   4: loss 0.69387 → 0.69542
Epoch   5: loss 0.69542 → 0.69384
Epoch   6: loss 0.69384 → 0.69537
Epoch   7: loss 0.69537 → 0.69382
Epoch   8: loss 0.69382 → 0.69531
Epoch   9: loss 0.69531 → 0.69379
Epoch  10: loss 0.69379 → 0.69525
Epoch  11: loss 0.69525 → 0.69361
Epoch  12: loss 0.69361 → 0.69517
Epoch  13: loss 0.69517 → 0.69331
Epoch  14: loss 0.69331 → 0.69503
Epoch  15: loss 0.69503 → 0.69298
Epoch  16: loss 0.69298 → 0.69468
Epoch  17: loss 0.69468 → 0.69257
Epoch  18: loss 0.69257 → 0.69364
Epoch  19: loss 0.69364 → 0.69200
Epoch  20: loss 0.69200 → 0.69173
Epoch  21: loss 0.69173 → 0.69118
Epoch  22: loss 0.69118 → 0.68965
Epoch  23: loss 0.68965 → 0.68970
Epoch  24: loss 0.68970 → 0.68716
Epoch  25: loss 0.68716 → 0.68672
Epoch  26: loss 0.68672 → 0.68417
Epoch  27: loss 0.68417 → 0.68170
Epoch  28: loss 0.68170 → 0.67855
Epoch  29: loss 0.67855 → 0.67406
Epoch  30: loss 0.67406 → 0.66867
Epoch  31: loss 0.66867 → 0.66203
Epoch  32: loss 0.66203 → 0.65492
Epoch  33: loss 0.65492 → 0.64672
Epoch  34: loss 0.64672 → 0.63852
Epoch  35: loss 0.63852 → 0.62948
Epoch  36: loss 0.62948 → 0.61977
Epoch  37: loss 0.61977 → 0.60918
Epoch  38: loss 0.60918 → 0.59844
Epoch  39: loss 0.59844 → 0.58893
Epoch  40: loss 0.58893 → 0.57598
Epoch  41: loss 0.57598 → 0.56310
Epoch  42: loss 0.56310 → 0.55035
Epoch  43: loss 0.55035 → 0.53809
Epoch  44: loss 0.53809 → 0.52214
Epoch  45: loss 0.52214 → 0.50660
Epoch  46: loss 0.50660 → 0.49073
Epoch  47: loss 0.49073 → 0.47591
Epoch  48: loss 0.47591 → 0.45758
Epoch  49: loss 0.45758 → 0.44074
Epoch  50: loss 0.44074 → 0.42251
Epoch  51: loss 0.42251 → 0.41069
Epoch  52: loss 0.41069 → 0.38858
Epoch  53: loss 0.38858 → 0.36998
Epoch  54: loss 0.36998 → 0.35227
Epoch  55: loss 0.35227 → 0.34356
Epoch  56: loss 0.34356 → 0.32577
Epoch  57: loss 0.32577 → 0.31462
Epoch  58: loss 0.31462 → 0.29240
Epoch  59: loss 0.29240 → 0.27704
Epoch  60: loss 0.27704 → 0.25851
Epoch  61: loss 0.25851 → 0.25409
Epoch  62: loss 0.25409 → 0.23884
Epoch  63: loss 0.23884 → 0.23260
Epoch  64: loss 0.23260 → 0.21815
Epoch  65: loss 0.21815 → 0.21238
Epoch  66: loss 0.21238 → 0.19964
Epoch  67: loss 0.19964 → 0.19350
Epoch  68: loss 0.19350 → 0.17787
Epoch  69: loss 0.17787 → 0.16963
Epoch  70: loss 0.16963 → 0.15270
Epoch  71: loss 0.15270 → 0.14804
Epoch  72: loss 0.14804 → 0.13478
Epoch  73: loss 0.13478 → 0.13357
Epoch  74: loss 0.13357 → 0.12381
Epoch  75: loss 0.12381 → 0.12041
Epoch  76: loss 0.12041 → 0.10789
Epoch  77: loss 0.10789 → 0.10512
Epoch  78: loss 0.10512 → 0.09204
Epoch  79: loss 0.09204 → 0.08493
Epoch  80: loss 0.08493 → 0.07447
Epoch  81: loss 0.07447 → 0.07243
Epoch  82: loss 0.07243 → 0.06423
Epoch  83: loss 0.06423 → 0.06479
Epoch  84: loss 0.06479 → 0.06053
Epoch  85: loss 0.06053 → 0.05940
Epoch  86: loss 0.05940 → 0.05501
Epoch  87: loss 0.05501 → 0.05465
Epoch  88: loss 0.05465 → 0.05370
Epoch  89: loss 0.05370 → 0.05036
Epoch  90: loss 0.05036 → 0.04376
Epoch  91: loss 0.04376 → 0.04596
Epoch  92: loss 0.04596 → 0.04283
Epoch  93: loss 0.04283 → 0.04213
Epoch  94: loss 0.04213 → 0.04031
Epoch  95: loss 0.04031 → 0.03879
Epoch  96: loss 0.03879 → 0.03629
Epoch  97: loss 0.03629 → 0.03574
Epoch  98: loss 0.03574 → 0.03499
Epoch  99: loss 0.03499 → 0.03293
Epoch 100: loss 0.03293 → 0.03075
Train acc: 1.0
Test acc:  1.0

Vizulalization

Show code
# Plot helper: decision boundary in the original (x1, x2) plane

def plot_decision_boundary(model, X, y, title="Naïve MLP decision boundary"):

    # grid over the input plane
    pad = 0.3
    x1_min, x1_max = X[:,0].min()-pad, X[:,0].max()+pad
    x2_min, x2_max = X[:,1].min()-pad, X[:,1].max()+pad

    xx, yy = np.meshgrid(
        np.linspace(x1_min, x1_max, 400),
        np.linspace(x2_min, x2_max, 400)
    )
    grid = np.c_[xx.ravel(), yy.ravel()]

    # predict probabilities on the grid
    p = model.forward(grid).reshape(xx.shape)

    # filled probabilities + p=0.5 contour + data points
    plt.figure(figsize=(3.75, 3.75), dpi=140)
    plt.contourf(xx, yy, p, levels=50, alpha=0.7)
    cs = plt.contour(xx, yy, p, levels=[0.5], linewidths=2)
    plt.scatter(X[:,0], X[:,1], c=y, s=18, edgecolor="k", linewidth=0.2)
    plt.clabel(cs, fmt={0.5: "p=0.5"})
    plt.title(title)
    plt.xlabel("x₁")
    plt.ylabel("x₂")
    plt.tight_layout()
    plt.show()

plot_decision_boundary(model, X, y)

XOR-like data

Show code
n_samples = 800
rng = np.random.default_rng(42)

X = rng.uniform(-6, 6, size=(n_samples, 2))
x1, x2 = X[:, 0], X[:, 1]

y = ((x1 * x2) > 0).astype(int)

plt.figure(figsize=(4.5, 4.5))
plt.scatter(X[y == 0, 0], X[y == 0, 1],
            color="C0", label="class 0", edgecolor="k", linewidth=0.3)
plt.scatter(X[y == 1, 0], X[y == 1, 1],
            color="C1", label="class 1", edgecolor="k", linewidth=0.3)

plt.axhline(0, color="gray", linestyle="--", linewidth=1)
plt.axvline(0, color="gray", linestyle="--", linewidth=1)

plt.xlabel("x₁")
plt.ylabel("x₂")
plt.title("XOR-like data")
plt.xlim(-6, 6)
plt.ylim(-6, 6)
plt.axis("equal")
plt.legend()
plt.tight_layout()
plt.show()

XOR-like data (continued)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model = NaiveMLP([2, 4, 4, 1], step=0.06, seed=0)

print("Initial loss:", model.loss(X_train, y_train))

model.train(X_train, y_train, epochs=1000, verbose=False)

print("Final loss:", model.loss(X_train, y_train))

print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test acc: ", accuracy_score(y_test, model.predict(X_test)))
Initial loss: 0.7029764953722529
Final loss: 3.4494531195516116e-07
Train acc: 1.0
Test acc:  0.99

XOR-like data (continued)

plot_decision_boundary(model, X, y)

Drawbacks

  • Computational inefficiency.
  • Scalability limitations.
  • Fixed step size (±η) lacks adaptivity.
  • Poor coordination of parameters.
  • No directional or magnitude information.
  • Lack of sophisticated optimizer features.
  • Potential for over-fitting or poor generalisation.

Notation

Notation

A two-layer perceptron computes:

\[ \hat{y} = \phi_2(\phi_1(X)) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A 3-layer perceptron computes:

\[ \hat{y} = \phi_3(\phi_2(\phi_1(X))) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A \(k\)-layer perceptron computes:

\[ \hat{y} = \phi_k( \ldots \phi_2(\phi_1(X)) \ldots ) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Back-propagation

3Blue1Brown

Back-propagation

Learning representations by back-propagating errors

David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

Before the back-propagation

  • Limitations, such as the inability to solve the XOR classification task, essentially stalled research on neural networks.

  • The perceptron was limited to a single layer, and there was no known method for training a multi-layer perceptron.

  • Single-layer perceptrons are limited to solving classification tasks that are linearly separable.

Back-propagation: contributions

  • The model employs mean squared error as its loss function.

  • Gradient descent is used to minimize loss.

  • A sigmoid activation function is used instead of a step function, as its derivative provides valuable information for gradient descent.

  • Shows how updating internal weights using a two-pass algorithm consisting of a forward pass and a backward pass.

  • Enables training multi-layer perceptrons.

Conceptual Idea

Conceptual Idea (continued)

Conceptual Idea (continued)

Backpropagation

  • Backpropagation is an algorithm for methodically computing the partial derivatives of a neural network’s loss function with respect to each weight and bias parameter.

  • Backpropagation applies the chain rule of calculus recursively to compute \(\frac{\partial J}{\partial w_{i,j}^{(\ell)}}\) for all network parameters \(w_{i,j}^{(\ell)}\) efficiently, using intermediate quantities from the forward pass, where \(w_{i,j}^{(\ell)}\) denotes the parameter \(w_{i,j}\) of the layer \(\ell\).

Chain rule

Given,

\[ h(x) = f(g(x)) \]

using the Lagrange notation, we have

\[ h^\prime(x) = f^\prime(g(x)) g^\prime(x) \]

or equivalently using Leibniz notation

\[ \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} \]

Applying the chain rule recursively

Computational graph

Scalar input; one hidden node

Let

\[ J = -\Bigl[y \,\log(\hat y) + (1-y)\,\log(1-\hat y)\Bigr] \]

\[ \hat y = a_2 = \sigma(z_2), \quad z_2 = w_2 \cdot a_1 + b_2 \]

\[ a_1 = \sigma(z_1), \quad z_1 = w_1 \cdot x + b_1 \]

Derivatives

\[ \frac{\partial J}{\partial \hat{y}} \]

\[ \frac{\partial \hat{y}}{\partial z_2} \]

\[ \frac{\partial z_2}{\partial w_2}, \quad \frac{\partial z_2}{\partial b_2}, \quad \frac{\partial z_2}{\partial a_1}, \]

\[ \frac{\partial a_1}{\partial z_1}, \]

\[ \frac{\partial z_1}{\partial w_1}, \quad \frac{\partial z_1}{\partial b_1}, \quad \frac{\partial z_1}{\partial x}, \]

Derivatives

Loss derivative w.r.t. \(\hat{y}\):

\[ \frac{\partial J}{\partial \hat{y}} = -\left(\frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}}\right) \]

Derivatives

\(\hat{y}\) derivative w.r.t. \(z_2\):

\[ \frac{\partial \hat{y}}{\partial z_2}=\sigma^{\prime}\left(z_2\right)=\hat{y}(1-\hat{y}) \]

Derivatives

Derivative \(z_2 = w_2 a_1 + b_2\):

\[ \frac{\partial z_2}{\partial w_2}=a_1, \quad \frac{\partial z_2}{\partial b_2}=1, \quad \frac{\partial z_2}{\partial a_1}=w_2 \]

Derivatives

Derivative \(a_1 = \sigma(z_1)\):

\[ \frac{\partial a_1}{\partial z_1}=\sigma^{\prime}\left(z_1\right)=a_1\left(1-a_1\right) \]

Derivatives

Derivative \(z_1 = w_1 x + b_1\):

\[ \frac{\partial z_1}{\partial w_1}=x, \quad \frac{\partial z_1}{\partial b_1}=1, \quad \frac{\partial z_1}{\partial x} = w_1 \]

Combined derivatives

For \(w_2\):

\[ \frac{\partial J}{\partial w_2}=\frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial w_2}=\left[-\left(\frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}}\right)\right] \cdot(\hat{y}(1-\hat{y})) \cdot a_1 \]

Simplifies to:

\[ \frac{\partial J}{\partial w_2}=(\hat{y}-y) a_1 \]

Combined derivatives

For \(b_2\):

\[ \frac{\partial J}{\partial b_2}=\frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2}=(\hat{y}-y) \cdot 1=\hat{y}-y \]

Combined derivatives

For \(w_1\):

\[ \frac{\partial J}{\partial w_1}=\frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1} \]

Plug in:

\[ =\left[-\left(\frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}}\right)\right] \cdot(\hat{y}(1-\hat{y})) \cdot w_2 \cdot\left(a_1\left(1-a_1\right)\right) \cdot x \]

Simplifies to:

\[ \frac{\partial J}{\partial w_1}=(\hat{y}-y) w_2\left(a_1\left(1-a_1\right)\right) x \]

Combined derivatives

For \(b_1\):

\[ \frac{\partial J}{\partial b_1}=\frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} \]

Plug in:

\[ =(\hat{y}-y) w_2\left(a_1\left(1-a_1\right)\right) \cdot 1 \]

Simplifies to:

\[ \frac{\partial J}{\partial b_1}=(\hat{y}-y) w_2\left(a_1\left(1-a_1\right)\right) \]

Forward

import math
import random

random.seed(42)

def sigma(x):
    return 1 / (1 + math.exp(-x))

w1 = random.random()
w2 = random.random()
b1 = 0
b2 = 0

x = 3.14
y = 1

Forward

z1 = w1 * x + b1
a1 = sigma(z1)

z2 = w2 * a1 + b2
y_hat = sigma(z2)

J = -(y * math.log(y_hat) + (1-y) * math.log(1 - y_hat))
J
0.6821830425123782

Backward

nabla_J_w2 = (y_hat - y) * a1
nabla_J_b2 = y_hat - y

nabla_J_w1 = (y_hat - y) * w2 * (a1 * (1-a1)) * x
nabla_J_b1 = (y_hat - y) * w2 * (a1 * (1-a1))

print((nabla_J_w2, nabla_J_b2, nabla_J_w1, nabla_J_b1))
(-0.43594714797256023, -0.4944877677583355, -0.00405314422749428, -0.0012908102635332103)

Backpropagation: top level

  1. Computational Graph

  2. Initialization

  3. Forward Pass

  4. Compute Loss

  5. Backward Pass (Backpropagation)

  6. Update the parameters and repeat 3 to 6.

Backpropagation: detailed

  1. Create the computational graph.

  2. Initialize the weights and biases.

  3. Forward pass: starting from the input, compute the output of each operation in the graph, and store these values.

  4. Compute loss.

  5. Backward pass: starting from the output and moving backward, for each operation.

    1. Compute the derivative of the output with respect to each of the inputs.

    2. For each input \(u\),

\[ \delta_u = \frac{\partial J}{\partial u} = \frac{\partial z}{\partial u} \cdot \frac{\partial J}{\partial z} \]

  1. Update the parameters and repeat 3 to 6.

Backpropagation: 2. Initialization

Initialize the weights and biases of the neural network.

  1. Zero Initialization
    • All weights are initialized to zero.
    • Symmetry problems, all neurons produce identical outputs, preventing effective learning.
  2. Random Initialization
    • Weights are initialized randomly, often using a uniform or normal distribution.
    • Breaks the symmetry between neurons, allowing them to learn.
    • If not scaled properly, leads to slow convergence or vanishing/exploding gradients.

Backpropagation: 3. Forward Pass

For each example in the training set (or in a mini-batch):

  • Input Layer: Pass input features to first layer.

  • Hidden Layers: For each hidden layer, compute the activations (output) by applying the weighted sum of inputs plus bias, followed by an activation function (e.g., sigmoid, ReLU).

  • Output Layer: Same process as hidden layers. Output layer activations represent the predicted values.

Backpropagation: 4. Compute Loss

Calculate the loss (error) using a suitable loss function by comparing the predicted values to the actual target values.

Backpropagation: 5. Backward Pass

  • Output Layer: Compute the gradient of the loss with respect to the output layer’s weights and biases using the chain rule of calculus.

  • Hidden Layers: Propagate the error backward through the network, layer by layer. For each layer, compute the gradient of the loss with respect to the weights and biases. Use the derivative of the activation function to help calculate these gradients.

  • Update Weights and Biases: Adjust the weights and biases using the calculated gradients and a learning rate, which determines the step size for each update.

Key Concepts

  • Activation Functions: Functions like sigmoid, ReLU, and tanh introduce non-linearity, which allows the network to learn complex patterns.

  • Learning Rate: A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.

  • Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively moving towards the steepest descent as defined by the negative of the gradient.

Implementation

SimpleMLP

The complete implementation is presented below and will be examined in the subsequent slides.

Show code
# Activations & loss

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1.0 - s)

def relu(z):
    return np.maximum(0.0, z)

def relu_prime(z):
    return (z > 0).astype(z.dtype)

def bce_loss(y_true, y_prob, eps=1e-9):

    """Binary cross-entropy averaged over samples (with clipping for stability)."""

    y_prob = np.clip(y_prob, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

# Initializers

def he_init(rng, fan_in, fan_out):

    # He normal: good for ReLU

    std = np.sqrt(2.0 / fan_in)
    return rng.normal(0.0, std, size=(fan_in, fan_out))

def xavier_init(rng, fan_in, fan_out):

    # Glorot/Xavier normal: good for sigmoid/tanh

    std = np.sqrt(2.0 / (fan_in + fan_out))
    return rng.normal(0.0, std, size=(fan_in, fan_out))


# SimpleMLP (API mirrors NaiveMLP)

class SimpleMLP:

    """
    Minimal MLP for binary classification.

    - Hidden: ReLU (default) with He init; or 'sigmoid' with Xavier init
    - Output: Sigmoid + BCE (δ_L = a_L - y)
    - API: forward -> probas (N,), predict_proba, predict, loss, train
    """

    def __init__(self, layer_sizes, lr=0.1, seed=None, l2=0.0,
                 hidden_activation="relu", lr_decay=None):
        """
        layer_sizes: e.g., [2, 4, 4, 1]
        lr: learning rate
        l2: L2 regularization strength (0 disables)
        hidden_activation: 'relu' (default) or 'sigmoid'
        lr_decay: optional float in (0,1); multiply lr by this every epoch (e.g., 0.9)
        """
        self.sizes = list(layer_sizes)
        self.lr = float(lr)
        self.base_lr = float(lr)
        self.lr_decay = lr_decay
        self.l2 = float(l2)
        self.hidden_activation = hidden_activation
        rng = np.random.default_rng(seed)

        # Initialize weights/biases per layer
        self.W = []
        for din, dout in zip(self.sizes[:-1], self.sizes[1:]):
            if hidden_activation == "relu":
                Wk = he_init(rng, din, dout)
            else:
                Wk = xavier_init(rng, din, dout)
            self.W.append(Wk)
        self.b = [np.zeros(dout) for dout in self.sizes[1:]]

    # activations (hidden vs output)

    def _act(self, z, last=False):
        if last:
            return sigmoid(z)  # output layer
        return relu(z) if self.hidden_activation == "relu" else sigmoid(z)

    def _act_prime(self, z, last=False):
        if last:
            return sigmoid_prime(z)  # rarely needed with BCE+sigmoid
        return relu_prime(z) if self.hidden_activation == "relu" else sigmoid_prime(z)

    # forward (public): returns probabilities (N,)

    def forward(self, X):
        a = X
        L = len(self.W)
        for ell, (W, b) in enumerate(zip(self.W, self.b), start=1):
            a = self._act(a @ W + b, last=(ell == L))
        return a.ravel()

    # Aliases to match NaiveMLP

    def predict_proba(self, X):
        return self.forward(X)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

    def loss(self, X, y):

        # BCE + optional L2
        p = self.predict_proba(X)
        base = bce_loss(y, p)
        if self.l2 > 0:
            reg = 0.5 * self.l2 * sum((W**2).sum() for W in self.W)
            # Normalize reg by number of samples to be consistent with mean loss
            base += reg / max(1, X.shape[0])
        return base

    # internal: forward caches for backprop

    def _forward_full(self, X):
        a = X
        activations = [a]
        zs = []
        L = len(self.W)
        for ell, (W, b) in enumerate(zip(self.W, self.b), start=1):
            z = a @ W + b
            a = self._act(z, last=(ell == L))
            zs.append(z)
            activations.append(a)
        return activations, zs

    # training: mini-batch gradient descent with backprop

    def train(self, X, y, epochs=30, batch_size=None, verbose=True, shuffle=True):

        """
        X: (N, d), y: (N,) in {0,1}
        batch_size: None -> full-batch; else int
        """

        N = X.shape[0]
        idx = np.arange(N)
        B = N if batch_size is None else int(batch_size)

        for ep in range(1, epochs + 1):
            if shuffle:
                np.random.shuffle(idx)
            if self.lr_decay:
                self.lr = self.base_lr * (self.lr_decay ** (ep - 1))

            base_loss = self.loss(X, y)

            for start in range(0, N, B):
                sl = idx[start:start+B]
                Xb = X[sl]
                yb = y[sl].reshape(-1, 1)  # (B,1)

                # Forward caches
                activations, zs = self._forward_full(Xb)
                A_L = activations[-1]          # (B,1)
                Bsz = Xb.shape[0]

                # Backprop
                # Output layer: BCE + sigmoid => delta_L = (A_L - y)

                delta = (A_L - yb)             # (B,1)

                grads_W = [None] * len(self.W)
                grads_b = [None] * len(self.b)

                # Last layer grads

                grads_W[-1] = activations[-2].T @ delta / Bsz   # (n_{L-1}, 1)
                grads_b[-1] = delta.mean(axis=0)                # (1,)

                # Hidden layers: l = L-1 down to 1

                for l in range(2, len(self.sizes)):
                    z = zs[-l]                                  # (B, n_l)
                    sp = self._act_prime(z, last=False)         # (B, n_l)
                    delta = (delta @ self.W[-l+1].T) * sp       # (B, n_l)
                    grads_W[-l] = activations[-l-1].T @ delta / Bsz  # (n_{l-1}, n_l)
                    grads_b[-l] = delta.mean(axis=0)                 # (n_l,)

                # L2 regularization (add to grads)

                if self.l2 > 0:
                    for k in range(len(self.W)):
                        grads_W[k] = grads_W[k] + self.l2 * self.W[k]

                # Gradient step

                for k in range(len(self.W)):
                    self.W[k] -= self.lr * grads_W[k]
                    self.b[k] -= self.lr * grads_b[k]

            new_loss = self.loss(X, y)
            if verbose:
                print(f"Epoch {ep:3d} | loss {base_loss:.5f}{new_loss:.5f} | Δ={base_loss - new_loss:.5f} | lr={self.lr:.4f}")

Activation functions

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1.0 - s)

def relu(z):
    return np.maximum(0.0, z)

def relu_prime(z):
    return (z > 0).astype(z.dtype)

Loss

def bce_loss(y_true, y_prob, eps=1e-9):

    """Binary cross-entropy averaged over samples (with clipping for stability)."""

    y_prob = np.clip(y_prob, eps, 1 - eps)

    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

Initializers

def he_init(rng, fan_in, fan_out):

    # He normal: good for ReLU

    std = np.sqrt(2.0 / fan_in)
    return rng.normal(0.0, std, size=(fan_in, fan_out))

def xavier_init(rng, fan_in, fan_out):

    # Glorot/Xavier normal: good for sigmoid/tanh

    std = np.sqrt(2.0 / (fan_in + fan_out))
    return rng.normal(0.0, std, size=(fan_in, fan_out))

Class definition + constructor

class SimpleMLP:

    def __init__(self, layer_sizes, lr=0.1, seed=None, l2=0.0,
                 hidden_activation="relu", lr_decay=None):

        self.sizes = list(layer_sizes)
        self.lr = float(lr)
        self.base_lr = float(lr)
        self.lr_decay = lr_decay
        self.l2 = float(l2)
        self.hidden_activation = hidden_activation
        rng = np.random.default_rng(seed)

        # Initialize weights/biases per layer
        self.W = []
        for din, dout in zip(self.sizes[:-1], self.sizes[1:]):
            if hidden_activation == "relu":
                Wk = he_init(rng, din, dout)
            else:
                Wk = xavier_init(rng, din, dout)
            self.W.append(Wk)
        self.b = [np.zeros(dout) for dout in self.sizes[1:]]

Forward (public)

    def forward(self, X):
        a = X
        L = len(self.W)
        for ell, (W, b) in enumerate(zip(self.W, self.b), start=1):
            a = self._act(a @ W + b, last=(ell == L))
        return a.ravel()

Forward (private)

    def _forward_full(self, X):
        a = X
        activations = [a]
        zs = []
        L = len(self.W)
        for ell, (W, b) in enumerate(zip(self.W, self.b), start=1):
            z = a @ W + b
            a = self._act(z, last=(ell == L))
            zs.append(z)
            activations.append(a)
        return activations, zs

Training

    def train(self, X, y, epochs=30, batch_size=None, verbose=True, shuffle=True):

        """
        X: (N, d), y: (N,) in {0,1}
        batch_size: None -> full-batch; else int
        """

        N = X.shape[0]
        idx = np.arange(N)
        B = N if batch_size is None else int(batch_size)

        for ep in range(1, epochs + 1):
            if shuffle:
                np.random.shuffle(idx)
            if self.lr_decay:
                self.lr = self.base_lr * (self.lr_decay ** (ep - 1))

            base_loss = self.loss(X, y)

            for start in range(0, N, B):
                sl = idx[start:start+B]
                Xb = X[sl]
                yb = y[sl].reshape(-1, 1)  # (B,1)

                # Forward caches
                activations, zs = self._forward_full(Xb)
                A_L = activations[-1]          # (B,1)
                Bsz = Xb.shape[0]

                # Backprop
                # Output layer: BCE + sigmoid => delta_L = (A_L - y)

                delta = (A_L - yb)             # (B,1)

                grads_W = [None] * len(self.W)
                grads_b = [None] * len(self.b)

                # Last layer grads

                grads_W[-1] = activations[-2].T @ delta / Bsz   # (n_{L-1}, 1)
                grads_b[-1] = delta.mean(axis=0)                # (1,)

                # Hidden layers: l = L-1 down to 1

                for l in range(2, len(self.sizes)):
                    z = zs[-l]                                  # (B, n_l)
                    sp = self._act_prime(z, last=False)         # (B, n_l)
                    delta = (delta @ self.W[-l+1].T) * sp       # (B, n_l)
                    grads_W[-l] = activations[-l-1].T @ delta / Bsz  # (n_{l-1}, n_l)
                    grads_b[-l] = delta.mean(axis=0)                 # (n_l,)

                # L2 regularization (add to grads)

                if self.l2 > 0:
                    for k in range(len(self.W)):
                        grads_W[k] = grads_W[k] + self.l2 * self.W[k]

                # Gradient step

                for k in range(len(self.W)):
                    self.W[k] -= self.lr * grads_W[k]
                    self.b[k] -= self.lr * grads_b[k]

            new_loss = self.loss(X, y)
            if verbose:
                print(f"Epoch {ep:3d} | loss {base_loss:.5f}{new_loss:.5f} | Δ={base_loss - new_loss:.5f} | lr={self.lr:.4f}")

Testing

from sklearn.preprocessing import StandardScaler

X, y = make_circles(n_samples=200, factor=0.5, noise=0.08, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SimpleMLP([2, 4, 4, 1], lr=0.3, seed=42, hidden_activation="relu", l2=0.0, lr_decay=0.95)

model.train(X_train, y_train, epochs=150, batch_size=32, verbose=False)

print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print(" Test acc:", accuracy_score(y_test, model.predict(X_test)))
Train acc: 0.95
 Test acc: 0.9166666666666666

Automatic differentiation

Automatic differentiation (autodiff) systematically applies the chain rule to compute exact derivatives of functions expressed as computer programs. It propagates derivatives through elementary operations, either forward (from inputs to outputs) or backward (from outputs to inputs), enabling efficient and precise gradient computation essential for optimization and learning algorithms.

Summary

Prologue

Summary

  • Artificial Neural Networks (ANNs):
    • Inspired by biological neural networks.
    • Consist of interconnected neurons arranged in layers.
    • Applicable to supervised, unsupervised, and reinforcement learning.
  • Feed-forward Neural Networks (FNNs):
    • Information flows unidirectionally from input to output.
    • Comprised of input, hidden, and output layers.
    • Can vary in the number of layers and nodes per layer.
  • Activation Functions:
    • Introduce non-linearity to enable learning complex patterns.
    • Common functions: Sigmoid, Tanh, ReLU, Leaky ReLU.
    • Choice of activation function affects gradient flow and network performance.
  • Universal Approximation Theorem:
    • A neural network with a single hidden layer can approximate any continuous function.
  • Backpropagation Algorithm:
    • Training involves forward pass, loss computation, backward pass, and weight updates.
    • Utilizes gradient descent to minimize the loss function.
    • Enables training of multi-layer perceptrons by adjusting internal weights.
  • Key Concepts:
    • Learning rate determines the step size during optimization.
    • Gradient descent is used to update weights in the direction of minimizing loss.
    • Proper selection of activation functions and initialization methods is crucial for effective training.

3Blue1Brown

A series of videos, with animations, providing the intuition behind the backpropagation algorithm.

StatQuest

Herman Kamper

One of the most thorough series of videos on the backpropagation algorithm.

Free book with implementation

In his book, Neural Networks and Deep Learning, Michael Nielsen provides a comprehensive Python implementation of a neural network.

Next lecture

  • We will talk about the vanishing gradient, softmax, and regularization.

References

Angermueller, Christof, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. 2016. “Deep Learning for Computational Biology.” Mol Syst Biol 12 (7): 878. https://doi.org/10.15252/msb.20156651.
Baydin, Atılım Günes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017. “Automatic Differentiation in Machine Learning: A Survey.” J. Mach. Learn. Res. 18 (1): 5595–5637.
Cybenko, George V. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems 2: 303–14. https://api.semanticscholar.org/CorpusID:3958369.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/CVPR.2016.90.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/https://doi.org/10.1016/0893-6080(89)90020-8.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa