Training Artificial Neural Networks (Part 1)

CSI 4106 - Fall 2025

Marcel Turcotte

Version: Nov 4, 2025 15:11

Preamble

Message of the Day

Learning objectives

  • Explain the architecture and function of feed-forward neural networks (FNNs).
  • Identify common activation functions and understand their impact on network performance.
  • Introduce a simple but functional implementation of a feed-forward neural networks.

Summary

3Blue1Brown (1/2)

3Blue1Brown (2/2)

Summary - DL

  • Deep learning (DL) is a machine learning technique that can be applied to supervised learning (including regression and classification), unsupervised learning, and reinforcement learning.

  • Inspired from the structure and function of biological neural networks found in animals.

  • Comprises interconnected neurons (or units) arranged into layers.

Summary - FNN

Summary - FNN

Summary - units

Common Activation Functions

Show code
# Attribution: https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb

import numpy as np
import matplotlib.pyplot as plt

from scipy.special import expit as sigmoid

def relu(z):
    return np.maximum(0, z)

def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)

max_z = 4.5
z = np.linspace(-max_z, max_z, 200)

plt.figure(figsize=(11, 3.1))

plt.subplot(121)
plt.plot(z, relu(z), "m-.", linewidth=2, label="ReLU")
plt.plot(z, sigmoid(z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", linewidth=1, label="Tanh")
plt.grid(True)
plt.title("Activation functions")
plt.axis([-max_z, max_z, -1.65, 2.4])
plt.gca().set_yticks([-1, 0, 1, 2])
plt.legend(loc="lower right", fontsize=13)

plt.subplot(122)
plt.plot(z, derivative(sigmoid, z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", linewidth=1, label="Tanh")
plt.plot([-max_z, 0], [0, 0], "m-.", linewidth=2)
plt.plot([0, max_z], [1, 1], "m-.", linewidth=2)
plt.plot([0, 0], [0, 1], "m-.", linewidth=1.2)
plt.plot(0, 1, "mo", markersize=5)
plt.plot(0, 1, "mx", markersize=10)
plt.grid(True)
plt.title("Derivatives")
plt.axis([-max_z, max_z, -0.2, 1.2])

plt.show()

Universal Approximation

The universal approximation theorem states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of \(\mathbb{R}^n\), given appropriate weights and activation functions.

Naïve MLP

Data

Show code
# Generate and plot the "circles" dataset
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

# Generate synthetic data
X, y = make_circles(n_samples=1200, factor=0.35, noise=0.06, random_state=42)

# Separate coordinates for plotting
x1, x2 = X[:, 0], X[:, 1]

# Plot the two classes
plt.figure(figsize=(5, 5))
plt.scatter(x1[y==0], x2[y==0], color="C0", label="class 0 (outer ring)")
plt.scatter(x1[y==1], x2[y==1], color="C1", label="class 1 (inner circle)")
plt.xlabel("x₁")
plt.ylabel("x₂")
plt.title("Dataset generated with make_circles")
plt.axis("equal") # ensures circles look round
plt.legend()
plt.show()

Architecture

Utilities

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def bce_loss(y_true, y_prob, eps=1e-9):

    """Binary cross-entropy loss (average over data)."""

    y_prob = np.clip(y_prob, eps, 1 - eps)

    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

NaïveMLP

The complete implementation is presented below and will be examined in the subsequent slides.

Show code
class NaiveMLP:

    """
    A minimal multilayer perceptron (MLP) utilizing a brute force training 
    algorithm that does not require derivative calculations.

    Please note that the suggested training algorithm is intended solely for 
    didactic purposes and should not be mistaken for a genuine training algorithm.
    """

    def __init__(self, layer_sizes, step=0.1, seed=None):
        
        self.sizes = list(layer_sizes)
        self.step = float(step)
        rng = np.random.default_rng(seed)

        # Initialize weights and biases

        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]

        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

    def forward(self, X):

        """
        Simple forward pass: compute output activations only.

        X: shape (N, input_dim)

        Returns: output probabilities, shape (N,)
        """

        a = X
        for W, b in zip(self.W, self.b):
            a = sigmoid(a @ W + b)

        return a.ravel()

    def predict(self, X, threshold=0.5):

        return (self.forward(X) >= threshold).astype(int)

    def loss(self, X, y):

        return bce_loss(y, self.forward(X))

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

    def _get_param(self, tag):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            return self.W[l][i, j]
        else:
            _, l, j = tag
            return self.b[l][j]

    def _set_param(self, tag, val):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            self.W[l][i, j] = val
        else:
            _, l, j = tag
            self.b[l][j] = val

    def train(self, X, y, epochs=10, verbose=True):

        """
        Simultaneous update:
        - For each scalar parameter θ, try θ + δ for δ in {−step, 0, +step},
          pick the δ that gives minimal loss.
        - Collect all chosen δ’s, then apply all updates together.
        """

        for ep in range(1, epochs + 1):

            base_loss = self.loss(X, y)
            updates = {}

            # Probe all parameters
            for tag in self._all_param_tags():

                theta = self._get_param(tag)
                best_delta = 0.0
                best_loss = base_loss

                for delta in (-self.step, 0.0, +self.step):
                    self._set_param(tag, theta + delta)
                    trial_loss = self.loss(X, y)
                    if trial_loss < best_loss:
                        best_loss = trial_loss
                        best_delta = delta

                # restore original
                self._set_param(tag, theta)
                updates[tag] = best_delta

            # Apply all deltas together
            for tag, d in updates.items():
                if d != 0.0:
                    self._set_param(tag, self._get_param(tag) + d)

            new_loss = self.loss(X, y)

            if verbose:
                print(f"Epoch {ep:3d}: loss {base_loss:.5f}{new_loss:.5f}")

            # optional early stop
            if abs(new_loss - base_loss) < 1e-12:
                break

Class Definition

class NaiveMLP:

    """
    A minimal multilayer perceptron (MLP) utilizing a brute force training 
    algorithm that does not require derivative calculations.

    Please note that the suggested training algorithm is intended solely for 
    didactic purposes and should not be mistaken for a genuine training algorithm.
    """

Constructor

    def __init__(self, layer_sizes, step=0.1, seed=None):
        
        self.sizes = list(layer_sizes)
        self.step = float(step)
        rng = np.random.default_rng(seed)

        # Initialize weights and biases

        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]

        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

Python

seed = 0

rng = np.random.default_rng(seed)

layer_sizes = [2, 4, 4, 1]

[(in_d, out_d) for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
[(2, 4), (4, 4), (4, 1)]
[rng.standard_normal(size=(in_d, out_d)) * 0.5 for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
[array([[ 0.06286511, -0.06605243,  0.32021133,  0.05245006],
        [-0.26783469,  0.18079753,  0.65200002,  0.47354048]]),
 array([[-0.35186762, -0.63271074, -0.31163723,  0.02066299],
        [-1.16251539, -0.10939583, -0.62295547, -0.36613368],
        [-0.27212949, -0.15815008,  0.20581527,  0.52125668],
        [-0.06426733,  0.68323174, -0.33259734,  0.17575504]]),
 array([[ 0.45173509],
        [ 0.04700615],
        [-0.37174962],
        [-0.46086269]])]

Python

[out_d for out_d in layer_sizes[1:]]
[4, 4, 1]
[np.zeros(out_d) for out_d in layer_sizes[1:]]
[array([0., 0., 0., 0.]), array([0., 0., 0., 0.]), array([0.])]

Forward Pass

    def forward(self, X):

        """
        Simple forward pass: compute output activations.
        X: shape (N, input_dim)
        Returns: output probabilities, shape (N,)
        """

        a = X
        for W, b in zip(self.W, self.b):
            a = sigmoid(a @ W + b)

        return a.ravel()

Making predictions

    def predict(self, X, threshold=0.5):

        return (self.forward(X) >= threshold).astype(int)

Computing the loss

    def loss(self, X, y):

        return bce_loss(y, self.forward(X))

Discussion

With the exception of the training algorithm, our neural network implementation is now complete.

For those who are not familiar with the back-propagation algorithm, how do you propose to learn the parameters of the model?

Change weights → compute loss → keep if better → repeat.

Pseudocode

for each epoch:
    deltas = {}  # store best delta for each parameter
    for each parameter w in network:
        best_delta = 0
        best_loss = current_loss  # loss computed with current weights
        for delta in [-0.01, 0, +0.01]:
            w_temp = w + delta
            loss_temp = compute_loss_with_replacement(w_temp, w, data)
            if loss_temp < best_loss:
                best_loss = loss_temp
                best_delta = delta
        deltas[w] = best_delta

    # Apply all updates simultaneously
    for each parameter w in network:
        w += deltas[w]

Python

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

Python

Show code
class Demo:

    def __init__(self, layer_sizes):
        self.sizes = list(layer_sizes)
        rng = np.random.default_rng(0)
        self.W = [rng.standard_normal(size=(in_d, out_d)) * 0.5
                  for in_d, out_d in zip(layer_sizes[:-1], layer_sizes[1:])]
        self.b = [np.zeros(out_d) for out_d in layer_sizes[1:]]

    def _all_param_tags(self):

        """
        Yields tags referencing every scalar parameter:
        ('W', layer_idx, i, j) or ('b', layer_idx, j)
        """

        for l, W in enumerate(self.W):
            for i in range(W.shape[0]):
                for j in range(W.shape[1]):
                    yield ('W', l, i, j)
            for j in range(self.b[l].shape[0]):
                yield ('b', l, j)

    def show(self):

      for tag in self._all_param_tags():
        print(tag)
d = Demo([2,4,4,1])
d.show()

Python

('W', 0, 0, 0)
('W', 0, 0, 1)
('W', 0, 0, 2)
('W', 0, 0, 3)
('W', 0, 1, 0)
('W', 0, 1, 1)
('W', 0, 1, 2)
('W', 0, 1, 3)
('b', 0, 0)
('b', 0, 1)
('b', 0, 2)
('b', 0, 3)
('W', 1, 0, 0)
('W', 1, 0, 1)
('W', 1, 0, 2)
('W', 1, 0, 3)
('W', 1, 1, 0)
('W', 1, 1, 1)
('W', 1, 1, 2)
('W', 1, 1, 3)
('W', 1, 2, 0)
('W', 1, 2, 1)
('W', 1, 2, 2)
('W', 1, 2, 3)
('W', 1, 3, 0)
('W', 1, 3, 1)
('W', 1, 3, 2)
('W', 1, 3, 3)
('b', 1, 0)
('b', 1, 1)
('b', 1, 2)
('b', 1, 3)
('W', 2, 0, 0)
('W', 2, 1, 0)
('W', 2, 2, 0)
('W', 2, 3, 0)
('b', 2, 0)

Python

    def _get_param(self, tag):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            return self.W[l][i, j]
        else:
            _, l, j = tag
            return self.b[l][j]

Python

    def _set_param(self, tag, val):
        kind = tag[0]
        if kind == 'W':
            _, l, i, j = tag
            self.W[l][i, j] = val
        else:
            _, l, j = tag
            self.b[l][j] = val

Training (learning)

    def train(self, X, y, epochs=10, verbose=True):

        for ep in range(1, epochs + 1):

            base_loss = self.loss(X, y)
            updates = {}

            # Probe all parameters
            for tag in self._all_param_tags():

                theta = self._get_param(tag)
                best_delta = 0.0
                best_loss = base_loss

                for delta in (-self.step, 0.0, +self.step):
                    self._set_param(tag, theta + delta)
                    trial_loss = self.loss(X, y)
                    if trial_loss < best_loss:
                        best_loss = trial_loss
                        best_delta = delta

                # restore original
                self._set_param(tag, theta)
                updates[tag] = best_delta

            # Apply all deltas together
            for tag, d in updates.items():
                if d != 0.0:
                    self._set_param(tag, self._get_param(tag) + d)

            new_loss = self.loss(X, y)

            if verbose:
                print(f"Epoch {ep:3d}: loss {base_loss:.5f}{new_loss:.5f}")

            # optional early stop
            if abs(new_loss - base_loss) < 1e-12:
                break

Ouf!

Does it work?

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model = NaiveMLP([2, 4, 4, 1], step=0.06, seed=0)

print("Initial loss:", model.loss(X_train, y_train))

model.train(X_train, y_train, epochs=100)

print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test acc: ", accuracy_score(y_test, model.predict(X_test)))
Initial loss: 0.7001969055705487
Epoch   1: loss 0.70020 → 0.69315
Epoch   2: loss 0.69315 → 0.69547
Epoch   3: loss 0.69547 → 0.69387
Epoch   4: loss 0.69387 → 0.69542
Epoch   5: loss 0.69542 → 0.69384
Epoch   6: loss 0.69384 → 0.69537
Epoch   7: loss 0.69537 → 0.69382
Epoch   8: loss 0.69382 → 0.69531
Epoch   9: loss 0.69531 → 0.69379
Epoch  10: loss 0.69379 → 0.69525
Epoch  11: loss 0.69525 → 0.69361
Epoch  12: loss 0.69361 → 0.69517
Epoch  13: loss 0.69517 → 0.69331
Epoch  14: loss 0.69331 → 0.69503
Epoch  15: loss 0.69503 → 0.69298
Epoch  16: loss 0.69298 → 0.69468
Epoch  17: loss 0.69468 → 0.69257
Epoch  18: loss 0.69257 → 0.69364
Epoch  19: loss 0.69364 → 0.69200
Epoch  20: loss 0.69200 → 0.69173
Epoch  21: loss 0.69173 → 0.69118
Epoch  22: loss 0.69118 → 0.68965
Epoch  23: loss 0.68965 → 0.68970
Epoch  24: loss 0.68970 → 0.68716
Epoch  25: loss 0.68716 → 0.68672
Epoch  26: loss 0.68672 → 0.68417
Epoch  27: loss 0.68417 → 0.68170
Epoch  28: loss 0.68170 → 0.67855
Epoch  29: loss 0.67855 → 0.67406
Epoch  30: loss 0.67406 → 0.66867
Epoch  31: loss 0.66867 → 0.66203
Epoch  32: loss 0.66203 → 0.65492
Epoch  33: loss 0.65492 → 0.64672
Epoch  34: loss 0.64672 → 0.63852
Epoch  35: loss 0.63852 → 0.62948
Epoch  36: loss 0.62948 → 0.61977
Epoch  37: loss 0.61977 → 0.60918
Epoch  38: loss 0.60918 → 0.59844
Epoch  39: loss 0.59844 → 0.58893
Epoch  40: loss 0.58893 → 0.57598
Epoch  41: loss 0.57598 → 0.56310
Epoch  42: loss 0.56310 → 0.55035
Epoch  43: loss 0.55035 → 0.53809
Epoch  44: loss 0.53809 → 0.52214
Epoch  45: loss 0.52214 → 0.50660
Epoch  46: loss 0.50660 → 0.49073
Epoch  47: loss 0.49073 → 0.47591
Epoch  48: loss 0.47591 → 0.45758
Epoch  49: loss 0.45758 → 0.44074
Epoch  50: loss 0.44074 → 0.42251
Epoch  51: loss 0.42251 → 0.41069
Epoch  52: loss 0.41069 → 0.38858
Epoch  53: loss 0.38858 → 0.36998
Epoch  54: loss 0.36998 → 0.35227
Epoch  55: loss 0.35227 → 0.34356
Epoch  56: loss 0.34356 → 0.32577
Epoch  57: loss 0.32577 → 0.31462
Epoch  58: loss 0.31462 → 0.29240
Epoch  59: loss 0.29240 → 0.27704
Epoch  60: loss 0.27704 → 0.25851
Epoch  61: loss 0.25851 → 0.25409
Epoch  62: loss 0.25409 → 0.23884
Epoch  63: loss 0.23884 → 0.23260
Epoch  64: loss 0.23260 → 0.21815
Epoch  65: loss 0.21815 → 0.21238
Epoch  66: loss 0.21238 → 0.19964
Epoch  67: loss 0.19964 → 0.19350
Epoch  68: loss 0.19350 → 0.17787
Epoch  69: loss 0.17787 → 0.16963
Epoch  70: loss 0.16963 → 0.15270
Epoch  71: loss 0.15270 → 0.14804
Epoch  72: loss 0.14804 → 0.13478
Epoch  73: loss 0.13478 → 0.13357
Epoch  74: loss 0.13357 → 0.12381
Epoch  75: loss 0.12381 → 0.12041
Epoch  76: loss 0.12041 → 0.10789
Epoch  77: loss 0.10789 → 0.10512
Epoch  78: loss 0.10512 → 0.09204
Epoch  79: loss 0.09204 → 0.08493
Epoch  80: loss 0.08493 → 0.07447
Epoch  81: loss 0.07447 → 0.07243
Epoch  82: loss 0.07243 → 0.06423
Epoch  83: loss 0.06423 → 0.06479
Epoch  84: loss 0.06479 → 0.06053
Epoch  85: loss 0.06053 → 0.05940
Epoch  86: loss 0.05940 → 0.05501
Epoch  87: loss 0.05501 → 0.05465
Epoch  88: loss 0.05465 → 0.05370
Epoch  89: loss 0.05370 → 0.05036
Epoch  90: loss 0.05036 → 0.04376
Epoch  91: loss 0.04376 → 0.04596
Epoch  92: loss 0.04596 → 0.04283
Epoch  93: loss 0.04283 → 0.04213
Epoch  94: loss 0.04213 → 0.04031
Epoch  95: loss 0.04031 → 0.03879
Epoch  96: loss 0.03879 → 0.03629
Epoch  97: loss 0.03629 → 0.03574
Epoch  98: loss 0.03574 → 0.03499
Epoch  99: loss 0.03499 → 0.03293
Epoch 100: loss 0.03293 → 0.03075
Train acc: 1.0
Test acc:  1.0

Vizulalization

Show code
# Plot helper: decision boundary in the original (x1, x2) plane

def plot_decision_boundary(model, X, y, title="Naïve MLP decision boundary"):

    # grid over the input plane
    pad = 0.3
    x1_min, x1_max = X[:,0].min()-pad, X[:,0].max()+pad
    x2_min, x2_max = X[:,1].min()-pad, X[:,1].max()+pad

    xx, yy = np.meshgrid(
        np.linspace(x1_min, x1_max, 400),
        np.linspace(x2_min, x2_max, 400)
    )
    grid = np.c_[xx.ravel(), yy.ravel()]

    # predict probabilities on the grid
    p = model.forward(grid).reshape(xx.shape)

    # filled probabilities + p=0.5 contour + data points
    plt.figure(figsize=(3.75, 3.75), dpi=140)
    plt.contourf(xx, yy, p, levels=50, alpha=0.7)
    cs = plt.contour(xx, yy, p, levels=[0.5], linewidths=2)
    plt.scatter(X[:,0], X[:,1], c=y, s=18, edgecolor="k", linewidth=0.2)
    plt.clabel(cs, fmt={0.5: "p=0.5"})
    plt.title(title)
    plt.xlabel("x₁")
    plt.ylabel("x₂")
    plt.tight_layout()
    plt.show()

plot_decision_boundary(model, X, y)

XOR-like data

Show code
n_samples = 800
rng = np.random.default_rng(42)

X = rng.uniform(-6, 6, size=(n_samples, 2))
x1, x2 = X[:, 0], X[:, 1]

y = ((x1 * x2) > 0).astype(int)

plt.figure(figsize=(4.5, 4.5))
plt.scatter(X[y == 0, 0], X[y == 0, 1],
            color="C0", label="class 0", edgecolor="k", linewidth=0.3)
plt.scatter(X[y == 1, 0], X[y == 1, 1],
            color="C1", label="class 1", edgecolor="k", linewidth=0.3)

plt.axhline(0, color="gray", linestyle="--", linewidth=1)
plt.axvline(0, color="gray", linestyle="--", linewidth=1)

plt.xlabel("x₁")
plt.ylabel("x₂")
plt.title("XOR-like data")
plt.xlim(-6, 6)
plt.ylim(-6, 6)
plt.axis("equal")
plt.legend()
plt.tight_layout()
plt.show()

XOR-like data (continued)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model = NaiveMLP([2, 4, 4, 1], step=0.06, seed=0)

print("Initial loss:", model.loss(X_train, y_train))

model.train(X_train, y_train, epochs=1000, verbose=False)

print("Final loss:", model.loss(X_train, y_train))

print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test acc: ", accuracy_score(y_test, model.predict(X_test)))
Initial loss: 0.7029764953722529
Final loss: 3.4494531195516116e-07
Train acc: 1.0
Test acc:  0.99

XOR-like data (continued)

plot_decision_boundary(model, X, y)

Drawbacks

  • Computational inefficiency.
  • Scalability limitations.
  • Fixed step size (±η) lacks adaptivity.
  • Poor coordination of parameters.
  • No directional or magnitude information.
  • Lack of sophisticated optimizer features.
  • Potential for over-fitting or poor generalisation.

Notation

Notation

A two-layer perceptron computes:

\[ \hat{y} = \phi_2(\phi_1(X)) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A 3-layer perceptron computes:

\[ \hat{y} = \phi_3(\phi_2(\phi_1(X))) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Notation

A \(k\)-layer perceptron computes:

\[ \hat{y} = \phi_k( \ldots \phi_2(\phi_1(X)) \ldots ) \]

where

\[ \phi_l(Z) = \phi(W_lZ_l + b_l) \]

Prologue

Summary

  • Framed deep learning as layered function approximation across tasks.
  • Described FNNs: inputs → hidden layers → outputs; information flowed forward only.
  • Noted units used bias and activations; clarified why non-linearity mattered.
  • Reviewed sigmoid/tanh/ReLU ranges and derivative behavior.
  • Stated the Universal Approximation Theorem and its practical limits.
  • Built a tiny MLP and computed predictions and BCE loss on toy data.
  • Demonstrated a naïve, non-gradient training algorithm; it worked but scaled poorly and was brittle.
  • Established compact layer notation, \(\hat{y} = \phi_k( \ldots \phi_2(\phi_1(X)) \ldots )\) where \(\phi_l(Z) = \phi(W_lZ_l + b_l)\), to prepare for backprop.

Next lecture

  • We will introduce backprop, and discuss vanishing gradient, softmax, and regularization.

References

Cybenko, George V. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems 2: 303–14. https://api.semanticscholar.org/CorpusID:3958369.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/CVPR.2016.90.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/https://doi.org/10.1016/0893-6080(89)90020-8.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa