Neural networks evolved from simple, biologically inspired perceptrons to deep, multilayer architectures that rely on nonlinear activation functions for learning complex patterns. The universal approximation theorem underpins their ability to approximate any continuous function, and modern frameworks like PyTorch, TensorFlow, and Keras enable practical deep learning applications.
Learning Objectives
Explain basic neural network models (perceptrons and MLPs) and their computational foundations.
Appreciate the limitations of single-layer networks and the necessity for hidden layers.
Describe the role and impact of nonlinear activation functions (sigmoid, tanh, ReLU) in learning.
Articulate the universal approximation theorem and its significance.
Implement and evaluate deep learning models using modern frameworks such as TensorFlow and Keras.
James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and Amalio Telenti, A primer on deep learning in genomics, Nat Genet51:1, 12–18, 2019.
With \(\theta = 2\), the neurode implements an AND logic gate.
With \(\theta = 1\), the neurode implements an OR logic gate.
Computations with Neurodes
Digital computations can be broken down into a sequence of logical operations, enabling neurode networks to execute any computation.
McCulloch and Pitts (1943) did not focus on learning parameter \(\theta\).
They introduced a machine that computes any function but cannot learn.
Threshold Logic Unit
Simple Step Functions
\(\text{heaviside}(t)\) =
1, if \(t \geq 0\)
0, if \(t < 0\)
\(\text{sign}(t)\) =
1, if \(t > 0\)
0, if \(t = 0\)
-1, if \(t < 0\)
Notation
Notation
Perceptron
Perceptron
Notation
Notation
\(X\) is the input data matrix where each row corresponds to an example and each column represents one of the \(D\) features.
\(W\) is the weight matrix, structured with one row per input (feature) and one column per neuron.
Bias terms can be represented separately; both approaches appear in the literature. Here, \(b\) is a vector with a length equal to the number of neurons.
Discussion
The algorithm to train the perceptron closely resembles stochastic gradient descent.
In the interest of time and to avoid confusion, we will skip this algorithm and focus on multilayer perception (MLP) and its training algorithm, backpropagation.
As will be discussed later, the training algorithm, known as backpropagation, employs gradient descent, necessitating the calculation of the partial derivatives of the loss function.
The step function in the multilayer perceptron had to be replaced, as it consists only of flat surfaces. Gradient descent cannot progress on flat surfaces due to their zero derivative.
Activation Function
Nonlinear activation functions are paramount because, without them, multiple layers in the network would only compute a linear function of the inputs.
According to the Universal Approximation Theorem, sufficiently large deep networks with nonlinear activation functions can approximate any continuous function. See Universal Approximation Theorem.
Sigmoid
Code
import numpy as npimport matplotlib.pyplot as plt# Sigmoid functiondef sigmoid(x):return1/ (1+ np.exp(-x))# Generate x valuesx = np.linspace(-10, 10, 400)# Compute y values for the sigmoid functiony = sigmoid(x)plt.figure(figsize=(4,3))plt.plot(x, y, color='black', linewidth=2)plt.grid(True)plt.show()plt.show()
\[
\sigma(t) = \frac{1}{1 + e^{-t}}
\]
Hyperbolic Tangent Function
Code
# Compute y values for the hyperbolic tangent functiony = np.tanh(x)plt.figure(figsize=(4,3))plt.plot(x, y, color='black', linewidth=2)plt.grid(True)plt.show()
Rectified linear unit function (ReLU)
Code
# Compute y values for the rectified linear unit function (ReLU) functiony = np.maximum(0, x)plt.figure(figsize=(4,3))plt.plot(x, y, color='black', linewidth=2)plt.grid(True)plt.show()
The Universal Approximation Theorem (UAT) states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of \(\mathbb{R}^n\), given appropriate weights and activation functions.
Single Hidden Layer
\[
y = \sum_{i=1}^N \alpha_i \sigma(w_{1,i} x + b_i)
\]
Effect of Varying w
Code
def logistic(x, w, b):"""Compute the logistic function with parameters w and b."""return1/ (1+ np.exp(-(w * x + b)))# Define a range for x values.x = np.linspace(-10, 10, 400)# Plot 1: Varying w (steepness) with b fixed at 0.plt.figure(figsize=(6,4))w_values = [0.5, 1, 2, 5] # different steepness valuesb =0# fixed biasfor w in w_values: plt.plot(x, logistic(x, w, b), label=f'w = {w}, b = {b}')plt.title('Effect of Varying w (with b = 0)')plt.xlabel('x')plt.ylabel(r'$\sigma(wx+b)$')plt.legend()plt.grid(True)plt.show()
Effect of Varying b
Code
# Plot 2: Varying b (horizontal shift) with w fixed at 1.plt.figure(figsize=(6,4))w =1# fixed steepnessb_values = [-5, -2, 0, 2, 5] # different bias valuesfor b in b_values: plt.plot(x, logistic(x, w, b), label=f'w = {w}, b = {b}')plt.title('Effect of Varying b (with w = 1)')plt.xlabel('x')plt.ylabel(r'$\sigma(wx+b)$')plt.legend()plt.grid(True)plt.show()
Effect of Varying w
Code
def relu(x, w, b):"""Compute the ReLU activation with parameters w and b."""return np.maximum(0, w * x + b)# Define a range for x values.x = np.linspace(-10, 10, 400)# Plot 1: Varying w (scaling) with b fixed at 0.plt.figure(figsize=(6,4))w_values = [0.5, 1, 2, 5] # different scaling valuesb =0# fixed biasfor w in w_values: plt.plot(x, relu(x, w, b), label=f'w = {w}, b = {b}')plt.title('Effect of Varying w (with b = 0) on ReLU Activation')plt.xlabel('x')plt.ylabel('ReLU(wx+b)')plt.legend()plt.grid(True)plt.show()
Effect of Varying b
Code
# Plot 2: Varying b (horizontal shift) with w fixed at 1.plt.figure(figsize=(6,4))w =1# fixed scalingb_values = [-5, -2, 0, 2, 5] # different bias valuesfor b in b_values: plt.plot(x, relu(x, w, b), label=f'w = {w}, b = {b}')plt.title('Effect of Varying b (with w = 1) on ReLU Activation')plt.xlabel('x')plt.ylabel('ReLU(wx+b)')plt.legend()plt.grid(True)plt.show()
Single Hidden Layer
\[
y = \sum_{i=1}^N \alpha_i \sigma(w_{1,i} x + b_i)
\]
Demonstration with Code
# Defining the function to be approximateddef f(x):return2* x**3+4* x**2-5* x +1# Generating a dataset, x in [-4,2), f(x) as aboveX =6* np.random.rand(1000, 1) -4y = f(X).flatten()
Increasing the Number of Neurons
from sklearn.neural_network import MLPRegressorfrom sklearn.model_selection import train_test_splitX_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, random_state=42)models = []sizes = [1, 2, 5, 10, 100]for i, n inenumerate(sizes): models.append(MLPRegressor(hidden_layer_sizes=[n], max_iter=5000, random_state=42)) models[i].fit(X_train, y_train)
Increasing the Number of Neurons
Code
# Create a colormapcolors = plt.colormaps['cool'].resampled(len(sizes))X_valid = np.sort(X_valid,axis=0)for i, n inenumerate(sizes): y_pred = models[i].predict(X_valid) plt.plot(X_valid, y_pred, "-", color=colors(i), label="Number of neurons = {}".format(n))y_true = f(X_valid)plt.plot(X_valid, y_true, "r.", label='Actual')plt.legend()plt.show()
Increasing the Number of Neurons
Code
for i, n inenumerate(sizes): plt.plot(models[i].loss_curve_, "-", color=colors(i), label="Number of neurons = {}".format(n))plt.title('MLPRegressor Loss Curves')plt.xlabel('Iterations')plt.ylabel('Loss')plt.legend()plt.show()
PyTorch has gained considerable traction in the research community. Initially developed by Meta AI, it is now part of the Linux Foundation.
TensorFlow, created by Google, is widely adopted in industry for deploying models in production environments.
Keras
Keras is a high-level API designed to build, train, evaluate, and execute models across various backends, including PyTorch, TensorFlow, and JAX, Google’s high-performance platform.
Fashion-MNIST dataset
“Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.”
Neural Networks Foundations:
We introduced bio-inspired computation with neurodes and threshold logic units, outlining the perceptron model and its limitations (e.g., the XOR problem).
From Perceptrons to Deep Networks:
We explained the evolution to multilayer perceptrons (MLPs) and feedforward architectures, emphasizing the critical role of nonlinear activation functions (sigmoid, tanh, ReLU) in enabling gradient-based learning and complex function approximation.
Universal Approximation:
We discussed how even single hidden layer networks can approximate any continuous function on a compact set, highlighting the theoretical underpinning of deep learning.
Practical Frameworks and Applications:
Finally, we reviews leading deep learning frameworks (PyTorch, TensorFlow, Keras) and demonstrates practical model-building using the Fashion-MNIST dataset, covering model training, evaluation, and prediction.
D’haeseleer, Patrik. 2006. “How Does DNA Sequence Motif Discovery Work?”Nature Biotechnology 24 (8): 959–61. https://doi.org/10.1038/nbt0806-959.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Adaptive Computation and Machine Learning. MIT Press. https://dblp.org/rec/books/daglib/0040158.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.”Neural Networks 2 (5): 359–66. https://doi.org/https://doi.org/10.1016/0893-6080(89)90020-8.
LeNail, Alexander. 2019. “NN-SVG: Publication-Ready Neural Network Architecture Schematics.”Journal of Open Source Software 4 (33): 747. https://doi.org/10.21105/joss.00747.
McCulloch, Warren S, and Walter Pitts. 1943. “A logical calculus of the ideas immanent in nervous activity.”The Bulletin of Mathematical Biophysics 5 (4): 115–33. https://doi.org/10.1007/bf02478259.
Minsky, Marvin, and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. Cambridge, MA, USA: MIT Press.
Rosenblatt, F. 1958. “The perceptron: A probabilistic model for information storage and organization in the brain.”Psychological Review 65 (6): 386–408. https://doi.org/10.1037/h0042519.
Wasserman, WW, and A Sandelin. 2004. “Applied bioinformatics for the identification of regulatory elements.”Nature Reviews Genetics 5 (4): 276–87. https://doi.org/10.1038/nrg1315.
Zou, James, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani, and Amalio Telenti. 2019. “A Primer on Deep Learning in Genomics.”Nature Genetics 51 (1): 12–18. https://doi.org/10.1038/s41588-018-0295-5.