CSI 4106 - Fall 2024
Version: Oct 23, 2024 15:24
The softmax function is an activation function used in multi-class classification problems to convert a vector of raw scores into probabilities that sum to 1.
Given a vector \(\mathbf{z} = [z_1, z_2, \ldots, z_n]\):
\[ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} \]
where \(\sigma(\mathbf{z})_i\) is the probability of the \(i\)-th class, and \(n\) is the number of classes.
\(z_1\) | \(z_2\) | \(z_3\) | \(\sigma(z_1)\) | \(\sigma(z_2)\) | \(\sigma(z_3)\) | \(\sum\) |
---|---|---|---|---|---|---|
1.47 | -0.39 | 0.22 | 0.69 | 0.11 | 0.20 | 1.00 |
5.00 | 6.00 | 4.00 | 0.24 | 0.67 | 0.09 | 1.00 |
0.90 | 0.80 | 1.10 | 0.32 | 0.29 | 0.39 | 1.00 |
-2.00 | 2.00 | -3.00 | 0.02 | 0.98 | 0.01 | 1.00 |
The cross-entropy in a multi-class classification task for one example:
\[ J(W) = -\sum_{k=1}^{K} y_k \log(\hat{y}_k) \]
Where:
For a dataset with \(N\) examples, the average cross-entropy loss over all examples is computed as:
\[ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k}) \]
Where:
Regularization comprises a set of techniques designed to enhance a model’s ability to generalize by mitigating overfitting. By discouraging excessive model complexity, these methods improve the model’s robustness and performance on unseen data.
In numerical optimization, it is standard practice to incorporate additional terms into the objective function to deter undesirable model characteristics.
For a minimization problem, the optimization process aims to circumvent the substantial costs associated with these penalty terms.
Consider the mean absolute error loss function:
\[ \mathrm{MAE}(X,W) = \frac{1}{N} \sum_{i=1}^N | h_W(x_i) - y_i | \]
Where:
One or more terms can be added to the loss:
\[ \mathrm{MAE}(X,W) = \frac{1}{N} \sum_{i=1}^N | h_W(x_i) - y_i | + \mathrm{penalty} \]
A norm is assigns a non-negative length to a vector.
The \(\ell_p\) norm of a vector \(\mathbf{z} = [z_1, z_2, \ldots, z_n]\) is defined as:
\[ \|\mathbf{z}\|_p = \left( \sum_{i=1}^{n} |z_i|^p \right)^{1/p} \]
The \(\ell_1\) norm (Manhattan norm) is:
\[ \|\mathbf{z}\|_1 = \sum_{i=1}^{n} |z_i| \]
The \(\ell_2\) norm (Euclidean norm) is:
\[ \|\mathbf{z}\|_2 = \sqrt{\sum_{i=1}^{n} z_i^2} \]
Below, \(\alpha\) and \(\beta\) determine the degree of regularization applied; setting these values to zero effectively disables the regularization term. \[ \mathrm{MAE}(X,W) = \frac{1}{N} \sum_{i=1}^N | h_W(x_i) - y_i | + \alpha \ell_1 + \beta \ell_2 \]
Dropout is a regularization technique in neural networks where randomly selected neurons are ignored during training, reducing overfitting by preventing co-adaptation of features.
During each training step, each neuron in a dropout layer has a probability \(p\) of being excluded from the computation, typical values for \(p\) are between 10% and 50%.
While seemingly counterintuitive, this approach prevents the network from depending on specific neurons, promoting the distribution of learned representations across multiple neurons.
Dropout is one of the most popular and effective methods for reducing overfitting.
The typical improvement in performance is modest, usually around 1 to 2%.
import keras
from keras.models import Sequential
from keras.layers import InputLayer, Dropout, Flatten, Dense
model = tf.keras.Sequential([
InputLayer(shape=[28, 28]),
Flatten(),
Dropout(rate=0.2),
Dense(300, activation="relu"),
Dropout(rate=0.2),
Dense(100, activation="relu"),
Dropout(rate=0.2),
Dense(10, activation="softmax")
])
Early stopping is a regularization technique that halts training once the model’s performance on a validation set begins to degrade, preventing overfitting by stopping before the model learns noise.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa