Model evaluation

CSI 4106 - Fall 2024

Marcel Turcotte

Version: Dec 2, 2024 12:34

Preamble

Quote of Day

Learning objectives

  • Clarify the concepts of underfitting and overfitting in machine learning.
  • Describe the primary metrics used to evaluate model performance.
  • Contrast micro- and macro-averaged performance metrics.

Model fitting

Model fitting

During our class discussions, we have touched upon the concepts of underfitting and overfitting. To delve deeper into these topics, let’s examine them in the context of polynomial regression.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

Generating a nonlinear dataset

import numpy as np
np.random.seed(42)

X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X ** 2 - X + 2 + np.random.randn(100, 1)

Linear regression

A linear model inadequately represents this dataset

Definition

Feature engineering is the process of creating, transforming, and selecting variables (attributes) from raw data to improve the performance of machine learning models.

PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
array([-0.75275929])
X_poly[0]
array([-0.75275929,  0.56664654])

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form \([a, b]\), the degree-2 polynomial features are \([1, a, b, a^2, ab, b^2]\).

PolynomialFeatures

Given two features \(a\) and \(b\), PolynomialFeatures with degree=3 would add \(a^2\), \(a^3\), \(b^2\), \(b^3\), as well as, \(ab\), \(a^2b\), \(ab^2\)!

Warning

PolynomialFeatures(degree=d) adds \(\frac{(D+d)!}{d!D!}\) features, where \(D\) is the original number of features.

Polynomial regression

LinearRegression on PolynomialFeatures

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

Polynomial regression

The data was generated according to the following equation, with the inclusion of Gaussian noise.

\[ y = 0.5 x^2 + 1.0 x + 2.0 \]

Presented below is the learned model.

\[ \hat{y} = 0.56 x^2 + (-1.06) x + 1.78 \]

lin_reg.coef_, lin_reg.intercept_
(array([[-1.06633107,  0.56456263]]), array([1.78134581]))

Overfitting and underfitting

A low loss value on the training set does not necessarily indicate a “better” model.

Under- and over- fitting

  • Underfitting:
    • Your model is too simple (here, linear).
    • Uninformative features.
    • Poor performance on both training and test data.
  • Overfitting:
    • Your model is too complex (tall decision tree, deep and wide neural networks, etc.).
    • Too many features given the number of examples available.
    • Excellent performance on the training set, but poor performance on the test set.

Learning curves

  • One way to assess our models is to visualize the learning curves:
    • A learning curve shows the performance of our model, here using RMSE, on both the training set and the test set.
    • Multiple measurements are obtained by repeatedly training the model on larger and larger subsets of the data.

Learning curve – underfitting

Poor performance on both training and test data.

Learning curve – overfitting

Excellent performance on the training set, but poor performance on the test set.

Overfitting - deep nets - loss

Overfitting - deep nets - accuracy

Bias/Variance Tradeoff

  • Bias:
    • Error from overly simplistic models
    • High bias can lead to underfitting
  • Variance:
    • Error from overly complex models
    • Sensitivity to fluctuations in the training data
    • High variance can lead to overfitting
  • Tradeoff:
    • Aim for a model that generalizes well to new data
    • Methods: cross-validation, regularization, ensemble learning

Performance metrics

Confusion matrix

Positive (Predicted) Negative (Predicted)
Positive (Actual) True positive (TP) False negative (FN)
Negative (Actual) False positive (FP) True negative (TN)

sklearn.metrics.confusion_matrix

from sklearn.metrics import confusion_matrix

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

confusion_matrix(y_actual,y_pred)
array([[1, 2],
       [3, 4]])
tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel()
(tn, fp, fn, tp)
(1, 2, 3, 4)

Perfect prediction

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

confusion_matrix(y_actual,y_pred)
array([[4, 0],
       [0, 6]])
tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel()    
(tn, fp, fn, tp)
(4, 0, 0, 6)

Confusion matrix - multiple classes

Source code

import numpy as np
np.random.seed(42)

from sklearn.datasets import load_digits
digits = load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

clf = clf.fit(X_train, y_train)

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

plt.show()

Visualizing errors

mask = (y_test == 9) & (y_pred == 8)

X_9_as_8 = X_test[mask]

y_9_as_8 = y_test[mask]

Confusion matrix - multiple classes

Accuracy

How accurate is this result?

\[ \mathrm{accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{N}} \]

from sklearn.metrics import accuracy_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

accuracy_score(y_actual,y_pred)
0.5

Accuracy

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [1, 0, 1, 1, 0, 0, 0, 1, 0, 0]

accuracy_score(y_actual,y_pred)
0.0
y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

accuracy_score(y_actual,y_pred)
1.0

Accuracy can be misleading

y_actual = [0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
y_pred   = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

accuracy_score(y_actual,y_pred)
0.8

Precision

AKA, positive predictive value (PPV).

\[ \mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \]

from sklearn.metrics import precision_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

precision_score(y_actual, y_pred)
0.6666666666666666

Precision alone is not enough

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

precision_score(y_actual,y_pred)
1.0

Recall

AKA sensitivity or true positive rate (TPR) \[ \mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]

from sklearn.metrics import recall_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

recall_score(y_actual,y_pred)
0.5714285714285714

F\(_1\) score

\[ \begin{align*} F_1~\mathrm{score} &= \frac{2}{\frac{1}{\mathrm{precision}}+\frac{1}{\mathrm{recall}}} = 2 \times \frac{\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} \\ &= \frac{\mathrm{TP}}{\mathrm{FP}+\frac{\mathrm{FN}+\mathrm{FP}}{2}} \end{align*} \]

from sklearn.metrics import f1_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

f1_score(y_actual,y_pred)
0.6153846153846154

Micro Performance Metrics

Micro performance metrics aggregate the contributions of all classes to compute the average performance metric, such as precision, recall, or F1 score. This approach treats each individual prediction equally, providing a balanced evaluation by emphasizing the performance on frequent classes.

Macro Performance Metrics

Macro performance metrics compute the performance metric independently for each class and then average these metrics. This approach treats each class equally, regardless of its frequency, providing an evaluation that equally considers performance across both frequent and infrequent classes.

Micro/macro metrics

from sklearn.metrics import ConfusionMatrixDisplay

# Sample data
y_true = ['Cat'] * 42 + ['Dog'] *  7 + ['Fox'] * 11
y_pred = ['Cat'] * 39 + ['Dog'] *  1 + ['Fox'] *  2 + \
         ['Cat'] *  4 + ['Dog'] *  3 + ['Fox'] *  0 + \
         ['Cat'] *  5 + ['Dog'] *  1 + ['Fox'] *  5

ConfusionMatrixDisplay.from_predictions(y_true, y_pred)

Micro/macro precision

from sklearn.metrics import classification_report, precision_score

print(classification_report(y_true, y_pred), "\n")

print("Micro precision: {:.2f}".format(precision_score(y_true, y_pred, average='micro')))
print("Macro precision: {:.2f}".format(precision_score(y_true, y_pred, average='macro')))
              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro precision: 0.78
Macro precision: 0.71

Macro-average precision is calculated as the mean of the precision scores for each class: \(\frac{0.81 + 0.60 + 0.71}{3} = 0.71\).

Whereas, the micro-average precision is calculated using the formala, \(\frac{TP}{TP+FP}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+9+2+2} = \frac{47}{60} = 0.78\)

Micro/macro recall

              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro recall: 0.78
Macro recall: 0.60

Macro-average recall is calculated as the mean of the recall scores for each class: \(\frac{0.93 + 0.43 + 0.45}{3} = 0.60\).

Whereas, the micro-average recall is calculated using the formala, \(\frac{TP}{TP+FN}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+3+4+6} = \frac{39}{60} = 0.78\)

Micro/macro metrics (medical data)

Consider a medical dataset, such as those involving diagnostic tests or imaging, comprising 990 normal samples and 10 abnormal (tumor) samples. This represents the ground truth.

Micro/macro metrics (medical data)

              precision    recall  f1-score   support

      Normal       1.00      0.99      1.00       990
      Tumour       0.55      0.60      0.57        10

    accuracy                           0.99      1000
   macro avg       0.77      0.80      0.78      1000
weighted avg       0.99      0.99      0.99      1000
 

Micro precision: 0.99
Macro precision: 0.77


Micro recall: 0.99
Macro recall: 0.80

Hand-written digits (revisited)

Loading the dataset

import numpy as np
np.random.seed(42)

from sklearn.datasets import fetch_openml

digits = fetch_openml('mnist_784', as_frame=False)
X, y = digits.data, digits.target

Plotting the first five examples

These images have dimensions of ( 28 ) pixels.

Creating a binary classification task

# Creating a binary classification task (one vs the rest)

some_digit = X[0]
some_digit_y = y[0]

y = (y == some_digit_y)
y
array([ True, False, False, ..., False,  True, False])
# Creating the training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

SGDClassifier

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()
clf.fit(X_train, y_train)

clf.predict(X[0:5]) # small sanity check
array([ True, False, False, False, False])

Performance

from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)
0.9572857142857143

Wow!

Not so fast

y_pred = dummy_clf.predict(X_test)

accuracy_score(y_test, y_pred)
0.906

Precision-recall trade-off

Precision-recall trade-off

Precision/Recall curve

ROC curve

Receiver Operating Characteristics (ROC) curve

  • True positive rate (TPR) against false positive rate (FPR)
  • An ideal classifier has TPR close to 1.0 and FPR close to 0.0
  • \(\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\) (recall, sensitivity)
  • TPR approaches one when the number of false negative predictions is low
  • \(\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\) (aka~[1-specificity])
  • FPR approaches zero when the number of false positive is low

ROC curve

AUC/ROC

The 7 steps of machine learning

Prologue

Further reading

Next lecture

  • We will examine cross-validation and hyperparameter tuning.

References

Chollet, François. 2017. Deep Learning with Python. Manning Publications.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics. Springer. https://doi.org/10.1007/978-0-387-84858-7.
Japkowicz, Nathalie, and Mohak Shah. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press. http://assets.cambridge.org/97805211/96000/cover/9780521196000.jpg.
Knowler, William C., David J. Pettitt, Peter J. Savage, and Peter H. Bennett. 1981. “Diabetes Incidence in Pima Indians: Contributions of Obesity and Parental Diabetes.” American Journal of Epidemiology 113 2: 144–56. https://api.semanticscholar.org/CorpusID:25209675.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa