Performance Evaluation

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Feb 15, 2025 11:00

Preamble

Quote of the Day (1/3)

The American Society for Microbiology altered its website to remove references to diversity and equity, and temporarily removed articles about scientists from under-represented groups — raising an outcry from some of its members. The organization’s president says it was following legal advice in the hope of protecting its federally funded programmes from the impact of wide-ranging executive orders issued by President Donald Trump, which banned federal funding related to topics including diversity, equity and inclusion.

Quote of the Day (2/3)

Quote of the Day (3/3)

Summary

This lecture covers classification model evaluation, focusing on confusion matrices and key metrics: accuracy, precision, recall, and F₁ score. It addresses accuracy’s limitations in imbalanced datasets, introducing micro and macro averaging. The precision-recall trade-off and ROC analysis, including AUC, are also explored. Practical insights are provided through Python implementations like logistic regression via gradient descent.

Learning Outcomes

  • Describe the structure and role of the confusion matrix in model evaluation.
  • Compute and interpret accuracy, precision, recall, and \(F_1\) score.
  • Identify the pitfalls of using accuracy with imbalanced datasets.
  • Differentiate between micro and macro averaging for performance metrics.
  • Analyze precision-recall trade-offs and construct ROC curves, including the calculation of AUC.
  • Implement the calculation or ROC curves and AUC in Python.

On Performance Measures

  • Sokolova, M. & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management45(4), 427–437.
    • Scopus: 4,222 citations
    • Google Scholar: 6,839 citations

Evaluating Learning Algorithms

  • This book, 4.6 stars rating on Amazon, delves into the evaluation process, particularly focusing on classification algorithms (Japkowicz and Shah 2011).

  • Nathalie Japkowicz previously served as a professor at the University of Ottawa and is currently affiliated with American University in Washington.

  • Mohak Shah, who earned his PhD from the University of Ottawa, has held numerous industry roles, including Vice President of AI and Machine Learning at LG Electronics.

Performance Metrics

Confusion Matrix

Positive (Predicted) Negative (Predicted)
Positive (Actual) True positive (TP) False negative (FN)
Negative (Actual) False positive (FP) True negative (TN)

Confusion Matrix

Given a test set with \(N\) examples and a classifier \(h(x):\)

\[ C_{i,j} = \sum_{k = 1}^N [y_k = i \wedge h(x_k) = j] \]

Where \(C\) is \(l \times l\) matrix, for a dataset with \(l\) classes.

Confusion Matrix

  • The total number of examples of the (actual) class \(i\) is \[ C_{i \cdot} = \sum_{j=1}^l C_{i,j} \]

  • The total number of examples assigned to the (predicted) class \(j\) by classifier \(h\) is \[ C_{\cdot j} = \sum_{i=1}^l C_{i,j} \]

Confusion Matrix

  • Terms on the diagonal denote the total number of examples classified correctly by classifier \(h\). Hence, the number of correctly classified examples is \[ \sum_{i=1}^l C_{i,i} \]

  • Non-diagonal terms represent misclassifications.

Confusion Matrix - Multi-Class

To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.

Confusion Matrix - Multi-Class

Confusion Matrix - True Positive

Confusion Matrix - False Positive

Confusion Matrix - False Negative

Confusion Matrix - True Negative

Confusion Matrix - Multi-Class

Multi-Class

To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.

  • True Positives (\(\mathrm{TP}_i\)): Diagonal entry \(C_{i,i}\).
  • False Positives (\(\mathrm{FP}_i\)): Sum of column \(i\) excluding \(C_{i,i}\).
  • False Negatives (\(\mathrm{FN}_i\)): Sum of row \(i\) excluding \(C_{i,i}\) .
  • True Negatives (\(\mathrm{TN}_i\)): \(N - (\mathrm{TP}_i + \mathrm{FP}_i + \mathrm{FN}_i)\)

Multi-Class

To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.

  • \(\mathrm{TP}_i = C_{i,i}\).
  • \(\mathrm{FP}_i = \sum_{k \ne i} C_{k,i}\).
  • \(\mathrm{FN}_i = \sum_{k \ne i} C_{i,k}\).
  • \(\mathrm{TN}_i = \sum_{j \ne i} \sum_{k \ne i} C_{j,k}\).

sklearn.metrics.confusion_matrix

from sklearn.metrics import confusion_matrix

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

confusion_matrix(y_actual,y_pred)
array([[1, 2],
       [3, 4]])
tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel()
(tn, fp, fn, tp)
(1, 2, 3, 4)

Perfect Prediction

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

confusion_matrix(y_actual,y_pred)
array([[4, 0],
       [0, 6]])
tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel()    
(tn, fp, fn, tp)
(4, 0, 0, 6)

Confusion Matrix - Multiple Classes

Code
from sklearn.datasets import load_digits

import numpy as np
np.random.seed(42)

digits = load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

clf = clf.fit(X_train, y_train)

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

Visualizing errors

mask = (y_test == 9) & (y_pred == 8)

X_9_as_8 = X_test[mask]

y_9_as_8 = y_test[mask]
Code
import numpy as np
np.random.seed(42)

from sklearn.datasets import load_digits
digits = load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

clf = clf.fit(X_train, y_train)

X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

mask = (y_test == 9) & (y_pred == 8)

X_9_as_8 = X_test[mask]

y_9_as_8 = y_test[mask]

import matplotlib.pyplot as plt

plt.figure(figsize=(4,2))

for index, (image, label) in enumerate(zip(X_9_as_8, y_9_as_8)):
    plt.subplot(1, len(X_9_as_8), index + 1)
    plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
    plt.title(f'y = {label}')

Confusion Matrix - Multiple Classes

Accuracy

How accurate is this result?

\[ \mathrm{accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{N}} \]

from sklearn.metrics import accuracy_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

accuracy_score(y_actual,y_pred)
0.5

Accuracy

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [1, 0, 1, 1, 0, 0, 0, 1, 0, 0]

accuracy_score(y_actual,y_pred)
0.0
y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

accuracy_score(y_actual,y_pred)
1.0

Accuracy can be misleading

y_actual = [0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
y_pred   = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

accuracy_score(y_actual,y_pred)
0.8

Precision

AKA, positive predictive value (PPV).

\[ \mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \]

from sklearn.metrics import precision_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

precision_score(y_actual, y_pred)
0.6666666666666666

Precision alone is not enough

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

precision_score(y_actual,y_pred)
1.0

Recall

AKA sensitivity or true positive rate (TPR) \[ \mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]

from sklearn.metrics import recall_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

recall_score(y_actual,y_pred)
0.5714285714285714

F\(_1\) score

\[ \begin{align*} F_1~\mathrm{score} &= \frac{2}{\frac{1}{\mathrm{precision}}+\frac{1}{\mathrm{recall}}} = 2 \times \frac{\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} \\ &= \frac{\mathrm{TP}}{\mathrm{FP}+\frac{\mathrm{FN}+\mathrm{FP}}{2}} \end{align*} \]

from sklearn.metrics import f1_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

f1_score(y_actual,y_pred)
0.6153846153846154

Micro and Macro Averaging

Definition

The class imbablance problem is a scenario where the number of instances in one class significantly outnumbers the instances in other classes.

Models tend to be biased towards the majority class, leading to poor performance on the minority class.

Micro Performance Metrics

  • Micro performance metrics aggregate the contributions of all instances to compute average performance metrics like precision, recall, or F1 score.
  • This approach treats each individual prediction equally, regardless of its class, as it considers the total number of true positives, false positives, and false negatives across all classes.
  • Consequently, micro metrics are particularly sensitive to the performance on frequent classes because they are more numerous and thus have a greater influence on the overall metric.

Macro Performance Metrics

  • Macro performance metrics compute the performance metric independently for each class and then average these metrics.
  • This approach treats each class equally, regardless of its frequency, providing an evaluation that equally considers performance across both frequent and infrequent classes.
  • Consequently, macro metrics are less sensitive to the performance on frequent classes.

Multi-Class

When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).

  • True Positives (\(\mathrm{TP}_i\)): Diagonal entry \(C_{i,i}\).
  • False Positives (\(\mathrm{FP}_i\)): Sum of column \(i\) excluding \(C_{i,i}\).
  • False Negatives (\(\mathrm{FN}_i\)): Sum of row \(i\) excluding \(C_{i,i}\) .
  • True Negatives (\(\mathrm{TN}_i\)): \(N - (\mathrm{TP}_i + \mathrm{FP}_i + \mathrm{FN}_i)\)

Multi-Class

When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).

  • \(\mathrm{TP}_i = C_{i,i}\).
  • \(\mathrm{FP}_i = \sum_{k \ne i} C_{k,i}\).
  • \(\mathrm{FN}_i = \sum_{k \ne i} C_{i,k}\).
  • \(\mathrm{TN}_i = \sum_{j \ne i} \sum_{k \ne i} C_{j,k}\).

Micro/Macro Metrics

from sklearn.metrics import ConfusionMatrixDisplay

# Sample data
y_true = ['Cat'] * 42 + ['Dog'] *  7 + ['Fox'] * 11
y_pred = ['Cat'] * 39 + ['Dog'] *  1 + ['Fox'] *  2 + \
         ['Cat'] *  4 + ['Dog'] *  3 + ['Fox'] *  0 + \
         ['Cat'] *  5 + ['Dog'] *  1 + ['Fox'] *  5

ConfusionMatrixDisplay.from_predictions(y_true, y_pred)

Micro/Macro Precision

from sklearn.metrics import classification_report, precision_score

print(classification_report(y_true, y_pred), "\n")

print("Micro precision: {:.2f}".format(precision_score(y_true, y_pred, average='micro')))
print("Macro precision: {:.2f}".format(precision_score(y_true, y_pred, average='macro')))
              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro precision: 0.78
Macro precision: 0.71

Micro/Macro Precision

  • Macro-average precision is calculated as the mean of the precision scores1 for each class: \(\frac{0.81 + 0.60 + 0.71}{3} = 0.71\).

  • Whereas, the micro-average precision is calculated using the formala, \(\frac{TP}{TP+FP}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+9+2+2} = \frac{47}{60} = 0.78\)

Micro/Macro Recall

              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro recall: 0.78
Macro recall: 0.60

Micro/Macro Recall

  • Macro-average recall is calculated as the mean of the recall scores for each class: \(\frac{0.93 + 0.43 + 0.45}{3} = 0.60\).

  • Whereas, the micro-average recall is calculated using the formala, \(\frac{TP}{TP+FN}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+3+4+6} = \frac{39}{60} = 0.78\)

Example

Using the 20 newsgroups text dataset from scikit-learn.org.

Comprises around 18,000 newsgroups posts on 20 topics.

Code
## https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html

from time import time

## Load Dataset

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6

def load_dataset(verbose=False, remove=()):
    """Load and vectorize the 20 newsgroups dataset."""

    data_train = fetch_20newsgroups(
        subset="train",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )

    data_test = fetch_20newsgroups(
        subset="test",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )

    # order of labels in `target_names` can be different from `categories`
    target_names = data_train.target_names

    # split target in a training set and a test set
    y_train, y_test = data_train.target, data_test.target

    # Extracting features from the training data using a sparse vectorizer
    t0 = time()
    vectorizer = TfidfVectorizer(
        sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english"
    )
    X_train = vectorizer.fit_transform(data_train.data)
    duration_train = time() - t0

    # Extracting features from the test data using the same vectorizer
    t0 = time()
    X_test = vectorizer.transform(data_test.data)
    duration_test = time() - t0

    feature_names = vectorizer.get_feature_names_out()

    if verbose:
        # compute size of loaded data
        data_train_size_mb = size_mb(data_train.data)
        data_test_size_mb = size_mb(data_test.data)

        # print(
        #     f"{len(data_train.data)} documents - "
        #     f"{data_train_size_mb:.2f}MB (training set)"
        # )
        # print(f"{len(data_test.data)} documents - {data_test_size_mb:.2f}MB (test set)")
        # print(f"{len(target_names)} categories")
        # print(
        #     f"vectorize training done in {duration_train:.3f}s "
        #     f"at {data_train_size_mb / duration_train:.3f}MB/s"
        # )
        # print(f"n_samples: {X_train.shape[0]}, n_features: {X_train.shape[1]}")
        # print(
        #     f"vectorize testing done in {duration_test:.3f}s "
        #     f"at {data_test_size_mb / duration_test:.3f}MB/s"
        # )
        # print(f"n_samples: {X_test.shape[0]}, n_features: {X_test.shape[1]}")

    return X_train, X_test, y_train, y_test, feature_names, target_names

X_train, X_test, y_train, y_test, feature_names, target_names = load_dataset(
    verbose=True
)

## Training and Prediction

from sklearn.linear_model import RidgeClassifier

clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

## Display the Confusion Matrix

from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax)
ax.xaxis.set_ticklabels(target_names)
ax.yaxis.set_ticklabels(target_names)
_ = ax.set_title(
    f"Confusion Matrix for {clf.__class__.__name__}"
)

Example

Example

cm = confusion_matrix(y_test, y_pred)

TP, FP, FN, TN

def true_positive(cm, i):
    return cm[i,i] # diagonal entry i,i

def false_positive(cm, i):
    return np.sum(cm[:, i]) - cm[i,i] # col - TP_i

def false_negative(cm, i):
    return np.sum(cm[i, :]) - cm[i,i] # row - TP_i

def true_negative(cm, i):
  N = cm.sum()
  TP = true_positive(cm, i)
  FP = false_positive(cm, i)
  FN = false_negative(cm, i)
  return N - (TP + FP + FN)

Precision

def precision_micro(cm):
    _, l = cm.shape
    tp = fp = 0
    for i in range(l):
        tp += true_positive(cm, i)
        fp += false_positive(cm, i)
    return tp / (tp+fp)

def precision_macro(cm):
    _, l = cm.shape
    precision = 0
    for i in range(l):
        tp = true_positive(cm, i)
        fp = false_positive(cm, i)
        precision += tp/(tp+fp)
    return precision/l

Precision Micro Average

\[ \frac{(258+380+371+199)}{(258+380+371+199)+(40+38+22+45)} \] where

  • 40 = 2 + 1 + 37
  • 38 = 7 + 22 + 9
  • 22 = 12 + 4 + 6
  • 45 = 42 + 3 + 0

Precision Macro Average

  • \(\mathrm{Precision}_0 = \frac{258}{258+(2+1+37)} = 0.8657718121\)
  • \(\mathrm{Precision}_1 = \frac{380}{380+(7+22+9)} = 0.9090909091\)
  • \(\mathrm{Precision}_2 = \frac{371}{371+(12+4+6)} = 0.9440203562\)
  • \(\mathrm{Precision}_3 = \frac{199}{199+(42+3+0)} = 0.8155737705\)

\(\mathrm{Precision}_3 = \frac{0.8657718121 + 0.9090909091 + 0.9440203562 + 0.8155737705}{4}\)

Recall

def recall_micro(cm):
    _, l = cm.shape
    tp = fn = 0
    for i in range(l):
        tp += true_positive(cm, i)
        fn += false_negative(cm, i)
    return tp / (tp+fn)

def recall_macro(cm):
    _, l = cm.shape
    recall = 0
    for i in range(l):
        tp = true_positive(cm, i)
        fn = false_negative(cm, i)
        recall += tp / (tp+fn)
    return recall/l

Micro/Macro Metrics (Medical Data)

Consider a medical dataset, such as those involving diagnostic tests or imaging, comprising 990 normal samples and 10 abnormal (tumor) samples. This represents the ground truth.

Micro/macro metrics (medical data)

              precision    recall  f1-score   support

      Normal       1.00      0.99      1.00       990
      Tumour       0.55      0.60      0.57        10

    accuracy                           0.99      1000
   macro avg       0.77      0.80      0.78      1000
weighted avg       0.99      0.99      0.99      1000
 

Micro precision: 0.99
Macro precision: 0.77


Micro recall: 0.99
Macro recall: 0.80

Precision-Recall Trade-Off

Hand-Written Digits (Revisited)

Loading the dataset

import numpy as np
np.random.seed(42)

from sklearn.datasets import fetch_openml

digits = fetch_openml('mnist_784', as_frame=False)
X, y = digits.data, digits.target

Plotting the first five examples

These images have dimensions of \(28 \times 28\) pixels.

Creating a Binary Classification Task

# Creating a binary classification task (one vs the rest)

some_digit = X[0]
some_digit_y = y[0]

y = (y == some_digit_y)
y
array([ True, False, False, ..., False,  True, False])
# Creating the training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

SGDClassifier

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()
clf.fit(X_train, y_train)

clf.predict(X[0:5]) # small sanity check
array([ True, False, False, False, False])

Performance

from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)
0.9572857142857143

Wow!

Not so Fast

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier()

dummy_clf.fit(X_train, y_train)
y_pred = dummy_clf.predict(X_test)

accuracy_score(y_test, y_pred)
0.906

Precision-Recall Trade-Off

Precision-Recall Trade-Off

Code
from sklearn.model_selection import cross_val_predict
y_scores = cross_val_predict(clf, X_train, y_train, cv=3, method="decision_function")

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

threshold = 3000

plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

Precision/Recall Curve

Code
import matplotlib.patches as patches  # extra code – for the curved arrow

plt.figure(figsize=(5, 5))  # extra code – not needed, just formatting

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall Curve")

# extra code – just beautifies and saves Figure 3–6
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
         label="Point at threshold 3,000")
plt.gca().add_patch(patches.FancyArrowPatch(
    (0.79, 0.60), (0.61, 0.78),
    connectionstyle="arc3,rad=.2",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.56, 0.62, "Higher\nthreshold", color="#333333")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")

plt.show()

ROC Curve

ROC Curve

Receiver Operating Characteristics (ROC) curve

  • True positive rate (TPR) against false positive rate (FPR)
  • An ideal classifier has TPR close to 1.0 and FPR close to 0.0
  • \(\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\) (recall, sensitivity)
  • TPR approaches one when the number of false negative predictions is low
  • \(\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\) (aka~[1-specificity])
  • FPR approaches zero when the number of false positive is low

ROC Curve

ROC Curve

Code
idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
y_train_pred_90 = (y_scores >= threshold_for_90_precision)

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train, y_scores)

idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(5, 5))  # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

# extra code – just beautifies and saves Figure 3–7
plt.gca().add_patch(patches.FancyArrowPatch(
    (0.20, 0.89), (0.07, 0.70),
    connectionstyle="arc3,rad=.4",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)

plt.show()

Dataset - openml

OpenML is an open platform for sharing datasets, algorithms, and experiments - to learn how to learn better, together.

import numpy as np
np.random.seed(42)

from sklearn.datasets import fetch_openml

diabetes = fetch_openml(name='diabetes', version=1)
print(diabetes.DESCR)

Dataset - openml

Author: Vincent Sigillito

Source: Obtained from UCI

Please cite: UCI citation policy

  1. Title: Pima Indians Diabetes Database

  2. Sources:

    1. Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
    2. Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu) Research Center, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University Johns Hopkins Road Laurel, MD 20707 (301) 953-6231
    3. Date received: 9 May 1990
  3. Past Usage:

    1. Smith,J.W., Everhart,J.E., Dickson,W.C., Knowler,W.C., & Johannes,R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261–265). IEEE Computer Society Press.

      The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

      Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.

  4. Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.

  5. Number of Instances: 768

  6. Number of Attributes: 8 plus class

  7. For Each Attribute: (all numeric-valued)

    1. Number of times pregnant
    2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    3. Diastolic blood pressure (mm Hg)
    4. Triceps skin fold thickness (mm)
    5. 2-Hour serum insulin (mu U/ml)
    6. Body mass index (weight in kg/(height in m)^2)
    7. Diabetes pedigree function
    8. Age (years)
    9. Class variable (0 or 1)
  8. Missing Attribute Values: None

  9. Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

    Class Value Number of instances 0 500 1 268

  10. Brief statistical analysis:

    Attribute number: Mean: Standard Deviation:

    1.                 3.8     3.4
    2.               120.9    32.0
    3.                69.1    19.4
    4.                20.5    16.0
    5.                79.8   115.2
    6.                32.0     7.9
    7.                 0.5     0.3
    8.                33.2    11.8

Relabeled values in attribute ‘class’ From: 0 To: tested_negative
From: 1 To: tested_positive

Downloaded from openml.org.

Pima Indians Diabetes Dataset

from sklearn.datasets import fetch_openml

# Load the Pima Indians Diabetes dataset
pima = fetch_openml(name='diabetes', version=1, as_frame=True)

# Extract the features and target
X = pima.data
y = pima.target

# Convert target labels 'tested_negative' and 'tested_positive' to 0 and 1
y = y.map({'tested_negative': 0, 'tested_positive': 1})

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Comparing Multiple Classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Comparing Multiple Classifiers

lr = LogisticRegression()
lr.fit(X_train, y_train)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

AUC/ROC

Code
from sklearn.metrics import roc_auc_score

y_pred_prob_lr = lr.predict_proba(X_test)[:, 1]
y_pred_prob_knn = knn.predict_proba(X_test)[:, 1]
y_pred_prob_dt = dt.predict_proba(X_test)[:, 1]
y_pred_prob_rf = rf.predict_proba(X_test)[:, 1]

# Compute ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr)
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_pred_prob_knn)
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_prob_dt)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf)

# Compute AUC scores
auc_lr = roc_auc_score(y_test, y_pred_prob_lr)
auc_knn = roc_auc_score(y_test, y_pred_prob_knn)
auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
auc_rf = roc_auc_score(y_test, y_pred_prob_rf)

# Plot ROC curves
plt.figure(figsize=(5, 5)) # plt.figure()
plt.plot(fpr_lr, tpr_lr, color='blue', label=f'Logistic Regression (AUC = {auc_lr:.2f})')
plt.plot(fpr_knn, tpr_knn, color='green', label=f'K-Nearest Neighbors (AUC = {auc_knn:.2f})')
plt.plot(fpr_dt, tpr_dt, color='orange', label=f'Decision Tree (AUC = {auc_dt:.2f})')
plt.plot(fpr_rf, tpr_rf, color='purple', label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')  # Diagonal line for random chance
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Logistic Regression, KNN, Decision Tree, and Random Forest')
plt.legend(loc="lower right")
plt.show()

Implementation: Logistic Regression

Below is our implementation of the logistic regression.

Code
def sigmoid(z):
    """Compute the sigmoid function."""
    return 1 / (1 + np.exp(-z))

def cost_function(theta, X, y):
    """
    Compute the binary cross-entropy cost.
    theta: parameter vector
    X: feature matrix (each row is an example)
    y: true binary labels (0 or 1)
    """
    m = len(y)
    h = sigmoid(X.dot(theta))
    # Add a small epsilon to avoid log(0)
    epsilon = 1e-5
    cost = -(1/m) * np.sum(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
    return cost

def gradient(theta, X, y):
    """Compute the gradient of the cost with respect to theta."""
    m = len(y)
    h = sigmoid(X.dot(theta))
    return (1/m) * X.T.dot(h - y)

def logistic_regression(X, y, learning_rate=0.1, iterations=1000):
    """
    Train logistic regression using gradient descent.
    Returns the optimized parameter vector theta and the history of cost values.
    """
    m, n = X.shape
    theta = np.zeros(n)
    cost_history = []
    for i in range(iterations):
        theta -= learning_rate * gradient(theta, X, y)
        cost_history.append(cost_function(theta, X, y))
    return theta, cost_history

def predict_probabilities(theta, X):
    """Return predicted probabilities for the positive class."""
    return sigmoid(X.dot(theta))

Implementation: ROC

def compute_roc_curve(y_true, y_scores, thresholds):
    tpr_list, fpr_list = [], []
    for thresh in thresholds:
        # Classify as positive if predicted probability >= threshold
        y_pred = (y_scores >= thresh).astype(int)
        TP = np.sum((y_true == 1) & (y_pred == 1))
        FN = np.sum((y_true == 1) & (y_pred == 0))
        FP = np.sum((y_true == 0) & (y_pred == 1))
        TN = np.sum((y_true == 0) & (y_pred == 0))
        TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
        FPR = FP / (FP + TN) if (FP + TN) > 0 else 0
        tpr_list.append(TPR)
        fpr_list.append(FPR)
    tpr_list.sort()
    fpr_list.sort()
    return np.array(fpr_list), np.array(tpr_list)

Implementation: AUC ROC

def compute_auc(fpr, tpr):
    """
    Compute the Area Under the Curve (AUC) using the trapezoidal rule.
    
    fpr: array of false positive rates
    tpr: array of true positive rates
    """
    return np.trapz(tpr, fpr)

Example: Generate Data + Predictions

# Generate synthetic data for binary classification
np.random.seed(0)
m = 100  # number of samples
X = np.random.randn(m, 2)
noise = 0.5 * np.random.randn(m)

# Define labels: a noisy linear combination thresholded at 0
y = (X[:, 0] + X[:, 1] + noise > 0).astype(int)

# Add an intercept term (a column of ones) to X
X_intercept = np.hstack([np.ones((m, 1)), X])

# Train logistic regression model using gradient descent
theta, cost_history = logistic_regression(X_intercept, y, learning_rate=0.1, iterations=1000)

Example: Plot

Code
# Compute predicted probabilities for the positive class
y_probs = predict_probabilities(theta, X_intercept)

# Define a set of threshold values between 0 and 1 (e.g., 100 equally spaced thresholds)
thresholds = np.linspace(0, 1, 100)

# Compute the ROC curve (FPR and TPR for each threshold)
fpr, tpr = compute_roc_curve(y, y_probs, thresholds)
auc_value = compute_auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

See Also

Cross-Validation

Training and test set

Sometimes called holdout method.

  • Guideline: Typically, allocate 80% of your dataset for training and reserve the remaining 20% for testing.

  • Training Set: This subset of data is utilized to train your model.

  • Test Set: This is an independent subset used exclusively at the final stage to assess the model’s performance.

Training and test set

Training Error:

  • Generally tends to be low
  • Achieved by optimizing learning algorithms to minimize error through parameter adjustments (e.g., weights)

Definition

Generalization Error: The error rate observed when the model is evaluated on new, unseen data.

Prologue

Summary

  • Examined classification model evaluation techniques, focusing on confusion matrices and key metrics: accuracy, precision, recall, and \(F_1\) score.
  • Addressed the limitations of accuracy in imbalanced datasets, introducing micro and macro averaging techniques.
  • Explored the precision-recall trade-off and ROC analysis, including the area under the curve (AUC).
  • Provided practical insights through Python implementations.

Next lecture

  • Cross-evaluation

References

Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Japkowicz, Nathalie, and Mohak Shah. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press.
Knowler, William C., David J. Pettitt, Peter J. Savage, and Peter H. Bennett. 1981. “Diabetes Incidence in Pima Indians: Contributions of Obesity and Parental Diabetes.” American Journal of Epidemiology 113 2: 144–56. https://api.semanticscholar.org/CorpusID:25209675.
Rafi, Abdul Muntakim, Brett Kiyota, Nozomu Yachie, and Carl G de Boer. 2025. Detecting and avoiding homology-based data leakage in genome-trained sequence models.” https://doi.org/10.1101/2025.01.22.634321.
Sokolova, Marina, and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks.” Information Processing and Management 45 (4): 427–37. https://doi.org/10.1016/j.ipm.2009.03.002.
Walsh, Ian, Dmytro Fishman, Dario Garcia-Gasulla, Tiina Titma, Gianluca Pollastri, ELIXIR Machine Learning Focus Group, Emidio Capriotti, et al. 2021. DOME: recommendations for supervised machine learning validation in biology.” Nature Methods 18 (10): 1122–27. https://doi.org/10.1038/s41592-021-01205-4.
Whalen, Sean, Jacob Schreiber, William S. Noble, and Katherine S. Pollard. 2022. Navigating the pitfalls of applying machine learning in genomics.” Nature Reviews Genetics 23 (3): 169–81. https://doi.org/10.1038/s41576-021-00434-9.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa