Performance Evaluation

CSI 4106 - Fall 2025

Marcel Turcotte

Version: Oct 7, 2025 09:50

Preamble

Message of the Day

In a recent interview with Bloomberg Technology, Demis Hassabis discussed the innovative work of Isomorphic Labs in significantly expediting drug development processes. Below is a summary of Hassabis’ notable achievements:

A chess prodigy from a young age, Hassabis began playing at four years old and achieved an Elo rating of approximately 2300 by the age of 13.
He co-founded DeepMind in 2010 alongside Shane Legg and Mustafa Suleyman, where he currently serves as CEO.
Under his leadership, DeepMind has pioneered several groundbreaking advancements in artificial intelligence, including the development of AlphaGo and, notably, AlphaFold and AlphaFold2, which are pivotal in protein structure prediction.
In recognition of his contributions to protein structure prediction, Hassabis was awarded the Nobel Prize in Chemistry in 2024.
In 2021, he founded Isomorphic Labs, which concentrates on the application of AI in drug discovery and translational science.
“The Thinking Game” is a documentary that explores the life of Demis Hassabis, the evolution of DeepMind, and the pursuit of artificial general intelligence (AGI).

In a related vein, an article titled “Which diseases will you have in 20 years? This AI accurately predicts your risks” was published in Nature on September 17, 2025. This brief news piece discusses Delphi-2M, a large language model designed to analyze an individual’s medical records and lifestyle factors to provide risk assessments for over 1,000 diseases. Complementing the article, a podcast is also available for further insights.

Summary

This lecture covers classification model evaluation, focusing on confusion matrices and key metrics: accuracy, precision, recall, and F₁ score. It addresses accuracy’s limitations in imbalanced datasets, introducing micro and macro averaging. The precision-recall trade-off and ROC analysis, including AUC, are also explored. Practical insights are provided through Python implementations like logistic regression via gradient descent.

Learning Outcomes

Describe the structure and role of the confusion matrix in model evaluation.
Compute and interpret accuracy, precision, recall, and \(F_1\) score.
Identify the pitfalls of using accuracy with imbalanced datasets.
Differentiate between micro and macro averaging for performance metrics.
Analyze precision-recall trade-offs and construct ROC curves, including the calculation of AUC.
Implement the calculation or ROC curves and AUC in Python.

Performance Metrics

Confusion Matrix

	Positive (Predicted)	Negative (Predicted)
Positive (Actual)	True positive (TP)	False negative (FN)
Negative (Actual)	False positive (FP)	True negative (TN)

ConfusionMatrixDisplay

Code

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

seed = 42

X, y = make_classification(n_samples = 500, random_state=seed)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

clf = LogisticRegression(random_state=seed)

clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

cm = confusion_matrix(y_test, predictions, labels=[1, 0])

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Positive", "Negative"])

disp.plot()
plt.show()

Confusion Matrix

Given a test set with \(N\) examples and a classifier \(h(x):\)

\[ C_{i,j} = \sum_{k = 1}^N [y_k = i \wedge h(x_k) = j] \]

Where \(C\) is \(l \times l\) matrix, for a dataset with \(l\) classes.

Confusion Matrix

The total number of examples of the (actual) class \(i\) is \[ C_{i \cdot} = \sum_{j=1}^l C_{i,j} \]
The total number of examples assigned to the (predicted) class \(j\) by classiﬁer \(h\) is \[ C_{\cdot j} = \sum_{i=1}^l C_{i,j} \]

Confusion Matrix

Terms on the diagonal denote the total number of examples classified correctly by classifier \(h\). Hence, the number of correctly classified examples is \[ \sum_{i=1}^l C_{i,i} \]
Non-diagonal terms represent misclassifications.

Confusion Matrix - Multi-Class

To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.

Confusion Matrix - Multi-Class

Confusion Matrix - True Positive

Confusion Matrix - False Positive

Confusion Matrix - False Negative

Confusion Matrix - True Negative

Confusion Matrix - Multi-Class

Multi-Class

To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.

True Positives (\(\mathrm{TP}_i\)): Diagonal entry \(C_{i,i}\)
False Positives (\(\mathrm{FP}_i\)): Sum of column \(i\) excluding \(C_{i,i}\)
False Negatives (\(\mathrm{FN}_i\)): Sum of row \(i\) excluding \(C_{i,i}\)
True Negatives (\(\mathrm{TN}_i\)): \(N - (\mathrm{TP}_i + \mathrm{FP}_i + \mathrm{FN}_i)\)

Multi-Class

\(\mathrm{TP}_i = C_{i,i}\)
\(\mathrm{FP}_i = \sum_{k \ne i} C_{k,i}\)
\(\mathrm{FN}_i = \sum_{k \ne i} C_{i,k}\)
\(\mathrm{TN}_i = \sum_{j \ne i} \sum_{k \ne i} C_{j,k}\)

sklearn.metrics.confusion_matrix

from sklearn.metrics import confusion_matrix

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

confusion_matrix(y_actual,y_pred)

array([[1, 2],
       [3, 4]])

tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel().tolist()
(tn, fp, fn, tp)

(1, 2, 3, 4)

By default, sklearn.metrics.confusion_matrix determines the set of labels from the data (\(\textrm{y_true} \cup \textrm{y_pred}\)), and then:

It sorts them in ascending order (which for strings or mixed types corresponds to Python’s lexicographic ordering).
It then builds the matrix so that row i corresponds to the true class with label labels[i], and column j corresponds to the predicted class with label labels[j].

So if you don’t pass labels=..., you may get a confusion matrix with class order that is not what you expect — especially if your classes are strings, or if you assume the order follows the order of appearance in the dataset.

Example

from sklearn.metrics import confusion_matrix

y_true = ["dog", "cat", "cat", "dog"]
y_pred = ["dog", "dog", "cat", "cat"]

print(confusion_matrix(y_true, y_pred))

Output:

[[1 1]
 [1 1]]

Here the rows/columns are in lexicographic order: ["cat", "dog"]. So the matrix is:

Row 0: true = “cat”
Row 1: true = “dog”

Controlling order

To force a specific order, you should pass the labels argument:

confusion_matrix(y_true, y_pred, labels=["dog", "cat"])

This will swap the row/column order accordingly.

Perfect Prediction

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

confusion_matrix(y_actual,y_pred)

array([[4, 0],
       [0, 6]])

tn, fp, fn, tp = confusion_matrix(y_actual, y_pred).ravel().tolist()  
(tn, fp, fn, tp)

(4, 0, 0, 6)

Confusion Matrix - Multiple Classes

Code

from sklearn.datasets import load_digits

import numpy as np
np.random.seed(42)

digits = load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

clf = clf.fit(X_train, y_train)

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

Visualizing errors

mask = (y_test == 9) & (y_pred == 8)

X_9_as_8 = X_test[mask]

y_9_as_8 = y_test[mask]

Code

import numpy as np
np.random.seed(42)

from sklearn.datasets import load_digits
digits = load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

clf = clf.fit(X_train, y_train)

X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

mask = (y_test == 9) & (y_pred == 8)

X_9_as_8 = X_test[mask]

y_9_as_8 = y_test[mask]

import matplotlib.pyplot as plt

plt.figure(figsize=(4,2))

for index, (image, label) in enumerate(zip(X_9_as_8, y_9_as_8)):
    plt.subplot(1, len(X_9_as_8), index + 1)
    plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
    plt.title(f'y = {label}')

Confusion Matrix - Multiple Classes

Accuracy

How accurate is this result?

\[ \mathrm{accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{N}} \]

from sklearn.metrics import accuracy_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

accuracy_score(y_actual,y_pred)

0.5

Accuracy

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [1, 0, 1, 1, 0, 0, 0, 1, 0, 0]

accuracy_score(y_actual,y_pred)

0.0

y_actual = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]
y_pred   = [0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

accuracy_score(y_actual,y_pred)

1.0

Accuracy can be misleading

y_actual = [0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
y_pred   = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

accuracy_score(y_actual,y_pred)

0.8

Precision

AKA, positive predictive value (PPV).

\[ \mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \]

from sklearn.metrics import precision_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

precision_score(y_actual, y_pred)

0.6666666666666666

Can you think of a problem or situation where precision is paramount?

A classic example: medical screening for a rare but serious disease.

Suppose you have a test for a disease with very low prevalence (say 1 in 10,000).
If your model predicts “positive” too loosely, you will generate many false positives.
Here, precision (the proportion of predicted positives that are actually true positives) is crucial:

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]

A high precision means that when the test says “positive,” it is very likely correct.
This reduces unnecessary anxiety, costs, and follow-up procedures for patients incorrectly flagged.

Other real-world settings where precision is key:

Spam detection: High precision ensures that emails classified as spam are really spam (minimizing false positives that would hide real emails).
Legal document search / e-discovery: High precision ensures that returned documents are relevant, reducing time wasted on irrelevant results.
Recommender systems: High precision means that recommended items are very likely to be of interest, improving user trust.

Precision alone is not enough

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

precision_score(y_actual,y_pred)

1.0

Recall

AKA sensitivity or true positive rate (TPR) \[ \mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]

from sklearn.metrics import recall_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

recall_score(y_actual,y_pred)

0.5714285714285714

Can you think of a problem or situation where recall is paramount?

An example where recall is the critical measure: cancer diagnosis (screening for malignant tumors).

Here, false negatives (missing an actual cancer case) are far more dangerous than false positives.
Recall measures the proportion of actual positives correctly identified:

\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

A high recall means the test finds nearly all patients with cancer, even if it also produces some false alarms.
Missing a true case (low recall) could mean a patient doesn’t receive treatment in time — a much more serious error than investigating a few extra false positives.

Other real-world settings where recall matters most:

Security / Intrusion detection: Better to flag all suspicious activity (even with false positives) than miss a real attack.
Search engines: For certain queries (e.g., legal precedent search, medical literature search), recall ensures you retrieve all relevant documents.
Emergency response systems: For natural disaster warnings, high recall ensures no real threat goes unnoticed.

F\(_1\) score

\[ \begin{align*} F_1~\mathrm{score} &= \frac{2}{\frac{1}{\mathrm{precision}}+\frac{1}{\mathrm{recall}}} = 2 \times \frac{\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} \\ &= \frac{\mathrm{TP}}{\mathrm{TP}+\frac{\mathrm{FN}+\mathrm{FP}}{2}} \end{align*} \]

from sklearn.metrics import f1_score

y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred   = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]

f1_score(y_actual,y_pred)

0.6153846153846154

Micro and Macro Averaging

Definition

The class imbablance problem is a scenario where the number of instances in one class significantly outnumbers the instances in other classes.

Models tend to be biased towards the majority class, leading to poor performance on the minority class.

Micro Performance Metrics

Micro performance metrics aggregate the contributions of all instances to compute average performance metrics like precision, recall, or F1 score.
This approach treats each individual prediction equally, regardless of its class, as it considers the total number of true positives, false positives, and false negatives across all classes.
Consequently, micro metrics are particularly sensitive to the performance on frequent classes because they are more numerous and thus have a greater influence on the overall metric.

Macro Performance Metrics

Macro performance metrics compute the performance metric independently for each class and then average these metrics.
This approach treats each class equally, regardless of its frequency, providing an evaluation that equally considers performance across both frequent and infrequent classes.
Consequently, macro metrics are less sensitive to the performance on frequent classes.

Multi-Class

When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).

True Positives (\(\mathrm{TP}_i\)): Diagonal entry \(C_{i,i}\)
False Positives (\(\mathrm{FP}_i\)): Sum of column \(i\) excluding \(C_{i,i}\)
False Negatives (\(\mathrm{FN}_i\)): Sum of row \(i\) excluding \(C_{i,i}\)
True Negatives (\(\mathrm{TN}_i\)): \(N - (\mathrm{TP}_i + \mathrm{FP}_i + \mathrm{FN}_i)\)

Multi-Class

When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).

\(\mathrm{TP}_i = C_{i,i}\)
\(\mathrm{FP}_i = \sum_{k \ne i} C_{k,i}\)
\(\mathrm{FN}_i = \sum_{k \ne i} C_{i,k}\)
\(\mathrm{TN}_i = \sum_{j \ne i} \sum_{k \ne i} C_{j,k}\)

Micro/Macro Metrics

from sklearn.metrics import ConfusionMatrixDisplay

# Sample data
y_true = ['Cat'] * 42 + ['Dog'] *  7 + ['Fox'] * 11
y_pred = ['Cat'] * 39 + ['Dog'] *  1 + ['Fox'] *  2 + \
         ['Cat'] *  4 + ['Dog'] *  3 + ['Fox'] *  0 + \
         ['Cat'] *  5 + ['Dog'] *  1 + ['Fox'] *  5

ConfusionMatrixDisplay.from_predictions(y_true, y_pred)

Micro/Macro Precision

from sklearn.metrics import classification_report, precision_score

print(classification_report(y_true, y_pred), "\n")

print("Micro precision: {:.2f}".format(precision_score(y_true, y_pred, average='micro')))
print("Macro precision: {:.2f}".format(precision_score(y_true, y_pred, average='macro')))

              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro precision: 0.78
Macro precision: 0.71

Micro/Macro Precision

Macro-average precision is calculated as the mean of the precision scores¹ for each class: \(\frac{0.81 + 0.60 + 0.71}{3} = 0.71\).
Whereas, the micro-average precision is calculated using the formala, \(\frac{TP}{TP+FP}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+9+2+2} = \frac{47}{60} = 0.78\)

The high micro-average precision observed here is primarily due to the high precision and large number of examples in the majority class, Cat. This masks the classifier’s relatively poor performance on the minority classes, Dog and Fox.

In a balanced dataset, both micro-average and macro-average metrics yield similar scores.

However, in an imbalanced dataset, significant disparities in classifier performance between the majority and minority classes will result in divergent micro-average and macro-average scores. Specifically, the classifier tends to underperform on the minority class(es), leading to these discrepancies.

In macro-average metrics, each class contributes equally to the final metric calculation, irrespective of the number of examples it contains. This means that the performance metric for each class are computed independently and then averaged, without considering the proportion of instances that each class represents in the dataset. Consequently, macro-averaging ensures that each class has an equal impact on the overall metric, which can be particularly useful in cases where the class distribution is imbalanced.

Micro/Macro Recall

              precision    recall  f1-score   support

         Cat       0.81      0.93      0.87        42
         Dog       0.60      0.43      0.50         7
         Fox       0.71      0.45      0.56        11

    accuracy                           0.78        60
   macro avg       0.71      0.60      0.64        60
weighted avg       0.77      0.78      0.77        60
 

Micro recall: 0.78
Macro recall: 0.60

Micro/Macro Recall

Macro-average recall is calculated as the mean of the recall scores for each class: \(\frac{0.93 + 0.43 + 0.45}{3} = 0.60\).
Whereas, the micro-average recall is calculated using the formala, \(\frac{TP}{TP+FN}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+3+4+6} = \frac{39}{60} = 0.78\)

Example

Using the 20 newsgroups text dataset from scikit-learn.org.

Comprises around 18,000 newsgroups posts on 20 topics.

Code

## https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html

from time import time

## Load Dataset

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6

def load_dataset(verbose=False, remove=()):
    """Load and vectorize the 20 newsgroups dataset."""

    data_train = fetch_20newsgroups(
        subset="train",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )

    data_test = fetch_20newsgroups(
        subset="test",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )

    # order of labels in `target_names` can be different from `categories`
    target_names = data_train.target_names

    # split target in a training set and a test set
    y_train, y_test = data_train.target, data_test.target

    # Extracting features from the training data using a sparse vectorizer
    t0 = time()
    vectorizer = TfidfVectorizer(
        sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english"
    )
    X_train = vectorizer.fit_transform(data_train.data)
    duration_train = time() - t0

    # Extracting features from the test data using the same vectorizer
    t0 = time()
    X_test = vectorizer.transform(data_test.data)
    duration_test = time() - t0

    feature_names = vectorizer.get_feature_names_out()

    if verbose:
        # compute size of loaded data
        data_train_size_mb = size_mb(data_train.data)
        data_test_size_mb = size_mb(data_test.data)

        # print(
        #     f"{len(data_train.data)} documents - "
        #     f"{data_train_size_mb:.2f}MB (training set)"
        # )
        # print(f"{len(data_test.data)} documents - {data_test_size_mb:.2f}MB (test set)")
        # print(f"{len(target_names)} categories")
        # print(
        #     f"vectorize training done in {duration_train:.3f}s "
        #     f"at {data_train_size_mb / duration_train:.3f}MB/s"
        # )
        # print(f"n_samples: {X_train.shape[0]}, n_features: {X_train.shape[1]}")
        # print(
        #     f"vectorize testing done in {duration_test:.3f}s "
        #     f"at {data_test_size_mb / duration_test:.3f}MB/s"
        # )
        # print(f"n_samples: {X_test.shape[0]}, n_features: {X_test.shape[1]}")

    return X_train, X_test, y_train, y_test, feature_names, target_names

X_train, X_test, y_train, y_test, feature_names, target_names = load_dataset(
    verbose=True
)

## Training and Prediction

from sklearn.linear_model import RidgeClassifier

clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

## Display the Confusion Matrix

from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax)
ax.xaxis.set_ticklabels(target_names)
ax.yaxis.set_ticklabels(target_names)
_ = ax.set_title(
    f"Confusion Matrix for {clf.__class__.__name__}"
)

Example

Example

cm = confusion_matrix(y_test, y_pred)

TP, FP, FN, TN

def true_positive(cm, i):
    return cm[i,i] # diagonal entry i,i

def false_positive(cm, i):
    return np.sum(cm[:, i]) - cm[i,i] # col - TP_i

def false_negative(cm, i):
    return np.sum(cm[i, :]) - cm[i,i] # row - TP_i

def true_negative(cm, i):
  N = cm.sum()
  TP = true_positive(cm, i)
  FP = false_positive(cm, i)
  FN = false_negative(cm, i)
  return N - (TP + FP + FN)

Precision

def precision_micro(cm):
    _, l = cm.shape
    tp = fp = 0
    for i in range(l):
        tp += true_positive(cm, i)
        fp += false_positive(cm, i)
    return tp / (tp+fp)

def precision_macro(cm):
    _, l = cm.shape
    precision = 0
    for i in range(l):
        tp = true_positive(cm, i)
        fp = false_positive(cm, i)
        precision += tp/(tp+fp)
    return precision/l

Precision Micro Average

\[ \frac{(258+380+371+199)}{(258+380+371+199)+(40+38+22+45)} \] where

40 = 2 + 1 + 37
38 = 7 + 22 + 9
22 = 12 + 4 + 6
45 = 42 + 3 + 0

Precision Macro Average

\(\mathrm{Precision}_0 = \frac{258}{258+(2+1+37)} = 0.8657718121\)
\(\mathrm{Precision}_1 = \frac{380}{380+(7+22+9)} = 0.9090909091\)
\(\mathrm{Precision}_2 = \frac{371}{371+(12+4+6)} = 0.9440203562\)
\(\mathrm{Precision}_3 = \frac{199}{199+(42+3+0)} = 0.8155737705\)

\(\mathrm{Precision}_3 = \frac{0.8657718121 + 0.9090909091 + 0.9440203562 + 0.8155737705}{4}\)

Recall

def recall_micro(cm):
    _, l = cm.shape
    tp = fn = 0
    for i in range(l):
        tp += true_positive(cm, i)
        fn += false_negative(cm, i)
    return tp / (tp+fn)

def recall_macro(cm):
    _, l = cm.shape
    recall = 0
    for i in range(l):
        tp = true_positive(cm, i)
        fn = false_negative(cm, i)
        recall += tp / (tp+fn)
    return recall/l

Micro/Macro Metrics (Medical Data)

Consider a medical dataset, such as those involving diagnostic tests or imaging, comprising 990 normal samples and 10 abnormal (tumor) samples. This represents the ground truth.

Micro/macro metrics (medical data)

              precision    recall  f1-score   support

      Normal       1.00      0.99      1.00       990
      Tumour       0.55      0.60      0.57        10

    accuracy                           0.99      1000
   macro avg       0.77      0.80      0.78      1000
weighted avg       0.99      0.99      0.99      1000
 

Micro precision: 0.99
Macro precision: 0.77


Micro recall: 0.99
Macro recall: 0.80

Precision-Recall Trade-Off

Hand-Written Digits (Revisited)

Loading the dataset

import numpy as np
np.random.seed(42)

from sklearn.datasets import fetch_openml

digits = fetch_openml('mnist_784', as_frame=False)
X, y = digits.data, digits.target

Plotting the first five examples

These images have dimensions of \(28 \times 28\) pixels.

Creating a Binary Classification Task

# Creating a binary classification task (one vs the rest)

some_digit = X[0]
some_digit_y = y[0]

y = (y == some_digit_y)
y

array([ True, False, False, ..., False,  True, False], shape=(70000,))

# Creating the training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

`SGDClassifier`

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()
clf.fit(X_train, y_train)

clf.predict(X[0:5]) # small sanity check

array([ True, False, False, False, False])

Performance

from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.9572857142857143

Wow!

Not so Fast

from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier()

dummy_clf.fit(X_train, y_train)

y_pred = dummy_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.906

Precision-Recall Trade-Off

Code

from sklearn.model_selection import cross_val_predict
y_scores = cross_val_predict(clf, X_train, y_train, cv=3, method="decision_function")

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

threshold = 3000

plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

As the decision threshold decreases, a higher number of examples are predicted as positive, potentially leading the classifier to eventually label all instances as positive.

Conversely, as the decision threshold increases, fewer examples are classified as positive, which may result in the classifier predicting no positive instances at all.

For certain applications, a classifier with high precision is essential. For example, consider a scenario where each prediction necessitates a costly laboratory experiment to verify its accuracy, such as in a pharmaceutical company aiming to discover new drugs. Here, the classifier predicts whether a compound is active. Given the high cost of experiments to validate candidates, the company would prioritize focusing on the most promising compounds first.

In contrast, consider a scenario involving cancer screening, such as using mammograms to detect breast cancer. In this case, it may be preferable to lower the decision threshold, thereby increasing the number of false-positive predictions. Although this approach results in more patients undergoing additional tests, such as biopsies, it can potentially save more lives by ensuring that fewer cases of cancer go undetected.

Precision/Recall Curve

Code

import matplotlib.patches as patches  # extra code – for the curved arrow

plt.figure(figsize=(5, 5))  # extra code – not needed, just formatting

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall Curve")

# extra code – just beautifies and saves Figure 3–6
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
         label="Point at threshold 3,000")
plt.gca().add_patch(patches.FancyArrowPatch(
    (0.79, 0.60), (0.61, 0.78),
    connectionstyle="arc3,rad=.2",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.56, 0.62, "Higher\nthreshold", color="#333333")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")

plt.show()

ROC Curve

Receiver Operating Characteristics (ROC) curve

True positive rate (TPR) against false positive rate (FPR)
An ideal classifier has TPR close to 1.0 and FPR close to 0.0
\(\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\) (recall, sensitivity)
TPR approaches one when the number of false negative predictions is low
\(\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\) (aka~[1-specificity])
FPR approaches zero when the number of false positive is low

ROC (Receiver Operating Characteristic) curves are popular in machine learning and statistics for several reasons:

Comprehensive Performance Evaluation: ROC curves provide a visual representation of a classifier’s performance across all possible thresholds. By plotting the True Positive Rate (TPR) against the False Positive Rate (FPR), it allows practitioners to evaluate the trade-off between sensitivity (recall) and specificity.
Threshold Independence: Unlike metrics like accuracy, ROC curves evaluate classifier performance without relying on a specific decision threshold. This makes them particularly useful in comparing models across varying thresholds.
Area Under the Curve (AUC): The Area Under the ROC Curve (AUC) provides a single value summary of the model’s performance. AUC-ROC is often used as a benchmark metric to compare different models, with values ranging from 0.5 (random guessing) to 1.0 (perfect classification).
Broad Applicability: ROC curves can be used for any binary classification task and are easily extended to multiclass problems using techniques like one-vs-rest classification, making them versatile in evaluating classifiers.

ROC Curve

ROC Curve

Code

idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
y_train_pred_90 = (y_scores >= threshold_for_90_precision)

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train, y_scores)

idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(5, 5))  # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

# extra code – just beautifies and saves Figure 3–7
plt.gca().add_patch(patches.FancyArrowPatch(
    (0.20, 0.89), (0.07, 0.70),
    connectionstyle="arc3,rad=.4",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)

plt.show()

Dataset - openml

www.openml.org

OpenML is an open platform for sharing datasets, algorithms, and experiments - to learn how to learn better, together.

import numpy as np
np.random.seed(42)

from sklearn.datasets import fetch_openml

diabetes = fetch_openml(name='diabetes', version=1)
print(diabetes.DESCR)

Dataset - openml

Author: Vincent Sigillito

Source: Obtained from UCI

Please cite: UCI citation policy

Title: Pima Indians Diabetes Database
Sources:
1. Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
2. Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu) Research Center, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University Johns Hopkins Road Laurel, MD 20707 (301) 953-6231
3. Date received: 9 May 1990
Past Usage:
1. Smith,_J.W., Everhart,_J.E., Dickson,_W.C., Knowler,_W.C., & Johannes,_R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261–265). IEEE Computer Society Press.
  
  The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.
  
  Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.
Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Missing Attribute Values: None
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

Class Value Number of instances 0 500 1 268

Brief statistical analysis:

Attribute number: Mean: Standard Deviation:

```
                3.8     3.4
```
```
              120.9    32.0
```
```
               69.1    19.4
```
```
               20.5    16.0
```
```
               79.8   115.2
```
```
               32.0     7.9
```
```
                0.5     0.3
```
```
               33.2    11.8
```

Relabeled values in attribute ‘class’ From: 0 To: tested_negative
From: 1 To: tested_positive

Downloaded from openml.org.

Pima Indians Diabetes Dataset

from sklearn.datasets import fetch_openml

# Load the Pima Indians Diabetes dataset
pima = fetch_openml(name='diabetes', version=1, as_frame=True)

# Extract the features and target
X = pima.data
y = pima.target

# Convert target labels 'tested_negative' and 'tested_positive' to 0 and 1
y = y.map({'tested_negative': 0, 'tested_positive': 1})

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Comparing Multiple Classifiers

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Comparing Multiple Classifiers

lr = LogisticRegression()
lr.fit(X_train, y_train)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

AUC/ROC

Code

from sklearn.metrics import roc_auc_score

y_pred_prob_lr = lr.predict_proba(X_test)[:, 1]
y_pred_prob_knn = knn.predict_proba(X_test)[:, 1]
y_pred_prob_dt = dt.predict_proba(X_test)[:, 1]
y_pred_prob_rf = rf.predict_proba(X_test)[:, 1]

# Compute ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr)
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_pred_prob_knn)
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_prob_dt)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf)

# Compute AUC scores
auc_lr = roc_auc_score(y_test, y_pred_prob_lr)
auc_knn = roc_auc_score(y_test, y_pred_prob_knn)
auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
auc_rf = roc_auc_score(y_test, y_pred_prob_rf)

# Plot ROC curves
plt.figure(figsize=(5, 5)) # plt.figure()
plt.plot(fpr_lr, tpr_lr, color='blue', label=f'Logistic Regression (AUC = {auc_lr:.2f})')
plt.plot(fpr_knn, tpr_knn, color='green', label=f'K-Nearest Neighbors (AUC = {auc_knn:.2f})')
plt.plot(fpr_dt, tpr_dt, color='orange', label=f'Decision Tree (AUC = {auc_dt:.2f})')
plt.plot(fpr_rf, tpr_rf, color='purple', label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')  # Diagonal line for random chance
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Logistic Regression, KNN, Decision Tree, and Random Forest')
plt.legend(loc="lower right")
plt.show()

Logistic Regression

Model:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]
Prediction:
- Assign \(y_i = 0\), if \(h_\theta(x_i) < 0.5\); \(y_i = 1\), if \(h_\theta(x_i) \geq 0.5\)
Loss Function: cross-entropy

\[ J(\theta) = - \sum_{i=1}^{N} \left[ y_i \log \sigma(\theta x_i) + (1-y_i) \log (1 - \sigma(\theta x_i)) \right] \]

Let’s walk through what happens to the confusion matrix terms as the decision threshold moves from 0 to 1.

At threshold = 0

All predicted positive.
TP: all actual positives are predicted positive, maximal TP.
FP: all actual negatives are predicted positive, maximal FP.
\(\text{TPR} = 1\) (since \(\text{TP} = P\), where \(P\) is total positives).
\(\text{FPR} = 1\) (since \(\text{FP} = N\), where \(N\) is total negatives).
ROC point = \((1,1)\).

As threshold increases from 0 to 1

Fewer examples are predicted positive.
TP decreases (some positives no longer exceed threshold).
FP decreases (some negatives no longer exceed threshold).
TPR decreases monotonically from 1 to 0.
FPR decreases monotonically from 1 to 0.
Curve traces down-left along the ROC space.

At threshold = 1

All predicted negative.
TP = 0, FP = 0.
\(\text{TPR} = 0\), \(\text{FPR} = 0\).
ROC point = \((0,0)\).

Summary intuition

Lower threshold, more predicted positives, both TP and FP increase, both TPR and FPR increase.
Higher threshold, more predicted negatives, both TP and FP decrease, both TPR and FPR decrease.
The shape of the ROC curve depends on how well logistic regression separates - positives from negatives:
A perfect model climbs quickly toward \((0,1)\).
A random model follows the diagonal.
A realistic model lies between.

Implementation: Logistic Regression

Below is our implementation of the logistic regression.

Code

def sigmoid(z):
    """Compute the sigmoid function."""
    return 1 / (1 + np.exp(-z))

def cost_function(theta, X, y):
    """
    Compute the binary cross-entropy cost.
    theta: parameter vector
    X: feature matrix (each row is an example)
    y: true binary labels (0 or 1)
    """
    m = len(y)
    h = sigmoid(X.dot(theta))
    # Add a small epsilon to avoid log(0)
    epsilon = 1e-5
    cost = -(1/m) * np.sum(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
    return cost

def gradient(theta, X, y):
    """Compute the gradient of the cost with respect to theta."""
    m = len(y)
    h = sigmoid(X.dot(theta))
    return (1/m) * X.T.dot(h - y)

def logistic_regression(X, y, learning_rate=0.1, iterations=1000):
    """
    Train logistic regression using gradient descent.
    Returns the optimized parameter vector theta and the history of cost values.
    """
    m, n = X.shape
    theta = np.zeros(n)
    cost_history = []
    for i in range(iterations):
        theta -= learning_rate * gradient(theta, X, y)
        cost_history.append(cost_function(theta, X, y))
    return theta, cost_history

def predict_probabilities(theta, X):
    """Return predicted probabilities for the positive class."""
    return sigmoid(X.dot(theta))

Implementation: ROC

def compute_roc_curve(y_true, y_scores, thresholds):
    tpr_list, fpr_list = [], []
    for thresh in thresholds:
        # Classify as positive if predicted probability >= threshold
        y_pred = (y_scores >= thresh).astype(int)
        TP = np.sum((y_true == 1) & (y_pred == 1))
        FN = np.sum((y_true == 1) & (y_pred == 0))
        FP = np.sum((y_true == 0) & (y_pred == 1))
        TN = np.sum((y_true == 0) & (y_pred == 0))
        TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
        FPR = FP / (FP + TN) if (FP + TN) > 0 else 0
        tpr_list.append(TPR)
        fpr_list.append(FPR)
        
    tpr_list.reverse()
    fpr_list.reverse()

    return np.array(fpr_list), np.array(tpr_list)

Implementation: AUC ROC

def compute_auc(fpr, tpr):
    """
    Compute the Area Under the Curve (AUC) using the trapezoidal rule.
    
    fpr: array of false positive rates
    tpr: array of true positive rates
    """
    return np.trapezoid(tpr, fpr)

Example: Generate Data + Predictions

# Generate synthetic data for binary classification
np.random.seed(seed)
m = 1000  # number of samples
X = np.random.randn(m, 2)
noise = 0.5 * np.random.randn(m)

# Define labels: a noisy linear combination thresholded at 0
y = (X[:, 0] + X[:, 1] + noise > 0).astype(int)

# Add an intercept term (a column of ones) to X
X_intercept = np.hstack([np.ones((m, 1)), X])

X_train, X_test, y_train, y_test = train_test_split(X_intercept, y, random_state=seed)

# Train logistic regression model using gradient descent
theta, cost_history = logistic_regression(X_train, y_train, learning_rate=0.1, iterations=1000)

Example: Plot

Code

# Compute predicted probabilities for the positive class on the test set
y_probs = predict_probabilities(theta, X_test)

# Define a set of threshold values between 0 and 1 (e.g., 100 equally spaced thresholds)
thresholds = np.linspace(0, 1, 100)

# Compute the ROC curve (FPR and TPR for each threshold)
fpr, tpr = compute_roc_curve(y_test, y_probs, thresholds)
auc_value = compute_auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Random classifier (simulation)

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

# Simulate labels (balanced dataset for clarity)
rng = np.random.RandomState(42)
y_true = rng.randint(0, 2, size=1000)  # random true labels

# Simulate random scores (independent of labels)
y_scores = rng.rand(1000)

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)

# Plot ROC
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label="Random classifier (simulation)", lw=2)
plt.plot([0,1],[0,1],'k--', label="y = x diagonal")
plt.scatter([0,0.25,0.5,0.75,1],[0,0.25,0.5,0.75,1], 
            color="red", zorder=5, label="Illustrative points")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("ROC Curve of a Random Classifier")
plt.legend()
plt.grid(True)
plt.show()

Prologue

Summary

Examined classification model evaluation techniques, focusing on confusion matrices and key metrics: accuracy, precision, recall, and \(F_1\) score.
Addressed the limitations of accuracy in imbalanced datasets, introducing micro and macro averaging techniques.
Explored the precision-recall trade-off and ROC analysis, including the area under the curve (AUC).
Provided practical insights through Python implementations.

On Performance Measures

Sokolova, M. & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.
- Scopus: 4,222 citations
- Google Scholar: 6,839 citations

Evaluating Learning Algorithms

This book, 4.6 stars rating on Amazon, delves into the evaluation process, particularly focusing on classification algorithms (Japkowicz and Shah 2011).
Nathalie Japkowicz previously served as a professor at the University of Ottawa and is currently affiliated with American University in Washington.
Mohak Shah, who earned his PhD from the University of Ottawa, has held numerous industry roles, including Vice President of AI and Machine Learning at LG Electronics.

Next lecture

We will examine cross-validation and hyperparameter tuning.

References

Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.

Japkowicz, Nathalie, and Mohak Shah. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge: Cambridge University Press.

Knowler, William C., David J. Pettitt, Peter J. Savage, and Peter H. Bennett. 1981. “Diabetes Incidence in Pima Indians: Contributions of Obesity and Parental Diabetes.” American Journal of Epidemiology 113 2: 144–56. https://api.semanticscholar.org/CorpusID:25209675.

Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa

Performance Evaluation

Preamble

Message of the Day

Summary

Learning Outcomes

Performance Metrics

Confusion Matrix

ConfusionMatrixDisplay

Confusion Matrix

Confusion Matrix

Confusion Matrix

Confusion Matrix - Multi-Class

Confusion Matrix - Multi-Class

Confusion Matrix - True Positive

Confusion Matrix - False Positive

Confusion Matrix - False Negative

Confusion Matrix - True Negative

Confusion Matrix - Multi-Class

Multi-Class

Multi-Class

sklearn.metrics.confusion_matrix

Perfect Prediction

Confusion Matrix - Multiple Classes

Visualizing errors

Confusion Matrix - Multiple Classes

Accuracy

Accuracy

Accuracy can be misleading

Precision

Precision alone is not enough

Recall

F\(_1\) score

Micro and Macro Averaging

Definition

Micro Performance Metrics

Macro Performance Metrics

Multi-Class

Multi-Class

Micro/Macro Metrics

Micro/Macro Precision

Micro/Macro Precision

Micro/Macro Recall

Micro/Macro Recall

Example

Example

Example

TP, FP, FN, TN

Precision

Precision Micro Average

Precision Macro Average

Recall

Micro/Macro Metrics (Medical Data)

Micro/macro metrics (medical data)

Precision-Recall Trade-Off

Hand-Written Digits (Revisited)

Creating a Binary Classification Task

SGDClassifier

Performance

Not so Fast

Precision-Recall Trade-Off

Precision-Recall Trade-Off

Precision/Recall Curve

ROC Curve

ROC Curve

ROC Curve

ROC Curve

Dataset - openml

Dataset - openml

Pima Indians Diabetes Dataset

Comparing Multiple Classifiers

Comparing Multiple Classifiers

AUC/ROC

Logistic Regression

Implementation: Logistic Regression

Implementation: ROC

Implementation: AUC ROC

Example: Generate Data + Predictions

Example: Plot

Random classifier (simulation)

See Also

`SGDClassifier`