from sklearn.metrics import confusion_matrix
y_actual = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
confusion_matrix(y_actual,y_pred)
array([[1, 2],
[3, 4]])
CSI 5180 - Machine Learning for Bioinformatics
Version: Feb 15, 2025 11:00
The American Society for Microbiology altered its website to remove references to diversity and equity, and temporarily removed articles about scientists from under-represented groups — raising an outcry from some of its members. The organization’s president says it was following legal advice in the hope of protecting its federally funded programmes from the impact of wide-ranging executive orders issued by President Donald Trump, which banned federal funding related to topics including diversity, equity and inclusion.
This lecture covers classification model evaluation, focusing on confusion matrices and key metrics: accuracy, precision, recall, and F₁ score. It addresses accuracy’s limitations in imbalanced datasets, introducing micro and macro averaging. The precision-recall trade-off and ROC analysis, including AUC, are also explored. Practical insights are provided through Python implementations like logistic regression via gradient descent.
This book, 4.6 stars rating on Amazon, delves into the evaluation process, particularly focusing on classification algorithms (Japkowicz and Shah 2011).
Nathalie Japkowicz previously served as a professor at the University of Ottawa and is currently affiliated with American University in Washington.
Mohak Shah, who earned his PhD from the University of Ottawa, has held numerous industry roles, including Vice President of AI and Machine Learning at LG Electronics.
Positive (Predicted) | Negative (Predicted) | |
---|---|---|
Positive (Actual) | True positive (TP) | False negative (FN) |
Negative (Actual) | False positive (FP) | True negative (TN) |
Given a test set with \(N\) examples and a classifier \(h(x):\)
\[ C_{i,j} = \sum_{k = 1}^N [y_k = i \wedge h(x_k) = j] \]
Where \(C\) is \(l \times l\) matrix, for a dataset with \(l\) classes.
The total number of examples of the (actual) class \(i\) is \[ C_{i \cdot} = \sum_{j=1}^l C_{i,j} \]
The total number of examples assigned to the (predicted) class \(j\) by classifier \(h\) is \[ C_{\cdot j} = \sum_{i=1}^l C_{i,j} \]
Terms on the diagonal denote the total number of examples classified correctly by classifier \(h\). Hence, the number of correctly classified examples is \[ \sum_{i=1}^l C_{i,i} \]
Non-diagonal terms represent misclassifications.
To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.
To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.
To evaluate performance in a multi-class setting, one typically derives “one-vs-all” metrics for each class from the confusion matrix. These metrics are then averaged using specific weighting schemes.
from sklearn.datasets import load_digits
import numpy as np
np.random.seed(42)
digits = load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())
clf = clf.fit(X_train, y_train)
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()
import numpy as np
np.random.seed(42)
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())
clf = clf.fit(X_train, y_train)
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)
mask = (y_test == 9) & (y_pred == 8)
X_9_as_8 = X_test[mask]
y_9_as_8 = y_test[mask]
import matplotlib.pyplot as plt
plt.figure(figsize=(4,2))
for index, (image, label) in enumerate(zip(X_9_as_8, y_9_as_8)):
plt.subplot(1, len(X_9_as_8), index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
plt.title(f'y = {label}')
How accurate is this result?
\[ \mathrm{accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{N}} \]
AKA, positive predictive value (PPV).
\[ \mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \]
AKA sensitivity or true positive rate (TPR) \[ \mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \]
\[ \begin{align*} F_1~\mathrm{score} &= \frac{2}{\frac{1}{\mathrm{precision}}+\frac{1}{\mathrm{recall}}} = 2 \times \frac{\mathrm{precision}\times\mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} \\ &= \frac{\mathrm{TP}}{\mathrm{FP}+\frac{\mathrm{FN}+\mathrm{FP}}{2}} \end{align*} \]
The class imbablance problem is a scenario where the number of instances in one class significantly outnumbers the instances in other classes.
Models tend to be biased towards the majority class, leading to poor performance on the minority class.
When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).
When calculating precision, recall, and \(F_1\), one usually compute “one-vs-all” metrics for each class. Then, average them using weighting schemes (macro, micro).
from sklearn.metrics import ConfusionMatrixDisplay
# Sample data
y_true = ['Cat'] * 42 + ['Dog'] * 7 + ['Fox'] * 11
y_pred = ['Cat'] * 39 + ['Dog'] * 1 + ['Fox'] * 2 + \
['Cat'] * 4 + ['Dog'] * 3 + ['Fox'] * 0 + \
['Cat'] * 5 + ['Dog'] * 1 + ['Fox'] * 5
ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
from sklearn.metrics import classification_report, precision_score
print(classification_report(y_true, y_pred), "\n")
print("Micro precision: {:.2f}".format(precision_score(y_true, y_pred, average='micro')))
print("Macro precision: {:.2f}".format(precision_score(y_true, y_pred, average='macro')))
precision recall f1-score support
Cat 0.81 0.93 0.87 42
Dog 0.60 0.43 0.50 7
Fox 0.71 0.45 0.56 11
accuracy 0.78 60
macro avg 0.71 0.60 0.64 60
weighted avg 0.77 0.78 0.77 60
Micro precision: 0.78
Macro precision: 0.71
Macro-average precision is calculated as the mean of the precision scores1 for each class: \(\frac{0.81 + 0.60 + 0.71}{3} = 0.71\).
Whereas, the micro-average precision is calculated using the formala, \(\frac{TP}{TP+FP}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+9+2+2} = \frac{47}{60} = 0.78\)
precision recall f1-score support
Cat 0.81 0.93 0.87 42
Dog 0.60 0.43 0.50 7
Fox 0.71 0.45 0.56 11
accuracy 0.78 60
macro avg 0.71 0.60 0.64 60
weighted avg 0.77 0.78 0.77 60
Micro recall: 0.78
Macro recall: 0.60
Macro-average recall is calculated as the mean of the recall scores for each class: \(\frac{0.93 + 0.43 + 0.45}{3} = 0.60\).
Whereas, the micro-average recall is calculated using the formala, \(\frac{TP}{TP+FN}\) and the data from the entire confusion matrix \(\frac{39+3+5}{39+3+5+3+4+6} = \frac{39}{60} = 0.78\)
Using the 20 newsgroups text dataset from scikit-learn.org.
Comprises around 18,000 newsgroups posts on 20 topics.
## https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html
from time import time
## Load Dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
]
def size_mb(docs):
return sum(len(s.encode("utf-8")) for s in docs) / 1e6
def load_dataset(verbose=False, remove=()):
"""Load and vectorize the 20 newsgroups dataset."""
data_train = fetch_20newsgroups(
subset="train",
categories=categories,
shuffle=True,
random_state=42,
remove=remove,
)
data_test = fetch_20newsgroups(
subset="test",
categories=categories,
shuffle=True,
random_state=42,
remove=remove,
)
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
# split target in a training set and a test set
y_train, y_test = data_train.target, data_test.target
# Extracting features from the training data using a sparse vectorizer
t0 = time()
vectorizer = TfidfVectorizer(
sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english"
)
X_train = vectorizer.fit_transform(data_train.data)
duration_train = time() - t0
# Extracting features from the test data using the same vectorizer
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration_test = time() - t0
feature_names = vectorizer.get_feature_names_out()
if verbose:
# compute size of loaded data
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
# print(
# f"{len(data_train.data)} documents - "
# f"{data_train_size_mb:.2f}MB (training set)"
# )
# print(f"{len(data_test.data)} documents - {data_test_size_mb:.2f}MB (test set)")
# print(f"{len(target_names)} categories")
# print(
# f"vectorize training done in {duration_train:.3f}s "
# f"at {data_train_size_mb / duration_train:.3f}MB/s"
# )
# print(f"n_samples: {X_train.shape[0]}, n_features: {X_train.shape[1]}")
# print(
# f"vectorize testing done in {duration_test:.3f}s "
# f"at {data_test_size_mb / duration_test:.3f}MB/s"
# )
# print(f"n_samples: {X_test.shape[0]}, n_features: {X_test.shape[1]}")
return X_train, X_test, y_train, y_test, feature_names, target_names
X_train, X_test, y_train, y_test, feature_names, target_names = load_dataset(
verbose=True
)
## Training and Prediction
from sklearn.linear_model import RidgeClassifier
clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
## Display the Confusion Matrix
from sklearn.metrics import ConfusionMatrixDisplay
fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax)
ax.xaxis.set_ticklabels(target_names)
ax.yaxis.set_ticklabels(target_names)
_ = ax.set_title(
f"Confusion Matrix for {clf.__class__.__name__}"
)
def true_positive(cm, i):
return cm[i,i] # diagonal entry i,i
def false_positive(cm, i):
return np.sum(cm[:, i]) - cm[i,i] # col - TP_i
def false_negative(cm, i):
return np.sum(cm[i, :]) - cm[i,i] # row - TP_i
def true_negative(cm, i):
N = cm.sum()
TP = true_positive(cm, i)
FP = false_positive(cm, i)
FN = false_negative(cm, i)
return N - (TP + FP + FN)
def precision_micro(cm):
_, l = cm.shape
tp = fp = 0
for i in range(l):
tp += true_positive(cm, i)
fp += false_positive(cm, i)
return tp / (tp+fp)
def precision_macro(cm):
_, l = cm.shape
precision = 0
for i in range(l):
tp = true_positive(cm, i)
fp = false_positive(cm, i)
precision += tp/(tp+fp)
return precision/l
\[ \frac{(258+380+371+199)}{(258+380+371+199)+(40+38+22+45)} \] where
\(\mathrm{Precision}_3 = \frac{0.8657718121 + 0.9090909091 + 0.9440203562 + 0.8155737705}{4}\)
def recall_micro(cm):
_, l = cm.shape
tp = fn = 0
for i in range(l):
tp += true_positive(cm, i)
fn += false_negative(cm, i)
return tp / (tp+fn)
def recall_macro(cm):
_, l = cm.shape
recall = 0
for i in range(l):
tp = true_positive(cm, i)
fn = false_negative(cm, i)
recall += tp / (tp+fn)
return recall/l
Consider a medical dataset, such as those involving diagnostic tests or imaging, comprising 990 normal samples and 10 abnormal (tumor) samples. This represents the ground truth.
precision recall f1-score support
Normal 1.00 0.99 1.00 990
Tumour 0.55 0.60 0.57 10
accuracy 0.99 1000
macro avg 0.77 0.80 0.78 1000
weighted avg 0.99 0.99 0.99 1000
Micro precision: 0.99
Macro precision: 0.77
Micro recall: 0.99
Macro recall: 0.80
Loading the dataset
Plotting the first five examples
These images have dimensions of \(28 \times 28\) pixels.
SGDClassifier
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.9572857142857143
Wow!
from sklearn.model_selection import cross_val_predict
y_scores = cross_val_predict(clf, X_train, y_train, cv=3, method="decision_function")
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
threshold = 3000
plt.figure(figsize=(8, 4)) # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax() # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")
plt.show()
import matplotlib.patches as patches # extra code – for the curved arrow
plt.figure(figsize=(5, 5)) # extra code – not needed, just formatting
plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall Curve")
# extra code – just beautifies and saves Figure 3–6
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
label="Point at threshold 3,000")
plt.gca().add_patch(patches.FancyArrowPatch(
(0.79, 0.60), (0.61, 0.78),
connectionstyle="arc3,rad=.2",
arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
color="#444444"))
plt.text(0.56, 0.62, "Higher\nthreshold", color="#333333")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")
plt.show()
Receiver Operating Characteristics (ROC) curve
idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
y_train_pred_90 = (y_scores >= threshold_for_90_precision)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train, y_scores)
idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]
plt.figure(figsize=(5, 5)) # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")
# extra code – just beautifies and saves Figure 3–7
plt.gca().add_patch(patches.FancyArrowPatch(
(0.20, 0.89), (0.07, 0.70),
connectionstyle="arc3,rad=.4",
arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
color="#444444"))
plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)
plt.show()
OpenML is an open platform for sharing datasets, algorithms, and experiments - to learn how to learn better, together.
Author: Vincent Sigillito
Source: Obtained from UCI
Please cite: UCI citation policy
Title: Pima Indians Diabetes Database
Sources:
Past Usage:
Smith,J.W., Everhart,J.E., Dickson,W.C., Knowler,W.C., & Johannes,R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261–265). IEEE Computer Society Press.
The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.
Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.
Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
Missing Attribute Values: None
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)
Class Value Number of instances 0 500 1 268
Brief statistical analysis:
Attribute number: Mean: Standard Deviation:
3.8 3.4
120.9 32.0
69.1 19.4
20.5 16.0
79.8 115.2
32.0 7.9
0.5 0.3
33.2 11.8
Relabeled values in attribute ‘class’ From: 0 To: tested_negative
From: 1 To: tested_positive
Downloaded from openml.org.
from sklearn.datasets import fetch_openml
# Load the Pima Indians Diabetes dataset
pima = fetch_openml(name='diabetes', version=1, as_frame=True)
# Extract the features and target
X = pima.data
y = pima.target
# Convert target labels 'tested_negative' and 'tested_positive' to 0 and 1
y = y.map({'tested_negative': 0, 'tested_positive': 1})
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.metrics import roc_auc_score
y_pred_prob_lr = lr.predict_proba(X_test)[:, 1]
y_pred_prob_knn = knn.predict_proba(X_test)[:, 1]
y_pred_prob_dt = dt.predict_proba(X_test)[:, 1]
y_pred_prob_rf = rf.predict_proba(X_test)[:, 1]
# Compute ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr)
fpr_knn, tpr_knn, _ = roc_curve(y_test, y_pred_prob_knn)
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_prob_dt)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf)
# Compute AUC scores
auc_lr = roc_auc_score(y_test, y_pred_prob_lr)
auc_knn = roc_auc_score(y_test, y_pred_prob_knn)
auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
auc_rf = roc_auc_score(y_test, y_pred_prob_rf)
# Plot ROC curves
plt.figure(figsize=(5, 5)) # plt.figure()
plt.plot(fpr_lr, tpr_lr, color='blue', label=f'Logistic Regression (AUC = {auc_lr:.2f})')
plt.plot(fpr_knn, tpr_knn, color='green', label=f'K-Nearest Neighbors (AUC = {auc_knn:.2f})')
plt.plot(fpr_dt, tpr_dt, color='orange', label=f'Decision Tree (AUC = {auc_dt:.2f})')
plt.plot(fpr_rf, tpr_rf, color='purple', label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--') # Diagonal line for random chance
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Logistic Regression, KNN, Decision Tree, and Random Forest')
plt.legend(loc="lower right")
plt.show()
Below is our implementation of the logistic regression.
def sigmoid(z):
"""Compute the sigmoid function."""
return 1 / (1 + np.exp(-z))
def cost_function(theta, X, y):
"""
Compute the binary cross-entropy cost.
theta: parameter vector
X: feature matrix (each row is an example)
y: true binary labels (0 or 1)
"""
m = len(y)
h = sigmoid(X.dot(theta))
# Add a small epsilon to avoid log(0)
epsilon = 1e-5
cost = -(1/m) * np.sum(y * np.log(h + epsilon) + (1 - y) * np.log(1 - h + epsilon))
return cost
def gradient(theta, X, y):
"""Compute the gradient of the cost with respect to theta."""
m = len(y)
h = sigmoid(X.dot(theta))
return (1/m) * X.T.dot(h - y)
def logistic_regression(X, y, learning_rate=0.1, iterations=1000):
"""
Train logistic regression using gradient descent.
Returns the optimized parameter vector theta and the history of cost values.
"""
m, n = X.shape
theta = np.zeros(n)
cost_history = []
for i in range(iterations):
theta -= learning_rate * gradient(theta, X, y)
cost_history.append(cost_function(theta, X, y))
return theta, cost_history
def predict_probabilities(theta, X):
"""Return predicted probabilities for the positive class."""
return sigmoid(X.dot(theta))
def compute_roc_curve(y_true, y_scores, thresholds):
tpr_list, fpr_list = [], []
for thresh in thresholds:
# Classify as positive if predicted probability >= threshold
y_pred = (y_scores >= thresh).astype(int)
TP = np.sum((y_true == 1) & (y_pred == 1))
FN = np.sum((y_true == 1) & (y_pred == 0))
FP = np.sum((y_true == 0) & (y_pred == 1))
TN = np.sum((y_true == 0) & (y_pred == 0))
TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
FPR = FP / (FP + TN) if (FP + TN) > 0 else 0
tpr_list.append(TPR)
fpr_list.append(FPR)
tpr_list.sort()
fpr_list.sort()
return np.array(fpr_list), np.array(tpr_list)
# Generate synthetic data for binary classification
np.random.seed(0)
m = 100 # number of samples
X = np.random.randn(m, 2)
noise = 0.5 * np.random.randn(m)
# Define labels: a noisy linear combination thresholded at 0
y = (X[:, 0] + X[:, 1] + noise > 0).astype(int)
# Add an intercept term (a column of ones) to X
X_intercept = np.hstack([np.ones((m, 1)), X])
# Train logistic regression model using gradient descent
theta, cost_history = logistic_regression(X_intercept, y, learning_rate=0.1, iterations=1000)
# Compute predicted probabilities for the positive class
y_probs = predict_probabilities(theta, X_intercept)
# Define a set of threshold values between 0 and 1 (e.g., 100 equally spaced thresholds)
thresholds = np.linspace(0, 1, 100)
# Compute the ROC curve (FPR and TPR for each threshold)
fpr, tpr = compute_roc_curve(y, y_probs, thresholds)
auc_value = compute_auc(fpr, tpr)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Sometimes called holdout method.
Guideline: Typically, allocate 80% of your dataset for training and reserve the remaining 20% for testing.
Training Set: This subset of data is utilized to train your model.
Test Set: This is an independent subset used exclusively at the final stage to assess the model’s performance.
Training Error:
Generalization Error: The error rate observed when the model is evaluated on new, unseen data.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa