CSI 5180 - Machine Learning for Bioinformatics
Version: Feb 6, 2025 10:53
Distinguished Lecture
Leland McInnes, author of UMAP, on April 7, 2025, at 1:30 p.m.
In this lecture, we will introduce concepts essential for understanding machine learning, including the paradigms (types) and tasks (problems).
Let’s start by telling the truth: machines don’t learn. (…) just like artificial intelligence is not intelligence, machine learning is not learning.
Mitchell (1997), page 2
A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).
There are three (3) distinct types of feedback:
Supervised learning is the most extensively studied and arguably the most intuitive type of learning. It is typically the first type of learning introduced in educational contexts.
The data set (“experience”) is a collection of labelled examples.
Problem: Given the data set as input, create a model that can be used to predict the value of \(y\) for an unseen \(x\).
When the label \(y_i\) is a class, taken from a finite list of classes, \(\{1, 2, \ldots, C\}\), we call the task a classification task.
When the label \(y_i\) is a real number, we call the task a regression task.
Prediction of Chemical Carcinogenicity in Human
Additional learning paradigms encompass self-supervised learning and contrastive learning.
Scikit-learn
Scikit-learn
is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
Scikit-learn
provides dozens of built-in machine learning algorithms and models, called estimators.
Built on NumPy, SciPy, and matplotlib.
Scikit-learn
import pandas as pd
# Creating 2 test examples
columns_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X_test = pd.DataFrame([[34.2, 17.9, 186.8, 2945.0], [51.0, 15.2, 223.7, 5560.0]], columns=columns_names)
# Prediction
y_test = clf.predict(X_test)
# Printing the predicted labels for our two examples
print(y_test)
['Adelie' 'Gentoo']
from sklearn.metrics import classification_report, accuracy_score
# Make predictions
y_pred = clf.predict(X)
# Evaluate the model
accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=target_names)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)
Accuracy: 1.00
Classification Report:
precision recall f1-score support
Adelie 0.99 1.00 1.00 152
Chinstrap 1.00 1.00 1.00 68
Gentoo 1.00 0.99 1.00 124
accuracy 1.00 344
macro avg 1.00 1.00 1.00 344
weighted avg 1.00 1.00 1.00 344
We have demonstrated a complete example:
from sklearn.metrics import classification_report, accuracy_score
# Make predictions
y_pred = clf.predict(X)
# Evaluate the model
accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=target_names)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)
Important
This example is misleading, or even flawed!
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year | |
---|---|---|---|---|---|
count | 342.000000 | 342.000000 | 342.000000 | 342.000000 | 344.000000 |
mean | 43.921930 | 17.151170 | 200.915205 | 4201.754386 | 2008.029070 |
std | 5.459584 | 1.974793 | 14.061714 | 801.954536 | 0.818356 |
min | 32.100000 | 13.100000 | 172.000000 | 2700.000000 | 2007.000000 |
25% | 39.225000 | 15.600000 | 190.000000 | 3550.000000 | 2007.000000 |
50% | 44.450000 | 17.300000 | 197.000000 | 4050.000000 | 2008.000000 |
75% | 48.500000 | 18.700000 | 213.000000 | 4750.000000 | 2009.000000 |
max | 59.600000 | 21.500000 | 231.000000 | 6300.000000 | 2009.000000 |
Seaborn
Seaborn
Accuracy: 0.94
Classification Report:
precision recall f1-score support
Adelie 0.96 0.90 0.93 30
Chinstrap 0.94 1.00 0.97 15
Gentoo 0.92 0.96 0.94 24
accuracy 0.94 69
macro avg 0.94 0.95 0.95 69
weighted avg 0.94 0.94 0.94 69
# Creating 2 test examples
# 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'
X_test = [[5.1, 3.5, 1.4, 0.2],[6.7, 3.0, 5.2, 2.3]]
# Prediction
y_test = clf.predict(X_test)
# Printing the predicted labels for our two examples
print(iris.target_names[y_test])
['setosa' 'virginica']
from sklearn.metrics import classification_report, accuracy_score
# Make predictions
y_pred = clf.predict(X)
# Evaluate the model
accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=iris.target_names)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)
Accuracy: 1.00
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 50
versicolor 1.00 1.00 1.00 50
virginica 1.00 1.00 1.00 50
accuracy 1.00 150
macro avg 1.00 1.00 1.00 150
weighted avg 1.00 1.00 1.00 150
We have demonstrated a complete example:
from sklearn.metrics import classification_report, accuracy_score
# Make predictions
y_pred = clf.predict(X)
# Evaluate the model
accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=iris.target_names)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)
Important
This example is misleading, or even flawed!
Dataset Description:
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. dropdown:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Pandas
(continued)Pandas
(continued)Pandas
(continued) sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000
std 0.828066 0.435866 1.765298
min 4.300000 2.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.350000
75% 6.400000 3.300000 5.100000
max 7.900000 4.400000 6.900000
petal width (cm) species
count 150.000000 150.000000
mean 1.199333 1.000000
std 0.762238 0.819232
min 0.100000 0.000000
25% 0.300000 0.000000
50% 1.300000 1.000000
75% 1.800000 2.000000
max 2.500000 2.000000
Seaborn
Seaborn
from sklearn.metrics import classification_report, accuracy_score
# Make predictions
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)
Accuracy: 0.90
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 7
versicolor 0.91 0.83 0.87 12
virginica 0.83 0.91 0.87 11
accuracy 0.90 30
macro avg 0.91 0.91 0.91 30
weighted avg 0.90 0.90 0.90 30
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa