Definitions, Paradigms, and Tasks

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Feb 6, 2025 10:53

Preamble

Quote of the Day

Mark Your Calendar

Distinguished Lecture

Leland McInnes, author of UMAP, on April 7, 2025, at 1:30 p.m.

Summary

In this lecture, we will introduce concepts essential for understanding machine learning, including the paradigms (types) and tasks (problems).

General Objective:

Describe the fundamental concepts of machine learning

Learning Outcomes

Identify the key components of a learning problem (data, tasks, and performance measures).
Differentiate between major machine learning paradigms (supervised, unsupervised, semi-supervised, and reinforcement).
Recognize common tasks (classification, regression, clustering, dimensionality reduction) and link them to appropriate paradigms.
Outline the two-phase process of model creation (training) and usage (inference).
Appreciate the importance of partitioning data (e.g., train/test splits) for fair model evaluation.

Explainability, causality, bias, etc.

In this video, Barbara Engelhardt elucidates several foundational concepts in machine learning. She begins by contrasting black-box models with open models. A significant critique of certain machine learning models is their opacity, despite their exceptional performance in tasks like classifying novel, unseen examples. These models often fail to provide sufficient information for mechanistic explanations.

The field of explainable AI is highly active, addressing this challenge by developing models that are more interpretable and devising methods to illuminate the workings of black-box models. For instance, AlphaFold demonstrates performance comparable to experimental techniques like X-ray crystallography and nuclear magnetic resonance. However, while it excels in practical applications such as drug design and understanding protein interactions, it does not contribute to the derivation of new physical principles underlying protein folding. In some scenarios, open models are preferable.

Another critical area in machine learning that requires greater emphasis is causality. Although machine learning models excel at pattern recognition and association discovery, their ability to infer causality remains nascent.

The video primarily focuses on the supervised learning paradigm, where each input example is associated with a label. When the label is categorical, the task is termed classification; when it is a continuous value, it is called regression. In Engelhardt’s first example, the task is binary classification, where the label indicates the presence or absence of a disease.

Each example is characterized by multiple attributes, known as features. Engelhardt highlights gene expression, mutations, and tissue types as examples of such features. Another recurring theme is the necessity for a large dataset to achieve robust model performance. The speaker also addresses the concept of bias, specifically technical bias, which arises from experimental conditions, including the specific equipment used for measurements.

Introduction

What Do You Think?

(Burkov 2019)

Let’s start by telling the truth: machines don’t learn. (…) just like artificial intelligence is not intelligence, machine learning is not learning.

Rationale

Why a computer program should learn?

Definition

Mitchell (1997), page 2

A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).

Concepts

Learning Types (Paradigms)

There are three (3) distinct types of feedback:

Unsupervised Learning: No feedback is provided to the algorithm.
Supervised Learning: Each example is accompanied by a label.
Reinforcement Learning: The algorithm receives a reward or a punishment following each action.

Supervised learning is the most extensively studied and arguably the most intuitive type of learning. It is typically the first type of learning introduced in educational contexts.

Two phases

Learning (building a model)
Inference (using the model)

Learning (Building a Model)

Inference (Using a Model)

Formal definitions

Supervised Learning (Notation)

The data set (“experience”) is a collection of labelled examples.

\(\{(x_i, y_i)\}_{i=1}^N\)
- Each \(x_i\) is a feature (attribute) vector with \(D\) dimensions.
- \(x^{(j)}_i\) is the value of the feature \(j\) of the example \(i\), for \(j \in 1 \ldots D\) and \(i \in 1 \ldots N\).
- The label \(y_i\) is either a class, taken from a finite list of classes, \(\{1, 2, \ldots, C\}\), or a real number, or a complex object (tree, graph, etc.).

Problem: Given the data set as input, create a model that can be used to predict the value of \(y\) for an unseen \(x\).

Supervised learning (notation, contd)

When the label \(y_i\) is a class, taken from a finite list of classes, \(\{1, 2, \ldots, C\}\), we call the task a classification task.
When the label \(y_i\) is a real number, we call the task a regression task.

Supervised Learning - An Example

Prediction of Chemical Carcinogenicity in Human

Input is a list of chemical compounds with information about their carcinogenicity.
- Each compound is represented as a feature vector: electronegativity, octanol-water partition, molecular weight, Pka, volume, dipole, etc.
Label
- Classification: \(y_i \in \{\text{Carcinogenic}, \text{Not carcinogenic}\}\)
- Regression: \(y_i\) is a real number

Supervised Learning - Classification

Classification
- Binary classification: two classes, positive and negative, for instance.
- Multi-class: each example belongs to one-and-only-one class. However, there are multiple classes.
- Multi-label: each example belongs to one or more classes.

Unsupervised Learning

The data set (“experience”) is a collection of unlabelled examples.
- \(\{(x_i)\}_{i=1}^N\)
  - Each \(x_i\) is a feature (attribute) vector with \(D\) dimensions.
  - \(x_k^{(j)}\) is the value of the feature \(j\) of the example \(k\), for \(j \in 1 \ldots D\) and \(k \in 1 \ldots N\).
Problem: given the data set as input, create a “model” that captures relationships in the data. In clustering, the task is to assign each example to a cluster. In dimensionality reduction, the task is to reduce the number of features in the input space.

Unsupervised Learning - Problems

Clustering
- K-Means, DBSCAN, hierarchical
Anomaly detection
- One-class SVM
Dimensionality reduction
- Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

Clustering: Grouping a set of objects in such a way that objects in the same group (or cluster) are more similar than those in other groups. Examples include K-means clustering and hierarchical clustering.
Dimensionality Reduction: Reducing the number of random variables under consideration by obtaining a set of principal variables. Techniques include Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE).
Anomaly Detection: Identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. This is often used in fraud detection and network security.
Association Rule Learning: Discovering interesting relations between variables in large databases. A well-known example is the Apriori algorithm used for market basket analysis.
Density Estimation: Determining the probability distribution for a dataset, which can be useful in identifying the data structure and anomalies.

Semi-supervised Learning

The data set (“experience”) is a collection of labelled and unlabelled examples.
- Generally, there are many more unlabelled examples than labelled examples. Presumably, the cost of labelling examples is high.
Problem: given the data set as input, create a “model” that can be used to predict the value of \(y\) for an unseen \(x\). The goal is the same as for supervised learning. Having access to more examples is expected to help the algorithm.

Reinforcement Learning

In reinforcement learning, the agent “lives” in an environment.
The state of the environment is represented as a feature vector.
The agent is capable of actions that (possibly) change the state of the environment.
Each action brings a reward (or punishment).
Problem: learn a policy (a model) that takes as input a feature vector representing the environment and produces as output the optimal action—the action that maximizes the expected average reward.

Others

Additional learning paradigms encompass self-supervised learning and contrastive learning.

Supervised Learning

`Scikit-learn`

scikit-learn.org

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators.

Built on NumPy, SciPy, and matplotlib.

`Scikit-learn`

Decisionn Trees in Bioinformatics

Interpretable Models
Ensemble Learning: Random Forest
Next lecture

Interpretable Models

Interpretable Models

Example: Palmer Pinguins Dataset

Example: In Case of a Missing Library

try:
  from palmerpenguins import load_penguins
except:
  ! pip install palmerpenguins
  from palmerpenguins import load_penguins

Example: Loading the Data

# It is customary to use X and y for the data and labels

X, y = load_penguins(return_X_y = True)

Example: Using a DecisionTree

from sklearn import tree

clf = tree.DecisionTreeClassifier(random_state=42)

Example: Training

# Training

clf = clf.fit(X, y)

All the classifiers inherit from sklearn.base.BaseEstimator and sklearn.base.ClassifierMixin. Accordingly, all the classifiers implement fit, predict, and score.

The DecisionTreeClassifier in scikit-learn constructs a decision tree by recursively splitting the dataset into subsets based on the attribute that results in the highest information gain (e.g., Gini impurity, entropy). Here is a concise description of the process:

Initialization: The algorithm starts with the entire dataset as the root node.
Splitting Criteria: For each node, it evaluates all possible splits across all attributes to find the one that best separates the classes. This is typically done by minimizing a criterion such as Gini impurity or entropy.
Recursive Splitting: The dataset is divided into subsets based on the selected attribute and threshold, creating child nodes. This process is repeated recursively for each child node.
Stopping Conditions: Splitting stops when a predefined criterion is reached, such as a maximum tree depth, a minimum number of samples per leaf, or if further splitting does not significantly improve information gain.
Terminal Nodes: Once splitting is complete, each terminal node is assigned a class label based on the majority class of the samples in that node.

The resulting tree can then be used to classify new samples by traversing from the root to a terminal node, following the decision rules defined at each node.

Example: Visualizing the tree (1/2)

import matplotlib.pyplot as plt

tree.plot_tree(clf)
plt.show()

Example: Visualizing the tree (2/2)

target_names = ['Adelie','Chinstrap','Gentoo']

tree.plot_tree(clf, 
               feature_names = X.columns,
               class_names = target_names,
               label = 'none',
               filled = True)
plt.show()

Example: Prediction

import pandas as pd

# Creating 2 test examples

columns_names = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X_test = pd.DataFrame([[34.2, 17.9, 186.8, 2945.0], [51.0, 15.2, 223.7, 5560.0]], columns=columns_names)

# Prediction

y_test = clf.predict(X_test)

# Printing the predicted labels for our two examples

print(y_test)

['Adelie' 'Gentoo']

Example: Complete

X, y = load_penguins(return_X_y = True)
clf = tree.DecisionTreeClassifier(random_state=123)
clf = clf.fit(X, y)
tree.plot_tree(clf)
X_test = pd.DataFrame([[34.2, 17.9, 186.8, 2945.0], [51.0, 15.2, 223.7, 5560.0]], columns=columns_names)
y_test = clf.predict(X_test)
print(y_test)

['Adelie' 'Gentoo']

Example: Performance

from sklearn.metrics import classification_report, accuracy_score

# Make predictions

y_pred = clf.predict(X)

# Evaluate the model

accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Example: Performance

Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

      Adelie       0.99      1.00      1.00       152
   Chinstrap       1.00      1.00      1.00        68
      Gentoo       1.00      0.99      1.00       124

    accuracy                           1.00       344
   macro avg       1.00      1.00      1.00       344
weighted avg       1.00      1.00      1.00       344

Example: Discussion

We have demonstrated a complete example:

Loading the data
Selecting a classifier
Training the model
Visualizing the model
Making a prediction

Example: Wait a Minute!

from sklearn.metrics import classification_report, accuracy_score

# Make predictions

y_pred = clf.predict(X)

# Evaluate the model

accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Important

This example is misleading, or even flawed!

Example: Exploration

penguins = load_penguins()

type(penguins)

pandas.core.frame.DataFrame

penguins.head()

Example: Exploration

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Example: Exploration

penguins.describe()

Example: Exploration

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
count	342.000000	342.000000	342.000000	342.000000	344.000000
mean	43.921930	17.151170	200.915205	4201.754386	2008.029070
std	5.459584	1.974793	14.061714	801.954536	0.818356
min	32.100000	13.100000	172.000000	2700.000000	2007.000000
25%	39.225000	15.600000	190.000000	3550.000000	2007.000000
50%	44.450000	17.300000	197.000000	4050.000000	2008.000000
75%	48.500000	18.700000	213.000000	4750.000000	2009.000000
max	59.600000	21.500000	231.000000	6300.000000	2009.000000

Example: Using `Seaborn`

import seaborn as sns

# Pairplot using seaborn

sns.pairplot(penguins, hue='species', markers=["o", "s", "D"])
plt.suptitle("Pairwise Scatter Plots of Penguins Features")
plt.show()

Example: Using `Seaborn`

Example: Training and Test Set

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

Example: Creating a New Classifier

clf = tree.DecisionTreeClassifier()

Example: Training the New Classifier

clf.fit(X_train, y_train)

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Example: Visualizing the Tree

tree.plot_tree(clf, 
               feature_names = X.columns,
               class_names = target_names,
               label = 'none',
               filled = True)
plt.show()

Example: Making Predictions

# Make predictions
y_pred = clf.predict(X_test)

Example: Measuring the Performance

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Example: Measuring the Performance

Accuracy: 0.94
Classification Report:
              precision    recall  f1-score   support

      Adelie       0.96      0.90      0.93        30
   Chinstrap       0.94      1.00      0.97        15
      Gentoo       0.92      0.96      0.94        24

    accuracy                           0.94        69
   macro avg       0.94      0.95      0.95        69
weighted avg       0.94      0.94      0.94        69

Summary

We introduced relevant terminology.
Next, we will explore a complete example using scikit-learn.
We performed a simple exploration of our data.
Finally, we recognized the necessity of an independent test set to accurately measure performance.

Prologue

Next Lecture

Detailed presentation of decision trees

Appendix: Iris

Example: iris data set

Example: loading the Data

from sklearn.datasets import load_iris

# Load the Iris dataset

iris = load_iris()

Example: Using a DecisionTree

from sklearn import tree

clf = tree.DecisionTreeClassifier()

Example: Training

# It is customary to use X and y for the data and labels

X, y = iris.data, iris.target

# Training

clf = clf.fit(X, y)

All the classifiers inherit from sklearn.base.BaseEstimator and sklearn.base.ClassifierMixin. Accordingly, all the classifiers implement fit, predict, and score.

Le DecisionTreeClassifier de scikit-learn construit un arbre de décision en divisant récursivement l’ensemble de données en sous-ensembles, basé sur l’attribut qui résulte dans le gain d’information le plus élevé (par exemple, l’impureté de Gini, l’entropie). Voici une description concise du processus :

Initialisation : L’algorithme commence avec l’ensemble de données entier comme nœud racine.
Critères de division : Pour chaque nœud, il évalue toutes les divisions possibles à travers toutes les attributs pour trouver celle qui sépare le mieux les classes. Cela est généralement fait en minimisant un critère comme l’impureté de Gini ou l’entropie.
Division récursive : L’ensemble de données est divisé en sous-ensembles basés sur l’attribut et le seuil sélectionnés, créant des nœuds enfants. Ce processus est répété récursivement pour chaque nœud enfant.
Conditions d’arrêt : La division s’arrête lorsqu’un critère prédéfini est atteint, comme une profondeur maximale de l’arbre, un nombre minimum d’échantillons par feuille, ou si une division supplémentaire n’améliore pas significativement le gain d’information.
Nœuds terminaux : Une fois la division terminée, chaque nœud terminal se voit attribuer une étiquette de classe basée sur la classe majoritaire des échantillons dans ce nœud.

L’arbre résultant peut ensuite être utilisé pour classifier de nouveaux échantillons en parcourant de la racine à un nœud terminal, en suivant les règles de décision définies à chaque nœud.

Example: Visualizing the tree (1/2)

import matplotlib.pyplot as plt

tree.plot_tree(clf)
plt.show()

Example: Visualizing the tree (2/2)

tree.plot_tree(clf, 
               feature_names=iris.feature_names, 
               class_names=iris.target_names,
               label='none',
               filled=True)
plt.show()

In a DecisionTreeClassifier, each internal node of the tree represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label. The decision tree makes predictions by traversing from the root to a leaf node, following the decision rules defined at each node. Decision trees are intuitive and easy to interpret but can be prone to overfitting if not properly regulated.

To build a decision tree, the method fit follows these steps:

Select the Best Attribute: Choose the attribute that best splits the data based on a criterion like Entropy, Gini Index (default), or Log Loss.
Create a Node: Make this attribute the root node of the tree, and create branches for each possible value of the attribute.
Split the Dataset: Divide the dataset into subsets, one for each branch, based on the attribute’s values.
Repeat Recursively: For each subset, repeat steps 1-3 using only the data in that subset and excluding the attribute used at the parent node.
Stop Conditions: Stop the recursion when one of the following conditions is met:
- All instances in a subset belong to the same class.
- No more attributes are available for splitting.
- A predefined depth limit or minimum number of instances per node is reached.
Assign Labels: For each leaf node, assign a class label based on the majority class of instances in that subset.

This process results in a tree where each path from the root to a leaf represents a classification rule.

In the figure above, the decision nodes contain the following information. - The decision rule, e.g. petal width (cm) <= 0.8 - The Geni score. - The number of examples in the subset corresponding to this node of the tree. - The number of examples for each of the classes, in the subset corresponding to this node of the tree. - A prediction.

Decision trees are constructed by incrementally adding decision nodes, guided by labeled training examples to determine optimal splits. An effective decision rule ideally segregates the training examples perfectly into their respective classes. For instance, the rule petal width (cm) <= 0.8 exemplifies this: when the rule holds true (left child), all instances are classified as Setosa. Conversely, when the rule does not hold (right child), the subset contains only Versicolor and Virginica, with no Setosa instances. In essence, a good decision rule is one that significantly reduces entropy.

See: - Kingsford and Salzberg (2008), you can access the paper here, html or PDF, from a computer with a uOttawa IP address.

Example: Prediction

# Creating 2 test examples
# 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'

X_test = [[5.1, 3.5, 1.4, 0.2],[6.7, 3.0, 5.2, 2.3]]

# Prediction

y_test = clf.predict(X_test)

# Printing the predicted labels for our two examples

print(iris.target_names[y_test])

['setosa' 'virginica']

Example: Complete

iris = load_iris()
clf = tree.DecisionTreeClassifier()
X, y = iris.data, iris.target
clf = clf.fit(X, y)
tree.plot_tree(clf)
X_test = [[5.1, 3.5, 1.4, 0.2],[6.7, 3.0, 5.2, 2.3]]
y_test = clf.predict(X_test)
print(iris.target_names[y_test])

['setosa' 'virginica']

Example: Performance

from sklearn.metrics import classification_report, accuracy_score

# Make predictions

y_pred = clf.predict(X)

# Evaluate the model

accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=iris.target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Example: Performance

Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        50
  versicolor       1.00      1.00      1.00        50
   virginica       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150

Example: Discussion

We have demonstrated a complete example:

Loading the data
Selecting a classifier
Training the model
Visualizing the model
Making a prediction

Example: Take 2

from sklearn.metrics import classification_report, accuracy_score

# Make predictions

y_pred = clf.predict(X)

# Evaluate the model

accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred, target_names=iris.target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Important

This example is misleading, or even flawed!

Example: Exploration

print(f'Dataset Description:\n{iris["DESCR"]}\n')

Dataset Description:
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. dropdown:: References

  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
    Mathematical Statistics" (John Wiley, NY, 1950).
  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
    Structure and Classification Rule for Recognition in Partially Exposed
    Environments".  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, Vol. PAMI-2, No. 1, 67-71.
  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
    on Information Theory, May 1972, 431-433.
  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
    conceptual clustering system finds 3 classes in the data.
  - Many, many more ...

Example: Exploration

print(f'Feature Names: {iris.feature_names}')

Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(f'Target Names: {iris.target_names}')

Target Names: ['setosa' 'versicolor' 'virginica']

print(f'Data Shape: {iris.data.shape}')

Data Shape: (150, 4)

print(f'Target Shape: {iris.target.shape}')

Target Shape: (150,)

Example: Using `Pandas` (continued)

import pandas as pd

# Create a DataFrame

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

Example: Using `Pandas` (continued)

# Display the first few rows of the DataFrame

print(df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   species  
0        0  
1        0  
2        0  
3        0  
4        0

Example: Using `Pandas` (continued)

# Summary statistics

print(df.describe())

       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)     species  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000

Example: Using `Seaborn`

import seaborn as sns

# Map target values to species names

df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Pairplot using seaborn

sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle("Pairwise Scatter Plots of Iris Features", y=1.02)
plt.show()

Example: Using `Seaborn`

Example: Training and test set

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

Example: Creating a new classifier

# Train the model
clf = tree.DecisionTreeClassifier()

Example: Training the new classifier

# Train the model
clf.fit(X_train, y_train)

DecisionTreeClassifier()

Example: Making predictions

# Make predictions
y_pred = clf.predict(X_test)

Example: measuring the performance

from sklearn.metrics import classification_report, accuracy_score
# Make predictions

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Example: measuring the performance

Accuracy: 0.90
Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         7
  versicolor       0.91      0.83      0.87        12
   virginica       0.83      0.91      0.87        11

    accuracy                           0.90        30
   macro avg       0.91      0.91      0.91        30
weighted avg       0.90      0.90      0.90        30

References

Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Andriy Burkov.

Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. 2020. Mathematics for Machine Learning. Cambridge University Press. https://doi.org/10.1017/9781108679930.

Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.

Geurts, Pierre, Alexandre Irrthum, and Louis Wehenkel. 2009. “Supervised Learning with Decision Tree-Based Methods in Computational and Systems Biology.” Molecular bioSystems 5 (12): 1593–1605. https://doi.org/10.1039/b907946g.

Kingsford, C, and Steven L Salzberg. 2008. “What Are Decision Trees?” Nature Biotechnology 26 (9): 1011–13. https://doi.org/10.1038/nbt0908-1011.

Mitchell, Tom M. 1997. Machine Learning. New York: McGraw-Hill.

Schietgat, Leander, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev, and Saso Dzeroski. 2010. “Predicting Gene Function Using Hierarchical Multi-Label Decision Tree Ensembles.” BMC Bioinformatics 11 (1): 2. https://doi.org/10.1186/1471-2105-11-2.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa

Definitions, Paradigms, and Tasks

Preamble

Quote of the Day

Mark Your Calendar

Summary

General Objective:

Learning Outcomes

Explainability, causality, bias, etc.

Introduction

What Do You Think?

Rationale

Definition

Concepts

Concepts

Learning Types (Paradigms)

Two phases

Learning (Building a Model)

Inference (Using a Model)

Formal definitions

Supervised Learning (Notation)

Supervised learning (notation, contd)

Supervised Learning - An Example

Supervised Learning - Classification

Unsupervised Learning

Unsupervised Learning - Problems

Semi-supervised Learning

Reinforcement Learning

Others

Supervised Learning

Scikit-learn

Scikit-learn

Decisionn Trees in Bioinformatics

Interpretable Models

Interpretable Models

Example: Palmer Pinguins Dataset

Example: In Case of a Missing Library

Example: Loading the Data

Example: Using a DecisionTree

Example: Training

Example: Visualizing the tree (1/2)

Example: Visualizing the tree (2/2)

Example: Prediction

Example: Complete

Example: Performance

Example: Performance

Example: Discussion

Example: Wait a Minute!

Example: Exploration

Example: Exploration

Example: Exploration

Example: Exploration

Example: Using Seaborn

Example: Using Seaborn

Example: Training and Test Set

Example: Creating a New Classifier

Example: Training the New Classifier

Example: Visualizing the Tree

Example: Making Predictions

Example: Measuring the Performance

Example: Measuring the Performance

Summary

Prologue

Further readings (1/3)

Further readings (2/3)

Further readings (3/3)

Next Lecture

Appendix: Iris

Example: iris data set

Example: loading the Data

Example: Using a DecisionTree

Example: Training

Example: Visualizing the tree (1/2)

Example: Visualizing the tree (2/2)

Example: Prediction

Example: Complete

Example: Performance

Example: Performance

Example: Discussion

Example: Take 2

Example: Exploration

`Scikit-learn`

`Scikit-learn`

Example: Using `Seaborn`

Example: Using `Seaborn`

Example: Using `Pandas` (continued)

Example: Using `Pandas` (continued)

Example: Using `Pandas` (continued)

Example: Using `Seaborn`

Example: Using `Seaborn`