This lecture explores decision trees, emphasizing their interpretability as a key advantage. Decision trees adeptly handle diverse feature types, facilitating seamless integration of heterogeneous data. Their performance is notably enhanced within ensemble learning frameworks, demonstrating robust predictive accuracy.
General objective:
Explain what decision trees are, how they are built, and how they can be used to classify data.
Learning Outcomes
Describe the basic structure of a decision tree and how it processes data.
Explain the concepts of impurity, overfitting, and regularization in the context of decision trees.
Recognize how decision trees can be combined in ensemble methods (e.g., random forests) for improved stability and accuracy.
Demonstrate the application of decision trees to real-world bioinformatics tasks (e.g., single-cell classification).
Decision Trees
Rationale
Understanding foundational learning algorithms is crucial for contextualizing advanced topics.
Many algorithms leverage gradient-based optimization, making familiarity with a diverse range of algorithms essential.
Fundamental concepts, including generalization, underfitting, overfitting, regularization, and ensemble learning, will be explored in greater detail in subsequent discussions.
Rationale
Decision Trees are algorithms employed in supervised learning paradigms.
Applications:
Classification: Assigns instances to predefined classes based on input features (\(y_i\) is a class).
Regression: Predicts continuous output values (\(y_i\) is a real value).
Rationale
Essential components of Random Forest algorithms, particularly effective with small datasets.
Produce models that are easily interpretable by humans.
Rationale
Handle both categorical and continuous features, with certain implementations accommodating missing data.
As non-parametric models, they do not assume any specific data distribution, offering flexibility across diverse data types and distributions.
Capable of modeling complex non-linear relationships between features and target variables without requiring explicit data transformations.
Interpretable Models
Interpretable Models
What is a Decision Tree?
A decision tree is a hierarchical structure represented as a directed acyclic graph, utilized for classification and regression tasks.
Each internal node conducts a binary test on a specific feature (\(j\)), such as determining if the expression level of a gene in a sample exceeds a defined threshold.
The leaves function as decision nodes.
The tree’s structure is inferred (learnt) from the training data.
What is a Decision Tree?
Decision trees can extend beyond binary splits, as exemplified by algorithms like ID3, which accommodate nodes with multiple children.
Classifying New Instances (Inference)
Begin at the root node of the decision tree. Proceed by answering a sequence of binary questions until a leaf node is reached. The label associated with this leaf denotes the classification of the instance.
Alternatively, some algorithms may store a probability distribution at the leaf, representing the fraction of training samples corresponding to each class \(k\), across all possible classes \(k\).
Decision Boundary
Palmer Pinguins Dataset
# Loading our datasettry:from palmerpenguins import load_penguinsexcept:! pip install palmerpenguinsfrom palmerpenguins import load_penguinspenguins = load_penguins()# Pairplot using seabornimport matplotlib.pyplot as pltimport seaborn as snssns.pairplot(penguins, hue='species', markers=["o", "s", "D"])plt.suptitle("Pairwise Scatter Plots of Penguins Features")plt.show()
Palmer Pinguins Dataset
Binary Classification Problem
Several scatter plots reveal a distinct clustering of Gentoo instances.
To illustrate our next exemple, we propose a binary classification model: Gentoo versus non-Gentoo.
Our analysis will concentrate on two key features: body mass and bill depth.
Definition
A decision boundary is a “boundary” that partitions the underlying feature space into regions corresponding to different class labels.
Decision Boundary
The decision boundary between these attributes can be represented as a line.
Code
# Import necessary librariesimport numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splittry:from palmerpenguins import load_penguinsexcept:! pip install palmerpenguinsfrom palmerpenguins import load_penguins# Load the Palmer Penguins datasetdf = load_penguins()# Preserve only the necessary features: 'bill_depth_mm' and 'body_mass_g'features = ['bill_depth_mm', 'body_mass_g']df = df[features + ['species']]# Drop rows with missing valuesdf.dropna(inplace=True)# Create a binary problem: 'Gentoo' vs 'Not Gentoo'df['species_binary'] = df['species'].apply(lambda x: 1if x =='Gentoo'else0)# Define feature matrix X and target vector yX = df[features].valuesy = df['species_binary'].values# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Function to plot initial scatter of datadef plot_scatter(X, y): plt.figure(figsize=(9, 5)) plt.scatter(X[y ==1, 0], X[y ==1, 1], color='orange', edgecolors='k', marker='o', label='Gentoo') plt.scatter(X[y ==0, 0], X[y ==0, 1], color='blue', edgecolors='k', marker='o', label='Not Gentoo') plt.xlabel('Bill Depth (mm)') plt.ylabel('Body Mass (g)') plt.title('Scatter Plot of Bill Depth vs. Body Mass') plt.legend() plt.show()# Plot the initial scatter plotplot_scatter(X_train, y_train)
Decision Boundary
Decision Boundary
The decision boundary between these attributes can be represented as a line.
Code
# Train a logistic regression modelmodel = LogisticRegression()model.fit(X_train, y_train)# Function to plot decision boundarydef plot_decision_boundary(X, y, model): x_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1 y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(figsize=(9, 5)) plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu') plt.scatter(X[y ==1, 0], X[y ==1, 1], color='orange', edgecolors='k', marker='o', label='Gentoo') plt.scatter(X[y ==0, 0], X[y ==0, 1], color='blue', edgecolors='k', marker='o', label='Not Gentoo') plt.xlabel('Bill Depth (mm)') plt.ylabel('Body Mass (g)') plt.title('Logistic Regression Decision Boundary') plt.legend() plt.show()# Plot the decision boundary on the training setplot_decision_boundary(X_train, y_train, model)
Decision Boundary
Definition
We say that the data is linearly separable when two classes of data can be perfectly separated by a single linear boundary, such as a line in two-dimensional space or a hyperplane in higher dimensions.
Simple Decision Doundary
(a) training data, (b) quadratic curve, and (c) linear function.
Decision trees are capable of generating irregular and non-linear decision boundaries.
Attribution:ibidem.
Decision Boundary
Code
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D# Function to generate pointsdef generate_points_above_below_plane(num_points=100):# Define the plane z = ax + by + c a, b, c =1, 1, 0# Plane coefficients# Generate random points x1 = np.random.uniform(-10, 10, num_points) x2 = np.random.uniform(-10, 10, num_points) y1 = np.random.uniform(-10, 10, num_points) y2 = np.random.uniform(-10, 10, num_points)# Points above the plane z_above = a * x1 + b * y1 + c + np.random.normal(20, 2, num_points)# Points below the plane z_below = a * x2 + b * y2 + c - np.random.normal(20, 2, num_points)# Stack the points into arrays points_above = np.vstack((x1, y1, z_above)).T points_below = np.vstack((x2, y2, z_below)).Treturn points_above, points_below# Generate pointspoints_above, points_below = generate_points_above_below_plane()# Visualizationfig = plt.figure(figsize=(8,6))ax = fig.add_subplot(111, projection='3d')# Plot points above the planeax.scatter(points_above[:, 0], points_above[:, 1], points_above[:, 2], c='r', label='Positive')# Plot points below the planeax.scatter(points_below[:, 0], points_below[:, 1], points_below[:, 2], c='b', label='Negative')# Plot the plane itself for referencexx, yy = np.meshgrid(range(-10, 11), range(-10, 11))zz =1* xx +1* yy +0ax.plot_surface(xx, yy, zz, alpha=0.2, color='gray')# Set labelsax.set_xlabel('X1')ax.set_ylabel('X2')ax.set_zlabel('X3')ax.view_init(elev=-90, azim=90)# Set title and legendax.set_title('Binary classification')ax.legend()# Show plotplt.show()
Separating the data using these two attributes, \(x1\) and \(x2\), is infeasible.
Decision Boundary
Code
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D# Function to generate pointsdef generate_points_above_below_plane(num_points=100):# Define the plane z = ax + by + c a, b, c =1, 1, 0# Plane coefficients# Generate random points x1 = np.random.uniform(-10, 10, num_points) x2 = np.random.uniform(-10, 10, num_points) y1 = np.random.uniform(-10, 10, num_points) y2 = np.random.uniform(-10, 10, num_points)# Points above the plane z_above = a * x1 + b * y1 + c + np.random.normal(20, 2, num_points)# Points below the plane z_below = a * x2 + b * y2 + c - np.random.normal(20, 2, num_points)# Stack the points into arrays points_above = np.vstack((x1, y1, z_above)).T points_below = np.vstack((x2, y2, z_below)).Treturn points_above, points_below# Generate pointspoints_above, points_below = generate_points_above_below_plane()# Visualizationfig = plt.figure(figsize=(10,8))ax = fig.add_subplot(111, projection='3d')# Plot points above the planeax.scatter(points_above[:, 0], points_above[:, 1], points_above[:, 2], c='r', label='Above the plane (positive)')# Plot points below the planeax.scatter(points_below[:, 0], points_below[:, 1], points_below[:, 2], c='b', label='Below the plane (negative)')# Plot the plane itself for referencexx, yy = np.meshgrid(range(-10, 11), range(-10, 11))zz =1* xx +1* yy +0ax.plot_surface(xx, yy, zz, alpha=0.2, color='gray')# Set labelsax.set_xlabel('X1')ax.set_ylabel('X2')ax.set_zlabel('X3')ax.view_init(elev=10, azim=-35)# Set title and legendax.set_title('Binary classification, 3 attributes, linear decision boundary')ax.legend()# Show plotplt.show()
Adding attributes can help making the data (linearly) separable.
Definition (revised)
A decision boundary is a hypersurface that partitions the underlying feature space into regions corresponding to different class labels.
Construction
Constructing a Decision Tree
How to construct (learnt) a decision tree?
Are there some trees that are “better” than others?
Is it feasible to construct an optimal decision tree with computational efficiency?
Optimality
Let \(X = \{x_1, \ldots, x_n\}\) be a finite set of objects.
Let \(\mathcal{T} = \{T_1, \ldots, T_t\}\) be a finite set of tests.
For each object and test, we have:
\(T_i(x_j)\) is either true or false.
An optimal tree is one that completely identifies all the objects in \(X\) and \(|T|\) is minimum.
Constructing a Decision Tree
Iterative development: Initiate with an empty tree. Progressively introduce nodes, each informed by the training dataset, continuing until the dataset is completely classified or alternative termination criteria, such as maximum tree depth, are met.
Constructing a Decision Tree
Initial Node Construction:
To establish the root node, evaluate all available \(D\) features.
For each feature, assess various threshold values derived from the observed data within the training set.
Constructing a Decision Tree
For a numerical feature, the algorithm considers all possible split points (thresholds) in the feature’s range.
These split points are typically the midpoints between two consecutive, sorted unique values of the feature.
Constructing a Decision Tree
For a categorical feature with \(k\) unique values, the algorithm considers all possible ways of splitting the categories into two groups.
For instance, if the feature (tissue type) has values \(\{\mathrm{brain}, \mathrm{hearth}, \mathrm{liver}\}\), it might evaluate splits like \(\{\mathrm{brain}\}\) vs. \(\{\mathrm{hearth}, \mathrm{liver}\}\), \(\{\mathrm{liver}\}\) vs. \(\{\mathrm{brain}, \mathrm{hearth}\}\) , etc.
Evaluation
What defines a “good” data split?
\(\{\mathrm{brain}\}\) vs. \(\{\mathrm{hearth}, \mathrm{liver}\}\) : \([20,15]\) and \([20,15]\).
\(\{\mathrm{liver}\}\) vs. \(\{\mathrm{brain}, \mathrm{hearth}\}\) : \([40,0]\) and \([0,30]\).
Evaluation
Heterogeneity (also referred to as impurity) and homogeneity are critical metrics for evaluating the composition of resulting data partitions.
Optimally, each of these partitions should contain data entries from a single class to achieve maximum homogeneity.
Entropy and the Gini index are two widely utilized metrics for assessing these characteristics.
Evalution
Objective function for sklearn.tree.DecisionTreeClassifier (CART):
The cost of partitioning the data using feature\(k\) and threshold\(t_k\).
\(m_{\text{left}}\) and \(m_{\text{right}}\) is the number of examples in the left and right subsets, respectively, and \(m\) is the number of examples before splitting the data.
\(G_{\text{left}}\) and \(G_{\text{right}}\) is the impurity of the left and right subsets, respectively.
Gini Index
Gini index (default)
\[
G_i = 1 - \sum_{k=1}^n p_{i,k}^2
\]
\(p_{i,k}\) is the proportion of the examples from this class \(k\) in the node \(i\).
What is the maximum value of the Gini index?
Gini Index
Examples:
\(1 - (0/100)^2 + (100/100)^2 = 0\) (pure)
\(1 - (25/100)^2 + (75/100)^2 = 0.375\)
\(1 - (50/100)^2 + (50/100)^2 = 0.5\)
Gini Index
Code
def gini_index(p):"""Calculate the Gini index."""return1- (p**2+ (1- p)**2)# Probability values for class 1p_values = np.linspace(0, 1, 100)# Calculate Gini index for each probabilitygini_values = [gini_index(p) for p in p_values]# Plot the Gini indexplt.figure(figsize=(8, 6))plt.plot(p_values, gini_values, label='Gini Index', color='b')plt.title('Gini Index for Binary Classification')plt.xlabel('Probability of Class 1 (p)')plt.ylabel('Gini Index')plt.grid(True)plt.legend()plt.show()
Iris Dataset
Complete Example
Entropy
Entropy in information theory quantifies the uncertainty or unpredictability of a random variable’s possible outcomes. It measures the average amount of information produced by a stochastic source of data and is typically expressed in bits for binary systems. The entropy \(H\) of a discrete random variable \(X\) with possible outcomes \(\{x_1, x_2, \ldots, x_n\}\) and probability mass function \(P(X)\) is given by:
\[
H(X) = -\sum_{i=1}^n P(x_i) \log_2 P(x_i)
\]
Entropy
Entropy is maximized when all outcomes are equally likely, in which case it equals the logarithm of the number of outcomes:
\[
H_{\text{max}} = \log_2(n)
\]
Entropy
Code
import numpy as npimport matplotlib.pyplot as plt# Function to compute entropydef entropy(p):if p ==0or p ==1:return0return-p * np.log2(p) - (1- p) * np.log2(1- p)# Generate probabilities from 0 to 1probabilities = np.linspace(0, 1, 1000)# Compute entropy for each probabilityentropies = [entropy(p) for p in probabilities]# Plot the resultsplt.figure(figsize=(10, 6))plt.plot(probabilities, entropies, label='Entropy H(p)', color='blue')plt.title('Entropy for a Single Variable with Two Outcomes')plt.xlabel('Probability p')plt.ylabel('Entropy H(p)')plt.grid(True)plt.legend()plt.show()
Small changes to the data set produce vastly different trees
Large Trees
Small Changes to the Dataset
Code
from sklearn import treefrom sklearn.metrics import classification_report, accuracy_score# Loading the datasetX, y = load_penguins(return_X_y =True)target_names = ['Adelie','Chinstrap','Gentoo']# Split the dataset into training and testing setsfor seed in (4, 7, 90, 96, 99, 2):print(f'Seed: {seed}')# Create new training and test sets based on a different random seed X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)# Creating a new classifier clf = tree.DecisionTreeClassifier(random_state=seed)# Training clf.fit(X_train, y_train)# Make predictions y_pred = clf.predict(X_test)# Plotting the tree tree.plot_tree(clf, feature_names = X.columns, class_names = target_names, filled =True) plt.show()# Evaluating the model accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred, target_names=target_names)print(f'Accuracy: {accuracy:.2f}')print('Classification Report:')print(report)
Ensemble methods will be discussed with greater details later.
“Although single decision trees can be excellent classifiers, increased accuracy often can be achieved by combining the results of a collection of decision trees.” (Kingsford and Salzberg 2008)
Random Forest
A Random Forest is a collection of decision trees.
Strategies to build a collection of trees:
Creating new data sets using a sampling with replacement procedure (bootstrap sampling);
Using a random subset of the features for splitting (typically the square root of the total number of features);
Taking advantage of the stochastic nature of the procedure to build trees.
Random Forest
Prediction: the most common prediction (majority vote) amongst all the trees (the information can be used as an indication of the strength of the prediction).
Ensemble Learning
Other ensemble learning techniques, such as bagging, pasting, boosting, and stacking will be discussed later.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.
BaggingClassifier
When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting. If samples are drawn with replacement, then the method is known as Bagging. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches.
Multi-class Random Forest classifier to classify single-cell RNA-seq data across different platforms and species.
Shown to be resilient to confounding factors, such as the cell cycle stage, which often affects clustering analysis in scRNA-seq data.
SCN-TP outperforms other methods like SCMAP in terms of mean AUPR and classification accuracy.
Applications in bioinformatics
Synthetic sick and lethal (SSL) genetic interactions between genes A and B occur when the organism exhibits poor growth (or death) when both A and B are knocked out but not when either A or B is disabled individually. (Kingsford and Salzberg 2008)
Decision trees can perform both classification and regression, readily handling categorical and numerical attributes without relying on rigid assumptions about the data. They offer high interpretability but can suffer from instability and overfitting, issues often mitigated by pruning or ensemble methods like random forests. Overall, decision trees remain a foundational tool in machine learning, used in everything from gene expression analysis to clinical diagnostics.
Next Lecture
Linear models
References
Barlin, Joyce N., Qin Zhou, Caryn M. St. Clair, Alexia Iasonos, Robert A. Soslow, Kaled M. Alektiar, Martee L. Hensley, Mario M. Leitao, Richard R. Barakat, and Nadeem R. Abu-Rustum. 2013. “Classification and regression tree (CART) analysis of endometrial carcinoma: Seeing the forest for the trees.”Gynecologic Oncology 130 (3): 452–56. https://doi.org/10.1016/j.ygyno.2013.06.009.
Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media.
Geurts, Pierre, Alexandre Irrthum, and Louis Wehenkel. 2009. “Supervised Learning with Decision Tree-Based Methods in Computational and Systems Biology.”Molecular bioSystems 5 (12): 1593–1605. https://doi.org/10.1039/b907946g.
Hyafil, Laurent, and Ronald L. Rivest. 1976. “Constructing Optimal Binary Decision Trees Is NP-Complete.”Inf. Process. Lett. 5 (1): 15–17. https://doi.org/10.1016/0020-0190(76)90095-8.
Kingsford, C, and Steven L Salzberg. 2008. “What Are Decision Trees?”Nature Biotechnology 26 (9): 1011–13. https://doi.org/10.1038/nbt0908-1011.
Leboeuf, Jean-Samuel, Frédéric LeBlanc, and Mario Marchand. 2022. “Generalization Properties of Decision Trees on Real-Valued and Categorical Features.”https://arxiv.org/abs/2210.10781.
Schietgat, Leander, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev, and Saso Dzeroski. 2010. “Predicting Gene Function Using Hierarchical Multi-Label Decision Tree Ensembles.”BMC Bioinformatics 11 (1): 2. https://doi.org/10.1186/1471-2105-11-2.
Stiglic, Gregor, Simon Kocbek, Igor Pernek, and Peter Kokol. 2012. “Comprehensive Decision Tree Models in Bioinformatics.” Edited by Ahmed Moustafa. PLoS ONE 7 (3): e33812. https://doi.org/10.1371/journal.pone.0033812.
Tan, Yuqi, and Patrick Cahan. 2019. “SingleCellNet: A Computational Tool to Classify Single Cell RNA-Seq Data Across Platforms and Across Species.”Cell Systems 9 (2): 207–213.e2. https://doi.org/10.1016/j.cels.2019.06.004.