A decision tree is a hierarchical structure represented as a directed acyclic graph, used for classification and regression tasks.
Each internal node performs a binary test on a particular feature (\(j\)), such as evaluating whether the number of connections at a school surpasses a specified threshold.
The leaves function as decision nodes.
Classifying New Instances (Inference)
Begin at the root node of the decision tree. Proceed by answering a sequence of binary questions until a leaf node is reached. The label associated with this leaf denotes the classification of the instance.
Alternatively, some algorithms may store a probability distribution at the leaf, representing the fraction of training samples corresponding to each class \(k\), across all possible classes \(k\).
Decision Boundary
Palmer Pinguins Dataset
# Loading our datasettry:from palmerpenguins import load_penguinsexcept:! pip install palmerpenguinsfrom palmerpenguins import load_penguinspenguins = load_penguins()# Pairplot using seabornimport matplotlib.pyplot as pltimport seaborn as snssns.pairplot(penguins, hue='species', markers=["o", "s", "D"])plt.suptitle("Pairwise Scatter Plots of Penguins Features")plt.show()
Palmer Pinguins Dataset
Binary Classification Problem
Several scatter plots reveal a distinct clustering of Gentoo instances.
To illustrate our next exemple, we propose a binary classification model: Gentoo versus non-Gentoo.
Our analysis will concentrate on two key features: body mass and bill depth.
Definition
A decision boundary is a “boundary” that partitions the underlying feature space into regions corresponding to different class labels.
Decision Boundary
The decision boundary between these attributes can be represented as a line.
Code
# Import necessary librariesimport numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splittry:from palmerpenguins import load_penguinsexcept:! pip install palmerpenguinsfrom palmerpenguins import load_penguins# Load the Palmer Penguins datasetdf = load_penguins()# Preserve only the necessary features: 'bill_depth_mm' and 'body_mass_g'features = ['bill_depth_mm', 'body_mass_g']df = df[features + ['species']]# Drop rows with missing valuesdf.dropna(inplace=True)# Create a binary problem: 'Gentoo' vs 'Not Gentoo'df['species_binary'] = df['species'].apply(lambda x: 1if x =='Gentoo'else0)# Define feature matrix X and target vector yX = df[features].valuesy = df['species_binary'].values# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Function to plot initial scatter of datadef plot_scatter(X, y): plt.figure(figsize=(9, 5)) plt.scatter(X[y ==1, 0], X[y ==1, 1], color='orange', edgecolors='k', marker='o', label='Gentoo') plt.scatter(X[y ==0, 0], X[y ==0, 1], color='blue', edgecolors='k', marker='o', label='Not Gentoo') plt.xlabel('Bill Depth (mm)') plt.ylabel('Body Mass (g)') plt.title('Scatter Plot of Bill Depth vs. Body Mass') plt.legend() plt.show()# Plot the initial scatter plotplot_scatter(X_train, y_train)
Decision Boundary
Decision Boundary
The decision boundary between these attributes can be represented as a line.
Code
# Train a logistic regression modelmodel = LogisticRegression()model.fit(X_train, y_train)# Function to plot decision boundarydef plot_decision_boundary(X, y, model): x_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1 y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(figsize=(9, 5)) plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu') plt.scatter(X[y ==1, 0], X[y ==1, 1], color='orange', edgecolors='k', marker='o', label='Gentoo') plt.scatter(X[y ==0, 0], X[y ==0, 1], color='blue', edgecolors='k', marker='o', label='Not Gentoo') plt.xlabel('Bill Depth (mm)') plt.ylabel('Body Mass (g)') plt.title('Logistic Regression Decision Boundary') plt.legend() plt.show()# Plot the decision boundary on the training setplot_decision_boundary(X_train, y_train, model)
Decision Boundary
Definition
We say that the data is linearly separable when two classes of data can be perfectly separated by a single linear boundary, such as a line in two-dimensional space or a hyperplane in higher dimensions.
Simple Decision Doundary
(a) training data, (b) quadratic curve, and (c) linear function.
Decision trees are capable of generating irregular and non-linear decision boundaries.
Attribution:ibidem.
Definition (revised)
A decision boundary is a hypersurface that partitions the underlying feature space into regions corresponding to different class labels.
Decision Tree (contd)
Constructing a Decision Tree
How to construct (learnt) a decision tree?
Are there some trees that are “better” than others?
Is it feasible to construct an optimal decision tree with computational efficiency?
Optimality
Let \(X = \{x_1, \ldots, x_n\}\) be a finite set of objects.
Let \(\mathcal{T} = \{T_1, \ldots, T_t\}\) be a finite set of tests.
For each object and test, we have:
\(T_i(x_j)\) is either true or false.
An optimal tree is one that completely identifies all the objects in \(X\) and \(|T|\) is minimum.
Constructing a Decision Tree
Iterative development: Initiate with an empty tree. Progressively introduce nodes, each informed by the training dataset, continuing until the dataset is completely classified or alternative termination criteria, such as maximum tree depth, are met.
Constructing a Decision Tree
Initial Node Construction:
To establish the root node, evaluate all available \(D\) features.
For each feature, assess various threshold values derived from the observed data within the training set.
Constructing a Decision Tree
For a numerical feature, the algorithm considers all possible split points (thresholds) in the feature’s range.
These split points are typically the midpoints between two consecutive, sorted unique values of the feature.
Constructing a Decision Tree
For a categorical feature with \(k\) unique values, the algorithm considers all possible ways of splitting the categories into two groups.
For instance, if the feature (forecast) has values, ‘Rainy’, ‘Cloudy’, and ‘Sunny’, it evaluates the following splits:
\(\{\mathrm{Rainy}\}\) vs. \(\{\mathrm{Cloudy}, \mathrm{Sunny}\}\),
\(\{\mathrm{Cloudy}\}\) vs. \(\{\mathrm{Rainy}, \mathrm{Sunny}\}\) ,
\(\{\mathrm{Sunny}\}\) vs. \(\{\mathrm{Rainy}, \mathrm{Cloudy}\}\).
Evaluation
What defines a “good” data split?
\(\{\mathrm{Rainy}\}\) vs. \(\{\mathrm{Cloudy}, \mathrm{Sunny}\}\) : \([20,10,5]\) and \([10,10,15]\).
\(\{\mathrm{Cloudy}\}\) vs. \(\{\mathrm{Rainy}, \mathrm{Sunny}\}\) : \([40,0,0]\) and \([0,30,0]\).
Evaluation
Heterogeneity (also referred to as impurity) and homogeneity are critical metrics for evaluating the composition of resulting data partitions.
Optimally, each of these partitions should contain data entries from a single class to achieve maximum homogeneity.
Entropy and the Gini index are two widely utilized metrics for assessing these characteristics.
Evalution
Objective function for sklearn.tree.DecisionTreeClassifier (CART):
The cost of partitioning the data using feature\(k\) and threshold\(t_k\).
\(m_{\text{left}}\) and \(m_{\text{right}}\) is the number of examples in the left and right subsets, respectively, and \(m\) is the number of examples before splitting the data.
\(G_{\text{left}}\) and \(G_{\text{right}}\) is the impurity of the left and right subsets, respectively.
Gini Index
Gini index (default)
\[
G_i = 1 - \sum_{k=1}^n p_{i,k}^2
\]
\(p_{i,k}\) is the proportion of the examples from this class \(k\) in the node \(i\).
What is the maximum value of the Gini index?
Gini Index
Considering a binary classification problem:
\(1 - (0/100)^2 + (100/100)^2 = 0\) (pure)
\(1 - (25/100)^2 + (75/100)^2 = 0.375\)
\(1 - (50/100)^2 + (50/100)^2 = 0.5\)
Gini Index
Code
def gini_index(p):"""Calculate the Gini index."""return1- (p**2+ (1- p)**2)# Probability values for class 1p_values = np.linspace(0, 1, 100)# Calculate Gini index for each probabilitygini_values = [gini_index(p) for p in p_values]# Plot the Gini indexplt.figure(figsize=(8, 6))plt.plot(p_values, gini_values, label='Gini Index', color='b')plt.title('Gini Index for Binary Classification')plt.xlabel('Probability of Class 1 (p)')plt.ylabel('Gini Index')plt.grid(True)plt.legend()plt.show()
Iris Dataset
Complete Example
Stopping Criteria
All the examples in a given node belong to the same class.
Depth of the tree would exceed max_depth.
Number of examples in the node is min_sample_split or less.
None of the splits decreases impurity sufficiently (min_impurity_decrease).
Small changes to the data set produce vastly different trees
Large Trees
Small Changes to the Dataset
Code
from sklearn import treefrom sklearn.metrics import classification_report, accuracy_score# Loading the datasetX, y = load_penguins(return_X_y =True)target_names = ['Adelie','Chinstrap','Gentoo']# Split the dataset into training and testing setsfor seed in (4, 7, 90, 96, 99, 2):print(f'Seed: {seed}')# Create new training and test sets based on a different random seed X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)# Creating a new classifier clf = tree.DecisionTreeClassifier(random_state=seed)# Training clf.fit(X_train, y_train)# Make predictions y_pred = clf.predict(X_test)# Plotting the tree tree.plot_tree(clf, feature_names = X.columns, class_names = target_names, filled =True) plt.show()# Evaluating the model accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred, target_names=target_names)print(f'Accuracy: {accuracy:.2f}')print('Classification Report:')print(report)
The training data is a collection of labelled examples.
\(\{(x_i,y_i)\}_{i=1}^N\)
Each \(x_i\) is a feature vector with \(D\) dimensions.
\(x_i^{(j)}\) is the value of the feature\(j\) of the example \(i\), for \(j \in 1 \ldots D\) and \(i \in 1 \ldots N\).
The label\(y_i\) is a real number.
Problem: Given the data set as input, create a model that can be used to predict the value of \(y\) for an unseen \(x\).
Rationale
Linear regression is introduced to conveniently present a well-known training algorithm, gradient descent. Additionally, it serves as a foundation for introducing logistic regression–a classification algorithm—which further facilitates discussions on artificial neural networks.
Linear Regression
Gradient Descent
Logistic Regression
Neural Networks
Linear Regression
A linear model assumes that the value of the label, \(\hat{y_i}\), can be expressed as a linear combination of the feature values, \(x_i^{(j)}\): \[
\hat{y_i} = \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}
\]
Here, \(\theta_{j}\) is the \(j\)th parameter of the (linear) model, with \(\theta_0\) being the bias term/parameter, and \(\theta_1 \ldots \theta_D\) being the feature weights.
Definition
Problem: find values for all the model parameters so that the model “best fits” the training data.
The Root Mean Square Error is a common performance measure for regression problems.
\[
\sqrt{\frac{1}{N}\sum_1^N [h(x_i) - y_i]^2}
\]
Minimizing RMSE
Characteristics
A typical learning algorithm comprises the following components:
A model, often consisting of a set of weights whose values will be “learnt”.
An objective function.
In the case of regression, this is often a loss function, a function that quantifies misclassification. The Root Mean Square Error is a common loss function for regression problems. \(\sqrt{\frac{1}{N}\sum_1^N [h(x_i) - y_i]^2}\)
Optimization algorithm
Optimization
Until some termination criteria is met1:
Evaluate the loss function, comparing \(h(x_i)\) to \(y_i\).
Make small changes to the weights, in a way that reduces the value of the loss function.
Remarks
It is crucial to separate the optimization algorithm from the problem it addresses.
For linear regression, an exact analytical solution exists, but it presents certain limitations.
Gradient descent serves as a general algorithm applicable not only to linear regression, but also to logistic regression, deep learning, t-SNE (t-distributed Stochastic Neighbor Embedding), among various other problems.
There exists a diverse range of optimization algorithms that do not rely on gradient-based methods.
Summary
The lecture surveyed three learning algorithms, k-nearest neighbours (KNN), decision trees, and linear regression, and framed them via model, objective, and optimization.
We then constructed decision trees, showed that regression leaves returned the sample mean, minimized the weighted impurity \(J\), and analyzed the Gini index.
Decision boundaries were illustrated for linear and non-linear models.
Finally, we formulated linear regression with a bias term.
Prologue
References
Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd ed. O’Reilly Media.
Geurts, Pierre, Alexandre Irrthum, and Louis Wehenkel. 2009. “Supervised Learning with Decision Tree-Based Methods in Computational and Systems Biology.”Molecular bioSystems 5 (12): 1593–1605. https://doi.org/10.1039/b907946g.
Hyafil, Laurent, and Ronald L. Rivest. 1976. “Constructing Optimal Binary Decision Trees Is NP-Complete.”Inf. Process. Lett. 5 (1): 15–17. https://doi.org/10.1016/0020-0190(76)90095-8.
Russell, Stuart, and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach. 4th ed. Pearson. http://aima.cs.berkeley.edu/.
Stanton, Jeffrey M. 2001. “Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors.”Journal of Statistics Education 9 (3). https://doi.org/10.1080/10691898.2001.11910537.
Stiglic, Gregor, Simon Kocbek, Igor Pernek, and Peter Kokol. 2012. “Comprehensive Decision Tree Models in Bioinformatics.” Edited by Ahmed Moustafa. PLoS ONE 7 (3): e33812. https://doi.org/10.1371/journal.pone.0033812.