Linear models, Logistic Regression

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Feb 19, 2025 11:50

Preamble

Quote of the Day

Summary

Introduces linear models for classification tasks, with a focus on logistic regression. Demonstrates how logistic regression leverages a sigmoidal (logistic) function to transform linear combinations of features into probabilities. Explains binary vs. multi-class classification (via one-vs-all), and illustrates how parameters are learned using gradient descent. Highlights the geometric interpretation of the decision boundary in high-dimensional spaces.

Learning Outcomes

Explain the concept of logistic regression and its relationship to linear models.
Apply logistic regression to both binary and multi-class classification tasks.
Interpret the logistic (sigmoid) function as a mapping from linear combinations of features to probabilities.
Use one-vs-all strategies to extend logistic regression to multi-class problems.
Analyze decision boundaries and describe their geometric interpretation in feature space.

Logistic Regression Example

Study: “An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets”
Summary:
- Use logistic regression as the basis for classifying immune cell types from gene expression data.
- Focus is on using logistic regression to generate classifiers and extract gene signatures in a biological context.

Recall

Linear regression was introduced to conveniently present a well-known training algorithm, gradient descent. Additionally, it serves as a foundation for introducing logistic regression, a classification algorithm, which further facilitates discussions on artificial neural networks.

Linear Regression
- Gradient Descent
- Logistic Regression
  - Neural Networks

Classification Tasks

Definitions

Binary classification is a supervised learning task where the objective is to categorize instances (examples) into one of two discrete classes.
A multi-class classification task is a type of supervised learning problem where the objective is to categorize instances into one of three or more discrete classes.

Binary Classification

Some machine learning algorithms are specifically designed to solve binary classification problems.
- Logistic regression and support vector machines (SVMs) are such examples.

Multi-Class Classification

Any multi-class classification problem can be transformed into a binary classification problem.
One-vs-All (OvA), aka one-vs-rest (OvR):
- A separate binary classifier is trained for each class.
- For each classifier, one class is treated as the positive class, and all other classes are treated as the negative class.
- The final assignment of a class label is made based on the classifier that outputs the highest confidence score for a given input.

Logistic Regression

Data and Problem

Dataset: Palmer Penguins
Task: Binary classification to distinguish Gentoo penguins from non-Gentoo species
Feature of Interest: Flipper length

Histogram

Code

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

try:
    from palmerpenguins import load_penguins
except ImportError:
    ! pip install palmerpenguins
    from palmerpenguins import load_penguins

# Load the Palmer Penguins dataset
df = load_penguins()

# Keep only 'flipper_length_mm' and 'species'
df = df[['flipper_length_mm', 'species']]

# Drop rows with missing values (NaNs)
df.dropna(inplace=True)

# Create a binary label: 1 if Gentoo, 0 otherwise
df['is_gentoo'] = (df['species'] == 'Gentoo').astype(int)

# Separate features (X) and labels (y)
X = df[['flipper_length_mm']]
y = df['is_gentoo']

# Plot the distribution of flipper lengths by binary species label
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='flipper_length_mm', hue='is_gentoo', kde=True, bins=30, palette='Set1')
plt.title('Distribution of Flipper Length (Gentoo vs. Others)')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Frequency')
plt.legend(title='Species', labels=['Gentoo', 'Non Gentoo'])
plt.show()

Model

General Case: \(P(y = k | x, \theta)\), where \(k\) is a class label.
Binary Case: \(y \in {0,1}\)
- Predict \(P(y = 1 | x, \theta)\)

Logistic Regression

Code

# Scatter plot of flipper length vs. binary label (Gentoo or Not Gentoo)
plt.figure(figsize=(10, 6))

# Plot points labeled as Gentoo (is_gentoo = 1)
plt.scatter(
    df.loc[df['is_gentoo'] == 1, 'flipper_length_mm'],
    df.loc[df['is_gentoo'] == 1, 'is_gentoo'],
    color='blue',
    label='Gentoo'
)

# Plot points labeled as Not Gentoo (is_gentoo = 0)
plt.scatter(
    df.loc[df['is_gentoo'] == 0, 'flipper_length_mm'],
    df.loc[df['is_gentoo'] == 0, 'is_gentoo'],
    color='red',
    label='Not Gentoo'
)

plt.title('Flipper Length vs. Gentoo Indicator')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Binary Label (1 = Gentoo, 0 = Not Gentoo)')
plt.legend(loc='best')
plt.grid(True)
plt.show()

Intuition

Fitting a linear regression is not the answer, but \(\ldots\)

Code

from sklearn.linear_model import LinearRegression
import pandas as pd

lin_reg = LinearRegression()
lin_reg.fit(X, y)

X_new = pd.DataFrame([X.min(), X.max()], columns=X.columns)

y_pred = lin_reg.predict(X_new)

# Plot the scatter plot
plt.figure(figsize=(5, 3))
plt.scatter(X, y, c=y, cmap='bwr', edgecolor='k')
plt.plot(X_new, y_pred, "r-")
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Binary Label (Gentoo or Not Gentoo)')
plt.title('Flipper Length vs. Binary Label (Gentoo or Not Gentoo)')
plt.yticks([0, 1], ['Not Gentoo', 'Gentoo'])
plt.grid(True)
plt.show()

Intuition (continued)

A high flipper_length_mm typically results in a model output approaching 1.
Conversely, a low flipper_length_mm generally yields a model output near 0.
Notably, the model outputs are not confined to the [0, 1] interval and may occasionally fall below 0 or surpass 1.

Intuition (continued)

For a single feature, the decision boundary is a specific point.
In this case, the decision boundary is approximately 205.

Intuition (continued)

As flipper_length_mm increases from 205 to 230, confidence in classifying the example as Gentoo rises.
Conversely, as flipper_length_mm decreases from 205 to 170, confidence in classifying the example as non-Gentoo rises.

Intuition (continued)

For values near the decision boundary, 205, some examples classify as Gentoo while others do not, leading to a classification uncertainty comparable to a coin flip (0.5 probability).

Logistic Function

In mathematics, the standard logistic function maps a real-valued input from \(\mathbb{R}\) to the open interval \((0,1)\). The function is defined as:

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Code

# Sigmoid function
def sigmoid(t):
    return 1 / (1 + np.exp(-t))

# Generate t values
t = np.linspace(-6, 6, 1000)

# Compute y values for the sigmoid function
sigma = sigmoid(t)

# Create a figure
fig, ax = plt.subplots()
ax.plot(t, sigma, color='blue', linewidth=2)  # Keep the curve opaque

# Draw vertical axis at x = 0
ax.axvline(x=0, color='black', linewidth=1)

# Add labels on the vertical axis
ax.set_yticks([0, 0.5, 1.0])

# Add labels to the axes
ax.set_xlabel('t')
ax.set_ylabel(r'$\sigma(t)$')

plt.grid(True)
plt.show()

Logistic Regression (intuition)

When the distance to the decision boundary is zero, uncertainty is high, making a probability of 0.5 appropriate.
As we move away from the decision boundary, confidence increases, warranting higher or lower probabilities accordingly.

Logistic Function

An S-shaped curve, such as the standard logistic function (aka sigmoid), is termed a squashing function because it maps a wide input domain to a constrained output range.

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Logistic (Logit) Regression

Analogous to linear regression, logistic regression computes a weighted sum of the input features, expressed as: \[ \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)} \]
However, using the sigmoid function limits its output to the range \((0,1)\): \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

Notation

Equation for the logistic regression: \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]
Multipling \(\theta_0\) (intercept/bias) by 1: \[ \sigma(\theta_0 \times 1 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]
Multipling \(\theta_0\) by \(x_i^{(0)} = 1\): \[ \sigma(\theta_0 x_i^{(0)} + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

Logistic regression

The Logistic Regression model, in its vectorized form, is defined as:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]

Logistic regression (two attributes)

\[ h_\theta(x_i) = \sigma(\theta x_i) \]

In logistic regression, the probability of correctly classifying an example increases as its distance from the decision boundary increases.
This principle holds for both positive and negative classes.
An example lying on the decision boundary has a 50% probability of belonging to either class.

Logistic regression

The Logistic Regression model, in its vectorized form, is defined as:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]
Predictions are made as follows:
- \(y_i = 0\), if \(h_\theta(x_i) < 0.5\)
- \(y_i = 1\), if \(h_\theta(x_i) \geq 0.5\)

The values of \(\theta\) are learned using gradient descent.

Geometric Interpretation

Do you recognize this equation? \[ w_1 x_1 + w_2 x_2 + \ldots + w_D x_2 \]
This is the dot product of \(\mathbf{w}\) and \(\mathbf{x}\), \(\mathbf{w} \cdot \mathbf{x}\).
What is the geometric interpretation of the dot product?

\[ \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta \]

Geometric Interpretation

\[ \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta \]

The dot product determines the angle \((\theta)\) between vectors.
It quantifies how much one vector extends in the direction of another.
Its value is zero, if the vectors are perpendicular \((\theta = 90^\circ)\).

Geometric Interpretation

Logistic regression uses a linear combination of the input features, \(\mathbf{w} \cdot \mathbf{x} + b\), as the argument to the sigmoid (logistic) function.
Geometrically, \(\mathbf{w}\) can be viewed as a vector normal to a hyperplane in the feature space, and any point \(\mathbf{x}\) is projected onto \(\mathbf{w}\) via the dot product \(\mathbf{w} \cdot \mathbf{x}\).

Geometric Interpretation

The decision boundary is where this linear combination equals zero, i.e., \(\mathbf{w} \cdot \mathbf{x} + b = 0\).
Points on one side of the boundary have a positive dot product and are more likely to be classified as the positive class (1).
Points on the other side have a negative dot product and are more likely to be in the opposite class (0).
The sigmoid function simply turns this signed distance into a probability between 0 and 1.

Prologue

Summary

Introduced linear models for classification tasks, focusing on logistic regression.
Demonstrated how logistic regression leveraged a sigmoidal (logistic) function to transform linear combinations of features into probabilities.
Explained binary vs. multi-class classification (via one-vs-all).
Highlighted the geometric interpretation of the decision boundary in high-dimensional spaces.

Next lecture

Negative log-likelihood, implementation, parameter values interpretation.

Resources

Logistic Regression 3-class Classifier from sklearn
Plot the decision surface of decision trees trained on the iris dataset from sklearn
Decision trees by Jan Kirenz, a Professor at HdM Stuttgart
CS 320 Apr12-2021 (Part 2) - Decision Boundaries by Tyler Caraza-Harter, an Instructor at UW-Madison

References

Alharbi, Fadi, and Aleksandar Vakanski. 2023. “Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review.” Bioengineering 10 (2): 173. https://doi.org/10.3390/bioengineering10020173.

Torang, Arezo, Paraag Gupta, and David J. Klinke. 2019. “An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets.” BMC Bioinformatics 20 (1): 433. https://doi.org/10.1186/s12859-019-2994-z.

Wu, Qianfan, Adel Boueiz, Alican Bozkurt, Arya Masoomi, Allan Wang, Dawn L DeMeo, Scott T Weiss, and Weiliang Qiu. 2018. “Deep Learning Methods for Predicting Disease Status Using Genomic Data.” Journal of Biometrics & Biostatistics 9 (5).

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa