Linear models, Logistic Regression

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Feb 19, 2025 11:50

Preamble

Quote of the Day

Summary

Introduces linear models for classification tasks, with a focus on logistic regression. Demonstrates how logistic regression leverages a sigmoidal (logistic) function to transform linear combinations of features into probabilities. Explains binary vs. multi-class classification (via one-vs-all), and illustrates how parameters are learned using gradient descent. Highlights the geometric interpretation of the decision boundary in high-dimensional spaces.

Learning Outcomes

  • Explain the concept of logistic regression and its relationship to linear models.
  • Apply logistic regression to both binary and multi-class classification tasks.
  • Interpret the logistic (sigmoid) function as a mapping from linear combinations of features to probabilities.
  • Use one-vs-all strategies to extend logistic regression to multi-class problems.
  • Analyze decision boundaries and describe their geometric interpretation in feature space.

Logistic Regression Example

  • Study: “An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets”
  • Summary:
    • Use logistic regression as the basis for classifying immune cell types from gene expression data.
    • Focus is on using logistic regression to generate classifiers and extract gene signatures in a biological context.

Recall

Linear regression was introduced to conveniently present a well-known training algorithm, gradient descent. Additionally, it serves as a foundation for introducing logistic regression, a classification algorithm, which further facilitates discussions on artificial neural networks.

  • Linear Regression
    • Gradient Descent
    • Logistic Regression
      • Neural Networks

Classification Tasks

Definitions

  • Binary classification is a supervised learning task where the objective is to categorize instances (examples) into one of two discrete classes.

  • A multi-class classification task is a type of supervised learning problem where the objective is to categorize instances into one of three or more discrete classes.

Binary Classification

  • Some machine learning algorithms are specifically designed to solve binary classification problems.
    • Logistic regression and support vector machines (SVMs) are such examples.

Multi-Class Classification

  • Any multi-class classification problem can be transformed into a binary classification problem.
  • One-vs-All (OvA), aka one-vs-rest (OvR):
    • A separate binary classifier is trained for each class.
    • For each classifier, one class is treated as the positive class, and all other classes are treated as the negative class.
    • The final assignment of a class label is made based on the classifier that outputs the highest confidence score for a given input.

Logistic Regression

Data and Problem

  • Dataset: Palmer Penguins
  • Task: Binary classification to distinguish Gentoo penguins from non-Gentoo species
  • Feature of Interest: Flipper length

Histogram

Code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

try:
    from palmerpenguins import load_penguins
except ImportError:
    ! pip install palmerpenguins
    from palmerpenguins import load_penguins

# Load the Palmer Penguins dataset
df = load_penguins()

# Keep only 'flipper_length_mm' and 'species'
df = df[['flipper_length_mm', 'species']]

# Drop rows with missing values (NaNs)
df.dropna(inplace=True)

# Create a binary label: 1 if Gentoo, 0 otherwise
df['is_gentoo'] = (df['species'] == 'Gentoo').astype(int)

# Separate features (X) and labels (y)
X = df[['flipper_length_mm']]
y = df['is_gentoo']

# Plot the distribution of flipper lengths by binary species label
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='flipper_length_mm', hue='is_gentoo', kde=True, bins=30, palette='Set1')
plt.title('Distribution of Flipper Length (Gentoo vs. Others)')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Frequency')
plt.legend(title='Species', labels=['Gentoo', 'Non Gentoo'])
plt.show()

Model

  • General Case: \(P(y = k | x, \theta)\), where \(k\) is a class label.
  • Binary Case: \(y \in {0,1}\)
    • Predict \(P(y = 1 | x, \theta)\)

Logistic Regression

Code
# Scatter plot of flipper length vs. binary label (Gentoo or Not Gentoo)
plt.figure(figsize=(10, 6))

# Plot points labeled as Gentoo (is_gentoo = 1)
plt.scatter(
    df.loc[df['is_gentoo'] == 1, 'flipper_length_mm'],
    df.loc[df['is_gentoo'] == 1, 'is_gentoo'],
    color='blue',
    label='Gentoo'
)

# Plot points labeled as Not Gentoo (is_gentoo = 0)
plt.scatter(
    df.loc[df['is_gentoo'] == 0, 'flipper_length_mm'],
    df.loc[df['is_gentoo'] == 0, 'is_gentoo'],
    color='red',
    label='Not Gentoo'
)

plt.title('Flipper Length vs. Gentoo Indicator')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Binary Label (1 = Gentoo, 0 = Not Gentoo)')
plt.legend(loc='best')
plt.grid(True)
plt.show()

Intuition

Fitting a linear regression is not the answer, but \(\ldots\)

Code
from sklearn.linear_model import LinearRegression
import pandas as pd

lin_reg = LinearRegression()
lin_reg.fit(X, y)

X_new = pd.DataFrame([X.min(), X.max()], columns=X.columns)

y_pred = lin_reg.predict(X_new)

# Plot the scatter plot
plt.figure(figsize=(5, 3))
plt.scatter(X, y, c=y, cmap='bwr', edgecolor='k')
plt.plot(X_new, y_pred, "r-")
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Binary Label (Gentoo or Not Gentoo)')
plt.title('Flipper Length vs. Binary Label (Gentoo or Not Gentoo)')
plt.yticks([0, 1], ['Not Gentoo', 'Gentoo'])
plt.grid(True)
plt.show()

Intuition (continued)

  • A high flipper_length_mm typically results in a model output approaching 1.

  • Conversely, a low flipper_length_mm generally yields a model output near 0.

  • Notably, the model outputs are not confined to the [0, 1] interval and may occasionally fall below 0 or surpass 1.

Intuition (continued)

  • For a single feature, the decision boundary is a specific point.
  • In this case, the decision boundary is approximately 205.

Intuition (continued)

  • As flipper_length_mm increases from 205 to 230, confidence in classifying the example as Gentoo rises.
  • Conversely, as flipper_length_mm decreases from 205 to 170, confidence in classifying the example as non-Gentoo rises.

Intuition (continued)

  • For values near the decision boundary, 205, some examples classify as Gentoo while others do not, leading to a classification uncertainty comparable to a coin flip (0.5 probability).

Logistic Function

In mathematics, the standard logistic function maps a real-valued input from \(\mathbb{R}\) to the open interval \((0,1)\). The function is defined as:

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Code
# Sigmoid function
def sigmoid(t):
    return 1 / (1 + np.exp(-t))

# Generate t values
t = np.linspace(-6, 6, 1000)

# Compute y values for the sigmoid function
sigma = sigmoid(t)

# Create a figure
fig, ax = plt.subplots()
ax.plot(t, sigma, color='blue', linewidth=2)  # Keep the curve opaque

# Draw vertical axis at x = 0
ax.axvline(x=0, color='black', linewidth=1)

# Add labels on the vertical axis
ax.set_yticks([0, 0.5, 1.0])

# Add labels to the axes
ax.set_xlabel('t')
ax.set_ylabel(r'$\sigma(t)$')

plt.grid(True)
plt.show()

Logistic Regression (intuition)

  • When the distance to the decision boundary is zero, uncertainty is high, making a probability of 0.5 appropriate.
  • As we move away from the decision boundary, confidence increases, warranting higher or lower probabilities accordingly.

Logistic Function

An S-shaped curve, such as the standard logistic function (aka sigmoid), is termed a squashing function because it maps a wide input domain to a constrained output range.

\[ \sigma(t) = \frac{1}{1+e^{-t}} \]

Logistic (Logit) Regression

  • Analogous to linear regression, logistic regression computes a weighted sum of the input features, expressed as: \[ \theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)} \]

  • However, using the sigmoid function limits its output to the range \((0,1)\): \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

Notation

  • Equation for the logistic regression: \[ \sigma(\theta_0 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

  • Multipling \(\theta_0\) (intercept/bias) by 1: \[ \sigma(\theta_0 \times 1 + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

  • Multipling \(\theta_0\) by \(x_i^{(0)} = 1\): \[ \sigma(\theta_0 x_i^{(0)} + \theta_1 x_i^{(1)} + \theta_2 x_i^{(2)} + \ldots + \theta_D x_i^{(D)}) \]

Logistic regression

The Logistic Regression model, in its vectorized form, is defined as:

\[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]

Logistic regression (two attributes)

\[ h_\theta(x_i) = \sigma(\theta x_i) \]

  • In logistic regression, the probability of correctly classifying an example increases as its distance from the decision boundary increases.
  • This principle holds for both positive and negative classes.
  • An example lying on the decision boundary has a 50% probability of belonging to either class.

Logistic regression

  • The Logistic Regression model, in its vectorized form, is defined as:

    \[ h_\theta(x_i) = \sigma(\theta x_i) = \frac{1}{1+e^{- \theta x_i}} \]

  • Predictions are made as follows:

    • \(y_i = 0\), if \(h_\theta(x_i) < 0.5\)
    • \(y_i = 1\), if \(h_\theta(x_i) \geq 0.5\)
  • The values of \(\theta\) are learned using gradient descent.

Geometric Interpretation

Geometric Interpretation

  • Do you recognize this equation? \[ w_1 x_1 + w_2 x_2 + \ldots + w_D x_2 \]

  • This is the dot product of \(\mathbf{w}\) and \(\mathbf{x}\), \(\mathbf{w} \cdot \mathbf{x}\).

  • What is the geometric interpretation of the dot product?

\[ \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta \]

Geometric Interpretation

\[ \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta \]

  • The dot product determines the angle \((\theta)\) between vectors.

  • It quantifies how much one vector extends in the direction of another.

  • Its value is zero, if the vectors are perpendicular \((\theta = 90^\circ)\).

Geometric Interpretation

  • Logistic regression uses a linear combination of the input features, \(\mathbf{w} \cdot \mathbf{x} + b\), as the argument to the sigmoid (logistic) function.

  • Geometrically, \(\mathbf{w}\) can be viewed as a vector normal to a hyperplane in the feature space, and any point \(\mathbf{x}\) is projected onto \(\mathbf{w}\) via the dot product \(\mathbf{w} \cdot \mathbf{x}\).

Geometric Interpretation

  • The decision boundary is where this linear combination equals zero, i.e., \(\mathbf{w} \cdot \mathbf{x} + b = 0\).

  • Points on one side of the boundary have a positive dot product and are more likely to be classified as the positive class (1).

  • Points on the other side have a negative dot product and are more likely to be in the opposite class (0).

  • The sigmoid function simply turns this signed distance into a probability between 0 and 1.

Prologue

Summary

  • Introduced linear models for classification tasks, focusing on logistic regression.
  • Demonstrated how logistic regression leveraged a sigmoidal (logistic) function to transform linear combinations of features into probabilities.
  • Explained binary vs. multi-class classification (via one-vs-all).
  • Highlighted the geometric interpretation of the decision boundary in high-dimensional spaces.

Next lecture

  • Negative log-likelihood, implementation, parameter values interpretation.

Resources

References

Alharbi, Fadi, and Aleksandar Vakanski. 2023. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review.” Bioengineering 10 (2): 173. https://doi.org/10.3390/bioengineering10020173.
Torang, Arezo, Paraag Gupta, and David J. Klinke. 2019. An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets.” BMC Bioinformatics 20 (1): 433. https://doi.org/10.1186/s12859-019-2994-z.
Wu, Qianfan, Adel Boueiz, Alican Bozkurt, Arya Masoomi, Allan Wang, Dawn L DeMeo, Scott T Weiss, and Weiliang Qiu. 2018. Deep Learning Methods for Predicting Disease Status Using Genomic Data.” Journal of Biometrics & Biostatistics 9 (5).

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa