CSI 4106 - Fall 2024
Version: Nov 14, 2024 09:02
GridSearchCV
in scikit-learn.OpenML is an open platform for sharing datasets, algorithms, and experiments - to learn how to learn better, together.
Author: Vincent Sigillito
Source: Obtained from UCI
Please cite: UCI citation policy
Title: Pima Indians Diabetes Database
Sources:
Past Usage:
Smith,J.W., Everhart,J.E., Dickson,W.C., Knowler,W.C., & Johannes,R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261–265). IEEE Computer Society Press.
The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.
Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.
Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
Missing Attribute Values: None
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)
Class Value Number of instances 0 500 1 268
Brief statistical analysis:
Attribute number: Mean: Standard Deviation:
3.8 3.4
120.9 32.0
69.1 19.4
20.5 16.0
79.8 115.2
32.0 7.9
0.5 0.3
33.2 11.8
Relabeled values in attribute ‘class’ From: 0 To: tested_negative
From: 1 To: tested_positive
Downloaded from openml.org.
return_X_y
fetch_openml
returns a Bunch
, a DataFrame
, or X
and y
Mild imbalance (ratio less than 3 or 4)
Sometimes called holdout method.
Guideline: Typically, allocate 80% of your dataset for training and reserve the remaining 20% for testing.
Training Set: This subset of data is utilized to train your model.
Test Set: This is an independent subset used exclusively at the final stage to assess the model’s performance.
Training Error:
Generalization Error: The error rate observed when the model is evaluated on new, unseen data.
Underfitting:
Overfitting:
Cross-validation is a method used to evaluate and improve the performance of machine learning models.
It involves partitioning the dataset into multiple subsets, training the model on some subsets while validating it on the remaining ones.
cross_val_score
from sklearn import tree
clf = tree.DecisionTreeClassifier()
from sklearn.model_selection import cross_val_score
clf_scores = cross_val_score(clf, X, y, cv=5)
print("\nScores:", clf_scores)
print(f"\nMean: {clf_scores.mean():.2f}")
print(f"\nStandard deviation: {clf_scores.std():.2f}")
Scores: [0.71428571 0.66883117 0.71428571 0.79738562 0.73202614]
Mean: 0.73
Standard deviation: 0.04
A hyperparameter is a configuration external to the model that is set prior to the training process and governs the learning process, influencing model performance and complexity.
criterion
: gini
, entropy
, log_loss
, measure the quality of a split.max_depth
: limits the number of levels in the tree to prevent overfitting.penalty
: l1
or l2
, helps in preventing overfitting.solver
: liblinear
, newton-cg
, lbfgs
, sag
, saga
.max_iter
: maximum number of iterations taken for the solvers to converge.tol
: stopping criteria, smaller values mean higher precision.n_neighbors
: number of neighbors to use for \(k\)-neighbors queries.weights
: uniform
or distance
, equal weight or distance-based weight.max_depth
for value in [3, 5, 7, None]:
clf = tree.DecisionTreeClassifier(max_depth=value)
clf_scores = cross_val_score(clf, X_train, y_train, cv=10)
print("\nmax_depth = ", value)
print(f"Mean: {clf_scores.mean():.2f}")
print(f"Standard deviation: {clf_scores.std():.2f}")
max_depth = 3
Mean: 0.74
Standard deviation: 0.04
max_depth = 5
Mean: 0.76
Standard deviation: 0.04
max_depth = 7
Mean: 0.73
Standard deviation: 0.04
max_depth = None
Mean: 0.71
Standard deviation: 0.05
criterion
for value in ["gini", "entropy", "log_loss"]:
clf = tree.DecisionTreeClassifier(max_depth=5, criterion=value)
clf_scores = cross_val_score(clf, X_train, y_train, cv=10)
print("\ncriterion = ", value)
print(f"Mean: {clf_scores.mean():.2f}")
print(f"Standard deviation: {clf_scores.std():.2f}")
criterion = gini
Mean: 0.76
Standard deviation: 0.04
criterion = entropy
Mean: 0.75
Standard deviation: 0.05
criterion = log_loss
Mean: 0.75
Standard deviation: 0.05
n_neighbors
from sklearn.neighbors import KNeighborsClassifier
for value in range(1, 11):
clf = KNeighborsClassifier(n_neighbors=value)
clf_scores = cross_val_score(clf, X_train, y_train, cv=10)
print("\nn_neighbors = ", value)
print(f"Mean: {clf_scores.mean():.2f}")
print(f"Standard deviation: {clf_scores.std():.2f}")
n_neighbors
n_neighbors = 1
Mean: 0.67
Standard deviation: 0.05
n_neighbors = 2
Mean: 0.71
Standard deviation: 0.03
n_neighbors = 3
Mean: 0.69
Standard deviation: 0.05
n_neighbors = 4
Mean: 0.73
Standard deviation: 0.03
n_neighbors = 5
Mean: 0.72
Standard deviation: 0.03
n_neighbors = 6
Mean: 0.73
Standard deviation: 0.05
n_neighbors = 7
Mean: 0.74
Standard deviation: 0.04
n_neighbors = 8
Mean: 0.75
Standard deviation: 0.04
n_neighbors = 9
Mean: 0.73
Standard deviation: 0.05
n_neighbors = 10
Mean: 0.73
Standard deviation: 0.04
weights
from sklearn.neighbors import KNeighborsClassifier
for value in ["uniform", "distance"]:
clf = KNeighborsClassifier(n_neighbors=5, weights=value)
clf_scores = cross_val_score(clf, X_train, y_train, cv=10)
print("\nweights = ", value)
print(f"Mean: {clf_scores.mean():.2f}")
print(f"Standard deviation: {clf_scores.std():.2f}")
weights = uniform
Mean: 0.72
Standard deviation: 0.03
weights = distance
Mean: 0.73
Standard deviation: 0.04
Many hyperparameters need tuning
Manual exploration of combinations is tedious
Grid search is more systematic
Enumerate all possible hyperparameter combinations
Train on training set, evaluate on validation set
GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = [
{'max_depth': range(1, 10),
'criterion': ["gini", "entropy", "log_loss"]}
]
clf = tree.DecisionTreeClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
(grid_search.best_params_, grid_search.best_score_)
({'criterion': 'gini', 'max_depth': 5}, 0.7481910124074653)
param_grid = [
{'n_neighbors': range(1, 15),
'weights': ["uniform", "distance"]}
]
clf = KNeighborsClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
(grid_search.best_params_, grid_search.best_score_)
({'n_neighbors': 14, 'weights': 'uniform'}, 0.7554165363361485)
GridSearchCV
from sklearn.linear_model import LogisticRegression
# 2 * 5 * 5 * 3 = 150 tests!
param_grid = [
{'penalty': ["l1", "l2", None],
'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
'max_iter' : [100, 200, 400, 800, 1600],
'tol' : [0.01, 0.001, 0.0001]}
]
clf = LogisticRegression()
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
(grid_search.best_params_, grid_search.best_score_)
({'max_iter': 100, 'penalty': 'l2', 'solver': 'newton-cg', 'tol': 0.001},
0.7756646856427901)
clf = LogisticRegression(max_iter=100, penalty='l2', solver='newton-cg', tol=0.001)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.83 0.83 0.83 52
1 0.64 0.64 0.64 25
accuracy 0.77 77
macro avg 0.73 0.73 0.73 77
weighted avg 0.77 0.77 0.77 77
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa