CSI 4106 - Fall 2024
Version: Nov 14, 2024 09:02
Data augmentation is a technique used to increase the diversity of a dataset by applying various transformations to the existing data.
Author: Ronny Kohavi and Barry Becker
Source: UCI - 1996
Please cite: Ron Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996
Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
This is the original version from the UCI repository, with training and test sets merged.
Variables are all self-explanatory except fnlwgt. This is a proxy for the demographic background of the people: “People with similar demographic characteristics should have similar weights”. This similarity-statement is not transferable across the 51 different states.
Description from the donor of the database:
The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex.
We use all three sets of controls in our weighting program and “rake” through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.
Ronny Kohavi and Barry Becker. Data Mining and Visualization, Silicon Graphics.
e-mail: ronnyk ‘@’ live.com for questions.
Downloaded from openml.org.
['Private', 'Local-gov', NaN, 'Self-emp-not-inc', 'Federal-gov', 'State-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked']
Categories (8, object): ['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay']
['Never-married', 'Married-civ-spouse', 'Widowed', 'Divorced', 'Separated', 'Married-spouse-absent', 'Married-AF-spouse']
Categories (7, object): ['Divorced', 'Married-AF-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed']
Key Points on Data Representation
workclass
attribute, which has 8 distinct values like ‘Federal-gov’, ‘Local-gov’, and so on.Which encoding method is preferable and why?
w
= 1, 2, 3, 4, 5, 6, 7, or 8w
= [0,0,0], [0,0,1], [0,1,0], \(\ldots\), or [1,1,1]w
= [1,0,0,0,0,0,0,0], [0,1,0,0,0,0,0,0], \(\ldots\), or [0,0,0,0,0,0,0,1]One-Hot Encoding: This method should be preferred for categorical data.
w
= 1, 2, 3, etc., implies that ‘Federal-gov’ and ‘Local-gov’ are similar, while ‘Federal-gov’ and ‘Without-pay’ are not.w
= [0,0,0], [0,0,1], etc., might mislead the algorithm by suggesting similarity based on numeric patterns.One-Hot Encoding: A technique that converts categorical variables into a binary vector representation, where each category is represented by a vector with a single ‘1’ and all other elements as ‘0’.
from numpy import array
from sklearn.preprocessing import OneHotEncoder
work = adult.data[['workclass']]
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(work)
values_encoded = onehot_encoder.transform(work)
for i in range(5): print(values_encoded.toarray()[i])
[0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1.]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Load the 'Heart-Disease' dataset from OpenML
data = fetch_openml(name='Heart-Disease', version=1, as_frame=True)
df = data.frame
# Replace '?' with NaN and convert columns to numeric
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Drop rows with missing values
df.dropna(inplace=True)
# Define features and target
X = df.drop(columns=['target'])
y = df['target']
# Columns to encode with OneHotEncoder
columns_to_encode = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
# Identify numerical columns
numerical_columns = X.columns.difference(columns_to_encode)
# Split the dataset into training and testing sets before transformations
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Apply OneHotEncoder and StandardScaler using ColumnTransformer
column_transformer = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(), columns_to_encode),
('scaler', StandardScaler(), numerical_columns)
]
)
# Fit the transformer on the training data and transform both training and test data
X_train_processed = column_transformer.fit_transform(X_train)
X_test_processed = column_transformer.transform(X_test)
# Initialize and train logistic regression model
model = LogisticRegression(max_iter=1000)
model = model.fit(X_train_processed, y_train)
# Retrieve feature names after transformation using get_feature_names_out()
feature_names = column_transformer.get_feature_names_out()
# Get coefficients and map them to feature names
coefficients = model.coef_[0]
# Create a DataFrame with feature names and coefficients
coef_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients
})
# Display coefficients associated with 'cp'
cp_features = coef_df[coef_df['Feature'].str.contains('_cp')]
print("\nCoefficients associated with 'cp':")
print(cp_features)
Coefficients associated with 'cp':
Feature Coefficient
2 onehot__cp_0.0 -1.013382
3 onehot__cp_1.0 -0.212284
4 onehot__cp_2.0 0.599934
5 onehot__cp_3.0 0.628824
Ordinal encoding is a technique that assigns numerical values to categorical attributes based on their inherent order or rank.
For attributs with values such as ‘Poor’, ‘Average’, and ‘Good’, an ordinal encoding would make sense.
Discretization involves grouping ordinal values into discrete categories.
Example: Categorizing ages into bins such as ‘infant’, ‘child’, ‘teen’, ‘adult’, and ‘senior citizen’.
Advantages:
Disadvantages:
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
bins = [0, 1, 13, 20, 60, np.inf]
labels = ['infant', 'kid', 'teen', 'adult', 'senior citizen']
transformer = FunctionTransformer(
pd.cut, kw_args={'bins': bins, 'labels': labels, 'retbins': False}
)
X = np.array([0.5, 2, 15, 25, 97])
transformer.fit_transform(X)
['infant', 'kid', 'teen', 'adult', 'senior citizen']
Categories (5, object): ['infant' < 'kid' < 'teen' < 'adult' < 'senior citizen']
Learning algorithms perform optimally when feature values have similar ranges, such as [-1,1] or [0,1].
Normalization: \[ \frac{x_i^{(j)} - \min^{(j)}}{\max^{(j)} - \min^{(j)}} \]
Standardization (AKA z-score normalization) transforms each feature to have a normal distribution with a mean (\(\mu\)) of 0 and a standard deviation (\(\sigma\)) of 1.
\[ \frac{x_i^{(j)} - \mu^{(j)}}{\sigma^{(j)}} \]
Note: The range of values is not bounded!
Missing values refer to the absence of data points or entries in a dataset where a value is expected.
XGBoost
sklearn.linear_model.LinearRegression
cannot handle missing values.Data imputation is the process of replacing missing values in a dataset with substituted values, typically using statistical or machine learning methods.
Replace missing values with mean or median of the attribute.
Cons: Ignores feature correlations and complex relationships.
Mode Imputation: Replace missing values with the most frequent value; also ignores feature correlations.
Special Value Method: Replace missing values with a value outside the normal range (e.g., use -1 or 2 for data normalized between [0,1]).
Middle-Range Imputation: Replace missing values with a value in the middle of the normal range (e.g., use 0 for data distributed in the range [-1,1]).
Instance-Based Method:
Model-Based Methods:
The class imbablance problem is a scenario where the number of instances in one class significantly outnumbers the instances in other classes.
Models tend to be biased towards the majority class, leading to poor performance on the minority class.
Resampling: Techniques such as oversampling the minority class or undersampling the majority class.
Algorithmic Adjustments: Using cost-sensitive learning or modifying decision thresholds.
Synthetic Data: Generating synthetic samples for the minority class using methods like SMOTE (Synthetic Minority Over-sampling Technique).
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
training_num_tr = num_pipeline.fit_transform(training_num)
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), ["sequence"]),
])
training_prepared = full_pipeline.fit_transform(training)
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa
I presented GPT-4o with an image of the main character from Marcel the Shell With Shoes On and had the following interactions.
Marcel: How would you describe the subject of the image, its artistic quality, technique, and rendering style?
GPT-4o: The image depicts a whimsical character that resembles a small shell with one large, expressive eye. It stands on two legs wearing shoes, giving it a cute and almost anthropomorphic appearance. The character has a surreal and quirky look, which lends it a sense of personality and charm.
Artistic Quality:
The image is rendered with a high degree of polish and smoothness, possibly created digitally. The textures, shading, and lighting are soft and subtle, giving the character a smooth and rounded look. The rendering captures a delicate balance between realism (in the textures of the shell and shoes) and cartoon-like abstraction (in the exaggerated eye and small stature).
Technique:
The artist appears to have employed a blend of digital painting and 3D-like modeling techniques, with careful attention to detail in the reflections in the eye and the textures of the shell and shoes. The color palette is muted, using soft tones that evoke a gentle, friendly atmosphere.
Rendering Style:
The rendering style is a mix of minimalist and hyperrealist elements. The minimalism is seen in the simplified forms of the character—particularly the body and legs—while the hyperrealism comes through in the detailing of the eye and the subtle textures. This gives the character a tactile, almost tangible quality while maintaining its fantastical, imaginative essence.
The overall effect of the design is both charming and surreal, combining everyday objects (a shell, shoes) with anthropomorphic and fantastical elements to create a unique, playful character.
Marcel: Create an image based on the following description: The image depicts a whimsical character that resembles a small shell with one large, expressive eye. It stands on two legs wearing shoes, giving it a cute and almost anthropomorphic appearance. The character has a surreal and quirky look, which lends it a sense of personality and charm.
Artistic Quality:
The image is rendered with a high degree of polish and smoothness, possibly created digitally. The textures, shading, and lighting are soft and subtle, giving the character a smooth and rounded look. The rendering captures a delicate balance between realism (in the textures of the shell and shoes) and cartoon-like abstraction (in the exaggerated eye and small stature).
Technique:
The artist appears to have employed a blend of digital painting and 3D-like modeling techniques, with careful attention to detail in the reflections in the eye and the textures of the shell and shoes. The color palette is muted, using soft tones that evoke a gentle, friendly atmosphere.
Rendering Style:
The rendering style is a mix of minimalist and hyperrealist elements. The minimalism is seen in the simplified forms of the character—particularly the body and legs—while the hyperrealism comes through in the detailing of the eye and the subtle textures. This gives the character a tactile, almost tangible quality while maintaining its fantastical, imaginative essence.
The overall effect of the design is both charming and surreal, combining everyday objects (a shell, shoes) with anthropomorphic and fantastical elements to create a unique, playful character.
GTP-4o: Here is the image based on the description you provided. Let me know if you’d like to make any adjustments!