Scaling

CSI4106 Introduction to Artificial Intelligence

Author
Affiliations

Marcel Turcotte

School of Electrical Engineering and Computer Science

University of Ottawa

Published

October 5, 2025

Scenario

We pretend to predict a house price using k-Nearest Neighbors (KNN) regression with two features:

  • \(x_1\): number of rooms (small scale)
  • \(x_2\): square footage (large scale)

We create three examples a, b, c chosen so that:

  • Without scaling, a is closer to b (because square footage dominates).
  • With scaling (z-score), a becomes closer to c (rooms difference matters after rescaling).

Data (three houses)

import numpy as np
import pandas as pd

# Three examples (rooms, sqft); prices only for b and c (training)
point_names = ["a", "b", "c"]
X = np.array([
    [4, 1500.0],  # a (query)
    [8, 1520.0],  # b (train)
    [4, 1300.0],  # c (train)
], dtype=float)

prices = pd.Series([np.nan, 520_000, 390_000], index=point_names, name="price")

df = pd.DataFrame(X, columns=["rooms", "sqft"], index=point_names)
display(df)
display(prices.to_frame())
rooms sqft
a 4.0 1500.0
b 8.0 1520.0
c 4.0 1300.0
price
a NaN
b 520000.0
c 390000.0

Note. We’ll treat b and c as the training set, and a as the query whose price we want to predict.

Euclidean distances (unscaled)

The (squared) Euclidean distance between \(u\) and \(v\) is \[ \|u-v\|_2^2 = \sum_j (u_j - v_j)^2. \]

When one feature has a much larger scale (e.g., square footage), it can dominate the sum.

from sklearn.metrics import pairwise_distances

dist_unscaled = pd.DataFrame(
    pairwise_distances(df.values, metric="euclidean"),
    index=df.index, columns=df.index
)
dist_unscaled
a b c
a 0.000000 20.396078 200.000000
b 20.396078 0.000000 220.036361
c 200.000000 220.036361 0.000000
print("Nearest to 'a' (unscaled):", dist_unscaled.loc["a"].drop("a").idxmin())
Nearest to 'a' (unscaled): b

Expectation: a is nearest to b (similar sqft overwhelms rooms).

Proper scaling for modeling (fit scaler on the training set)

For a fair ML workflow, compute scaling parameters on the training data (b, c) only, then transform both train and query:

\[ z(x) = \frac{x-\mu_{\text{train}}}{\sigma_{\text{train}}}. \]

from sklearn.preprocessing import StandardScaler

train_idx = ["b", "c"]
query_idx = ["a"]

scaler = StandardScaler()

scaler.fit(df.loc[train_idx])     # fit only on training points

Z = pd.DataFrame(
    scaler.transform(df),
    columns=df.columns, index=df.index
)

Z
rooms sqft
a -1.0 0.818182
b 1.0 1.000000
c -1.0 -1.000000

Euclidean distances (after scaling)

dist_scaled = pd.DataFrame(
    pairwise_distances(Z.values, metric="euclidean"),
    index=Z.index, columns=Z.index
)
dist_scaled
a b c
a 0.000000 2.008247 1.818182
b 2.008247 0.000000 2.828427
c 1.818182 2.828427 0.000000
print("Nearest to 'a' (scaled):", dist_scaled.loc["a"].drop("a").idxmin())
Nearest to 'a' (scaled): c

Now: a is nearest to c (rooms difference matters once features are on comparable scales).

KNN regressor: flip in the prediction

We’ll run a 1-NN regressor (so the prediction is exactly the nearest neighbor’s price) with and without scaling.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

X_train = df.loc[train_idx].values          # b, c
y_train = prices.loc[train_idx].values      # prices for b, c
X_query = df.loc[query_idx].values          # a

# 1) No scaling
knn_plain = KNeighborsRegressor(n_neighbors=1, metric="euclidean")
knn_plain.fit(X_train, y_train)
pred_plain = knn_plain.predict(X_query)[0]

# 2) With scaling (pipeline fits scaler only on training, then KNN on scaled)
knn_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsRegressor(n_neighbors=1, metric="euclidean"))
])
knn_scaled.fit(X_train, y_train)
pred_scaled = knn_scaled.predict(X_query)[0]

pd.DataFrame(
    {
        "prediction (no scaling)": [pred_plain],
        "prediction (with scaling)": [pred_scaled],
        "nearest neighbor (no scaling)": [point_names[1] if pred_plain==prices['b'] else point_names[2]],
        "nearest neighbor (with scaling)": [point_names[1] if pred_scaled==prices['b'] else point_names[2]],
    },
    index=["a"]
)
prediction (no scaling) prediction (with scaling) nearest neighbor (no scaling) nearest neighbor (with scaling)
a 520000.0 390000.0 b c

Takeaway:

  • Unscaled: a ↔︎ b ⇒ prediction ≈ $520,000
  • Scaled: a ↔︎ c ⇒ prediction ≈ $390,000

Same model and data; just feature scale changed the neighbor—and the prediction.

Why this happens

  • (Squared) Euclidean distance aggregates per-feature squared differences:

\[ |u-v|_2^2 = \sum_j (u_j - v_j)^2. \]

  • A large-scale feature (e.g., sqft) can dwarf small-scale features (e.g., rooms), so KNN effectively “ignores” the smaller-scale dimensions.
  • Standardization (\(z\)-scores) or min-max scaling puts dimensions on comparable footing.
  • Rule of thumb: For distance-based methods (KNN, k-means, RBF kernels, etc.), always scale features.

Show the distance to neighbors only

Distances from a to {b, c} before and after scaling.

def show_pair(name_from, names_to, D):
    return D.loc[name_from, names_to].to_frame("distance")

print("Unscaled distances from a → {b,c}")
display(show_pair("a", ["b", "c"], dist_unscaled))

print("Scaled distances from a → {b,c}")
display(show_pair("a", ["b", "c"], dist_scaled))
Unscaled distances from a → {b,c}
distance
b 20.396078
c 200.000000
Scaled distances from a → {b,c}
distance
b 2.008247
c 1.818182

Switch to Manhattan distance?

Even with \(L_1\) distance, scale still matters:

\[ |u-v|_1 = \sum_j |u_j - v_j|. \]

Try replacing metric="euclidean" with metric="manhattan"—you’ll see the same sensitivity to feature scale.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

X_train = df.loc[train_idx].values          # b, c
y_train = prices.loc[train_idx].values      # prices for b, c
X_query = df.loc[query_idx].values          # a

# 1) No scaling
knn_plain = KNeighborsRegressor(n_neighbors=1, metric="manhattan")
knn_plain.fit(X_train, y_train)
pred_plain = knn_plain.predict(X_query)[0]

# 2) With scaling (pipeline fits scaler only on training, then KNN on scaled)
knn_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsRegressor(n_neighbors=1, metric="manhattan"))
])
knn_scaled.fit(X_train, y_train)
pred_scaled = knn_scaled.predict(X_query)[0]

pd.DataFrame(
    {
        "prediction (no scaling)": [pred_plain],
        "prediction (with scaling)": [pred_scaled],
        "nearest neighbor (no scaling)": [point_names[1] if pred_plain==prices['b'] else point_names[2]],
        "nearest neighbor (with scaling)": [point_names[1] if pred_scaled==prices['b'] else point_names[2]],
    },
    index=["a"]
)
prediction (no scaling) prediction (with scaling) nearest neighbor (no scaling) nearest neighbor (with scaling)
a 520000.0 390000.0 b c

TL;DR

  • Distance-based models are highly sensitive to feature scales.
  • Always scale your inputs (fit the scaler on the training set only).
  • Scaling can change nearest neighbors and therefore change predictions—as seen here with 1-NN regression.