School of Electrical Engineering and Computer Science
University of Ottawa
Published
October 5, 2025
Scenario
We pretend to predict a house price using k-Nearest Neighbors (KNN) regression with two features:
\(x_1\): number of rooms (small scale)
\(x_2\): square footage (large scale)
We create three examples a, b, c chosen so that:
Without scaling, a is closer to b (because square footage dominates).
With scaling (z-score), a becomes closer to c (rooms difference matters after rescaling).
Data (three houses)
import numpy as npimport pandas as pd# Three examples (rooms, sqft); prices only for b and c (training)point_names = ["a", "b", "c"]X = np.array([ [4, 1500.0], # a (query) [8, 1520.0], # b (train) [4, 1300.0], # c (train)], dtype=float)prices = pd.Series([np.nan, 520_000, 390_000], index=point_names, name="price")df = pd.DataFrame(X, columns=["rooms", "sqft"], index=point_names)display(df)display(prices.to_frame())
rooms
sqft
a
4.0
1500.0
b
8.0
1520.0
c
4.0
1300.0
price
a
NaN
b
520000.0
c
390000.0
Note. We’ll treat b and c as the training set, and a as the query whose price we want to predict.
Euclidean distances (unscaled)
The (squared) Euclidean distance between \(u\) and \(v\) is \[
\|u-v\|_2^2 = \sum_j (u_j - v_j)^2.
\]
When one feature has a much larger scale (e.g., square footage), it can dominate the sum.
from sklearn.metrics import pairwise_distancesdist_unscaled = pd.DataFrame( pairwise_distances(df.values, metric="euclidean"), index=df.index, columns=df.index)dist_unscaled
a
b
c
a
0.000000
20.396078
200.000000
b
20.396078
0.000000
220.036361
c
200.000000
220.036361
0.000000
print("Nearest to 'a' (unscaled):", dist_unscaled.loc["a"].drop("a").idxmin())
Nearest to 'a' (unscaled): b
Expectation:a is nearest to b (similar sqft overwhelms rooms).
Proper scaling for modeling (fit scaler on the training set)
For a fair ML workflow, compute scaling parameters on the training data (b, c) only, then transform both train and query:
from sklearn.preprocessing import StandardScalertrain_idx = ["b", "c"]query_idx = ["a"]scaler = StandardScaler()scaler.fit(df.loc[train_idx]) # fit only on training pointsZ = pd.DataFrame( scaler.transform(df), columns=df.columns, index=df.index)Z