Skip to content

Semester 1 · Autumn 2025

Week 2 – Summary statistics, similarity, nearest neighbors and decision trees

Machine Learning 02452 · Week 2 Focus: Summary statistics, similarity, nearest neighbors and decision trees

Attribute statistics

When we are working with variables it is good to be able to use basic statistic measures. Such as mean and variance.

Suppose that we record a number of observations {x1,...,xN}\{x_1,...,x_N\}, for instance corresponding to the weight of N = 20 for schoolchildren we can compute the empirical mean and we can use this mean is a best guess to what the average will be of the larger population. This gives us a measure we can use to as a guess of the true population.

Mean (Average)

The sample mean (or average) is computed as:

xˉ=1Ni=1Nxi\bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i

This gives us the central tendency of our data.

Variance

The sample variance measures how spread out the data is around the mean:

s^=1N1i=1N(xixˉ)2\hat{s} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2

Note: We divide by (N1)(N-1) instead of NN to get an unbiased estimate of the population variance (this is called Bessel’s correction).

Standard Deviation

The sample standard deviation is simply the square root of the variance:

s=s^=1N1i=1N(xixˉ)2s = \sqrt{\hat{s}} = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2}

This gives us a measure of spread in the same units as the original data.

4 Summary statistics and measures of similarity

The mean does provide us important info about the data sample but it can be effected by outliers. A way to avoid that is to use the median which finds the value of XX that is in the middle of the dataset.

Median

The median is the middle value when the data is sorted:

  • For odd NN: median is the value at position N+12\frac{N+1}{2}
  • For even NN: median is the average of values at positions N2\frac{N}{2} and N2+1\frac{N}{2}+1

The median is robust to outliers because it only depends on the middle value(x), not on extreme values.

The median is then defined as the value of x such that half of the median values are lower than x.

Percentile

We can generalize a median into a percentile. Given a sample xx in the p’th percentile, this means that p%p\% of the observations in the dataset are less than or equal to xx.

Definition: The pp-th percentile is the value below which pp percent of the data falls.

Examples:

  • 50th percentile = median (50% of data below this value)
  • 25th percentile = first quartile (Q1)
  • 75th percentile = third quartile (Q3)
  • 100th percentile = maximum value

Computing Percentiles:

For a sorted dataset x1x2...xNx'_1 \leq x'_2 \leq ... \leq x'_N, we can approximate the pp-th percentile as:

xpxNpx_p \approx x'_{\lceil Np \rceil}

where \lceil \cdot \rceil rounds up to the nearest integer.

However, this is an approximation. The exact percentile requires interpolation because:

  • If 180 students have grades less than 11.7, they also have grades less than 11.7001
  • We need to choose a value between adjacent data points

Linear Interpolation Method:

  1. Sort the data: x1x2...xNx'_1 \leq x'_2 \leq ... \leq x'_N

  2. Calculate the position index: z=Np+12z = Np + \frac{1}{2}

  3. The percentile value is found by linear interpolation: xp=xz+(zz)(xz+1xz)x_p = x'_{\lfloor z \rfloor} + (z - \lfloor z \rfloor)(x'_{\lfloor z+1 \rfloor} - x'_{\lfloor z \rfloor})

What this means:

  • If zz is an integer, use that exact data point: xp=xzx_p = x'_z
  • If zz is between two integers (e.g., z=5.7z = 5.7), interpolate between x5x'_5 and x6x'_6
  • The fractional part (zz)=0.7(z - \lfloor z \rfloor) = 0.7 tells us how much weight to give to the higher value

Example: To find the 90th percentile with N=200N = 200 students:

  • z=200×0.90+0.5=180.5z = 200 \times 0.90 + 0.5 = 180.5
  • z=180\lfloor z \rfloor = 180, so interpolate between the 180th and 181st values
  • Using the interpolation formula: x0.90=x180+(180.5180)(x181x180)x_{0.90} = x'_{180} + (180.5 - 180)(x'_{181} - x'_{180}) x0.90=x180+0.5(x181x180)x_{0.90} = x'_{180} + 0.5(x'_{181} - x'_{180})
  • This simplifies to the average of the two values: x0.90=x180+x1812x_{0.90} = \frac{x'_{180} + x'_{181}}{2}

Why this works:

  • The fractional part is 0.50.5, meaning we’re exactly halfway between the two data points
  • So we weight each value equally: x180+0.5×(x181x180)=0.5×x180+0.5×x181x'_{180} + 0.5 \times (x'_{181} - x'_{180}) = 0.5 \times x'_{180} + 0.5 \times x'_{181}

Notation:

  • a\lceil a \rceil = ceiling function (rounds up): 2.8=3\lceil 2.8 \rceil = 3
  • a\lfloor a \rfloor = floor function (rounds down): 2.8=2\lfloor 2.8 \rfloor = 2

Mode

The mode is the most frequently occurring value in the dataset.

A dataset can be:

  • Unimodal: one mode
  • Bimodal: two modes
  • Multimodal: more than two modes

Covariance and Correlation

Covariance measures how much a variable yy changes when another variable xx changes and vice-versa.

Covariance

The sample covariance between two variables xx and yy is:

Cov(x,y)=1N1i=1N(xixˉ)(yiyˉ)\text{Cov}(x, y) = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})

Interpretation:

  • Positive covariance: When xx increases, yy tends to increase
  • Negative covariance: When xx increases, yy tends to decrease
  • Zero covariance: No linear relationship between xx and yy

Covariance Matrix

Given a dataset with MM attributes x1,...,xMx_1, ..., x_M, we can compute the pairwise covariance between any two attributes and collect them in an M×MM \times M covariance matrix Σ~\tilde{\Sigma}:

Σ~ij=Cov(xi,xj)\tilde{\Sigma}_{ij} = \text{Cov}(x_i, x_j)

This matrix is:

  • Symmetric: Σ~ij=Σ~ji\tilde{\Sigma}_{ij} = \tilde{\Sigma}_{ji}
  • Diagonal elements are variances: Σ~ii=Var(xi)\tilde{\Sigma}_{ii} = \text{Var}(x_i)
  • Off-diagonal elements are covariances between different attributes

Drawback: Covariance is affected by the scale of each attribute, making it difficult to compare covariances between different pairs of variables.

Correlation

To overcome the scale dependence of covariance, we standardize by dividing by the standard deviations of both variables. This gives us the correlation coefficient:

ρ^xy=Cov^(x,y)σ^xσ^y=1N1i=1N(xixˉ)(yiyˉ)σ^xσ^y\hat{\rho}_{xy} = \frac{\widehat{\text{Cov}}(x, y)}{\hat{\sigma}_x \hat{\sigma}_y} = \frac{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})}{\hat{\sigma}_x \hat{\sigma}_y}

Expanding the denominator:

ρ^xy=i=1N(xixˉ)(yiyˉ)i=1N(xixˉ)2i=1N(yiyˉ)2\hat{\rho}_{xy} = \frac{\sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2}}

where σ^x\hat{\sigma}_x and σ^y\hat{\sigma}_y are the sample standard deviations of xx and yy.

Properties:*

  • Always between -1 and 1: 1ρxy1-1 \leq \rho_{xy} \leq 1
  • Scale-invariant: Changing units of xx or yy doesn’t change ρxy\rho_{xy}
  • ρxy=1\rho_{xy} = 1: Perfect positive linear relationship
  • ρxy=1\rho_{xy} = -1: Perfect negative linear relationship
  • ρxy=0\rho_{xy} = 0: No linear relationship

A correlation of 0 means that x tells us nothing about y, a positive correlation tells us that when x is large y is also likely to be large and the opposite is true if theres a negative correlation y is going to be small if x is large.

Distance Metrics

Theres no simple definition of distance except that it is a function of observations x,yx,y such that if the value is large if they are not very similar and short if they are very similar.

Formal Properties of a Metric:

A proper distance measure dd should satisfy:

  1. Non-negativity: d(x,y)0d(x, y) \geq 0
  2. Identity of indiscernibles: d(x,y)=0d(x, y) = 0 if and only if x=yx = y
  3. Symmetry: d(x,y)=d(y,x)d(x, y) = d(y, x)
  4. Triangle inequality: d(x,y)d(x,z)+d(z,y)d(x, y) \leq d(x, z) + d(z, y)

The triangle inequality states that the direct distance between two points is never greater than going through an intermediate point. For example, the distance from home to work is not greater than the distance from home to the bakery plus the distance from the bakery to work.

Distance from Norms:

A common way to define distances is using norms. A norm x\|x\| is the magnitude of a vector and must satisfy:

  1. Non-negativity: x>0\|x\| > 0 if x0x \neq 0
  2. Scaling: ax=ax\|ax\| = |a|\|x\|
  3. Triangle inequality: x+yx+y\|x + y\| \leq \|x\| + \|y\|

We can define distance from a norm as:

d(x,y)=xyd(x, y) = \|x - y\|

Euclidean Distance

The most common distance metric, measuring the straight-line distance between two points:

d(x,y)=j=1M(xjyj)2d(x, y) = \sqrt{\sum_{j=1}^{M} (x_j - y_j)^2}

where xx and yy are MM-dimensional vectors.

Manhattan Distance (L1 Norm)

Sum of absolute differences along each dimension:

d1(x,y)=xy1=j=1Mxjyjd_1(x, y) = \|x - y\|_1 = \sum_{j=1}^{M} |x_j - y_j|

Also called “taxicab distance” or “city block distance” because it’s like traveling along a grid of streets.

Minkowski Distance (Lp Norm)

A generalization that includes both Euclidean and Manhattan distances:

dp(x,y)=(j=1Mxjyjp)1/pd_p(x, y) = \left(\sum_{j=1}^{M} |x_j - y_j|^p\right)^{1/p}

Special cases:

  • p=1p = 1: Manhattan distance
  • p=2p = 2: Euclidean distance
  • pp \to \infty: Chebyshev distance (maximum difference): d(x,y)=maxjxjyjd_\infty(x, y) = \max_j |x_j - y_j|

Effect of p: As pp increases, the distance metric becomes less sensitive to differences in individual dimensions and more dominated by the largest difference.

Mahalanobis Distance

Unlike the previous metrics, Mahalanobis distance takes into account the covariance structure of the data:

dM(x,y)=(xy)TΣ1(xy)d_M(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}

where Σ\Sigma is the covariance matrix of the dataset.

Key properties:

  • If Σ=I\Sigma = I (identity matrix), it reduces to Euclidean distance
  • Accounts for correlations between variables
  • Scale-invariant (adjusts for different variances in each dimension)

Intuition: The Mahalanobis distance is lower when two points lie along the direction where the data naturally varies (within the “point cloud”).

Example interpretation:

  • Two points might have the same Euclidean distance (~5.65)
  • But if one pair lies along the main axis of variation in the data (blue points): dM=4.15d_M = 4.15
  • While another pair lies perpendicular to the variation (red points): dM=13d_M = 13
  • The Mahalanobis distance captures that movement along the natural spread of data is “cheaper” than movement against it

This makes Mahalanobis distance useful when features have different scales or are correlated.