Why Do We Use the Mean?

Oct 16, 2019 00:00 · 855 words · 5 minute read animation dataviz math

On the occasions that I’ve had to teach statistics to people who are brand new to the topic, I’ve noticed something. People instinctively understand median and mode, but they just accept using the mean. Sometimes they push back, and want to know why we don’t use the median for everything. I think it’s easy to have learned averages by rote in elementary school, and to have never stopped to think about why we use the mean. It’d be nice to have a place to point them to, and what better place than here?

The motivation of the arithmetic mean is one of fairness. You take a total and distribute it evenly amongst the sample. As an example: three people want a reprsentative number for how much pocket change they have, so they pool all their pocket change and redistribute it evenly. The amount of money in each portion is the mean. This is a little more abstract when you apply it to something like heights, but the logic is just the same. You take the total amount of height and apportion it out evenly among the sample.

The mean is so ubiquitious in statistics because it’s the best number to “stand in” for a given set of numbers. There are two big advantages to using the arithmetic mean: it is unbiased, and has minimal variance¹.

The mean is unbiased

We call the deviation from the mean \(x-\mu\) a residual. We want a centrality measure that’s in the center. If we add all our residuals together, we’d like the sum to be zero. The residuals summing to zero indicates balance, we’re off by the same amount on one side as we are on the other. The mean is the only number that this happens for.

Let’s visualize that:

I generated a set of 10 points drawn from a standard normal distribution. I tried 1000 “potential means” (numbers we’re using as if they were the mean) within the range of the data, and calculated the sum of residuals for each (plotted in black). The mean (red line), is the value where the sum of residuals equals 0 (dashed line).

We can also show this algebraically. We want to find the “potential mean” \(\bar{x}'\), such that the residual sum equals 0. For a set of \(n\) variables (\(x\)): \[ 0 = \sum_{i=0}^{n} x_i - \bar{x}' \\ 0 = \left ( \sum_{i=0}^{n} x_i \right ) - \left ( \sum_{i=0}^{n} \bar{x}' \right ) \\ 0 = \left ( \sum_{i=0}^{n} x_i \right ) - n\bar{x}' \\ n\bar{x}' = \sum_{i=0}^{n} x_i \\ \bar{x}' = \bar{x} = \frac{1}{n} \sum_{i=0}^{n} x_i \]

The mean reduces variance

I want to use a bit of backward logic for a moment. The variance of a set of numbers is their average distance² from the mean³: \(\sigma^2 = \frac{1}{n}\sum(x_i - \bar{x})^2\). Let’s forget the ‘average distance’ part for a moment, though. The part of that equation that I’m interested in is the sum of squared deviations part. Let’s visualize that.

Here I’ve placed our numbers on the number line. The colored squares represent the squared deviation from a point. The mean is the number that makes the total area of these squares the smallest. Since the variance is only a scaling factor away from this sum-of-squares measure, we can also say the mean is the number that gives you the lowest variance when you use it as a measure of central tendency.

The square term means that larger deviations from the mean are emphasized. This is what gives the mean its lack of robustness. A distant outlier will drag the mean toward it as the mean attempts to minimize the variance.

We see here that the total areas of the squares (i.e. the sample variance) is lowest when our potential mean is close to the actual aritmetic mean, which for this set of numbers is 0.1322028. Let’s visualize another way.

Again we see that the sample mean minimizes the sample variance at the arithmetic mean (illustrated by the red line).

We can show that the mean minimizes variance mathematically too. The variance is just the sum of squares scaled by a constanct, so we’ll minimize the sums of squares directly. We want \(\bar{x} = \mathrm{argmin}_{\bar{x}'} \sum (x-\bar{x}')^2\). Start by setting the derivative to zero: \[ \frac{d}{d\mu} \sum_{i=1}^{n}(x_{i}-\bar{x}')^2 = 0 \\ \sum_{i=1}^{n}2(x_{i}-\bar{x}') = 0 \\ 2 \sum_{i=1}^{n}(x_{i}-\bar{x}') = 0 \\ \sum_{i=1}^{n}(x_{i}-\bar{x}') = 0 \\ n\bar{x}' = \sum_{i=1}^{n}x_{i} \\ \bar{x}' = \bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i} \]

I’m abusing these terms somewhat. The minimum-variance unbiased estimator already has an estabilished meaning in terms of identifying population parameters from samples. I’m using the same idea, but only talking in terms of sample statistics.↩
Statistics starts to make a lot more sense when you you understand the important role of geometry and distance in everything you do. The aritmetic mean is just one of many Fréchet_means that can be calculated from a dataset. Median is another. There’s a good Math StackExchange post describing mean, median, and mode as numbers that minimize \(L^p\) norms ↩
I’m not using the \((n-1)\) correction here for simplicity. If you’re trying to estimate sample parameters, don’t use this formula↩