2 Statistics 2.2 Simple Example 2.4 Hypothesis testing

2.3 Unbiased estimates

We’ll never know exactly what the true value is for the average length of grass on the library lawn. The only way to know that would be to go through all the blades of grass and measure every one. Same with the height of everyone. We can’t heigh everyone on the planet. But we can come up with estimates of these.

For example we can pick 100 blades of grass and get an average 6.1”. We could do it again with different blades and get 5.9”. Suppose we did that a million times. we better get the true value (6”). If we don’t, we’re in trouble. We’d call our estimate ”biased”. Is it biased? Let’s use a bar to denote a finite average:

{\bar{x}}={1\over n}\sum_{i=1}^{n}x_{i}

(2.1)

and brackets $\langle\dots\rangle$ to denote the ”expectation value”

\langle x\rangle={1\over N}\sum_{i=1}^{N}x_{i}

(2.2)

We talked about this earlier, and it meant the ”true” average, that is, the average over the entire population of $N$ members which could be virtually infinite.

So I’m saying that if we take the expectation value of the mean, we’d better get the true average. Let’s see if that works:

\langle{\bar{x}}\rangle={1\over n}\sum_{i=1}^{n}\langle x_{i}\rangle

(2.3)

But $\langle x_{i}\rangle$ is the true average, call it $\langle x\rangle$ . Putting that into the right hand side, we see that indeed

\langle{\bar{x}}\rangle=\langle x\rangle

(2.4)

So the estimate of the mean, eqn. 2.1 is unbiased. That’s good.

Now what’s a good estimate of the variance? The definition in eqn. 1.38 can be said in words, though it’s a mouthful: the average (expectation value) of the square of the differences of data values from their mean. That definition relies on something that you’d never really be able to measure. You’re averaging over all members of your population. So since we can’t measure all the blades of grass, what’s a good measure using only 100 of them?

The first guess you’d have is that you take

{1\over n}\sum_{i=1}^{n}\langle(x_{i}-{\bar{x}})^{2}\rangle~{}~{}~{}BIASED!

(2.5)

But it turns out that this is a biased estimate. In other words, if you calculate this with 10 blades of grass ( $n=10$ ), that you randomly picked, and then do the same with another 10, and keep repeating. The average of the variances you get, will not be the true variance you’d get for the whole population!

The culprit in this is your estimate for $\langle x\rangle$ is off and this messes up your answer. Let’s illustrate with $n=2$ . And suppose the true variance is $\sigma^{2}$ and true average is 0. This simplifies the complete calculation because the $\sigma^{2}=\langle x^{2}\rangle$ . Then ${\bar{x}}=(x_{1}+x_{2}/2)$ and the above formula is

	$\displaystyle{1\over 2}((x_{1}-({x_{1}+x_{2}\over 2}))^{2}+(x_{2}-({x_{1}+x_{2% }\over 2}))^{2})=$		(2.6)
	$\displaystyle{1\over 2}(({x_{1}-x_{2}\over 2})^{2}+({x_{2}-x_{1}\over 2})^{2})% ={1\over 4}(x_{1}^{2}+x_{2}^{2}-2x_{1}x_{2})$		(2.7)

If we now take the expectation value of this we see the final ”cross term” $\langle x_{1}x_{3}\rangle=\langle x_{1}\rangle\langle x_{2}\rangle=0$ assuming independence of the data. So we get

{1\over 4}\langle(x_{1}^{2}+x_{2}^{2}\rangle={1\over 2}\langle x^{2}\rangle={% \sigma^{2}\over 2}

(2.8)

So in this case, we’re off by a factor of 2. If you do the general calculation, for any $n$ and any $\langle x\rangle$ , you see you’re off by a factor $(n-1)/n$ . So to get an unbiased estimate, the correct formula for the variance is

\sigma^{2}={1\over n-1}\sum_{i=1}^{n}\langle(x_{i}-{\bar{x}})^{2}\rangle~{}~{}% ~{}CORRECT

(2.9)

So this seems like a detail if $n=100$ , that’s true. But if $n=3$ , which it sometimes is, then this makes a pretty big difference.

So in summary, we now know how to estimate the average and variance of a population from a finite number of data points. Now we’ll get to the harder part, how to determine if two populations are the same or different.