Let’s get back to the putatitive difference in height between Swedish men and women. We have to form a statistic to use on the data. The obvious one is the difference in the means between the men and the women. You have 25 men and 25 women. Let’s call the mean you measured for the men using eqn. 2.1 ; for women, call it . Let’s call the difference . Remember this isn’t the true difference you’d get if you used the whole population, only an estimate.
Maybe you get . Remember that if we did the experiment again we might get or even . How do we calculate the likelihood of finding a difference of if women and men have precisely the same distribution of height (that is if is true)? To figure that out, we have to know what the probability distribution is for our estimate of the mean difference . But assuming, like we are, that there’s no difference in the two distributions, the true is 0. In that case we have some brontosaurus-like curve for the probability of mean-differences. Let’s plot what we’d expect it to look like assuming :
The actual functional form of this is quite subtle and something that I’ll discuss a little later. Let’s try to understand the basics first.
Assuming (no difference between heights of men and women), it’s most likely that you’ll get a difference of . But we’ve found a difference of . That doesn’t seem so likely. How likely is it? Well we’re asking slightly the wrong question. The probability of getting exactly some number is 0 for a continuous distribution. We want to know what’s the probability we’d get or something even more unlikely, like . So what we really want to know is the probability of finding a value of or greater by chance.
We already know how to calculate this, from eqn 1.11, it’s just the area under the curve:
So the area in red is the probability that you’d get a value of or greater by chance, while in fact there’s actually no difference.
If the area in red is less than your threshold , say then you’d say ”the probability that there’s no difference in the height statistics between the groups is small therefore the assumption of no difference is unlikely to be true”. So in that case you conclude there’s a statistically significant difference between the heights of men and women based on your data.
If the area in red is greater than your threshold , say then you’d say ”the probability that there’s no difference in the height statistics between the groups isn’t that small, therefore we can’t reject the assumption that there’s no difference”. In that case you conclude that the difference in the height is not statistically significant. Of course if you used 5000 measurements instead, you might then change your tune. You do the best you can based on the available data.
You might get ridiculed for saying that there’s no statistically significant difference in this case. ”Come on, just look down the street. You can clearly see an average difference in height”, or ”everybody knows that there’s a difference”. But that kind of talk is what kept doctors blood-letting for centuries. They had some wacky ideas about the four humours, blood, bile, and so on, whose relative proportions told you about a person’s mood and health. Of course they knew it worked. Do you think anyone did a proper study of this? They just thought from theory and empirical observation that it worked. It worked in the sense that some ill people would sometimes get better despite being exsanguinated. But did it work better than nothing? These days, people have decided it wasn’t such a great idea. Now medicine routinely uses, or attempts to use, statistics as a way of devising new treatments.