Let’s start by asking whether two populations have different means (i.e. averages). What does that mean? An example would be men and women. If you take the distribution of the heights of all men and all women, are they means really different? You don’t want to consider 6 billion people, but take a sizable sample of say, fifty people. This is often the point of statistics, you do a study with a group much smaller than the total, and try to generalize your results from the small group to the whole. But how do you come to right conclusion from a small group?
Four example you could take just 4 people, Alice, Wendy, Bob, an Zack. Is the average height of the women (Alice and Wendy) smaller than that of men (Bob and Zack)? It may not be. There are zillions of cases of women taller than men. So with just four people, you might come to the wrong conclusions just considering them.
Say take a group of 25 men and 25 women and formulate a question: are the means of these two groups the same? Another question that you hear all the time in papers, is if some nutritional supplement is effective in some way or other, like does echinacea shorten the total time that you have a cold? You can do the same kind of experiment. Take a group of 50 people, and with 25 of them give them echinacea, and then 25 you give them a sugar water pill. Then you innoculate both groups with some cold virus. Is the mean time that they’re sick different between the two groups?
Getting back to the height question, lets plot (fake) data for the heights of women and men.
Here I made the women’s data by taking 25 numbers with mean of 65 (inches) and a standard deviation of 3 (inches), and generated random numbers according to a Gaussian distribution. I did the same for men, so gave them a mean of 70 and a standard deviation of 3. This is what you might expect from some rather homogeneous population of people, say from Sweden, and all about the same age.
Looking at this, it’s not really clear that you can draw any conclusions because there’s quite a lot of overlap. For example, let’s generate another 25 random numbers from the women’s population and compare them to the first ones that were generated.
Hmm, these look different too, but we know that in fact they’re generated from the same distribution. In other words, they’re just different individuals drawn from the population.
There are a few things here that we’ll need to do in order to procede. The first thing we’ll do is figure out how to correctly estimate a mean and standard deviation. This is important because a lot of statistical test rely on these two quantities. It’s not quite as trivial as you might think.
The second thing to do is formulate a very specific procedure to determine if we have a difference or not. And if so, find a measure of the chance that we’ve come up with the right conclusion. This is really the hardest part about statistics.