2.4 Hypothesis testing

If you flip a coin 100 times and it always comes out heads, you start to feel that maybe the coin is biased, that is it’s more likely to come out heads than tails, on average. If 51 times it’s heads, but 49 times its tails, you’d think that it’s probably unbiased. But how do you quantify this intuition?

If it was heads 100 times in a row you’d say: ”come on, I kept on flipping the darned coin and it was always heads. What are the odds of that? 1/2100=7.9×1031. So it’s got to be a biased coin”. But if it was 51 heads versus 49 tails you’d say: ”it’s probably unbiased, this is the kind of thing you’d see by chance. If you repeat it again you might see 46 heads and 54 tails.”

With this in mind, we’ll try to quantify what you do.

1. Form Hypotheses

Yes that’s right use a fancy word like that and everyone will listen to you. When you’ve got their attention, say ”Hypothesis, which is from the Greek hupotithenai, look at me ain’t I clever”. After you’ve learned to pronounce it correctly you’ve actually got to come up with hypotheses, like: ”this is an unbiased coin”. Another hypothesis is: ”this is a biased coin”. wow.

Another example of this kind of hypothesis could be H0: ”the heights of a population of men and women are the same”. An alternative hypothesis is H1: ”the heights of a population of men and women are different”.

When a hypothesis says, ”you’re not going to see any difference” like H0 above, it’s often refered to as ”the null hypothesis”. Here’s another example of an H0: ”Echinacea does zip for colds”.

When a hypothesis is an alternative to this like H1 above, this is often called ”the alternative hypothesis”. Fancy terminology for something seemingly pretty simple.

2. Form a statistic

You’ve also got to decide on a statistic to measure, like the fraction of times you get heads, or maybe something more complicated than that, like the difference in the means of the heights of men and women.

3. Do the experiment

How can you determine if the hypothesis is correct? You’ve got to do a real experiment. We’re not Einstein here, figuring out if the coin is biased just by solving the equation for the universe. We need to get real data drawn from a sample of our populations.

4. Likelihood of result

Now call the statistic you’ve calculated μ. You calculate the probablity p0 you’d get μ under the assumption of H0, the null hypothesis. In the case of coins, getting no heads would be μ=0. Under the null hypothesis, that your coin is not biased, we know the probability of this happening is 1/2100, which even an ant would say is a very small number.

In the height difference example, we measure the difference in heights between men and women, μ. Assuming the null hypothesis, what’s the probability that you’d get a difference μ? That can be quite a tough question to answer involving some quite complicated math.

In the case of real statistical measures, when I said ”complicated” I’m not kidding. Some of these formulas are pretty involved. We’ll go through how to estimate this probability a little later.

5. The verdict

So you get some number, like p0=1/2100, or maybe .4. This is the liklihood that you’d get your result assuming the null hypothesis H0, for example assuming the coin is unbiased. If it’s 1/2100 you can be pretty sure that H0 is not true. If p0 is .4, then you can’t reliably reject the null hypothesis. You just throw up your hands, or if you want to sound more erudite, you say there’s no statistically significant difference between the means of the two populations. Where do you set the threshold? In scientific work, you set it reasonably low, but not too low, typically .05.

Should you reverse items 2 and 3, that is, determine the statistics once you’ve seen your data? That’s not really a good idea because it messes up your analysis of statistical significance. You could then peruse some thick book on statistical tests and likely find some test ”Applecore’s 5 sided, triple blind, non-parametric, variation of means test” (I just made that up), that’ll give you the desired answer. What do I mean by desired? Desired by the drug company or the funding agency? That’s why you want to design the experiment and the test before seeing the actual data. Is what I said always followed in practice? Of course not. Statisticians call this ”the problem of multiplicity”.