In this module, we are going to discuss and explore a statistical test used for “goodness of fit”. What does this mean? You know whether your shoes fit your feet based on whether they cause pain, right?
In sort of the same way, you can decide whether your data fits your expectations using a “goodness of fit” test. And believe me, if your data doesn't fit, it can cause a lot of pain…
Note: having a calculator on hand will make things go faster. You can also use a spreadsheet, or calculator software on your computer, or google (to use google, type the numbers into the search bar, followed by an equal sign, and hit search).
I want to start with some data and a model from outside of biology. The "data" (such as it is) comes from a Dilbert cartoon, and the competing hypotheses about the data come from Dilbert (the hard-working and long-suffering engineer) and his boss (the Evil Pointy-Haired Boss). We will work through a statistical test to show that Dilbert is right and the boss is wrong -- of course!
In this cartoon, Dilbert's evil pointy-haired boss decides he's found a new way that employees are cheating him: they are taking fake "sick days" on Mondays and Fridays in order to get longer weekends.
(To make this problem interactive, turn on javascript!)
Apparently we have saved the day ... 40% of sick days SHOULD fall on Monday or Friday, which means that employees are not abusing the system.
But wait. What if next year, Evil Pointy-Haired Boss (EPHB) finds that 42% sick days fell on Monday or Friday??? Proof positive, in his view, that employees are out to get him.
Let's be Dilbert for a minute. How could we confirm or disprove Evil Pointy-Haired Boss (EPHB's) claim? Clearly 42% is more than 40% -- but how much is too much? Do the extra 2% just represent the natural "slop" around 40%?
Or, what if next year 90% of sickdays fell on Monday or Friday? Would that make you think that Dilbert was wrong, and sick-days were not random? What about 50% of sickdays on M/F?
When you do statistics, you are doing two things: first, putting numbers on common sense, and secondly, using a method that allows you to decide on the gray areas. So, what we expect out of statistics is the following:
Let's start with the 42% M/F sickdays. For simplicity, we'll assume this means 42 out of 100 (rather than 84 out of 200 or 420 out of 1000, etc). That's the data that was observed. Using the laws of probability, we also know that (approximately) 40 out of 100 sickdays should fall on M/F. That's the expected value.
What we want to do is test how far apart the "observed" and "expected" answers are, right? So a logical first step is to subtract one from the other -- that tells us how different they are. We'll do this both for M/F sickdays and for midweek sickdays:
|
observed |
expected |
difference |
| Mon/Fri | 42 |
40 |
|
Midweek |
58 |
60 |
|
Then we want to know how important this difference is. Is it big compared to what we expected, or small? To compare the size of two numbers, you need to find a ratio -- in other words, use division. You need to find out how big the difference is compared to the number you expected to get. So, divide the difference (between the observed and expected) by the expected value:
|
observed |
expected |
difference |
(difference compared to expected ) |
Mon/Fri |
42 |
40 |
+2 |
|
Midweek |
58 |
60 |
-2 |
|
Big deviations would mean that we probably have the wrong explanation, whereas small deviations would probably mean we're on the right track. Since we're trying to show that sick days are RANDOM, big deviations are bad for our case, while small deviations are good for our case.
The method I showed you on the last page was not quite right. For reasons that are difficult to explain without a degree in statistics, you need to SQUARE the deviation before dividing by the expected value. So we have the following sequence:
(To make this problem interactive, turn on javascript!)
On the form below, you can put this all together. As you fill out each column, click on the button below it to check your numbers. BTW, this is where having a calculator will come in handy. As long as you use 2 significant digits in your calculations, the program will fill in additional digits for you.
Once again, recall that there were 42 mon/fri sick days out of 100.
So now you have calculated a number which is the chi-square statistic for this test, also called the "chi-square-calc". It's the one that's obnoxiously flashing at you. But what do you DO with it? You know that a big chi-square-calc is bad (because it means that the data deviate a lot from the model) and a small chi-square-calc is good (because it means the data doesn't deviate). But how big is big, or how small is small?
Before we answer that question, we need to take a brief detour to discuss degrees of freedom. After that, we can finally answer the question, are Dilbert's colleagues really out fishing on their long weekends?
If your shoes don't fit a little, they might cause a little pain, but not enough to pay attention to. But somewhere there's a threshold. If the shoe is too small, you go out and buy new ones.
Something similar happens with statistical tests such as the chi-square. If your calculated statistic value (i.e., the chi-square-calc) is a "little bit" big, that's not enough to contradict your hypothesis. But if its a LOT too big, then it does matter -- it is "significant".
I know this is still rather vague, so hang on. Statisticians measure how significant the calculated value is using what they call a "p-value" (p stands for "probability", not "pain"). A big p-value means that the calculated value could "probably" have happened by chance process -- like a little random slop. A small p-value means there's only a small probability that the calculated value arose from a little random slop. A p-value of 0.05 means essentially only 5% similar calculated values come from "sloppy" data, and the rest are "significant". In fact, this is the famous p=0.05 threshold that most scientists use (well, not famous like American Idol, but trust me, famous among statisticians and scientists).
So, so far we have a chi-square-calc, which has a p-value associated with it. This would be fine and dandy IF we actually knew what that p-value was. But we don't. And in fact, finding out the p-value for any given chi-square-calc would involve a complicated mathematical formula. Believe it or not, biologists are not actually big on complicated mathematical formulas (or formuli either). So instead we have a lookup table. Or as I like to say, a Magic Lookup Table, because for our purposes, it might as well have appeared magically.
What the lookup table tells you is, for your specific dataset, what the chi-square calc is that would correspond with p=0.05. This special number is called the "chi-square-crit", as in the critical value or threshold value of the chi-square-calc.
And how do you know that this chi-square-crit is the one and only chi-square-crit that fits your exact dataset? It turns out that you only need to know one thing about your dataset, which is how many rows are in the chi-square table. Below is the Magic Lookup Table. If your chi-square table had 2 rows (like ours did), then you look up the chi-square crit under df = 1 (cuz 2-1 = 1, more about that on the next page).
When doing a chi-square goodness of fit test, there is one last wrinkle to iron out, called degrees of freedom.
When I told you that 42 out of 100 sick days were on Mondays or Fridays, you automatically knew that 58 had to be in the middle of the week, right? I was "free" to specify how many were on Monday/Friday, but then I was NOT "free" to decide how many were on non-Monday/Friday. So we say that, in this problem, there is only 1 degree of freedom.
(To make this problem interactive, turn on javascript!)
It is possible to do chi-square tests using more than 2 variables. For example, let's say I got data on how many sickdays fell on EACH of the five weekdays:
| day | observed | expected |
| mon | 22 | 20 |
| tues | 19 | 20 |
| wed | 19 | 20 |
| thurs | 20 | 20 |
| fri | 20 | 20 |
We could do a chi-square test to check whether the distribution of sick days matched our expectations for ALL FIVE weekdays
(To make this problem interactive, turn on javascript!)
On the last page, I said you should look up the chi-square-crit under "number of rows minus one". Why?
When I told you that 42 out of 100 sick days were on Mondays or Fridays, you automatically knew that 58 had to be in the middle of the week, right? I was "free" to specify how many were on Monday/Friday, but then I was NOT "free" to decide how many were on non-Monday/Friday. So we say that, in this problem, there is only 1 degree of freedom.
(To make this problem interactive, turn on javascript!)
It is possible to do chi-square tests using more than 2 variables. For example, let's say I got data on how many sickdays fell on EACH of the five weekdays:
| day | observed | expected |
| mon | 22 | 20 |
| tues | 19 | 20 |
| wed | 19 | 20 |
| thurs | 20 | 20 |
| fri | 20 | 20 |
We could do a chi-square test to check whether the distribution of sick days matched our expectations for ALL FIVE weekdays
(To make this problem interactive, turn on javascript!)
Once you know the degrees of freedom (or df) in your, you can use a chi square table like the one on the right to show you the chi-square-crit corresponding to a p-value of 0.05. That's the whole detour summed up in one sentence. Whew. For Dilbert's test, with 1 df , the chi-square-crit is 3.84. What does critical value mean?
|
(To make this problem interactive, turn on javascript!)
So, we have a chi-square value that we calculated, called chi-square-calc (0.166) and a chi-square value that we looked up, called chi-square-crit (3.84). Comparing these two values, we find that our chi-square-calc is much smaller than the chi-square-crit (0.166 < 3.84), which means the deviations were small and the model fits the data. This supports Dilbert's hypothesis that sick days were random. So employees are probably really sick and not out fishing on those Mondays and Fridays. |
|
Here's a summary of the steps needed to do a chi-square goodness of fit test:
General Steps |
In the Dilbert example... |
| 1. Decide on a null hypothesis -- a "model" that the data should fit | Dilbert's null hypothesis was that the sick days were randomly distributed. |
| 2. Note your "expected" and "observed" values | Since 40% of weekdays fall on Monday or Friday, the same should be true of sick days -- or 40 out of 100. The observed value was 42 out of 100. |
| 3. Find the chi-square-calc [add up (o-e)2 / e ] | We got 0.166 |
| 4. Look up the chi-square-crit based on your p-value and degrees of freedom. | With p=0.05 and df=1, chi-square-crit = 3.84. |
| 5. Determine whether chi-square-calc < chi-square crit -- if so, we say the model fits the data well. | Chi-square-calc < chi-square-crit, so the deviations are small and the data fit the null model of random sick days. |
What if 90% of the sickdays were on M/F?
You should have gotten a chi-square-calc of 104.166, compared to the chi-square-crit of 3.84. So, the chi-square-calc is much greater than the chi-square-crit -- so the data does not fit the model and you reject your null hypothesis. In other words, Dilbert's random sickday model does NOT hold up.
In the last two examples (42% and 90%), it was pretty obvious what the chi-square test would say. In this last case, where 50% of sick days fall on M/F, it's not so obvious. This is a case where the statistical test can help resolve a gray area. Here goes...
You should have gotten a chi-square-calc of 4.166, compared to the chi-square-crit of 3.84. So, its a close call, but the test says that Dilbert's random sickday model probably does NOT hold up. The test can't tell you this for sure, but it still gives you a way to say "(probably) yes" or "(probably) no" when you're in a "gray" area
The following table compares the steps necessary for the two types of goodness-of-fit models in this module. I changed the tables a little from those given earlier, to emphasize the similarities between the two tests.
Chi-square steps |
| 1. Decide on a null hypothesis -- a "model" that the data should fit |
| 2. Decide on your p-value (usually 0.05). |
| 3. Note your "expected" and "observed" values |
| 4. Calculate the chi-square [add up (o-e)2 / e ] |
| 5. Look up the chi-square-crit based on your p-value and degrees of freedom (df=rows-1). Determine whether chi-square-calc < chi-square crit. If so, we say the model fits the data well. |
In both tests, the hardest steps and 1 (deciding on your null model) and 3 (figuring out what you "expected" to see based on the null model).
Usually your null model is that "chance alone" is responsible for any patterns in the observed data. For example, the 9:3:3:1 ratio for a dihybrid cross is what happens by chance alone, given that you mating 2 dihybrids.
This step (#1) also encompasses setting up your chi-square table or your simulations. For the chi-square table, you need to think in terms of how many pieces of observed data you have to test. Each of these becomes a row. Now you also know the degrees of freedom for your test, which is the number of rows minus 1.
Step #3, finding the expected values, basically means doing some probability calculations, using the Laws of AND and OR.
Once you know the expected values, filling out the rest of the chi-square table is just a matter of arithmetic.
Now go back to the main menu and try your hand at the quiz!