A t-test is a way of determining whether two averages are the same (statistically speaking) or different. In order to do this, of course, you need to have data that can be averaged. Things like length, height, weight, speed, temperature ... you get the idea. This kind of data is called "quantitive", because you can measure the quantity. Data like color, shape, or emotion is called "qualitative" because you can only state the quality, not the quantity. Qualitative data cannot be evaluated with a t-test; instead, you need to use a qualitative test like a chi-square.
But let's get back to the t-test, with an example: a punkrockologist is trying to figure out whether different bands tend to write songs that are the same length or not.
First she has a random sample of the lengths of Green Day songs (in seconds) from the American Idiot CD:
548, 260, 285, 332, 246, 558
She also measured the lengths of 6 Nirvana songs, from the Bleach CD:
137, 162, 245, 250, 203, 222
It seems pretty clear by eyeballing the data that Green Day has, on average, longer songs. But when you're doing science or even government studies, you can't say “we eyeballed the data and it seems like …” Finally she measured 6 Linkin Park songs from the Meteora CD:
188, 175, 204, 198, 175, 145
These songs seem a little shorter than Nirvana's, but its pretty close. Maybe she just happened to pick the shorter songs for her sample?
Just to get an intuitive sense of what the t-test does, first try to plot all three sets of data on one graph:

Looking at the plot of the song length data, which two groups look the most similar? Which look the most different?
Intuitively, you could say that
If you guessed that there is a statistical test to determine whether a set of numbers (like the length of Green Day songs) has a higher average than another set (like the Nirvana songs), then you are right! Telepathic maybe.
What do you think the first step is? Calculate the averages, maybe?
| Band | Raw Data | Average |
| Green Day | 548, 260, 285, 332, 246, 558 |
371
|
| Nirvana | 137, 162, 245, 250, 203, 222 |
203
|
| Linkin Park | 188, 175, 204, 198, 175, 145 |
181
|
OK, so the Green Day songs average about two and a half minutes longer than Nirvana songs, which are in turn about 20 seconds longer than Linkin Park songs. But how long is long? How short is short? And how do we know if 20 seconds is a lot or a little?
For example, one year would be a large age difference between teenagers in the same school grade ...So, how can I get a handle on how big a difference has to be in order to matter?
you've either got a good memory or you're definitely telepathic.
Do you remember how to calculate a standard deviation (SD)? We'll do it with the Green Day songs below:
| Band | Raw Data | Average | SD |
| Green Day | 548, 260, 285, 332, 246, 558 | 371 |
144 |
| Nirvana | 137, 162, 245, 250, 203, 222 | 203 |
46 |
| Linkin Park | 188, 175, 204, 198, 175, 145 | 181 |
21 |
We're going to take a short detour here, into the Land of Variability . You just figured out some standard deviations. A useful question is, what happens when you collect more data? Does the standard deviation get bigger, or smaller, or stay the same?
Let's say for a moment that you only measure 2 songs, and they are 120 seconds and 140 seconds. So the average is 130 seconds and the standard deviation is
Sqrt((102+102)/1) = SQRT(200) = 14.1
Now let's say we take a bigger sample, which also has average = 130 seconds:
110, 120, 125, 135, 140, 150
Now the standard deviation is:
Sqrt(((-20)2 + (-10)2 + (-5)2 + 52 + 102 + 202 )/5) = 14.5
So we did a lot more work, but the standard deviation did not change much. In fact, it got slightly bigger. Why?
The answer is that what the standard deviation tells you how much the population varies. As you do more sampling, your standard deviation should stay approximately the same. There is variability in the population, and the standard deviation is measuring it.
But when we do more and more sampling, we are also getting closer and closer to figuring out the real average. Otherwise why do more sampling? What we need is a new number that tells us how close we are to the actual mean.
I won't explain why this works, but it is a well-established fact that if you divide the standard deviation by the square root of the sample size, you get a number called a standard error (SE), and that number tells you how close you are to the true mean.
SE (STANDARD ERROR) = SD / Sqrt(n)
A rule of thumb: 95% of the time, the true average lies within 2 SE's of your sample average.
So, as I do more and more sampling, n gets bigger and the standard error gets smaller. That means I can narrow in on the true average.
Let's try some examples. Let's say I have measured 9 songs (in Statisticalese, I say n=9, where “n” means “number in sample”).
but if I measure 100 songs:
So all that extra sampling paid off -- we can narrow down the range around the true average from 40 seconds to 12 seconds.
OK, so we are trying to get back onto the road after our detour. Remember we calculated the average length of Green Day songs and the Nirvana songs. Then we wanted to compare that difference to some sort of variability, in order to figure out if the difference is significant or not.
SEcombined = Sqrt (SE12 + SE22)
Does that remind you of anything? Like the pythagorean theorem? Tthe SEcombined ends up being a little bigger than either individual SE, but less than their sum. A pretty neat trick.

So the combined SE is sqrt (592 + 192) = 61.5 of song length for Green Day (SE=59 seconds) and Nirvana (SE=19 seconds).
Now we need to compare the difference in song lengths to the combined standard error.
In order to compare two numbers, we need to use the ratio:
Based on the argument above, it seems that
![]() |
vs | ![]() |
|---|
Hmm, what we need is another Magic Lookup Table!! (see the chi-square module for the original magic lookup table).
| degrees of freedom (df) | tcrit (for p-value = 0.05) |
|---|---|
| 1 | 12.7 |
| 2 | 4.3 |
| 3 | 3.2 |
| 4 | 2.8 |
| 5 | 2.6 |
| 6 | 2.5 |
| 7 | 2.4 |
| 8 | 2.3 |
| 9 | 2.2 |
| 10 | 2.1 |
| 20+ | 2.0 |
Remember that degrees of freedom tell you how many pieces of information are “free” to vary. We found the length of 6 songs (n=6). If you tell me the average and 5 of the lengths, then I can tell you how long the last song is. So there are 5 degrees of freedom (df = n-1 = 5).
The difference between average song length was 168 seconds, and the combined standard error was 61.5. That means the ratio of average difference to error was 2.74. So the difference was pretty big compared to the variability, and it seems like there is a real difference between the song lengths.
Now for the lookup table:
Our test had 5 degrees of freedom, so the critical number is 2.6. In other words, if the average difference is AT LEAST 2.6 times as great as the combined standard error, then the two sets of numbers really are different.
So our threshold was 2.6 and our calculated value was 2.74, which is bigger than our threshold. Since the difference in song length is at least 2.6 times as big as the combined standard error, we won … I mean, we showed that there is only a 5% chance that the difference is due to chance alone – mostly likely the song lengths really are different.
Try to do the next two tests on your own. First, here is the difference between the Green Day and the Linkin Park song lengths:
|
|---|
This time see if you can do the steps on your own
|
|
|---|
If you're having trouble, go back to the last page to review the steps.
Are the lengths of Linkin Park and Nirvana songs different??? And the answer is: