How standardized language and speech testing works

When you get the results of your child's language and speech testing, you may have some difficulty understanding what it all means. Especially if you have limited experience with standardized testing, the speech-language pathologist's explanations may sound like complete gibberish.

And not to bad-mouth my colleagues, but there are a lot of SLPs who do speech testing without a very thorough understanding of how standardized testing works. It's pretty dry, tedious stuff, particularly for those of us who don't particularly enjoy math. Still, the better you understand it, the better equipped you will be to understand what your child's abilities and needs are.

Some SLPs are pretty good at explaining what the different scores mean, and I'd like to think I'm one of them. If you are working with one of my colleagues whose gifts lie in other areas and you need some help making sense of it all, this is a page you will want to read. Fair warning, though: although I have done my utmost to keep it simple and accessible, it does involve some discussion of math and statistics. If there's stuff you don't understand, that's okay; keep going and focus on the big picture.

Types of tests

We're all familiar with criterion-referenced tests, the kind where the score you get determines whether or not you pass, or what letter grade you earn. Most of the tests children take in school are of this type. Criterion-referenced tests are generally used to gauge whether a student has achieved up to a given standard.

Most speech testing is done with norm-referenced tests. This means that your child's performance is compared to a large sample of other children the same age. The creators of a norm-referenced test administer their test to a large number of children within a given age range before making it available for use with the general public. If you give a well-designed test to a large random sample of children, a chart of all the scores will take the form of a mathematical bell curve.

A few children will score very high or very low, but the majority will score toward the middle of the range. On a perfect bell curve, the average score (mean) will also be the most commonly occurring score (mode) and the score with the same number of results above it as below it (median).

Your child's results on norm-referenced speech testing will typically consist of several different numbers.

Raw score

The raw score is the number of items the child answered correctly. If the test has 30 items and the child responded correctly on 23 of them, the raw score is 23. Simple, right? However, the raw score does not tell us much about the child's ability. Is 23 a good score? Since many of us are used to criterion-referenced tests, we'll be tempted to convert the score to a percentage (76.67%) and view it in terms of letter grades. If 90% is an A, and 80% is a B, and 70% is a C, then a score of 23 would get you a C. However, that's not the way norm-referenced tests work. Because the same test may be given to children of different ages, a 9-year-old child would be expected to score higher than a 6-year-old; older children generally have better language skills than younger ones. So the answer to "Is 23 a good score?" is, "It depends." Not a very satisfying answer, is it? That's why we also calculate the child's ...

Percentile rank

If you've taken the SAT or a similar test, you may be remember seeing a percentile rank when you got your score. The percentile rank tells what percentage of the sample scored equal to or less than your score. If your percentile rank is 50, you scored as well as or better than 50% of the sample. On the SAT, the 'sample' is everyone who took the test on the same day as you. For speech testing, the sample is the group of children who took the test before it was published. More specifically, it's the children close to your child's age who took the test before it was published. A 6-year-old will be compared to other 6-year-olds, and a 9-year-old will be compared to other 9-year-olds. In fact, many tests divide up the sample into finer age ranges, such as six years, zero months (6;0) to six years and three months (6;3). This is because a child's speech and language skills can change a lot in a year's time, and comparing results of speech testing a child who just turned six with a child about to turn seven is often misleading.

So, what's a good percentile rank? If you're used to the traditional 90-80-70-60 grading scale where any score below 60 percent is a failure, a percentile score of 73 may look mediocre while 37 may seem abysmal. Actually, both of these scores are average--in the C range, if that's how you like to think of it. In fact, any score between the 16th and 84th percentile is in the average range. Eighty-four minus 16 is 68, so 68% of scores fall within this range. Anything sixty-eight percent of the population does can reasonably be considered 'normal'. A score above the 84th percentile is considered above average, and one below the 16th percentile is considered below average (Note that mathematicians refer to what the rest of us call the average as the mean. In math and statistics, the term average refers to a range of scores, not a single score).

Why set the cutoff lines at 16 and 84? These seemingly arbitrary numbers are based on statistical principles having to do with the analysis of large random samples. Before I attempt to explain how this works, let's first take a look at the ...

Standard score

The standard score is a way of showing how far a score is from the average score in the sample. Recall that the sample on a speech and language test consists of children who took the test before it was published. The people who make these tests carefully tweak the sample to make sure that it is representative of the general public; then they tweak the difficulty of the test itself to make sure that the distribution of the speech testing scores falls as closely as possible to a normal bell curve pattern.

Statisticians use a measure called the standard deviation (represented by σ, the lower-case Greek letter sigma) to measure how widely spread the values in a data set are. The closer the majority of scores are to the average, the smaller the standard deviation. If you like math, here's how you calculate the standard deviation:

  1. Calculate the mean (average) of all the scores (sum of scores divided by the number of scores).
  2. For each score, calculate its distance, or deviation from the average you found in step 1 (the score minus the average score). [Note: Half of the results will be negative numbers, since half the scores will be below the average.]
  3. Calculate the square of each deviation from step 2 (multiply each deviation by itself). [Note: All of these results will be positive numbers, since a negative number times a negative number equals a positive number.]
  4. Calculate the mean (average) of all the squared deviations from step 3. This number is called the variance, represented as σ2 (lower-case sigma squared).
  5. Find the square root of the variance. This is the standard deviation, or σ.

If you're me, the way you find the standard deviation is by entering all the scores into an Excel spreadsheet and letting the computer do the calculations.

The standard deviation is why the cutoff lines are set at the 16th and 84th percentiles. Take another look at the bell curve. Approximately 68% of the scores on the bell curve are within one standard deviation of the mean (which should also be the median, or the 50th percentile); 34% will be one standard deviation or less below the mean (16th to 50th percentile), and 34% will be one standard deviation or less above it (50th to 84th percentile).

The common practice on standardized tests used for speech testing is to substitute the number 100 for the mean score and 15 for the standard deviation. This allows us easily to compare speech testing scores from different tests and subtests. With this system, standard scores between 85 and 115 are within the average range; scores above 115 (+1σ, or one standard deviation above the mean) are considered above average, while scores below 85 (-1σ, or one standard deviation below the mean) are considered below average.

So, if a child gets a raw score of 23 out of 30 on a test, the way we can tell if it's a high score or a low score is to convert it to a standard score. The SLP who administered the speech testing does this by looking up the raw score on a table in the test manual showing raw-to-standard-score conversions for the age range that applies to the child who took the test. If the mean score is 21 and the standard deviation is 3, a raw score of 23 is two-thirds of a standard deviation above the mean; the standard score would be two-thirds of 15 (i.e., 10) points above 100, or 110.

It's also possible to set the mean as zero and the standard deviation as one. Under this system, our raw score of 23 would convert to +0.67; a raw score of 19 would convert to -0.67. Probably because so many people find a score expressed with a negative number kind of depressing, the 100 +/- 15 system is more common.

Age equivalency score

Next to the raw score, the age equivalency score is the least helpful of the scores you'll see. But, because it is fairly easy to understand conceptually, it's the speech testing result parents often remember. I wish I had a dollar for every time I've heard a parent say something like "My 31-month-old child had speech testing and is at the 17-month level."

This is an unfortunate and misguided use of the age equivalency score, because it makes it sound as if the 31-month-old talks like a 17-month-old. This is rarely the case. No test has ever been written that measures the full range of a child's speaking ability. Most children have areas of speech and language where they are relatively strong and areas where they are not as strong.

What the age equivalency score really means is that the child's raw score corresponds to the mean score for children that age. If a child's raw score is 23, and 23 is the mean score for children aged 6 years, 9 months, the child's age equivalency score will be 6;9. But remember that statistically speaking, average is a range, not a single score. Depending on the standard deviation of the sample used for norm-referencing, an age equivalency score that appears to be significantly below or above the child's actual (chronological) age could in fact be within the normal range.

So how do you tell whether an age equivalency score represents a true delay or deficit? Easy--you ignore it, and look instead at the child's standard score or percentile rank. If the standard score is within one standard deviation of the mean (usually between 85 and 115), or if the percentile rank is between 16 and 84, the child's performance on the speech testing was age-appropriate.

If you are a typical parent, understanding the raw score, the percentile rank, the standard score, and the age equivalency score will be plenty good enough, and you can probably stop reading here. However, if you want an even deeper understanding of how speech testing (and other standardized testing) works, or if you're a glutton for punishment, read on.

Standard error of measurement

Once you have the scores and know what they mean, the next question for many parents is, 'How can we be sure that the test is accurate?' We've all had good days and bad days, and what kind of day we're having on the day we take a test can affect our performance. Some people are 'bad test-takers' and may freeze up when they are being tested. When we aren't absolutely sure about the answer to a question, some of us make a random guess; others say "I don't know;" still others will mull it over, eliminate some possibilities, and make an educated guess. How we deal with not knowing the answer can affect our performance on a test.

Because factors other than the child's ability can have an effect on speech testing performance, all tests have what is known as a standard error of measurement. This is a statistical calculation of the probability that a given speech testing score is a true reflection of the test-taker's ability. On some tests, the results will include confidence intervals for each score. The confidence interval is a range from a few points below the child's score to a few points above it, along with the statistical probability that the child's true ability is somewhere in that range indicated by the results of speech testing. For example, if the child's standard score was 100 and the 90% confidence interval is five points, the probability that the child's true ability is somewhere between 95 and 105 is 90%.

A higher confidence level will yield a wider interval, and a lower confidence level will give you a narrower interval. In other words, the more wiggle room we give ourselves, the more certain we can be that we are right. I can tell you right now with 100% certainty that your child's ability (assuming you have a child) is somewhere between zero and infinity, but that isn't really very useful information.

The way to narrow the confidence interval, or to increase the level of certainty, is to administer multiple tests measuring the same ability, preferably on different days. If a child scores at the same level on two different tests of, say, expressive vocabulary, it could be a fluke, but it's less likely. You can flip a coin twice in a row and get heads both times, but the more times you flip the coin, the less likely it is that you'll get all heads (unless it's a two-headed coin). The same is true for speech testing. However, giving multiple tests of the same ability area is time-consuming and therefore pretty rare except in cases of a dispute or a question over the accuracy of the speech testing results.

Reliability and validity

Along with the standard error of measurement, it is legitimate to ask how reliable a test is. A reliable test will give you consistent results for children of similar ability levels. On most tests, if a student re-takes the test, she will score slightly higher on the second try, but a test on which students consistently get vastly different results from one attempt to the next is not reliable. Such things can happen by chance, but they usually don't.

Reliability is checked in a number of ways. Test-retest reliability is what we've just been discussing--re-administering the test to a group of students and comparing the difference between the first and second attempts. Internal reliability is based on a comparison of questions within the test that address the same concept or skill.

Reliability is not everything, however. If you ask 1,000 fifth graders "What's one plus one?" and then ask them the same question six months later, I predict that close to 100% of them will give you the same answer. These results are extremely reliable, but all they tell us is that fifth graders know what one plus one is, and that some of them know how to give smart-alec answers. Results like this are not much help in speech testing.

The validity of a test addresses the question of whether it actually measures what it is meant to measure. A well-known controversy over validity has focused on tests of intelligence (IQ). Many IQ tests used in the United States (and probably in other countries, too) have been criticized for being culturally biased, resulting in a skewing of the scores in favor of white, middle class children over poorer, minority, and immigrant children. A culturally biased test, according to this argument, does not actually test how intelligent the test-taker is, but rather how familiar she is with mainstream American culture.

As you might have guessed, no language or speech testing measure has 100% validity. Although it is certainly not the only threat to test validity, cultural bias has reared its ugly head in speech testing, too. If a child grows up in a community that uses Hawai'i Creole English (HCE), she will hear adults and older children using sentences like He neva see nobody take da bus. If her language development is normal, she will learn to talk like them. If she takes a language or speech testing measure based on Standard American English (SAE), her use of a double negative and her non-standard use of neva corresponding to SAE didn't may negatively affect her score and incorrectly identify her as having expressive deficits in grammar and vocabulary. In addition, if she takes the Goldman-Fristoe Test of Articulation , she will be penalized for her deletion of final /r/ in neva (never) and her stopping of the voiced TH sound in da (the), both of which are normal in HCE.

Validity also refers to the conclusions we draw from the results of speech testing. The Peabody Picture Vocabulary Test, 3rd Edition (PPVT-III) is a measure of receptive vocabulary. On this test, the child is shown four pictures and the examiner says a word; the child then points to the picture that corresponds with the word. It's not a bad test, but it gets misused a lot. I've actually seen speech testing reports where the PPVT-III is the only language test used, and the report states, "Language skills are within normal limits," based on a PPVT-III score of between 85 and 115. This is not a valid conclusion, because the PPVT-III does not even claim to be a test of 'language skills,' but of one very specific skill, receptive vocabulary. Anyone who has studied a foreign language knows there's a lot more to learning a language than just being able to understand individual words!

Sensitivity and specificity

Sensitivity and specificity are measures of identification error for speech testing.

The sensitivity of a test is its ability to identify children with impairments as impaired. If a child with an impairment achieves a high score and is incorrectly identified as non-impaired, this is called a miss or a false negative. A test with 100% sensitivity will yield no false negatives.

Specificityis the ability of a test to identify children without impairments as non-impaired. If a child without an impairment scores low enough to be incorrectly identified as impaired, this is a false positive.

Sensitivity is the number of true positives divided by the sum of true positives plus false negatives.

Specificity is the number of true negatives divided by the sum of true negatives plus false positives.

In language and speech testing, you want children with language and speech impairments to score low and children without impairments to score high. If a typically developing child scores low and is identified as having a disorder, this is a false positive; if a child with an impairment scores high enough to indicate that she is not impaired, this is a miss.

It's easy to design a test with 100% sensitivity; all you have to do is identify all scores as low and all children who take it as impaired. Of course, if you do this, you'll get lousy specificity. All the non-impaired children who do speech testing of this sort will be identified as impaired, and the number of false positives will be through the roof.

Likewise, if you want 100% specificity, just call all speech testing scores 'normal' and you will have zero false positives. Of course, your sensitivity will plummet because you'll have misses on all of the kids with impairments.

Obviously, it's not realistic for test makers to use either of these approaches. It's also not very likely there will ever be speech testing that has both 100% sensitivity and 100% specificity. So far, I know of none. On a well-designed test, however, both numbers will be fairly high, and not too far apart.

It is not standard procedure in speech testing for the examiner's report to include information on standard error of measurement, reliability and validity, or sensitivity and specificity, so if you want this information, you'll have to ask for it. I don't necessarily recommend this, however, as long as your child's speech testing includes several different measures testing a range of speech and/or language skills. The most likely situation in which I could see myself asking for these figures is if I were told my child doesn't qualify for services because her speech testing scores are just a bit on the high side, or if the goals recommended don't square with what I feel she needs. In most cases, the report will include the child's raw score, standard score (possibly with confidence interval), and percentile rank. It may also include age equivalency scores, but as I noted earlier, I don't put much stock in age equivalency scores, and typically do not list them in my speech testing reports.

I hope all this has not made your head spin too much. In most cases, you won't need to worry about most of it, but it can be useful to know the basics. If there is anything you don't understand about your child's testing, by all means, ask the person or persons who did the speech testing! If they know what they are doing, they should be able to explain it to you.

Return from Speech Testing to Getting Help

Return from Speech Testing to Speech-Language Pathology home page

Share this page:
Enjoy this page? Please pay it forward. Here's how...

Would you prefer to share this page with others by linking to it?

  1. Click on the HTML link code below.
  2. Copy and paste it, adding a note of your own, into your blog, a Web page, forums, a blog comment, your Facebook account, or anywhere that someone would find this page valuable.