The Beautiful Math

This is Part 2 of “The Measurement Gap,” a series examining NWEA’s MAP testing and RIT scores—how they work, why teachers don’t trust them, and how they shape acceleration decisions in Oak Park District 97.

In Part 1, I documented MAP testing’s troubled history: teacher boycotts, ethics violations, a federal study showing no impact, and a for-profit conversion that raises new questions.

But controversy doesn’t mean the underlying design is bad. Before we can fairly evaluate MAP’s role in acceleration decisions, we need to understand what it’s actually trying to do—and the genuinely innovative mathematics that make it work.

The 1977 Problem

Traditional standardized tests had a fundamental limitation: they gave you a snapshot, not a trajectory.

A student might score in the 75th percentile in third grade and the 75th percentile in fifth grade. Did they grow? Stagnate? The percentile stayed the same, but that tells you nothing about actual learning.

Worse, tests designed for specific grade levels couldn’t measure students performing far above or below that level. A first grader doing third-grade math would max out a first-grade test. A struggling fifth grader would bottom out on a fifth-grade test. Neither score told you what the student actually knew.

In 1977, a group of educators in Portland, Oregon—Allan Olson, George Ingebo, and Vic Doherty—founded the Northwest Evaluation Association to solve this problem. They wanted a continuous scale that could track a student’s actual knowledge level from kindergarten through high school, regardless of grade.

The solution came from a Danish mathematician who had died seven years earlier.

The Rasch Model

Georg Rasch (1901-1980) was a Danish mathematician and statistician who developed a deceptively simple insight: you can put students AND test questions on the same scale.

Here’s the core idea:

Every student has an ability level. Every question has a difficulty level. If we measure both on the same scale, we can predict the probability that a specific student will answer a specific question correctly.

The formula: When a student’s ability equals a question’s difficulty, they have a 50% chance of getting it right. If their ability is higher, the probability goes up. If lower, it goes down.

This seems obvious, but the implications are profound.

Equal Intervals

Because students and questions share the same scale, the distance between any two points means the same thing everywhere on the scale.

The learning growth from 150 to 160 RIT represents the same amount of knowledge gain as growth from 220 to 230 RIT. This “equal interval” property is what makes growth measurement meaningful.

On a percentile scale, this isn’t true. Moving from the 50th to 60th percentile isn’t the same as moving from the 90th to the 95th (which is much harder). Percentiles compress at the extremes.

Grade Independence

The RIT scale doesn’t care what grade a student is in. A first grader with a 200 RIT and a fifth grader with a 200 RIT have demonstrated the same level of mathematical knowledge.

This is exactly what you need for acceleration decisions: a way to measure whether a younger student is performing at an older student’s level.

Adaptive Testing

Because questions are calibrated on the same scale as student ability, the test can adapt in real-time.

The algorithm:

Start with a medium-difficulty question
Student answers correctly → give a harder question
Student answers incorrectly → give an easier question
Repeat until you’ve zeroed in on their level

The test is searching for the difficulty level where the student answers correctly about 50% of the time. That’s their RIT score.

This is far more efficient than fixed-form tests. A high-achieving student doesn’t waste time on easy questions. A struggling student isn’t demoralized by impossible ones. The test meets each student where they are.

What the Numbers Mean

A RIT score typically ranges from about 100 to 350, though most K-12 students fall between 140 and 300.

Typical ranges by grade (Math):

Kindergarten: 140-160
2nd grade: 170-190
4th grade: 190-210
6th grade: 210-225
8th grade: 220-240

My daughter scored 205 on the Winter 2025 MAP Math assessment. For a first grader, that’s the 99th percentile nationally—a score more typical of a fourth or fifth grader.

The percentile tells you how she compares to other first graders. The RIT score tells you what level of math she’s actually ready for.

How Precise Is It?

No measurement is perfect. MAP’s Standard Error of Measurement (SEM) is typically ±3 to 3.5 RIT points.

What this means:

If a student scores 200, their “true” score is probably between 197 and 203 (68% confidence)
With 95% confidence, it’s between 194 and 206

This matters when scores are close to a threshold. A student scoring 197 and a student scoring 203 might actually have the same underlying ability—the difference could be measurement noise.

For what it’s worth, MAP’s reliability coefficients are in the 0.90s, which is considered excellent for standardized assessments. The test measures consistently.

What NWEA Says It’s For

NWEA is clear about MAP’s intended uses:

Universal screening: Identify students who need intervention or enrichment
Progress monitoring: Track growth across fall, winter, and spring testing
Program evaluation: Assess whether instructional changes are working
Differentiated instruction: Help teachers tailor instruction to student levels
Goal setting: Establish realistic growth targets

What NWEA Says It’s NOT For

NWEA is also clear about what MAP shouldn’t be used for:

High-stakes personnel decisions: Don’t evaluate teachers based on MAP scores alone
Single-score decisions: Don’t make important choices based on one test administration
Diagnostic purposes: MAP is a monitoring tool, not a diagnostic instrument

This last point is important. MAP tells you where a student is. It doesn’t tell you why they’re there or what specifically they need to learn next.

The Tension

So here’s what we have:

A rigorously designed assessment system built on sound psychometric principles. An equal-interval scale that enables meaningful growth measurement. An adaptive algorithm that efficiently measures performance across grade levels. High reliability coefficients. Clear intended uses and explicit warnings about misuse.

This should be exactly what districts need to identify students ready for acceleration.

A student in the 99th percentile—performing years ahead of grade level—would seem to be an obvious candidate for advanced instruction.

And yet.

Teachers distrust MAP scores. Districts hedge with multiple additional measures. A 99th percentile score earns the maximum 7 points on Oak Park’s rubric, but that’s only 15% of the 46 points needed to qualify for acceleration.

If the math is this good, why doesn’t anyone trust it?

Next in the series: The Cracks — The legitimate concerns that drive skepticism about MAP—from ceiling effects that limit precision for high achievers to motivation problems that undermine validity.

Sources:

This is part of an ongoing series documenting one family’s experience with gifted education acceleration in Oak Park Elementary School District 97.