The sausage-making of test scores

My work regularly involves consuming, exploring, and explaining data — Census numbers, education statistics, social and economic trend-lines. We want people-stories, of course, but we also need evidence-based, data-driven public decision-making.

So it came as an instructive dose of reality to read Brian A. Jacob’s paper, issued last week by the Brookings Institution, on the calculation of student scores on standardized tests. His basic point: A student who gets 30 answers correct on a 50-question test does not necessarily get a score of 60.

“Contrary to popular belief, modern cognitive assessments – including the new Common Core tests – produce test scores based on sophisticated statistical models rather than the simple percent of items a student answers correctly,” writes Jacob, a professor of education policy at the Gerald Ford School at the University of Michigan. Unaware of the decisions test designers make for legitimate statistical reasons, public policymakers “frequently misuse test scores,” he says.

The paper carries a provocative title, “Student test scores: How the sausage is made and why you should care.” It has become a cliché to describe the legislative process as sausage-making; it’s a bit unnerving to have it applied to test scores that now drive so much education policy.

In North Carolina, public schools get a letter grade, A through F, based on the overall scores, and the change in scores, of their students. Under the teacher-pay plan enacted by the General Assembly this year, third-grade teachers may qualify for bonus pay based on the improvement of their students’ scores.

Jacob makes an especially pointed observation about teacher evaluations:

“In recent years, many districts have started to use measures of teacher value-added as part of its determination of promotion, tenure, and even compensation. A teacher’s ‘value-added’ is based on how much improvement his or her students make on standardized tests during the school year (sometimes adjusted for various demographic characteristics). A teacher whose students grew by, say, 15 points is considered more effective than a teacher whose students only grew 10 points. However, if the students in these classrooms started from a different baseline, then this type of comparison depends entirely on the scaling of the exam. For example, it might be the case that a teacher who raises the scores of low-achieving students by 10 points has provided the students more than her colleague who manages to raise the scores of higher-achieving students by 15 points.”

Jacob points out that the length of tests has a bearing on scores. Longer tests, he writes, diminish measurement errors. Shorter tests tend to place more students in the top and bottom achievement categories.

Some test designs — including the influential National Assessment of Educational Progress — incorporate scores adjusted to take into consideration a student’s background. “As a consequence, if a black and white student respond identically to questions on the NAEP assessment,’’ Jacob writes, “the reported ability for the black student will be lower than for the white student, reflecting the lower average performance of black students in this assessment.” He adds that such weighted averages do not introduce a bias in the overall racial test-score gap, but can skew individuals’ scores.

At a time when testing is changing as well as under challenge, Jacob asks, “So, what should the conscientious analyst or policymaker do?” And he responds, “There are no easy answers.”

In North Carolina as across America, data remain a critical factor in evaluating our national educational enterprise. Tests give teachers information to respond to the needs of their students. Test scores provide insight — and apply pressure — in meeting the challenge of narrowing achievement gaps along lines of race and income. Test scores can help guide more resources to schools that need more, and to compare North Carolina with its international competitors.

In the making of education decisions, Jacob writes, “Researchers, policy analysts and the public need to better understand the tradeoffs embedded in the various decisions underlying test scores.”

Indeed, whatever the outcome of the 2016 elections, North Carolina needs to use 2017 as a year of reflection toward a deeper understanding of the dynamics of schooling and what evidence and experience teach. That would mean less legislating on the basis of predispositions, anecdotes, one-dimensional data, and fiat; and more consultation with scholars, teachers, administrators, and parents. We need data well-marinated.

Friday with Ferrel Feature Friday with Ferrel

Ferrel Guillory

Ferrel Guillory is a founder and serves on the board of directors of EducationNC.

The sausage-making of test scores

Recommended reading

Perspective | Teachable moments before and on Election Day

Time for a principals’ leadership initiative

Perspective | Gerrymandering, the ballot box, and education