Researchers Debate Merits of ‘Value Added’ Measures

Save to favorites
Print

Copy URL

Much of the controversy about using value-added assessments to measure the effectiveness of schools and teachers involves the strengths and weaknesses of existing models for tracking individual students’ growth.

“We know that every model currently in use contains some amount of statistical bias, and these imperfections are not well understood,” Daniel Fallon, the chairman of the education division at the Carnegie Corporation of New York, said during a meeting on value-added assessment in Columbus, Ohio, last month.

Return to the main story,

‘Value Added’ Models Gain in Popularity

Even though value-added methods should not be used in isolation, Mr. Fallon argued, “we should not let the desire to be absolutely right interfere with the progress we can make with this promising new technology.”

At present, there are at least four or five different ways of computing value-added results, ranging from relatively simple models to extremely complex methods that require extensive computing power and expertise.

One of the biggest issues is whether such models should control for student or school characteristics, such as race or poverty, that can influence the rate at which children learn. Some argue that unless the analyses account for such differences, they will not accurately isolate the contributions of teachers or schools to student learning.

The Tennessee Value-Added Assessment System, developed by researcher William L. Sanders, does not explicitly control for such background characteristics and has been criticized by some researchers.

But Mr. Sanders, who now manages the value-added assessment and research center at the Cary, N.C.-based SAS Institute, which provides such analyses to more than 400 districts nationwide, maintains that his model does not need to control for individual student demographics. That’s because it simultaneously analyzes each student’s previous achievement on tests in multiple subjects and grades across multiple years to predict his or her future performance. Those relationships essentially capture the influence of background characteristics, he said.

Mr. Sanders acknowledges, though, that debating whether to control for demographic characteristics at the school level is a “legitimate academic argument.”

Those who favor such controls contend that high concentrations of poor, minority, or low-achieving students in a school may influence the rate at which individual children learn.

The problem, notes Dale Ballou, a professor of education at Vanderbilt University in Nashville, Tenn., is that students are not assigned to teachers and schools at random. If disadvantaged students are systematically assigned to less effective schools and teachers—largely as a byproduct of where they live—then controlling for socioeconomic status in the models can mask genuine differences in school and teacher quality.

Researcher Daniel F. McCaffrey and his colleagues at the Santa Monica-based RAND Corp. concluded that the Tennessee model is relatively robust as long as students are well mixed across schools within a district. But if schools are highly stratified by race and class, then such factors may need to be taken into account when estimating growth. Just how integrated a school system needs to be before such controls are necessary, however, remains unclear.

Missing Data

Another concern is how the models account for missing student data—for example, when students are absent on a testing day one year. Lots of data are missing from many state and district databases, particularly in urban areas with highly mobile populations.

A number of the models treat missing data as random, essentially assuming that such students would make the same average amount of growth as their peers, given average teachers and schools.

Work conducted by John Stevens, a professor of education at the University of New Mexico, and his colleagues, using data in that state, suggests the assumption is probably erroneous.

“In our work,” said Mr. Stevens, speaking at a recent conference at the University of Maryland, “we find that the data are not missing at random.”

His and his partners’ analysis of four years of data for New Mexico students in grades 6-9 on the TerraNova mathematics tests found complete records for 75 percent of them. 69��ý with missing information tended to be poorer, more likely to speak limited English, and more likely to come from certain racial or ethnic groups.

“If you don’t appropriately deal with that,” cautioned Mr. Sanders, “you’re going to bias the heck out of the results.”

Defining Growth

Perhaps the biggest concern is whether existing state tests can accurately measure growth across grades. Critics such as William H. Schmidt, a professor of education at Michigan State University in East Lansing, argue that the math skills and knowledge tested in grade 3, for example, are significantly different from the content tested in grade 10.

“That really hits very hard at the very heart of the value-added research,” he said at the University of Maryland conference.

Testing experts try to link exams across grades through a process known as “vertical equating,” in which they include some items in common across tests given in consecutive grades so they can score the results on a similar scale.

But the quality of vertical equating is widely debated. Even if done well, many question whether the technique can be accurately used to draw inferences about students’ growth across more than two grade levels. Others assert that value-added methods can be done without using vertically equated tests at all.

At a minimum, experts suggest, states need far better articulation of content and performance standards across grades and subjects than many have at present, and far tighter alignment between standards and tests to increase their usefulness for value-added analyses.

States also need to ensure that their tests have enough “stretch” to measure accurately the growth of students at the top and bottom of the achievement spectrum.

Right now, many tests concentrate most items around the proficiency bar and are less sensitive to movement within different achievement levels.

One of the most important things about growth models, said Dennie Palmer Wolf, the director of opportunity and accountability initiatives at the Annenberg Institute for School Reform, based in Providence, R.I., is that they change the public discourse.

Although labeling a student’s performance as “below basic” on a state reading test may reinforce expectations that the designation is the best he or she can do, Ms. Wolf noted, “the emphasis becomes one of movement” with growth models.

Though the technical issues can’t be minimized, Ms. Wolf argued during a recent meeting in New Hampshire, organized by the Dover, N.H.-based Center for Assessment, “there may be pieces we can bite off to make progress.”

Lynn Olson

Lynn Olson was managing editor of special projects for Education Week. She also covered national policy (including “P-16 issues” issues, NCLB standards, accountability, and reform), assessment and testing.

A version of this article appeared in the November 17, 2004 edition of Education Week as Researchers Debate Merits of ‘Value Added’ Measures

69��ý