Researchers Rate Whole-School Reform Models

Save to favorites
Print

Copy URL

Only three of 24 popular school reform models have strong evidence that they improve student achievement, according to a report released last week that provides the most comprehensive rating of such programs by an independent research group.

Direct Instruction, High 69��ý That Work, and Success for All received the best marks from “An Educators’ Guide to Schoolwide Reform,” which was released at a press conference in Washington.

The 141-page guide from the Washington-based American Institutes for Research was commissioned by five leading education groups.

The consumer-oriented guide rates 24 whole-school reform models according to whether they improve achievement in such measurable ways as higher test scores and attendance rates. It also evaluates the assistance provided by the developers to schools that adopt their strategies, and compares the first-year costs of such programs.

“We wanted to have a document that really, critically evaluated the evidence base underpinning these programs,” said Marcella R. Dianda, a senior program associate at the National Education Association, which helped underwrite the $90,000 study. “We felt that our members really wanted that. They wanted us to get to the bottom line.”

The study comes as districts around the country seek proven, reliable solutions to the problem of low-performing schools. But as they spend greater amounts of tax dollars on the various reform models, questions remain about how well the programs work. Experts say that research such as the AIR report is needed to fill the gaps.

About 8,300 schools nationwide were using one of the 24 designs rated in the study as of Oct. 30, the report says. Congress gave a major impetus to such “whole school” reforms in 1997, when it authorized nearly $150 million in federal grants for low-performing schools to adopt “research-based, schoolwide” efforts. (“Who’s In, Who’s Out,” Jan. 20, 1999.)

Yet, according to the report, “most of the prose describing these approaches remains uncomfortably silent about their effectiveness.” That leaves schools in the tough position of deciding which model to choose with little evidence to go on.

“Before this guide came along, about the only way educators could judge the worth of some of these programs was by the quality of the developers’ advertising and the firmness of their handshakes,” said Paul D. Houston, the executive director of the American Association of School Administrators. “Now, superintendents, principals, and classroom teachers can sit down together and make reasonable decisions about which are best for their district’s needs.”

The study was sponsored by the NEA, the AASA, the American Federation of Teachers, the National Association of Elementary School Principals, and the National Association of Secondary School Principals.

Ratings Questioned

While the report is a big step forward in helping schools sort out the value of such programs, it also underscores how hard it is to judge effectiveness in education.

Last week, several of the organizations behind reform models evaluated in the report contested its ratings. In particular, developers questioned how AIR decided which studies to include as evidence of a program’s effectiveness. Several developers maintained that they have more evidence of positive results than AIR gave them credit for.

Henry M. Levin, a Stanford University economist and scholar whose Accelerated 69��ý program received only a “marginal” rating, described the study as “fairly amateurish.”

“Basically, they discounted anything, as far as I can tell, that comes in and changes test scores over time for a particular school,” Mr. Levin said. “And anything that said it had a comparison group was given a gold standard.”

The guide reviews all 17 whole-school models that were originally identified in the 1997 federal legislation that created the $150 million Comprehensive School Reform Demonstration Program. It also rates seven other prominent or widely used programs that schools could potentially adopt when seeking Obey-Porter grants, as the federal program is commonly known.

The evaluators used a two-step process to rate whether the programs had evidence that they raised student achievement.

First, AIR gathered almost any document about a program that reported student outcomes, including articles in scholarly journals, unpublished case studies and reports, and changes in raw test scores reported by the developers. “We tried to cast a really wide net in collecting the research,” said Rebecca Herman, the project director.

More than 130 studies were then reviewed and rated for their methodological rigor in 10 categories, based on such criteria as the quality and objectivity of the measurement instruments used, the period of time over which the data were collected, the use of comparison or control groups, and the number of students and schools included. Each study was assigned a final methodology rating by averaging across the 10 categories.

Only studies that met AIR’s criteria for rigor were used to rate whether a program was effective in raising student achievement.

For example, a number of developers submitted changes in state or local test scores as evidence that their programs were working. But “we didn’t really consider test scores alone, without some sort of context,” Ms. Herman said, “because there are a lot of things that can explain changes in test scores.”

Leaping to Conclusions?

The study gave a “strong” rating to the programs with the most conclusive research backing, notably four or more studies that used rigorous methodology and found improved achievement.

In at least three studies, the gains had to be statistically significant. A “promising” rating went to models with three or more rigorous studies that showed some evidence of success.

Reform models that earned a “marginal” rating had fewer rigorous studies with positive findings, or a higher proportion of studies showing negative or no effects. A “mixed or weak” label was assigned to programs with study findings that were ambiguous or negative. And AIR gave a “no research” rating to programs for which there were no methodologically rigorous studies.

Eight of the programs received the “no research” rating. Ms. Herman said that was not surprising, given the newness of many of the models.

“It takes a good three years to implement a reform model across a school, and another two years to come up with a decent study,” she said. “What we’re looking at is the first wave of research, and we’re hoping for an ocean to follow it.”

Janith Jordan, the vice president of Audrey Cohen College in New York City, whose design received a “no research” rating, said that “because of the fact that we are a younger design team, to leap to a conclusion about our potential or our effectiveness really is premature.”

More Research Needed

More than anything, experts said last week, the study underscores the need for strong, third-party evaluations of schoolwide reform models. Several other efforts are now completed or in the works.

“The fact is that the capacity to do this kind of research is very limited in this country,” said Marc S. Tucker, a founder of America’s Choice, one of the 24 models reviewed. “I believe that it’s very important for the federal government to put a fair amount of money on the table to make this kind of research possible.”

Ellen Condliffe Lagemann, the president of the National Academy of Education, a group of education researchers and scholars, agreed. “It’s amazing how little evaluation there is,” she said. “Since the early 20th century, the people who have peddled the educational reform strategies that we all hear about tend to be successful because they’re the best entrepreneurs. It doesn’t necessarily have to do with any research credibility.”

AIR rated the support that developers provide to schools based on the variety of help available; the frequency of on-site technical assistance; the number of years the support is given; and the tools schools receive to help monitor their own implementation.

To prepare the tables and a profile for each program, AIR interviewed the developers, gathered and reviewed all available studies, and collected additional information from schools that used the approach.

Lynn Olson

Lynn Olson was managing editor of special projects for Education Week. She also covered national policy (including “P-16 issues” issues, NCLB standards, accountability, and reform), assessment and testing.

A version of this article appeared in the February 17, 1999 edition of Education Week as Researchers Rate Whole-School Reform Models