Scholars Say High-Stakes Tests Deserve a Failing Grade

By D.W. MILLER

Studies suggest students and educators are judged by faulty yardsticks

As the father of twin third graders in Austin, Tex., Richard R. Valencia finds his state's devotion to standardized tests alarming. "I'm trying to get my children

ALSO SEE:

Call to Eliminate SAT Requirement May Reshape Debate on Affirmative Action

to develop intellectually," he says. "That's difficult to do when they're spending almost an hour a night on test preparation."

As a professor of education at the University of Texas, Mr. Valencia finds the state's tests downright pernicious. In his view, a system widely praised for improving schools and bolstering the achievement of minority students is, in fact, misguided and discriminatory. "It's the wrong way to reach equity," he says.

Texas is one of at least 27 states that use the results of standardized tests to make so-called high-stakes decisions: to hold students back a grade or withhold their diplomas; or to punish teachers, principals, and schools that perform poorly. During his presidential campaign, George W. Bush boasted about Texas' record of holding students and educators accountable for failure; now he proposes to withhold federal aid from schools that consistently flunk.

Scholars agree with educators and policymakers that tests are useful for tracking children's progress and identifying weaknesses in teaching. But Mr. Valencia and other education researchers have begun describing testing's dark side. Standardized tests, they say, are too limited, too imprecise, and too easily misunderstood to form the basis of crucial decisions about students. And, they say, the high-stakes consequences interfere with good teaching and discriminate against disadvantaged and minority students who need help the most.

High-stakes testing policy "is not based on science. If we launched a spaceship with this lack of knowledge and evaluation, the people responsible would lose their jobs," says Lorrie Shepard, an education professor at the University of Colorado at Boulder and a former president of the American Educational Research Association. "The question is, when kids show improvement on the tests, do they really know what the tests test?"

For one thing, tests are imprecise yardsticks of a student's abilities. Ideally, a child would earn the same score on variations of the same test given on different days. (Psychometricians would say such a test had a reliability of 100 percent.) But that threshold is beyond reach. Students' scores vary from day to day, depending on their health, their mood, or even what they ate for breakfast.

Furthermore, it's difficult to keep exams consistent from year to year. Test designers must constantly refresh the test questions, but the new items are never precisely comparable to the old ones. That's why designers publish the margins of error of of error oftheir products, expressed as "reliability coefficients" between 0 and 1.

Most standardized tests used to evaluate elementary and secondary students claim a reliability coefficient in the neighborhood of .9, "plenty good for most purposes," says David R. Rogosa, a professor of education at Stanford University and an expert in educational assessment. "But a reliability of .9 ain't all it's cracked up to be."

To explain why, Mr. Rogosa recently did a study of the popular Stanford 9 exam, which tests reading and mathematics skills in the second through 11th grades, and translated its margins of error into everyday terms. The ninth-grade math test, for example, has a reliability of .84. That means, as Mr. Rogosa tells it, that a student whose "true" ability is at the nationwide median has a 70 percent chance of scoring more than five percentile points above or below the median.

Put another way, a student who actually improved 10 percentile points from grade nine to grade 10 has a 26-percent chance of doing worse on the 10th-grade test -- with potentially dire consequences. Falling short even by one point of what the state has established as a passing score -- such as 70 percent on the Texas Assessment of Academic Skills, or TAAS -- could mean flunking a grade or attending summer school.

Design error isn't the only reason tests are imprecise. They cover just a fraction of everything students are supposed to have learned in class -- not unlike drawing broad conclusions about public opinion from a poll with a tiny sample. "Policymakers assume that a standardized achievement test measures what a school has taught," says W. James Popham, an education consultant retired from the school of education at the University of California at Los Angeles. "In fact, it doesn't."

Mr. Popham has opposed basing high-stakes decisions on such tests for decades. For one thing, he says, the goal of accountability is ostensibly to raise the achievement of all students above some passing mark. But achievement tests have traditionally been used to spread students out along the proverbial bell curve, so designers eliminate questions that "too many" test-takers answer correctly.

The desire to spread scores out, he says, leads designers to use items that are "unsuitable for measuring the quality of instruction." On two popular national achievement tests for elementary students, he estimates, 20 percent of mathematics questions, 40 to 50 percent of reading items, and 70 to 85 percent of language-arts items were better suited to measuring I.Q. or socioeconomic advantages.

One example, from a fourth-grade science test, required test-takers to know that peaches, limes, and pumpkins have seeds and celery stalks do not. Children whose parents can afford to buy fresh produce or make jack-o'-lanterns at Halloween, he says, will have an easier time with that question than children living on food stamps.

A longstanding criticism of standardized testing is that teachers learn to "teach to the test" -- substituting the shallow content of test preparation for more challenging curriculums and more sophisticated skills. Testing advocates counter that teaching to a well-designed test is valuable. Whether students really suffer is difficult to quantify, but evidence suggests that teachers do become better at preparing their classes for tests, which defeats the purpose of giving them.

Robert L. Linn, an education professor at the University of Colorado at Boulder, has identified a "sawtooth effect" in standardized testing. In general, when a state introduces a new version of its achievement test, scores initially drop in comparison to those on the last version, then rise for several years, then level off -- until the test is replaced again. That trend, which on a graph resembles the shape of a saw blade, suggests scores move only because teachers take a few years to hone their test preparations, not because instruction is perpetually improving.

Mr. Linn believes that phenomenon suits the political imperatives behind accountability. "Poor results in the beginning are desirable for policymakers who want to show they have had an effect," he wrote last March in Educational Researcher.

"Based on past experience, policymakers can reasonably expect increases in scores in the first few years of a program with or without real improvement in the broader achievement constructs that tests and assessments are intended to measure," he wrote. "The resulting overly rosy picture that is painted by short-term gains observed in most new testing programs gives the impression of improvement right on schedule for the next election."

Researchers have seen it happen. In the early 1990's, Kentucky's accountability system appeared to drive test scores up. In a 1998 report published by the Rand Corporation, however, Daniel M. Koretz and Sheila L. Barron attributed much of the improvement to "score inflation."

Kentucky students' gains on the state's high-stakes mathematics tests, they wrote, were "implausibly large," while the students' scores on other tests, such as the ACT or the National Assessment of Educational Progress, showed little or no improvement. And students consistently performed better on test questions recycled from previous years than on new items, perhaps because teachers got better at coaching students on test techniques and previous years' exams.

But the problem isn't just with the tests, say critics -- it's the way educators use them. And they point to the "Texas miracle" of rising test scores as a prime offender.

Since 1990, when its Legislature created the TAAS, Texas has been an exemplar of educational accountability. Each year, the state ranks schools and districts by their scores on TAAS tests, which measure achievement in grades three through eight and grade 10, as well as by their dropout rates. Schools get credit both for high performance and for improving scores, but only if that progress is reflected in the average score of each ethnic group. The annual rankings are always front-page news; teachers and principals at low-performing schools can be reassigned or fired.

Policymakers in Texas boast that black and Hispanic students have been gaining on white ones. But critics say that students -- particularly minority students -- are really getting shortchanged.

Linda M. McNeil, a professor of education at Rice University, studied the effect of high-stakes testing on teaching in Houston public schools. "In many urban schools, particularly those whose students are predominantly poor and minority, the TAAS system of testing reduces both the quality of what is taught and the quantity of what is taught because commercial test-prep materials are substituted for the regular curriculum," she writes in Contradictions of Reform: Educational Costs of Standardized Testing (Routledge, 2000).

"Teaching to the test" may be fine if the test is worth teaching to, writes Ms. McNeil, but she observed classes spending hours learning multiple-choice-answer strategies and rote templates for essay questions. She also found that high-performing schools have not "narrowed their curriculum" in this way, because students and teachers there have little to fear from the TAAS.

Because those schools tend to be in white, affluent areas, Ms. McNeil worries that high-stakes testing in Texas has harmed black and Latino students disproportionately. In fact, a number of scholars argue that the TAAS is a violation of civil rights.

In fall 1999, a Mexican-American civil-rights group sued the state of Texas on the grounds that the TAAS disproportionately harmed the educational prospects of minority students, particularly Latinos. On behalf of the plaintiffs, scholars testified that high-stakes testing not only hurt classroom instruction, but encouraged minority students to drop out.

Proponents of the Texas accountability system say it has helped shrink the achievement gap between white and minority students. In 1994, black and Latino 10th graders were about half as likely to pass the TAAS as their white classmates; by 1998, that ratio had increased to two-thirds. At the same time, the dropout rate for minority students appeared to decline.

Walter M. Haney, a professor of education at Boston College, testified that Texas' success in the 10th-grade TAAS is an illusion.

He calculated that minority students were nearly three times more likely to have to repeat ninth grade -- when no TAAS is given -- than were white students. And he found that minority students were more likely to drop out in that grade than in any other.

"I basically found that a substantial proportion of the apparent increase in test scores was actually due to the exclusion of increasing numbers of students, especially black and Hispanic students," he says.

In Texas, he says, there are three ways to exclude poor test-takers: flunk them in ninth grade, classify them as learning disabled, or encourage them to leave and study for the general-equivalency diploma, or G.E.D., instead of pursuing a high-school diploma. Such students are counted as "school leavers," he says, not dropouts.

Angela Valenzuela, an associate professor of education at the University of Texas at Austin, offered the court another explanation for dropouts: Such tests contribute to Mexican-American "alienation" from school. Based on her case study of a heavily Latino high school in the Houston area, she concluded that many Mexican-American and Mexican immigrant students regard high-stakes tests as a barrier to graduation and college. Furthermore, the TAAS makes no attempt to account for limited proficiency in English.

Mr. Valencia, of the University of Texas, testified that holding minority students and schools accountable for low scores is unfair. "There is a robust relationship between segregation and achievement," he says, because their performance is hampered by discrimination in school funding, teacher training, and other inequities.

In the end, the trial judge ruled that the benefits of accountability outweighed the drawbacks of the test's disparate impact on minority students. Criticism of the policy's inequities, however, may accelerate. Gary A. Orfield, a professor of education and social policy at Harvard University, plans to make his Civil Rights Project at Harvard a clearinghouse for research such as Ms. McNeil's and Mr. Haney's -- beginning this summer with an edited volume on "inequality and high-stakes testing" from the Century Foundation.

Some scholars believe that the flaws of high-stakes testing have been exaggerated. Jay P. Greene, an education researcher at the Manhattan Institute who once taught in the department of government at the University of Texas at Austin, describes himself as a former skeptic who testified for the Texas plaintiffs against the TAAS in 1999. "While the naysayers aren't completely off base about TAAS's weaknesses," he writes in the Summer 2000 issue of City Journal, "they're dead wrong about Texas's educational gains."

As independent evidence, he cites Texas students' improved performance on the NAEP, a national test with no high-stakes consequences. And he maintains that the dropout rate for minority students, though comparatively high, has declined throughout the 1990's. Unlike Mr. Haney, who calculated the attrition of high-school students from ninth grade on, Mr. Greene tracked students' fates from the seventh grade. The growing number of students held back in ninth grade, he writes, distorts Mr. Haney's calculations, and reflects a benign policy: to give weak students more time to prepare for the 10th-grade TAAS, which they must pass to graduate.

There are many good reasons not to like accountability tests, Mr. Greene concludes. But it's worth "stomaching the potential drawbacks" if it drives teachers "to make sure that students know how to read, write, and do arithmetic."

To prove it, supporters of such tests are studying their effects district by district. "Do I believe there are schools and districts that are using the Texas accountability system in positive ways? Yes -- I've been to some," says James Scheurich, an associate professor of education at the University of Texas at Austin.

Although he says that making high-stakes decisions based on a single measure "violates everything we know about measurement," he insists that everyone in the accountability debate wants the same thing: equity. "Before the accountability system" in Texas, he says, "the failure of Hispanic students was acceptable." The state's accountability system "has moved us out of a very nasty basement up to a decent main floor."

He and colleagues at the University of Texas at Austin and Texas A&M have conducted research on four Texas school districts with high numbers of low-income and minority students. Those districts, he says, have responded to accountability in the intended way: They raised their expectations for all students; reformed their curriculums; improved their teaching methods; and meted out reasonable consequences for success and failure.

In general, scholars recognize that the momentum for linking accountability to testing is unstoppable. The best that critics hope for is to improve the way it will be done.

At the request of Congress, the National Academy of Sciences in 1999 published a volume of caveats and recommendations on the proper use of tests for high-stakes decisions about students. In July, the American Educational Research Association issued a similar warning about the drawbacks of high-stakes testing. "Decisions that affect students' life chances or educational opportunities should not be made on the basis of test scores alone," the statement reads.

"Assessment is good, but we should use it primarily for diagnostic principles, not high stakes," says Mr. Orfield. Like most of the critics of high-stakes testing, he says he supports the idea of accountability for failure. In fact, he says, he helped pioneer the concept of "reconstitution" in the 1980's, when a failing San Francisco public school was remade with a completely new staff and curriculum. "But we had 17 indicators of success. If you do it the wrong way with the wrong measures, it will not work."

http://chronicle.com Section: Research & Publishing Page: A14

Scholars Say High-Stakes Tests Deserve a Failing Grade

Commentary

The Chronicle Review

Advice

Most Popular

JobsonVitae

Browse by Position Type

Search by Keyword

Top Jobs

Academe Today

Archives

Scholars Say High-Stakes Tests Deserve a Failing Grade

Commentary

The Chronicle Review

Advice

Most Popular

JobsonVitae

Browse by Position Type

Search by Keyword

Top Jobs

Academe Today