EdisWatching

New Study Examines Impacts of Evaluation Reform Across America, Findings Decidedly Unscary

You know that feeling you had when you were a kid and you got a new book? The excited rush to rip it open and start devouring it? Well, I’m that way with educational research. Some folks might say that makes me a “nerd.” Those folks would be right. Today I proudly embrace my nerdiness and present: Little Eddie’s Thursday Research Roundup.

Okay, “roundup” is probably overselling it a little. I actually just want to talk a single new study on teacher evaluation reform in America. The study, conducted by Matthew Kraft of Brown University and Allison Gilmour of Vanderbilt University, takes a look at the effects of evaluation reform on teacher effectiveness ratings in 19 states across the country. It also digs into the issue a little deeper with surveys and interviews in a large urban school district.

Most of you know that I am a big fan of strong evaluation systems that can meaningfully differentiate teacher performance. We need such systems to support tenure reform and pay-for-performance systems, and to make sure every kid has an effective teacher in his or her classroom. As it turns out, that last point is especially important, as teachers are consistently found to be the most important school-related factor in student achievement. One well-known study found that replacing an ineffective teacher with even an average teacher would result in a $250,000 increase in a single classroom’s lifetime income. Do you have any idea how many toys you could buy with $250,000?!

Older evaluation systems based purely on subjective observations famously produced the “Widget Effect,” or the tendency to rate nearly 100 percent of teachers effective year after year despite evidence to the contrary (and common sense). With that problem in mind, the idea has long been that including multiple objective measures of student growth alongside subjective observations would help paint a clearer picture of teacher performance.

The unions and their supporters, on the other hand, have spent the last few years shrieking about how terrible such evaluation systems are and the apocalyptic effects they would have. Teachers will be getting fired willy-nilly, they say, and public education will come crumbling down in a sea of rubble and smoke and colorful backpacks.

After the spectacular death of Senate Bill 105 at the Colorado Capitol this year, it seems like we’re finally going to get around to fully implementing SB 191—the law that reforms educator evaluations and teacher tenure in Colorado. That means we’re finally going to see how this all plays out in practice. For now, this new study provides some interesting perspective on where we’re headed.

Here’s the abstract from the study:

In 2009, TNTP’s The Widget Effect documented the failure of U.S. public education to recognize and act on differences in teacher effectiveness. We revisit these findings by compiling teacher performance ratings across 19 states that have adopted major reforms to their teacher evaluation systems. In a majority of these states, less than 3% of teachers are rated below Proficient. We also find substantial differences in the percentage of teachers rated below Proficient and Highly Effective across states. We present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools as below Proficient than they actually rate as such. Interviews with principals reveal several potential explanations for these patterns.

There’s variation between states (most notably in New Mexico), and we should note that not every included state requires the use of multiple measures of student growth. Those exceptions include Colorado, where the only available data come from a 26-district pilot program that does not include objective measures—and that rated 97 percent of participating teachers proficient or higher. So the study’s overarching generalizations should be taken with a grain of salt.

Even with those methodological caveats in mind, though, this new research seems to indicate that new, improved teacher evaluations may not get us to the level of differentiation we need. Places that have moved ahead with tying test scores to evaluations in various forms—Hawaii, Florida, and New York, for instance—have not seen the type of differentiation we reformers hoped for. Check out the graph below for an illustration of what I mean.

From Kraft and Gilmour, “Revisiting the Widget Effect: Teacher Evaluation Reforms and the Distribution of Teacher Effectiveness.”

Aha, shout the unions, reform critics, and skeptics. This proves that this was all just a huge waste of time! I knew we were right!

Not quite. I see three big takeaways from this study:

Fear-mongering speculation about the catastrophic effects of evaluation reform is plainly false. That means that we can stop all the silliness about the “catastrophic” effects of stronger teacher evaluation systems. There will be no guillotines in the school yards, no crowds of unfairly unemployed teachers begging on street corners, and no implosion of the education universe. If anything, most of these new systems are still skewed far too heavily toward positive results.

The new evaluation systems have moved the differentiation needle in a positive direction. Though new evaluation systems have not achieved the level of differentiation needed for things like tenure reform and pay for performance to function optimally, they have resulted in a meaningful positive change. Although this study does indicate that these evaluation systems alone are not enough to get us to the level of performance-based differentiation we’d like, it does not indicate that we ought to move backward or revert to older systems that did an even worse job of differentiating teacher performance.

This study is yet another testament to the importance of culture and leadership in successful schools. As usual, the secret sauce in education is people, leadership, and culture. The researchers’ interviews with principals in the spotlighted urban district reveal that things like time constraints, personal discomfort with rating teachers as ineffective, and subjective judgments of “teachers’ potential and motivation” all led to evaluators shying away from assigning less-than-effective ratings. These attitudes and perspectives probably go some way toward explaining why places like Harrison can leverage strong evaluations to such tremendous effect, while other districts fall flat.

So here we stand. Going backward isn’t an option. Weaker evaluation systems will help no one, and particularly not kids. At the same time, it is clear that we’re going to have to find a way to promote real, lasting cultural change in schools and school districts if we truly want to move toward a more effective system. Policy changes can’t accomplish much if the folks responsible for implementing them refuse to do so with any level of fidelity.

I hope that Colorado (and other states) will begin to experience such a cultural shift as we continue down the road of fully implementing and developing better evaluations, and as folks start to realize that the sky will not fall when we hold educators accountable for their crucial role in students’ outcomes. Until then, though, I see this as one more reminder of how critically important having the right people and leaders in place is to providing a great education.