Russian Dolls and Education Policy: New Study Looks More Closely at Teacher Evaluations

Ever heard of a matryoshka doll? You may not have heard the name before, but I bet you’ll recognize the concept. You start with a big doll, break it open, and discover a smaller doll inside. That doll contains a still smaller doll, and inside that one is an even smaller one. You’ve got to dig down through an awful lot of layers before you reach the center. (Do you feel the education policy analogy coming on?)

Teacher evaluation is like the center of many education policy matryoshka dolls. In particular, strategic compensation and tenure policies are heavily dependent on the reliability and validity of the teacher evaluations being used. That realization raises some big questions regarding evaluation, some of which I’ve written about before.

As it turns out, even “evaluation” may be too big a doll. A new study by Matthew Chingos, Russ Whitehurst, and Katharine Lindquist argues that the area of greatest concern is more specific still: The portion of evaluations based on classroom observation. As Chingos puts it in a related article:

“Based on the furor that this requirement has elicited from teachers’ unions, one might assume that students’ test scores feature prominently in these new evaluation systems. That is hardly the case. In our new study, published today in Education Next, my colleagues and I found that only 22 percent of teachers were evaluated based on test score gains in the four urban school districts we studied. This is largely because most teachers lead classrooms that are outside the grades and subjects subject to standardized tests … Most of the action and nearly all the opportunities for improving teacher evaluations lie in the area of classroom observations.”

More concerning is the fact that the study “discovered … bias in the classroom observation scores due to student ability.” In other words, teachers tended to fare better on observational assessments when they got a better crop of incoming students to work with.

And although these observations were generally more stable on a year-to-year basis than value-added measures, the authors note that this could be due to preconceived notions on the part of the observers:

“If a principal is positively disposed toward a particular teacher because of prior knowledge, the teacher may receive a higher observation score than the teacher would have received if the principal were unfamiliar with her or had a prior negative disposition. If the administrator’s impression of individual teachers is relatively sticky from year to year, then it will be less reflective of true teacher performance as observed at a particular point of time. For this reason, maximizing stability may not increase the effectiveness of the evaluation system.”

To help address these concerns, Chingos and his colleagues suggest that teachers be observed three times per year, with one of those evaluations being conducted by a trained observer from outside the school. They also recommend that observational evaluations statistically control for the same variables accounted for in value-added calculations.

I remain a firm believer that ensuring teacher quality in every subject is paramount to student success. Evaluations are and will always be a necessary and important part of that endeavor. Still, it’s our responsibility to make sure all components of those evaluations are reasomably fair, valid, and reliable.

With that in mind, I welcome further discussion on the issue.

Just don’t tell my friends I play with dolls. I have an image to maintain.