Student evaluations on teaching are biased and unreliable

Universities should rethink how they use student evaluations of teaching because of their bias towards male instructors, argue Anne Boring, Kellie Ottoboni and Philip B. Stark

October 3, 2018
University lecturer

Many universities rely heavily or exclusively on student evaluations of teaching (SET) for hiring, promoting and firing instructors. After all, who experiences teaching more directly than students? But to what extent do SET measure what universities expect them to measure – teaching effectiveness?

To answer this question, we apply nonparametric permutation tests to data from a natural experiment at a French university (based on an original study by Anne Boring), and a randomised, controlled, blind experiment in the US (based on an original study by Lillian MacNell, Adam Driscoll and Andrea N. Hunt). We confirm and extend the studies’ main conclusion: student evaluations of teaching are strongly associated with the gender of the instructor. Female instructors receive lower scores than male instructors. SET are also significantly correlated with students’ grade expectations: students who expect to get higher grades give higher SET, on average. But SET are not strongly associated with learning outcomes.

Some studies have found little difference between average SET for male and female instructors, but the design of those studies has serious flaws. Not only are they observational studies rather than experiments, they ask the wrong question, namely, “do male and female instructors get similar SET?”. A better question is, “would female instructors get higher SET but for the mere fact that they are women?”. We can answer that question using these unique data sets: yes.

The French data

Since effective teaching should promote student learning, students of more effective instructors should have better learning outcomes on average. Students in different sections of each course, taught by different instructors, take the same final exam, allowing us to compare learning outcomes. We find that SET are, at best, weakly associated with student performance.

Correlation between SET and final exam score by subject

Figure 1. Average correlation between SET and final exam score, by subject

Note: p-values are one-sided, since, if SET measured teaching effectiveness, mean SET should be positively associated with mean final exam scores. Correlations are computed for course-level averages of SET and final exam score within years, then averaged across years. *** p<0.01, * p<0.1

On the other hand, SET correlate significantly with instructor gender (male students gave higher SET to male instructors, Figure 2) and with students’ expected grades. This adds evidence to the hypothesis that instead of promoting better teaching, SET contribute to grade inflation. We find no evidence that male teachers are more effective than female teachers. If anything, students of male instructors perform worse on the final exam.

Average correlation between SET and gender
Note: p-values are two-sided. *** p<0.01, ** p<0.05, * p<0.1

Figure 2. Average correlation between SET and gender concordance

The US data

Lillian MacNell, Adam Driscoll and Andrea Hunt collected data from four online sections of a course, two taught by a male instructor and two by a female instructor. Students were assigned randomly to the four sections. The male instructor taught one section using his own identity and switched identities with the female instructor for the other section, and vice versa. 

This lets us see how believing that an instructor is male or female affects SET for the very same instructor. We confirm the original authors’ main finding that students generally rate perceived female instructors lower in several dimensions of teaching.

Even on measures one would expect to be objective, ratings were lower for perceived female instructors. For instance, graded assignments were returned simultaneously in all four sections, but students reported that the perceived female instructor was less prompt in returning assignments. 

Male-female instructor mean ratings
Note: The scale is 1-5 points, so a difference of 0.8 is 20% of the full range. p-values are two-sided. *** p<0.01, * p<0.1

Figure 3. Difference in mean ratings and reported instructor gender (male minus female)

In both the French and US data, male instructors got higher SET, but in the US data, female students tended to give higher scores to perceived male instructors, whereas in the French data, male students tended to give higher scores to male instructors.

Difference in mean SET by student gender

Figure 4. Difference in mean SET by student gender, for perceived and actual instructor gender (male minus female)

 In another study conducted in the Netherlands, researchers are finding that female instructors receive lower scores because male students give lower scores to female instructors. 

Differences among these studies could be cultural or related to topic, class size, mode of instruction (online versus face-to-face), ethnicity, race, physical attractiveness, or other confounding variables that have been found to affect SET. Clearly, there can be no simple adjustment for the bias.

The French data show that bias varies by course subject, further complicating any attempt to correct for these biases. The only field in which male students do not rate male instructors significantly higher is sociology. This is especially interesting because sociology is the only field in which there was near gender balance among instructors (46.4 per cent female instructors). This could suggest that gender balance in a field affects gender stereotypes and might reduce bias against female instructors.

Why don’t universities use better methods? SET are the familiar devil. Habits are hard to change. Alternatives (reviewing teaching materials, peer observation, surveying past students, and others) are more expensive and time-consuming, and this cost falls on faculty and administrators rather than on students. 

The mere fact that SET are numerical gives them an unearned air of scientific precision and reliability. And reducing the complexity of teaching to a single (albeit meaningless) number makes it possible to compare teachers. This might seem useful to administrators, but it is a gross oversimplification of teaching quality.

Evidence of any connection between SET and teaching effectiveness is murky, whereas the associations between SET and grade expectations and between SET and instructor gender are clear and significant. Because SET are evidently biased against women (and likely against other underrepresented and protected groups) and worse, do not reliably measure teaching effectiveness. The onus should be on universities either to abandon SET for employment decisions or to prove that their reliance on SET does not have disparate impact.

This blog post is based on a ScienceOpen preprint: Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness

Anne Boring is a research fellow at Sciences Po and a research affiliate at Paris Dauphine University. Kellie Ottoboni is a PhD student in the statistics department at the University of California, Berkeley and a fellow at the Berkeley Institute for Data Science. Philip B. Stark is professor of statistics and associate dean of mathematical and physical sciences at the University of California, Berkeley. 

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please Login or Register to read this article.

Related articles


Featured jobs