When 270 researchers spend several years replicating 100 psychology experiments, one expects momentous insights. That only 36 per cent of results could be replicated, and that social psychological research was less reproducible than research in cognitive psychology is, on the face of it, shocking (“Majority of psychology papers are not reproducible, study discovers”, News, 3 September). But are the findings of the Reproducibility Project: Psychology really that unexpected, and do they mean that we can no longer believe psychology textbooks?
Although these results have made headlines, they should not have been a surprise to research psychologists. In 1962, Jacob Cohen reported in the Journal of Abnormal and Social Psychology that the average statistical power of research in these fields was only 0.48. Several subsequent reviews indicated that the power of this type of research has not increased.
The power of a study, which determines whether we can identify valid and reject invalid hypotheses, can be compared to the magnification of a microscope: with too little magnification, we may not see things that are there or believe we see things that are not there. When conducting scientific studies, researchers look for “significant” results – in technical terms, for a “p-value” of less than 0.05. If one tried to replicate a study that had a p-value of less than 0.05 and power of 0.50, one would have only a 50 per cent chance of success. So why do psychologists not aim for a maximal power of 1? Power is determined by the study’s sample size, the significance level and the strength of the effect (relationship) studied. In some cases that effect might be large statistically, as when we study the impact of study hours on grades; in other cases the effect may be weaker, as between class size and grades. Since in novel research one often does not know the effect strength, researchers often underestimate the large sample size they need to achieve acceptable levels of power.
The power of the original studies is not reported in the Reproducibility Project’s new Science article. However, it notes that, for technical reasons, studies in cognitive psychology often have more power than social psychological studies. (In the latter field especially, researchers have to ensure that hypotheses are not obvious.) This would be one explanation why social psychology fared less well. Another is that whereas the variables investigated by cognitive psychologists are relatively unaffected by cultural norms and other social factors, variables studied by social psychologists are typically influenced by historical change and local context. The arguments used by US researchers to persuade US students in the 1980s would be unlikely to persuade European students, or US students tested 30 years later. Persuasive arguments change with time and context. So another factor in the lower percentage of successful replications for social versus cognitive psychological studies was probably that “exact replications” may often have failed to capture the same theoretical variables manipulated in the original study.
Reactions to the Reproducibility Project will remind social psychologists of the replicability crisis of the 1970s. This was triggered by complaints that social psychological knowledge was not cumulative: for every study demonstrating some significant effect, there were replications that were not significant. Thus a reviewer who added up significant and non-significant effects in tests of some theory (known as the “box score method”) often found that non-replications outweighed successful replications. This led to the development of powerful new meta-analytic methods, which statistically combined the results of different experiments.
Even when many individual studies fail to yield significant results, meta-analysis may reveal a significant overall result. This prevents us from concluding that a finding is not reliable when in fact it is. Because non-significant findings typically remain unpublished, meta-analyses could be subject to publication bias, if they relied exclusively on published findings. To avoid this, meta-analysts attempt to trace and to include relevant unpublished studies. Furthermore, sophisticated methods have been developed to identify publication bias and even to correct it.
Although the Open Science Collaboration did not use simple box scores, the statement that only 36 per cent of the findings could be replicated is based on the same logic. But the Open Science Collaboration also conducted a meta-analysis, combining the effect sizes of the original studies with those of the replications to yield an overall effect size. Not surprisingly, the number of studies that had significant effects increased to 68 per cent. In other words, two-thirds of the results could be replicated when evaluated with a simple meta-analysis that was based on both original and replication studies.
As meta-analyses published in psychology journals typically combine the results of hundreds of studies, it is hardly surprising that they give a much more positive picture of the replicability of psychological research than the Reproducibility Project does. The conclusions of textbooks should be based not on single studies but on multiple replications and large-scale meta-analyses, so the results of the Open Science Collaboration will not undermine our faith in good psychology textbooks.
Reporting the percentage of successful replications is not very informative. More usefully, the project could have identified aspects of studies that predicted replication failure. But here the report disappoints. Since meta-analysis permits us to evaluate the validity of research without the need to collect new data, one can question whether the meagre results of this project justify the time investment of 270 researchers and thousands of undergraduate research participants.
Wolfgang Stroebe is adjunct professor of social psychology at the University of Groningen in the Netherlands. Miles Hewstone is professor of social psychology and fellow of New College, Oxford.