What the replication of experiments does and doesn’t achieve

– by Stuart West & Max Burton-Chellew

Replication or repeating of experiments is a key part of the scientific methodology. It increases your trust in that result. It shows that the result was not just due to some unnoticed bias or a particular setup. This helps to filter out false positives and expose questionable research practices. That is great.

Suppose, however, that you had developed a hypothesis for a result. In that case, repeating the same experiment might not increase your confidence in that hypothesis. If your aim is hypothesis testing then it can be more useful to alter the experiment, so that you can test between competing hypotheses or make a strong attempt to falsify a hypothesis.

Our paper was about what happens if these two points get confused. This could happen anywhere, but we illustrate the potential problem with the literature on public goods games.

The public goods game is an economic experiment designed to investigate the tension between an individual’s selfish interest and the interest of the group. Individuals can contribute money to a group fund. The group fund is multiplied by a factor and then divided between players. Crucially, the amount it is multiplied means that everyone gets back less from their contribution than they initially contributed. Consequently, while everyone would do best if everyone contributed maximally, individuals maximise their personal income by not contributing.

But when people play the public goods game, they do contribute. On average, individuals contribute a significant amount. If they play the game multiple times they contribute less, but they keep contributing. This is an incredibly strong result – the experiment has been repeated >100 times. The common conclusion is that because individuals contribute >0%, this shows that humans have ‘prosocial preferences’ which make them cooperate even when it is not in their best interest.

Can you see a potential problem with this conclusion? Individuals would only appear to behave as if they are trying to maximise their personal gain by contributing nothing (0%). Any other amount (>0%) can be argued to reflect pro-social preferences. And it is possible to come up with a huge number of alternative hypotheses for why individuals might contribute >0% (see figure above). ‘Pro-social’ preferences is a hypothesis not an unavoidable conclusion. Alternative hypotheses include, but are not limited to:

  • Individuals are prone to playing as if it was the real world, where there would be repeated interactions, and hence the opportunity for reciprocal helping.
  • Individuals might consider extreme strategies like 0 or 100% as risky.
  • Individuals might start a bit confused and want to use the game to learn how to best maximise their gain.
  • Individuals might want to explore the options available.

So, while repeating the basic public goods game experiment was great for proving the robustness of the initial result, it doesn’t help distinguish between competing hypotheses. To do that, you need to adjust your experiment. You could play the game in a way that removes the potential concern for others. For example, by making individuals play with computers or by not letting them know about the game they are playing (contributions are to a ‘black box’). Or you could change the game so that the best selfish option is to contribute >0%. Or could analyse individual behaviour and see what makes them alter their contribition. And so on.

Another way of thinking about this issue is that we need to get to the right answer by as rigorous a scientific method as possible. This can involve actively developing different controls and null hypotheses. We need to work hard to try to falsify hypotheses and test competing hypotheses.

That is our main point, and you can stop here. If you want to know the gory details from public goods games, how different controls can be developed, and how to test competing hypotheses then go and read our paper. You might also be wondering what would happen if this approach overturned the accepted conclusions. No spoilers, but there is an obvious comparison to a folktale by Hans Christian Anderson: The Emperor’s New Clothes.

Stuart A. West & Maxwell N. Burton-Chellew. (2026). Replication of experiments and the canonisation of incorrect conclusions. Evolution & Human Behavior, 46, 106749.