You may have heard of an issue in scientific research, known as the reproducibility problem. When scientists have tried to reproduce the results of some published works, they are not always able to arrive at the same result.
In some cases, a careful review of the analysis reveals some serious issues with the quality of the analysis. In other cases, the issues are due to more subtle biases, such as confirmation bias, where a researcher has some preconceived notion of what the results might/should be and this influences the analysis towards producing those expected/desired results.
However, what if a careful review of someone's analysis does not reveal any obvious issues, but another, independent analyst still fails to reproduce the original findings?
Many Analysts, One Dataset
In an academic paper from 2018 entitled Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results, led by R. Silberzahn, E. L. Uhlmann, and D. P. Martin, researchers set out to show how different, experienced teams can be given the same exact dataset, and the same hypothesis to test, and can still produce different results.
In this case, they posed the question of whether soccer referees are more likely to hand out red cards to players with darker skin. Of the 29 independent teams who participated, 20 found a statistically significant effect that, yes, referees were more likely to give red cards to players with darker skin, versus 9 teams that found no significant relationship.
How did Experienced Analysts Arrive at Different Conclusions?
As part of this study, the researchers investigated different factors that one might expect could affect the results. These include surveying the different participants in the study on their prior beliefs about the question at hand (of referees potentially having some racial bias) and looking at how the level of expertise of the participants affected the results. Furthermore, they gave participants the chance to review their peers' analyses, in order to rank the quality of the analyses of the 29 different teams. None of these factors could explain the variation in the results.
So, how did the analyses of the 29 teams differ?
First of all, there was a large variety in the algorithms used, from Spearman correlation to "multilevel binomial logistic regression using Bayesian inference".
Second, the different variables from the dataset that the different teams chose to include in their analyses differed. The different teams used from 1 to 7 covariates each, out of a list of 14 potential covariates.
Another difference was whether the teams removed outliers and how they set the threshold for outlier removal.
In all these cases, there are subjective decisions to make that could affect the outcome, starting right from the choice of algorithm. In machine learning, for example, there are often multiple algorithms that one can apply to a problem, and there is not necessarily one absolute correct answer as to which to use. The same can be said for which features to include and how to remove outliers.
Takeaways for Data Scientists
So, what is the takeaway here? I think the first takeaway is that we should not immediately jump to conclusions when one person or team's results or conclusions can not be exactly reproduced. Two different people or teams may make different subjective choices and assumptions throughout their analysis, that are all completely valid and defensible in an objective way, but lead to slightly different outcomes.
The second takeaway is that it is worth considering, at least in large organizations, with multiple data teams, that when making some major business decision, it may be worth having multiple teams work in parallel on the same business question, if such a decision could have large consequences for the company.
For the example in Many Analysts, One Dataset, two-thirds of the teams found a significant effect. Put another way, if this problem was assigned to just one analyst or team, there would be a 2 out of 3 chance of getting a significant positive result. Therefore, by chance, it is fairly likely that an experienced analyst, or team of analysts, could have ended up with either a positive or a null result.
In the case of racial bias in sports, a 2010 paper showed racial bias in NBA foul calls, which resulted in a lot of publicity and controversy. Hence, when working on sensitive issues like this one, it is important to make sure there are no issues with the analysis before publicizing the results.
However, what if assigning a problem to multiple analysts is not possible? The authors of Many Analysts, One Dataset suggest using a specification curve. In this case, the solitary analyst works out a range of different, reasonable approaches to the problem and runs these different versions of the analysis. Then, you would look at how many versions of the analysis gave a significant result, and use that to inform your business decision.
Hopefully, this summary of Many Analysts, One Dataset gives you a flavor for how different, experienced data scientists can make different, but all reasonable, decisions during their analysis that could lead to different outcomes. Even if the crowdsourcing option is not practical or feasible, a data scientist still has the option of running a particular analysis in at least a few different ways, making different, but defensible, decisions each time, and then seeing how many versions of the analysis point towards the same outcome.
As many data scientists, including myself, are using Python, I will occasionally include some tips and tricks or interesting packages to share.
Even though there are powerful debugging tools out there, I often still use
You can find a tutorial here. Essentially, instead of a regular
Enjoy the rest of your week!
References and Notes
 R. Silberzahn, E. L. Uhlmann, D. P. Martin, et al., Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results (2018), Advances in Methods and Practices in Psychological Science
 U. Simonsohn, J. P. Simmons, and L. D. Nelson, Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications (2019), SSRN
 For reference, a follow-up academic paper has been published, similarly testing how subjective, but reasonable, assumptions and decisions made during data analysis can affect the results.