Reproducibility#

Synonyms: Replicability, Repeatability.

In brief#

Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators [1].

Abstract#

This entry firstly introduces the motivations behind reproducibility in the scientific process and, then, in artificial intelligence and machine learning. Due to the rather wide range of different meanings of reproducibility in the literature and the ambiguity of the terms, a brief review of the most important definitions is provided and discussed. In this context, we promote the most stable formulation of the definition. Practical guidelines to various standards for documenting code, technical experiment setup, and data are also discussed.

Motivation and Background#

Reproducibility in science means that one can repeat or replicate the same (or sufficiently similar) experiment and obtain the same (or sufficiently similar) research results as the original scientists on the basis of their publications and descriptions. To this aim and to ease the replication, the discovered claims, methods and analyses should be described in a sufficiently detailed and transparent way. Diverse reproducibility settings have been identified in the literature, see e.g. [1] [2], but from a more general standpoint, reproducibility entails that studies are reproduced by independent researchers.

Reproducibility is an essential ingredient of the scientific method, meant to verify the published results and claims and to enable a continuous self-correcting process in scientific discoveries. Unfortunately, the rising of a so-called research replication crisis has been lately pointed out ([3]). According to several surveys, a relatively too large amount of published research results, in such disciplines as chemistry, biology, medicine and pharmacy, earth and environmental sciences, cannot be repeated. This may suggest issues with these results or at least with their good descriptions. Reproducibility in artificial intelligence (AI) and, in particular machine learning (ML), are specifically challenging. The continuously increasing complexity of new methods (often having many hyper-parameters that need specialized optimization strategies), the size of studied datasets and the use of advanced computational resources pose many difficulties for communicating the necessary results as compared to the older works. The paper [4] presents the view of some researchers (such as J. Pineau citing her interview) claiming that ML was previously more theoretically based, while it has become a more experimental science in the past decade, and many proposals of new models, in particular deep networks, come from running many experiments with the intensive use of available data. In this context, the authors ([5]) indicate growing difficulties in reproducing the work of others. Other reasons of difficulty in reproducibility include: lack of access to the same training data or differences in data distribution; mis-specification or under-specification of the model or training procedure; lack of availability of the code necessary to run the experiments, or errors in the code; under-specification of the metrics used to report results; selective reporting of results and ignorance of the danger of adaptive overfitting as well as the use of adaptation strategies embedded in the development libraries.

Nevertheless, software solutions and systems based on AI and ML are gaining momentum. Many of them are being used in high-stake applications where their decisions can have an impact on people and society, and their improper operation may cause harm. In this frame, the quest for reproducibility of such methods is even more urgent and reproducibility becomes one of the key postulates within Responsible AI or Trustworthy AI. [6] also claims that reproducibility of AI is very important for other reasons. Researchers, students and R&D engineers need to have a good understanding of new and, often quite complex, methods, reproduce them (sometimes by their own re-implementations), carefully check their correctness, examine their working conditions and limitations, as well as to verify the presented results, especially if they need to further use them in their systems often applied to complex tasks. Moreover much of AI new projects receive either public or business funds, so it should be subject to accountability and it is necessary to convince others that these projects can produce reliable results.

Terminology#

In this handbook, we follow the concept of reproducibility introduced by [7]. According to this concept, which is also adopted in a number of more recent papers (e.g.,[1]; [8]; [5]), reproducibility refers to the ability of an independent researcher to reproduce the same, or reasonably similar results using the data and the experimental setup provided by the original authors.

Reproducibility should not be confused with other terms describing the ability to replicate the results in science, such as replicability and repeatability ([2]]). Replicability defined in a way consistent with our understanding of reproducibility is the ability of an independent researcher to produce results that are consistent with the conclusions of the original work, using new data or different the experimental setup. The term repeatability appears in some references, e.g. [9] that uses a notion of reproducibility inconsistent with our definition, but should be considered to describe an ability of a researcher to repeat his/her own experimental procedures using same experimental setup and data, while achieving reasonably repeatable results that support the same conclusions.

In order to compare these reproducibility-related terms, the main conceptual dimensions need to be identified. Based on the analysis of the literature, the following dimensions can be distinguished: (i) availability of the components originally deployed in experimental workflows (i.e., data, code and analysis as considered by [5]; [10]; [11]); (ii) teams involved in the experimentation (i.e., whether or not the experiments was conducted by the same group who is running the reproducibility validation); (iii) reasons because the experiment or part of it is re-conducted (i.e., validating the repeatability of the experiment or as suggested by [1] corroborating the scientific hypothesis and theory the experiment aims to support. With respect to these conceptual dimensions, the reproducibility-related terms used in the literature can be clustered in the following way:

Most of the literature (including [5]; [1]; [12]]) refers to reproducibility as the attempt to replicate experiment as much as possible as the original one, that is by using original data, code and analysis when available. Computational reproducibility, method reproducibility, direct replication and recomputation are used in lieu of reproducibility respectively by [10], [13], [14], [15] and [16]. [1] distinguishes the notion of reproducibility from corroborating the scientific hypotheses or theory to ground which the experiment is designed for.
The term replicability is highlighted by [7], [8], [12], [5], where an independent team can obtain the same result using the data, which could be slightly different, and methods which they develop completely independently or change slightly. Furthermore [8]; [5] use another name – robust – for carrying out the experiments with the same data and some changes in an analysis or code implementations.
Some works such as [9] [8]; [17]; [1] uses repeatability to indicate a weaker level of reproducibility where the replication of the experiment is achieved by the same team that provided the original experiments.

In the context of the above literature review, it is also worth clarifying the discussion of what is reproduced as a result of the above activities and how to understand the term result. In the case of AI works, [1] distinguishes between different possible results to reproduce:

Outcome – the result of applying the model implementation for selected data (e.g., predictions - labels for test examples)
Analysis – calculated measures or other indicators (e.g. prediction accuracy values)
Interpretation – more general conclusions from the experiments. According to Gunderesn the last point is the most important in reproducibility, because in the scientific method certain hypotheses are tested or certain beliefs are confirmed.

Similar importance of refining the levels of reproducibility has the division proposed in [13]:

Reproducibility of methods: the ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results
Reproducibility of results: the production of corroborating results in a new study, having used the same experimental methods
Reproducibility of inference: the drawing of qualitatively similar conclusions from either an independent replication of a study or a reanalysis of the original study

The general definitions should be however made more specific whenever we apply it to contemporary artificial intelligence research, and to the sub-field of machine learning in particular. The reasons are grounded in the high complication of the modern software processing pipelines, that often depend on third-party software (frameworks, libraries), use an extended set of metaparameters that are crucial to arrive at the correct results, and require modern hardware (e.g. recent GPU cards) with it’s specific architecture and drivers. These features of AI research and applications make this field different from the general science, where reproducibility refers primarily to the careful documentation of the experimental procedure.

In AI systems, the main components of the experimental setup are software and data. The software plays the role of our experimental setup. Although depending on the specific context, hardware components may be included as well (e.g. in computer vision, robotics), most of the AI-related research is conducted on pre-recorded datasets, so we can limit our scope to the software. The other dimension is data. Together, software and data define the conceptual dimensions of the space on which the defined terms are spanned in AI. However, as we noticed earlier, AI is a very broad field, with a number of distinctive sub-fields that have specific requirements when it comes to defining the exact elements of software, and sometimes have specific requirements as to the data, such as elimination of biases or privacy issues. This motivates the introduction of guidelines or “best practices” for reproducibility, that often also include terms that define the degree to which the postulate of full reproducibility is met, usually in relation to the amount of code, technical details and data that the author shares with readers.

Guidelines#

Definitions of the different reproducibility-related terms are often accompanied by badges and guidelines helping people in making the definitions operational.

Some definitions differentiate the notion of reproducibility according to the kind of resource shared. For example [10] focus on computational reproducibility with bronze, silver, gold standards. [11] and [1] propose different increasing levels R1, R2, R3, R4 depending on whether experiment descriptions, codes, data and experiment are stored. @ACMv1.1 recommends that three separate badges related to artefact review be associated with research articles in ACM publications: Artifacts Evaluated, Artifacts Available and Results Validated.
Guidelines ease the description of experiments. For example, [5] provides a special Machine learning reproducibility checklist; datasheets [18], model cards [19] and factsheets [20] provides templates for describing datasets and the AI models deployed increasing the transparency and accountability of experimentations and operational intelligent systems.

Below a few of the above guidelines are precised. Following [10]’s proposal, the three degrees of the reproducibility standards for ML are based on availability of data, model, and code, as well as other analyses or programming dependencies. For instance, in the bronze standards (the minimal requirements for reproducibility) the authors should make the data, model and its source code publicly available for downloading. The silver standard extends it by additionally providing: dependencies of the analysis (in a form to be installed in a single command), recording key details of the analysis and used software requirements. Furthermore, all elements in the analysis should be documented to be set deterministic. Within the gold standard the authors should also prepare this analysis reproducible with a single command - which is the most demanding with respect to full automatization of the reproducibility process.

[5] specify the necessary elements to be documented and made public with respect to the following categories: model and algorithm, theoretical claims, datasets used in experiments, shared code including dependencies specifications, all reported experimental results (with all details for the experimental setup, hyper-parameters, training details, definitions of evaluation measures, and description of the computing infrastructure used). [8] provide similar recommendations for ML in robotics, by focusing on the reproducibility of computation experiments on real robots. They stress the role of managing properly the software dependencies, distinguishing between experimental code and library code, and documenting the measurement metrics, which is essential for reinforcement learning.

Datasheets by [18] specify how to document the motivation, composition, collection process, recommended uses for data deployed in the systems and experiments; model cards by [19] ease the description of model’s intended use cases limiting their usage in contexts for which they are not well suited; factsheets [20] provide a template for describing the purpose, performance, safety, security, and provenance information to be completed by AI service providers for examination by consumers.

Software frameworks supporting reproducibility#

Lately, a paradigm based on tailoring the DevOps approach to AI and ML is emerging as a practical tool for ensuring reproducibility. This paradigm makes use of frameworks for Machine Learning Model Operationalization Management (MLOps), which streamline the whole development lifecycle of AI and ML models. MLOps enables developers and auditors to keep track of and inspect the various choices done and the artefacts produced in the different phases of AI and ML design and development (i.e., data gathering, data analysis, data transformation/preparation, model training and development, model validation, and model serving). [21] analyze some of the available open tools for MLOps. This allows for maintaining a comprehensive documentation that is at the basis of model reproducibility.

Bibliography#

1(1,2,3,4,5,6,7,8,9): Odd Erik Gundersen. The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2197):20200210, 2021. URL: https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2020.0210, arXiv:https://royalsocietypublishing.org/doi/pdf/10.1098/rsta.2020.0210, doi:10.1098/rsta.2020.0210.
2(1,2): Hans E. Plesser. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics, 11:76, January 2018. URL: http://journal.frontiersin.org/article/10.3389/fninf.2017.00076/full (visited on 2021-11-18), doi:10.3389/fninf.2017.00076.
3: Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454, 2016.
4: Will Douglas Heaven. Ai is wrestling with a replication crisis. MIT Technology Review, 2020.
5(1,2,3,4,5,6,7,8): Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d'Alché-Buc, Emily Fox, and Hugo Larochelle. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). arXiv:2003.12206 [cs, stat], December 2020. arXiv: 2003.12206. URL: http://arxiv.org/abs/2003.12206 (visited on 2021-09-26).
6: Edward Raff. A step toward quantifying independently reproducible machine learning research. 2019. arXiv:1909.06674.
7(1,2): Jon F. Claerbout and Martin Karrenbach. Electronic documents give reproducible research a new meaning. In SEG Technical Program Expanded Abstracts 1992, 601–604. 2005.
8(1,2,3,4,5): Nicolai A. Lynnerup, Laura Nolling, Rasmus Hasle, and John Hallam. A survey on reproducibility by evaluating deep reinforcement learning algorithms on real-world robots. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, 466–489. PMLR, 30 Oct–01 Nov 2020. URL: https://proceedings.mlr.press/v100/lynnerup20a.html.
9(1,2): ACM Artifact Review and Badging - Version 1.1. August 2020. URL: https://www.acm.org/publications/policies/artifact-review-and-badging-current (visited on 2022-01-20).
10(1,2,3,4): Benjamin J. Heil, Michael M. Hoffman, Florian Markowetz, Su-In Lee, Casey S. Greene, and Stephanie C. Hicks. Reproducibility standards for machine learning in the life sciences. Nature Methods, October 2021. URL: https://www.nature.com/articles/s41592-021-01256-7 (visited on 2021-11-18), doi:10.1038/s41592-021-01256-7.
11(1,2): Odd Erik Gundersen and Sigbjørn Kjensmo. State of the Art: Reproducibility in Artificial Intelligence. In Proceedings of the AAAI Conference on Artificial Intelligence,. 2018.
12(1,2): National Academies of Sciences, Engineering, and Medicine. Reproducibility and Replicability in Science. The National Academies Press, Washington, DC, 2019. ISBN 978-0-309-48616-3. URL: https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science, doi:10.17226/25303.
13(1,2): Steven N. Goodman, Daniele Fanelli, and John P. A. Ioannidis. What does research reproducibility mean? Science Translational Medicine, 8(341):341ps12–341ps12, 2016. URL: https://www.science.org/doi/abs/10.1126/scitranslmed.aaf5027, arXiv:https://www.science.org/doi/pdf/10.1126/scitranslmed.aaf5027, doi:10.1126/scitranslmed.aaf5027.
14: Stephan Guttinger. The limits of replicability. European Journal for Philosophy of Science, 10(2):1–17, 2020.
15: Ian P Gent and Lars Kotthoff. Recomputation. org: experiences of its first year and lessons learned. In 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, 968–973. IEEE, 2014.
16: Victoria C Stodden. Trust your science? open your data and code. 2011. URL: https://magazine.amstat.org/blog/2011/07/01/trust-your-science/.
17: Joint Committee for Guides in Metrology. The international vocabulary of metrology – basic and general concepts and associated terms - 3rd edition with minor corrections. JcGM, 2012. URL: https://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2012.pdf.
18(1,2): Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], March 2020. arXiv: 1803.09010. URL: http://arxiv.org/abs/1803.09010 (visited on 2021-09-14).
19(1,2): Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. Atlanta GA USA, January 2019. ACM. URL: https://dl.acm.org/doi/10.1145/3287560.3287596 (visited on 2021-09-14), doi:10.1145/3287560.3287596.
20(1,2): Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity. arXiv:1808.07261 [cs], February 2019. URL: http://arxiv.org/abs/1808.07261 (visited on 2021-09-14).
21: Philipp Ruf, Manav Madan, Christoph Reich, and Djaffar Ould-Abdeslam. Demystifying mlops and presenting a recipe for the selection of open-source tools. Applied Sciences, 2021. URL: https://www.mdpi.com/2076-3417/11/19/8861, doi:10.3390/app11198861.

This entry was written by Riccardo Albertoni, Sara Colantonio, Piotr Skrzypczyński, and Jerzy Stefanowski.

The TAILOR Handbook of Trustworthy AI

Reproducibility

Contents