Skip to main content

Data Management Planning: Reproducibility

Information on best practices and standards for data management planning.

General Information

 

 

 

Location

Staffed Hours: Fall 2017

Contact Us

Tell us how we're doing

 

Reproducibility RSS Feed

Loading

Reproducibility

The ultimate goal of data management is the reproducibility of an experiment and the reuse of its results.

Reproducibility of scientific research is imperative: it helps researchers verify results, and it allows others to build on them, advancing the global body of scientific knowledge.

However, with experiments becoming increasingly complex and digital, researchers have to rely on data described in papers, or secondary data if it is supplied. This leaves out data critical to understanding the composition of an experiment: descriptions of column names in tabular data, libraries used in scripting or computational experiments, algorithms used in machine learning, even software used to view files. 

 

Funders, award-granting institutions, and peer-reviewed journals are beginning to take notice of the general lack of reproducibility plaguing many scientific communities. Websites such as Retraction Watch have sprung up to track which journal articles are being retracted, most of the time because of issues with the data, mainly in reproducing the data. 

By taking proper care of your data throughout it's lifecycle, as detailed in this guide, you can avoid the horribly embarrassing fate of getting a paper retracted or research defunded. 

Refer in particular to converting your data to open-source file formats and documenting your data and research process.

ReproZip, the Reproducibility Packer!

ReproZip is a software developed by the ViDA (Visualization and Data Analysis) group at NYU. It's a tool aimed at simplifying the process of creating reproducible research from command-line executions. It creates a self-contained package that have all the binaries, files, and dependencies required to reproduce research on the author’s computational environment. A reviewer can then unpack the research in their own environment to reproduce the results, even if the environment has a different operating system from the original one.

ReproZip has users across domains, from digital humanities to machine learning. Check out some examples here, and on the YouTube channel!

ReproZip has two main steps:

 

The packing step happens in the original environment and generates a compendium of the experiment so as to make it reproducible. ReproZip creates a .rpz file, which contains all the necessary information and components for the experiment.

 

The unpacking step reproduces the experiment from the .rpz file. With ReproUnzip you CAN simply decompressing the files in a directory to starting a full virtual machine, and they can be used interchangeably from the same packed experiment.

 

For a full demo of ReproZip, see the video below: 

CC

Creative Commons License
Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.