Research Guides: Research Software: Designing for Publication and Reproducibility: Using Containers and Reproducibility Tools

Why Use Containers?

Image credit: Department of Defense. American Forces Information Service. Defense Visual Information Center. 1994. Retrieved from the Digital Public Library of America.

A lot of the documentation this guide discusses involves writing detailed instructions--intended for both another person and for another computer--that describe how to recreate the environment where you initially wrote your software. A container takes these instructions another step further by actually creating a virtual environment in which a user can run the software. When you run software in a container, it can only access whatever other programs, libraries, and data you have placed inside the container. This isolation is useful, as it ensures that other versions of software installed on someone else's computer won't sneak into the workflow you've designed to cause problems.

This is useful for several reasons:

You can verify whether you have specified all of the correct dependencies your program requires.
You can effectively share a computing environment with someone who doesn't actually share your computer. For example, if you have a collaborator, you probably don't coordinate when you each update your computer's software and operating system. Developing within a container will allow you to ensure that you're working consistently in the same environment
Similarly, if you need to run processes on one of NYU's high performance computing environments, it is helpful and in some cases strongly encouraged to create a container that the giant computer brain of the HPC can use to process data using your software and specifications.

Docker and Singularity are both open source tools for containerizing software. They function as operating-system level virtual environments, which require fewer computing resources than an entire virtual machine. The tech industry tends to use Docker, and consequently it has more features and integrations. However, Singularity is required by the NYU HPC and by most other university-based high performance computing environments because of how it handles permissions and security. Luckily, Singularity can open and convert Docker containers.

You can also export a Docker or Singularity container to a file, which you can then use to share or archive your project and its environment.

Resources

Singularity with Miniconda
A guide to using Singularity on the NYU HPC
Ten simple rules for writing Dockerfiles for reproducible data science

Reproducibility and Replay On the Web

Several projects focused on scientific computation and reproducibility allow you to test, run, and display your project online, either using a container or using a well-documented GitHub repository. NYU provides access to several of these services.

Resources

JupyterHub - NYU
At NYU, PIs and faculty can get request access to the university's centralized JupyterHub environment for labs and classes
JupyterHub
Generally used for classroom activities but also a way to share data and computational environments, JupyterHub allows you to configure a workspace with dependencies and load datasets. It's essentially a web-based IDE.
WholeTale
WholeTale also essentially functions as a web-based IDE: you write or paste code into the environment and the service will create a Docker container behind the scenes that incorporates your code and dependencies. You can then run the software in-browser, although it is currently not possible to export the Docker container.
Binder
Binder can open and replay software in-browser directly from a repository like GitHub or GitLab, provided it has an appropriate configuration file that declares its dependencies.

Reproducibility for long-term preservation

We strongly recommend using the NYU-created tool ReproZip to preserve projects at their conclusion.

ReproZip is a software developed by the ViDA (Visualization and Data Analysis) Center at NYU. It's a tool aimed at simplifying the process of creating reproducible research from command-line executions. It creates a self-contained package that have all the binaries, files, and dependencies required to reproduce research on the author’s computational environment. A reviewer can then unpack the research in their own environment to reproduce the results, even if the environment has a different operating system from the original one.

ReproZip has two main steps:

The packing step happens in the original environment and generates a compendium of the experiment so as to make it reproducible. ReproZip creates a .rpz file, which contains all the necessary information and components for the experiment.

The unpacking step reproduces the experiment from the .rpz file. With ReproUnzip you CAN simply decompressing the files in a directory to starting a full virtual machine, and they can be used interchangeably from the same packed experiment.

ReproZip has users across domains, from digital humanities to machine learning. Check out some multi-disciplinary examples and ReproZip video demos on YouTube!