Research Guides: Research Software: Designing for Publication and Reproducibility: Dependencies

Documenting and Managing Dependencies

Don't be like Ruby and Avi! Image CC BY, retrieved from Reproducibility In Cancer Informatics

Most research software projects use external dependencies to save time and work. However, the second you introduce third-party code, you invite a bevy of complications. What if that code has its own dependencies, and one of your dependency's dependencies breaks? What if an "improvement" or update to that code changes the way it works in the context of your program? How can you tell a third-party user exactly which versions of the different libraries you have picked will reliably produce the results you got?

Here are a few tips for using, managing, and capturing dependencies.

Use Dependencies Wisely

You can manage dependencies using the same strategy you use to avoid getting COVID-19: interact only with your household (your code), keep a social distance with friends (especially helpful dependencies), and avoid large groups of strangers.

-Noah Brenowitz, "Managing Dependencies For Reproducible (Scientific) Software"

Given the complications that dependencies can add, consider what dependencies you actually need, and whether those dependencies have stable releases. Make sure you do not call or import dependencies you don't use in your program.

Use a Package Manager

You've probably already installed software using a package manager if you've typed "pip install," "brew install" or "apt-get" into your terminal. Not only does installing software via a package manager (as opposed to from source) save time, the package manager will help you avoid installing redundant copies of software packages while ensuring that your system has the correct versions not only of your dependencies but also of your dependencies' dependencies. If you plan to update the software you're creating regularly and imagine it will be used by others (for instance, if you've made a software library), you may wish to incorporate your software into a package manager, as well.

On a project level, renv and conda (for R and Python, respectively) are useful because they manage dependencies within a virtual environment, isolating dependencies for a specific environment so that as you work on a project, whatever happens on the rest of your computer, you have a space where you can easily control the versions of software you're using. Python even has a built-in virtual environment, called venv.

Create a Metadata File That Lists Your Dependencies, and Test the Software In a New Environment

Most languages allow you to print the versions of dependencies you have loaded in a given environment, and then to save those versions in a structured file that can be run to recreate that environment. You can do this manually, by creating a file with the appropriate name (for example, environment.yml or requirements.txt) and recording all the versions in use or in some cases, with "freeze" or "lock" commands. The conventions for recording dependencies differ between different programming languages. Because the full output of a command like "pip freeze" includes not only precise versions of your dependencies but also your dependencies' dependencies, it can be overkill for a user, although it could be useful to have on hand. Make sure to include any dependencies that you know are required--eg. if you know that your program only works in a specific version of Python and will break in a later version.

To test whether you have documented dependencies correctly, try to run your program in another environment--in a virtual environment, on a virtual machine, on a different computer in your lab, etc. See if the program runs, but also ensure that the environment uses the correct versions of the dependencies you listed in your metadata file rather than default versions or the most recent updates of those dependencies.

Another option is to test your code in a web-based replay tool like Binder. Binder requires a configuration file that declares the dependencies you've called, uses that to construct a virtual environment in a web browser, and then runs the code from a GitHub repo within that environment. If you've done your work correctly, you can not only verify that you have captured your dependencies accurately, but also share a runnable version of your software with your friends, family, and peers.

Incorporate Continuous Integration

For projects meant to have a long life--ie., a library or a program used repeatedly over time in your lab--continuous integration (CI) can improve dependency management. In these types of projects you not want to freeze the exact version of the dependencies you've used, because updates to those programs might fix bugs or security issues. As a CI service like GitLab CI automatically tests the code in your repository, it will detect when changes to a third-party software cause your code to break.

Resources

List of Configuration Files
The replay service Binder offers this page on configuration files for loading dependencies for several commonly used software languages
Managing dependencies for reproducible (scientific) software
A useful blog post by climate scientist Noah Brenowitz on dependency management best practices, with a focus on Python.
Introduction to renv
A how-to for using renv for package management in R
Devops for Data Science: Making Your Python Project Reproducible
Step by step instructions for dependency management in Python virtual environments and conda.
Developing a Modern Workflow For Regularly Updated Data
This pre-print details the workflow used by a biology lab that incorporates CI.