Skip to Main Content

Research Software: Designing for Publication and Reproducibility

How to prepare scholarly code for submission to journals or repositories.

Making Reproducible Research Software

Welcome!

An infamous 2016 article in Nature, reporting on a survey of 1500 scientists, begins with an alarming statement: "More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments."  A reader might immediately wonder why this is the case. Is it for nefarious reasons, because scientists work sloppily and falsify results? Or is there a much more mundane explanation: people attempting to reproduce these experiments' results lacked key pieces of information, details which could have contributed to a successful replication. With research software, this is often the case; something as seemingly minor as running an identical script on two different operating systems can generate completely different results.

A cartoon image shows a number of people typing on computers, surrounded by computing symbols.

Image created by Scriberia for The Turing Way community and is used under a CC-BY licence.  DOI 10.5281/zenodo.3332807.

 

Partially as a result of attention being drawn to the "replication crisis," many grants and journals have started to implement code sharing requirements alongside existing data sharing requirements. These requirements create challenges for scholars. While an academic paper in a text format might be readable forever, software is much trickier to share, preserve, and reuse. Losing the ability to run a piece of research software may lead to a range of problems, from a creator needing to redo work to restore the software to workable form to a paper being retracted because its conclusions can no longer be proven.This guide aims to answer questions and provide resources that will help you prepare your software so it's fit to share and publish.

If you are just beginning your project, please take a look at the Data Management Planning guide. Many of the ideas and tools that the guide discusses, such as good file organization, storage, and backup practices apply both to software and to data. 

If you want to know more about open science and open access in general, please see the Open Access guide.

What do you mean by "reproducibility" and what does software have to do with it?

Here's what it boils down to: make sure your software will work on someone else's computer.

  • Can another researcher locate your dataset and your software? Do they know what the software is supposed to do?
  • Does this other researcher--and the other researcher's computer--have enough information to get the software to run and produce the intended results?
  • Do they have enough information to get the software to run and produce the intended results without calling you on the phone?
  • Are there decisions that you've made or specifics about the environment where you did your programming that you have not documented?
  • Have you hard coded file paths to refer to specific places on your personal computer's file system?

Note that "someone else's computer" might be a different location where you want to run your own software, such as the HPC (High Performance Computers).

 

Isn't This All Just Adding Work?

An image from 1978 shows a white man pressing buttons on an old, room-sized computer.

Image CC BY. Retrieved from the Digital Public Library of America

Many of our recommendations in the following pages, particularly around documentation, will add time and effort. At the Libraries, we advocate strongly for open scholarship. We support students, staff, and faculty in publishing their data, code, and related documentation and materials in support of this belief, as well as to meet funder and journal mandates. We also acknowledge that open scholarship replicates some of the historical biases that exist in scholarship, academia, and the world more broadly, and are open to conversations about how we can work together with the NYU community to dismantle some of these barriers.

Ideals aside, here are some advantages that come with publishing reproducible code:

  • Citeability: You can add code you publish to your CV, and other researchers can cite your code.
  • Contributions To Your Field: Other scientists working on similar problems might be find your methods, code, and data to be helpful
  • Institutional Memory: Many labs have a fair amount of turnover. Students graduate, postdocs move on, PIs begin new projects. The better you document a project, the less likely it is that someone joining your lab in the future will have to spend time trying to recreate others' work.
  • Helping Your Future Self: Documenting your code and processes will help you retrace your steps and remember how you did things. If you end up looking back at the code after some time, because a journal's review process is slow, or because you decide to use a similar method some time down the road, having clear clear notes and explanations packaged with the software will save you work and effort. 

Let's say that again: you will benefit from making well-documented, reusable code. The less time you have to spend redoing or relearning what you did previously--or starting completely from scratch because you did not back up your work--the more you can focus on improving your work and building on your skills. 

On the left half of the image, text says "Now Ruby" over a picture of a happy white woman with a computer that has a green check mark on the screen. On the right half, the text "Future Ruby" is over a picture of an older white woman with a distressed face, next to a computer that says "ERROR."

Image CC BY, retrieved from Reproducibility in Cancer Informatics.

Why We Love Open Source

There are a few reasons why we strongly advocate for open source software--we encourage you to apply open source licenses to your work, and to use open source products to the extent possible.

  • Transparency: If something goes wrong with a piece of open-source code, it's possible to review the code and try to figure out where the error has occurred. With a closed-source software, you might see an error message but have no way to trace the error back and figure out how to fix the issue.
  • Knowledge production:  When you and your fellow scholars can review each others' code, you can find solutions or shortcuts for your own computational problems. Sharing in this way is especially valuable in non-computer science fields, in which learning coding skills is not a part of formal training.
  • Accessibility: Many scholars, especially those outside the US, do not have institutional access to programs like MATLAB and SPSS due to their cost. Using free and open source software eliminates this economic barrier. 
  • Reusability: To reuse a file created for a closed-source software program, a user needs to find a copy of the program, which might not always be easy. Software becomes obsolete quickly. If the company that manufactured the software has gone out of business or simply decided to stop supporting the product, the user will need to embark on a search for the appropriate version of the software, as well as for documentation about how to use the product. This can be much more challenging than locating an older, deprecated version of an open source software product.