Skip to Main Content

Research Data Management

Information on best practices and standards for data and code management.

VERSION CONTROL

Version control is "the management of changes to documents, computer programs, large web sites, and other collections of information." (Wikipedia). It's a way that we can keep track of our projects across time, space, different users, and different systems!

But why learn version control techniques and software? The Software Carpentries Git lesson outlines some great reasons:

  • Nothing that is committed to version control is ever lost, unless you work really, really hard at it. Since all old versions of files are saved, it’s always possible to go back in time to see exactly who wrote what on a particular day, or what version of a program was used to generate a particular set of results.

  • As we have this record of who made what changes when, we know who to ask if we have questions later on, and, if needed, revert to a previous version, much like the “undo” feature in an editor.

  • When several people collaborate in the same project, it’s possible to accidentally overlook or overwrite someone’s changes. The version control system automatically notifies users whenever there’s a conflict between one person’s work and another’s.

Teams are not the only ones to benefit from version control: lone researchers can benefit immensely. Keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on (e.g., a year later, when memory has faded).

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

GIT

git logo

Git is one of the most popular revision control systems -- it's a program to manage your source code history. Git lets us compare, restore, and merge changes to our "stuff," where stuff is any plain text file. You can use Git to version CSV files, for example, but not XSLX files -- CSV is plain-text while XSLX is a binary file, which means Microsoft has wrapped it in other layers so Git can't look through to see and version the actual content of the file well.

Git was created in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial development. Since then, it's been highly used for development and within the academic community alike!

If you want to use GitHub, GitLab, or any other hosting platform (some listed below), then you'll first need to learn how to use Git. It's a command line utility, so if you're not comfortable using the terminal, then that's a good first step as well.

Vicky Rampin has created an online video to guide you through collaborating with Git and GitHub, embedded below:

REPOSITORY HOSTING PLATFORMS

GitLab is an open source git hosting platform that is rapidly rising in popularity, for a few key reasons: all features are free to all users and there are many features that set GitLab apart from other services. It has continuous integration built-in to each repository, and free LargeFileSharing, so we can share larger files within a repository. Another big plus – GitLab integrates with a lot of great tools and services, like JIRA, Kubernetes, and the Open Science Framework! You can read here about why GitLab is particularly great for research reproducibility: https://about.gitlab.com/2017/08/25/gitlab-and-reproducibility/

One other minor reason we like GitLab -- it has a new release (e.g. new features and fixes) every month!

Side note: if you want to collaborate between platforms, you can! GitLab has an automatic mirroring function to sync changes between GitHub and GitLab, so you can work on GitLab and make your work discoverable on GitHub, or collaborate with your community on both platforms.

Some tutorials for GitLab are listed below, but you can contact Vicky Rampin <vs77@nyu.edu> if you have any specific questions.

GitHub is a commercial version-control hosting platform that has gained lots of popularity in the academic community. It offers a web interface and provides functionality and a mixture of both free and paid services for working with such repositories. The majority of the content that GitHub hosts is open source software, though increasingly it is being used for other projects. It also integrates into third party software, like the Open Science Framework and Travis CI.

There are 2 ways that you can interact with repositories hosted on GitHub: the command line, or if you have Mac OSX and Windows, you can use GitHub Desktop.

Some tutorials for GitHub are listed below, but you can contact Vicky Rampin <vs77@nyu.edu> if you have any specific questions.

CC

Creative Commons License
Original work in this LibGuide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.