Hello! I am Vicky Rampin, the Librarian for Research Data Management and Reproducibility. I am also the liaison to computer science and data science programs at NYU! I am here to help you navigate the resources for both at NYU and beyond. You can set up an appointment with me or always email me at: vs77@nyu.edu.
If you need help with a specific quantitative, GIS, or qualitative software, you should reach out to Data Services.
While doing your data science work, you will inevitably be programming. And if you are programming, by yourself or with others, you should be using version control for your code.
Version control is "the management of changes to documents, computer programs, large web sites, and other collections of information." (Wikipedia). It's a way that we can keep track of our projects across time, space, different users, and different systems!
But why learn version control techniques and software? The Software Carpentries Git lesson outlines some great reasons:
Nothing that is committed to version control is ever lost, unless you work really, really hard at it.
As we have this record of who made what changes when, we know who to ask if we have questions later on, and, if needed, revert to a previous version.
When several people collaborate in the same project, it’s possible to accidentally overlook or overwrite someone’s changes.With version control, you have to make a conscious decision before being able to overwrite anyone's work (so no accidentally deleting things!).
Keeping a record of what was changed, when, and why is extremely useful for all researchers when we need to come back to the project later on (e.g., a year later, when memory has faded).
Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.
Git is one of the most popular revision control systems -- it's a program to manage your source code history. Git lets us compare, restore, and merge changes to our "stuff," where stuff is any plain text file. You can use Git to version CSV files, for example, but not XSLX files -- CSV is plain-text while XSLX is a binary file, which means Microsoft has wrapped it in other layers so Git can't look through to see and version the actual content of the file well.
If you want to use GitHub, GitLab, or any other hosting platform (some listed below), then you'll first need to learn how to use Git. It's a command line utility, so if you're not comfortable using the terminal, then that's a good first step as well.
GitLab is an open source git hosting platform that is rapidly rising in popularity, for a few key reasons: all features are free to all users and there are many features that set GitLab apart from other services. It has continuous integration built-in to each repository, and free LargeFileSharing, so we can share larger files within a repository. Another big plus – GitLab integrates with a lot of great tools and services, like the Open Science Framework! You can read here about why GitLab is particularly great for research reproducibility: https://about.gitlab.com/2017/08/25/gitlab-and-reproducibility.
One other minor reason we like GitLab -- it has a new release (e.g. new features and fixes) every month!
Side note: if you want to collaborate between platforms, you can! GitLab has an automatic mirroring function to sync changes between GitHub and GitLab, so you can work on GitLab and make your work discoverable on GitHub, or collaborate with your community on both platforms.
Some tutorials for GitLab are listed below, but you can contact Vicky Rampin <vs77@nyu.edu> if you have any specific questions.
GitHub is a commercial version-control hosting platform that has gained lots of popularity in the academic community. It offers a web interface and provides functionality and a mixture of both free and paid services for working with such repositories. The majority of the content that GitHub hosts is open source software, though increasingly it is being used for other projects. It also integrates into third party software, like the Open Science Framework and Travis CI.
There are 2 main ways that you can interact with repositories hosted on GitHub: the command line (via Git or the GitHub command line interface), or if you have Mac OSX and Windows, you can use GitHub Desktop.
Some tutorials for GitHub are listed below, but you can contact Vicky Rampin <vs77@nyu.edu> if you have any specific questions.
Here is a narrative version of Data Service's introduction to Git and GitHub tutorial: https://nyu-dataservices.gitlab.io/rdm-instruction/intro-to-git-and-github.html. The below version is embedded below.
Here are further tutorials to aid your learning: