Skip to Main Content

Data Science

A guide with resources for the data science community on campus.

STORAGE & BACKUP

You will be working with a LOT of different project files when doing your data science work -- code, data, documentation, presentations, visualizations, articles, and more! So you really have to pay attention to store your materials. Even if your data management practices are pristine, if your data is at risk because there are no backups of it or the storage medium isn't reliable, then you will have trouble. Luckily, we have some resources at NYU and a few good rules of thumb to help you!

NYU storage options

NYU ITS has a helpful chart comparing NYU storage options. Among them include:

  • NYU Drive for faculty, staff, and students (all-purpose file sharing via Google Apps for Education)

  • NYU Research Workspace for faculty, staff, and by request, students, designed for fast access to large datasets. Can get access to up to 5TB free after consultation.

  • NYU Box for faculty, staff, and by request, students, geared towards secure data needs

  • NYU Stream for faculty, staff, and students, specifically for audio, video, and image files with a focus on collaborative editing and linking with NYU Classes

  • NYU High Performance Computing Backups and Storage for those already using HPC for a project via the /archive data storage

We recommended that backups be saved in open or standard file formats, and not be compressed or encrypted (though sensitive data may require encryption). The UK Data Service also has a nice guide to data backups.  Do not use CDs or DVDs as these have been known to fail frequently.

Storage for sensitive or restricted materials

If you are planning on working with sensitive data, you should first review NYU's policy on transmitting and storing sensitive data and NYU's policy on data classification.

There are two options for storing secure data:

  1. NYU Secure Research Data Environment service is new, but available on a case-by-case basis. This is a custom secure research environment for you to store sensitive data and analyze it.
  2. NYU Box has been designed and deployed for sensitive and restricted data, including HIPAA-compliance, and should be used for any short-term file storage needs requiring such protection. However, researchers with schools, departments, or units that deploy their own secure data storage for sensitive or restricted data should use that service. For information about NYU Box and restricted data, see the ITS NYU Box FAQs on sensitive data and managing permissions.

3-2-1 RULE

To keep data safe, it is recommended that folks follow the 3-2-1 Rule, which suggests you maintain three copies of your data on two different storage types, with 1 of those being offsite:

3-2-1 rule as described above with some clipart

3-2-1 WITH NYU RESOURCES

Both Google Drive and Box have desktop applications (Google Drive for Desktop, Box Drive) where folks can mount and access files quickly. When downloaded and installed, the applications create a folder that appears just like a My Documents folder, only it’s connected to your account on whatever service (so it’s Google Drive or Box in your file explorer). Then it operates like a two-way door: changes will be synced to and from your local computer to the service in the cloud.

This helps us stick to the 3-2-1 rule pretty nicely as well:

  1. Sync data between local copies (on all my computers) and on the Google Drive server located elsewhere.
    1. So this is 2 copies on 2 different storage media, with 1 copy offsite
  2. Run the backup to an external hard drive over the Google Drive folder on my laptop whenever there are changes.
    1. This brings us to 3 copies on 2 media with 1 offsite copy!

This looks something like this in practice:

an image depicting the 3-2-1 rule using Google Drive filestream

Connecting storage to analysis

During the data cleaning and data analysis phases, it is often necessary to push and pull data from an external storage source efficiently so as to integrate that data into a workflow. The following tools can provide useful ways of doing this:

Google Sheets/Google Drive Integrations:

Python:

R: