Research Guides: Data Science: COMPUTE RESOURCES

NEW COMPUTE RESOURCE

NYU's High Speed Research Network department has an alpha version of Kubernetes available for the NYU community to test out. More information here: https://k8s-docs.hsrn.nyu.edu/

COMPUTE RESOURCES

At some point in your data science career, you will probably need access to some medium or high performance computing infrastructure. You would use the HPC when you need to deal with some data that is too large to be dealt with on your local machine.

HPC infrastructure is basically a bunch of clusters (computers) to which people submit jobs (scripts) and wait for them to be first in the queue, then run. Runs last from minutes to days, depending on the size of the input data. Typically you have to use the command line to access clusters. So a typical workflow might look like this:

SSH (securely log-in) into a cluster (if you don't know how to use the command line, we have an introduction to that).
Copy or transfer data onto the storage layer of the cluster. These are typically distributed, which means that data is stored across multiple nodes to be accessed by compute resources.
Run some data clean-up operations via the command line, through python scripts, R scripts, bash, etc. These operations are typically about formatting the data similarly so analysis pipelines run correctly, filtering out bad/unnecessary data, or validating data.
Run some analysis pipeline via the command line, through python scripts, R scripts, bash, etc. Batch processing is one method that allows researchers to analyze large data – this consists of breaking up the data into smaller pieces, scheduling each piece on a different compute node, and calculating the final result after running some job (analysis script).
Visualize what happened during the analysis – make some graph, time series, etc. that helps others understand what happened during the analysis stage.

NYU HPC

NYU Research Technology's High Performance Computing department maintains computing infrastructure that is available to the NYU community for research and teaching and learning. The HPC team also provides classes (live and online) and support for using the clusters.

Below is a table showing the different compute infrastructure offered. There is also the Secure Research Data Environment service, which is available on a case-by-case basis. This is a custom secure research environment for you to store sensitive data and analyze it.

Cluster

Short info, Use cases

Greene - HPC Cluster

The Greene cluster is named after Greene street in SoHo, a neighborhood in Lower Manhattan in New York City, near NYU. The cluster has also "green" characteristics, such as most of the cluster nodes are water cooled and it is deployed in a power efficient data center.

General purpose cluster suitable for the majority of computing and data analytics tasks, such numerical simulations and AI.
Uses SLURM for job submission
Detailed hardware specs and software specs

Cloud

There are a few options for those who need cloud infrastructure for their work:

NYU HPC DATA TRANSFER GUIDES

The HPC team has also put together documentation on how to transfer large amounts of data to and from the NYU HPC:

Globus
A Globus Endpoint is a data transfer location, a location where data can be moved to or from using Globus transfer, sync and sharing service. An endpoint can either be a personal endpoint (on a user’s personal computer) or a server endpoint (located on a server, for use by multiple users).
Google Drive
rclone - rsync for cloud storage, is a command line program to sync files and directories to and from cloud storage systems such as Google Drive, Amazon Drive, S3, B2 etc. rclone is available on Prince cluster.
Local Computer
To copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel on the workstation. There are instructions for this for Windows, Mac and Linux workstations.