Skip to Main Content

Data Services Class Descriptions

Information, materials, and schedules for all currently offered Data Services classes.
This tutorial provides a basic understanding of Apache Spark and its usage in the Hadoop ecosystem. There will be hands-on examples on how to use Apache Spark and step-by-step instructions on how to run Spark jobs using NYU's Dumbo (Hadoop) Cluster.
Software: Apache Spark
Duration: 120 min

Room description:

Some tutorials are held remotely and require NYU sign on to access, while others are held in person, without a remote component. Please note the correct modality and location of the tutorial when registering

Prerequisites:
Skills Taught / Learning Outcomes:
  • Brief overview of the Spark ecosystem 
  • RDD, transformation and action 
  • Running spark-shell, pyspark
  • Compiling Java code with Maven
  • Spark-submit
  • Spark SQL
  • Accessing Hive database in Spark
Class Materials: Link to Class Materials
Related Classes:

Introduction to Unix/Linux and the Shell

Big Data Tutorial 1: MapReduce

Big Data Tutorial 2: Using Hive

Additional Training Materials:

Available via LinkedIn Learning (NYU NetID required):

Feedback: bit.ly/feedbackds

 

Upcoming sessions for this tutorial