Apache Spark for Beginners


Please login to enroll is this event.


Apache Spark is a fast parallel cluster computing engine that supports interactive computing on large scale datasets in popular languages including Python, R, SQL, Scala and Java.

This training session will cover the basics: importing data into an Apache Spark cluster and an overview of some analytic tools that can be used with Spark including Python (PySpark) in Jupyter notebooks and R (SparkR) for interactive data analysis.

To illustrate the tools, we will show how Spark clusters can be used to perform analysis on both semi-structured data (for applications such as text analysis and genomics) and tabular/columnar formatted data (such as an SQL database).

We will also look at what it takes to set up a Spark cluster and OIT’s Spark services as well as run some hands-on data analysis illustrating how to optimize compute jobs for Spark.

Subjects: intermediate research computing spark



Status Archived
Date Monday, February 11th, 2019
Time 1:00pm - 3:00pm
Location TEC - Classroom
Leader Mark McCahill
Enrolled 14 of 30
The TEC (Technology Engagement Center) Classroom is in the Telcom Building. To get there, just walk down the stairs from Perkins/Bostock like you’re heading towards CIEMAS. When you get to the road halfway down, make a right and Telcom is the building ahead of you to the left. The Technology Engagement Center is the first floor, so just enter in the front of the building. The classroom is to the right of the entrance, adjacent to the center circular room. http://maps.duke.edu/map/?id=21&mrkIid=2765