The popular “Intro to Spark for Data Science” weekend workshop is back!
After completing this weekend workshop, you’ll be better prepared to use Apache Spark for real projects and problems on your own. We’ll use Spark to power product recommendations and natural language processing tasks. In just 2 days, you will take your data skills to the next level by using Spark to build data pipelines.
Workshop instructors will be on-hand all weekend to teach, live code, and help debug as you work through the course materials.
This is an introductory course on Spark for anyone who has a personal or professional interest in data science.
We don’t assume you know anything in particular about Spark. All you need to come to our workshop is a working knowledge of programming to get through the course materials, a Macintosh or Linux laptop, and a readiness to learn.
If you don’t have a Mac or Linux computer, we can provide a Galvanize workstation for your use during the workshop.
The more you know before the course, the more you’ll get out of it, so we do recommend the pre-course materials below:
March 18 & 19, 2017, 9:00 AM – 5:00 PM (Lunch & snacks provided)
In this two-day, in-person, hands-on Spark course, we will:
Jeff is a Senior Data Scientist and Instructor in the Galvanize Data Science Immersive program. Prior to joining Galvanize, Jeff was an Assistant Professor at one of the leading engineering schools in France. He managed large-scale multidisciplinary research projects in partnership between industry and academia. He has used Spark and Natural Language Processing for mining consumer sentiment and brand perception from user comments, and for mining concepts from scientific papers.
Miles is a Data Scientist and Associate Instructor in the Galvanize Data Science Immersive program. Before joining Galvanize, Miles worked as a systems/network engineering consultant and taught college-level classes in IT infrastructure and security. Miles has contributed to the development of widely recognized certification exams for server engineers. Miles is a graduate of the University of Washington and is a co-organizer of the local Python community in Seattle.
Spark is a powerful, open-source processing engine for data distributed across large clusters. Spark is optimized for speed and ease of use; it uses caching and memory to run distributed algorithms 100x faster than MapReduce. Spark can be used for batch processing and for processing data in near real-time.
Also, used by data professionals at Amazon, eBay, NASA, and 200+ other organizations, Spark’s community is one of the fastest growing in the world.
This workshop series is for anyone who wants to use Spark to analyze data at scale.
Course examples and exercises will use Python and PySpark. Basic working knowledge of the Python programming language (i.e. the ability to write scripts and functions in Python) is required.
Command Line & Version Control:
Basic knowledge of Unix commands (i.e. command line) is required.
We will use GitHub for sharing and maintaining code. If you do not already have a GitHub account, you will need to create one before class begins.
Students are expected to bring a personal computer running the Mac OS X or Linux operating system, with at least 4GB of RAM and at least 6GB of free disk space.