Apache Spark

Resources to Start Learning

Getting Setup Workshop

At Galvanize we want to do everything we can to help any learner on their journey! As the amount of data is every increasing, more and more data scientists and analysts find themselves needing to scale up their analyses but the complexity involved in working with a cluster can be daunting.

Apache Spark is a unique framework that strikes a subtle balance between developer accessibility through rich APIs in Python, R, SQL (in addition to the native Scala and Java) and performance by leveraging fault tolerant distributed in memory data structures. We have compiled here a set of resources intended to get data scientists and developers up and running with Apache Spark with as little headaches as possible.

Getting Setup

Installing and configuring a new framework always has its complications, especially given the multiplicity of operating systems, hardware, and versions. We have created a step-by-step guide to get you setup fast no matter your given system.

Data Science Applications Workshop

This workshop will teach the best practices of using Spark to practicing data scientists in the context of a data scientist’s standard workflow. By leveraging Spark’s APIs for Python, R, and SQL to present practical applications, the technology will be much more accessible by decreasing the barrier to entry.

The code, exercises, and data are freely accessible and released under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. The workshop has also been produced as a online video series available on InformIT as well as Safari Online with intro videos available for free.