The ‘sexy’ new role of data scientist has driven tremendous action from universities. New data science programs are popping up regularly at the undergraduate, graduate, and professional education level. Meanwhile, schools have responded typically by building upon multiple existing programs (computer science, math, statistics, business) and filling in the gaps as needed to quickly create new certificate and degree programs rooted in their current strengths and capabilities.
The result is that it’s very hard to compare programs, and even harder to pin down exactly what constitutes data science. Is data science simply a field of applied machine learning fundamentals? Or is a broader definition needed to take full advantage of modern data-rich applications?
Data science is better thought of as a broad field with numerous subfields, not unlike physics which has five major fields (applied, astro, atomic/molecular/optical, condensed matter, and particle) which in turn have numerous subfields. But while physics is a mature discipline, data science is nascent. We do expect the major fields and initial subfields to emerge quickly as we strive to understand data science. Since data science crosses numerous existing fields we’re already seeing a variety of programs with different focus.
- Applied Data Science fits the scope of the majority of initial programs, such as Case Western’s BS degree.
- Data science specializations within existing fields, such as medical and bio informatics and actuarial science. The National Institutes of Health, for example, has created a new group focused on data science.
- Computational math, science and engineering often taking root in high performance computing centers, or being created to meet the emerging need.
- Computer science-led labs such as the Berkeley AMPlab (Algorithms, Machines, People).
But a major field of data science that’s receiving little notice is data engineering. Data engineering is at least a distinct subfield of data science, if not it’s own field altogether. Let’s explore why this is the case.
What is Data Engineering?
Those on the leading edge of data engineering are not only analyzing data, but are increasingly implementing solutions that change how businesses are run—in software by machines. Recommendation engines, fraud detection, real-time pricing and bidding—the list of possibilities is endless.
These solutions often combine data from multiple sources and must run in near-real time at scale supporting thousands, if not millions of simultaneous users—all while remaining secure and protecting individual privacy. These are not the skills of the data science modeler. They’re the skills of the data engineer.
Data engineers are sometimes pigeonholed as data wranglers or plumbers who do nothing more than clean up data and make it ready to analyze. This is a key part of the work, but it’s only a component.
The data engineer’s work includes a broad range of knowledge and skill. Here’s some job requirements you might find on a data engineering position:
- Extract, clean, and integrate data (wrangling).
- Bridge between data science models and production systems.
- Implement machine learning and computational algorithms at scale.
- Put the right data system to work for the job at hand. This means needing a deep understanding of transactional ACID databases along with a growing variety of NoSQL databases including JSON document, graph, column stores, and partitioned row.
- Demonstrate a deep understanding of distributed computing and database considerations for consistency, scalability, and security.
- Protect customer privacy and anonymity.
While educators are fast at work creating data science programs, very few are focused on data engineering. Yet, data engineer job openings actually outnumber data scientist job openings according to LinkedIn. (52,065 data engineers needed vs. 24,869 data scientists, as of December 4, 2015)
Where will the skilled data engineers of the future come from? Certainly not graduates from typical undergraduate computer science or information systems programs. The typical undergraduate program offers one course in database—often only as an elective—which is typically focused on older relational (RDBMS) thinking and system administration.
Master’s programs provide a few more options, but not many. Courses are typically offered as electives, and only cover a subset of the skills needed by a data engineer. Big data master’s programs that aren’t just about the data scientist are emerging, but they are comparatively rare. (UCSD or Dundee for example.)
Working With Big Data
In the Galvanize data engineering course, we cover the state of the art technology that makes data science possible with so-called “big data.” Big data is frequently described in terms of three traditional V’s: variety, volume, and velocity—and with all three, it’s certainly the case that, to quote Philip W. Anderson, “more is different.” In each case, there’s a situation where traditional tools simply won’t work, both on a technological and theoretical level. At the same time, opportunities open that would never have been possible in the old world of “small data.”
One of the things that makes living in the digital age so interesting is that we have access to so many different forms of data, and the capacity to cross-reference and evaluate them. Traditional science tends to focus on the data emitted from one sort of measuring device, but data science often contends with data from many different sources and of many different types. How do you predict stock prices from Twitter status updates? New approaches to how you use and combine different sources of data require a diverse set of skills that are taught in our program.
Big data has been described by some as “more data than can fit in memory.” This definition, while limited, is operationally useful. As soon as big data can no longer fit in memory, traditional tools, from Excel to R, simply stop working. Python can be made to work, but it requires a fundamental shift in how one approaches the problem. The same goes for other languages. Similarly, traditional statistical analysis begins to break down. As N reaches into the millions, p frequently shrinks to zero, providing ample opportunity for spurious correlations to lead to faulty conclusions. Both of these problems (the technological and the theoretical) can be solved by sub-sampling, but that ignores the fact that new opportunities arise that may be far more effective.
The appearance of large volumes of data makes permutation analysis and cross-validation easy, making possible analyses that are both more ecologically valid and easier to interpret. However, in order to do this, we need tools to allow us to work with the huge amounts of data that won’t fit on your laptop.
Many companies today, particularly web companies (but increasingly other industries as well) have access to huge amounts of data that they don’t know what to do with and can’t make use of. Simply having terabytes (or petabytes) of access logs in a Hadoop cluster does not make a data-driven organization. That data is useful only if it can be made accessible in a meaningful way. Enter the data engineer. A data engineer makes it possible to take huge amounts of data and translate it into insights. The important phrase here is “makes it possible”. To many data scientists and most data analysts (and nearly all managers), huge volumes of data are, in themselves, quite useless. A savvy data engineer can provide the interface that makes that data useful.
High velocity in data also provides a qualitative shift in data science, but again, only if the engineering is in place to make it possible. Some traditional approaches to data science are already able to make use of huge amounts of data, but they tend to do so as though it were a one-time event. This is fine if you are trying to understand what happened last year, but is not useful if you are trying to make decisions about what’s happening right now.
Data engineering allows us to tap into a stream of data as it’s happening and do something about it. In some cases, that simply means providing a fault-tolerant scalable architecture that can remain responsible while millions of users try to access a service at the same time (something that every successful web company must deal with). But to take it a step further, we see the potential of having systems learn and adapt to their users in a way that was not previously conceivable. This is already happening at large companies like Google and Facebook, but the technology is available to all—if only there were people trained to make it possible. Enter the data engineer.