Data scientists are often drawn to the profession excited by the chance to spend their days on cutting-edge research and development and working with fantastic new machine learning algorithms. While this is indeed a fun and exciting part of the job, as most data scientists in the field will tell you, much of one’s time is spent cleaning, transforming, and engineering the data.
The common wisdom is that, given enough data, most standard algorithms will be able to (eventually) detect the signal. This is the thesis that in large N, when you have enough data points, all machine learning algorithms tend to converge on the same answer. They’re all going to be able to separate out the ‘Yes’s from the ‘No’s. But some algorithms are going to be able to do it quicker (and with a lot less training and data) than others—the best of these able to do it many orders of magnitude more quickly.
This is why data scientists spend so much time on managing their data directly, transforming it, making sure that they have the right features, and making sure that the data is best prepared before feeding it into an algorithm—to get that good signal. One of the fundamental parts of getting good signal—perhaps the first thing that the data scientist has to pay attention to—is the number of features itself. You sometimes hear this referred to as the “number of dimensions” of your data space.
When you think of multiple dimensions, what comes to mind? A lego videogame? A crazed scientist and his grandson traveling through funky alternate universes? For some, maybe. But most people probably think of three-dimensional space---that is, how we perceive objects as having three spatial dimensions: length, width, and height. Physicists like to refer to time as a fourth dimension of variation: things can be located---not just by their latitude, longitude and how high up they are---but also by what date or moment they were there. But what about five dimensions? Six? What about 10, or 100?
This is the world data scientists live in. A data point can have hundreds of different properties to it, and each of those is going to be a “dimension of variation.” Ambient temperature. Color. Humidity. Time of day. Every different property you can give to a data point---every way you can differentiate it---is another dimension to that data point. So when we talk about the different features (properties), or the number of columns in a data matrix, we're talking about the dimensionality of points in the “data space.”
So why does this matter? The more information (the more features) you have about a particular data point, the better---right? This is true to a certain degree, so long as you're not adding redundant data or features that aren't useful for separating what's important and what's unimportant. This is what's known as the curse of dimensionality: as you increase the number of dimensions, the need for data points increases as well. This might seem obvious, but the problem---the curse, if you will---is that as the number of dimensions increases linearly, the need for data points increases exponentially.
The trick here is to think about the following question:
How many more data points does it take to keep the data space just as full if we add another feature?
Or as a data scientists would say:
How many more points would it take to keep the same density of data in the new higher dimensional space?
Let’s consider an example. Say we are looking at the data for automobiles. Let's also consider only one dimension (one feature) when we are representing these automobiles: car vs. truck. In order to keep our “space” 50 percent full, we only need one data point. If we have a car data point in our space, then the space is 50 percent full or if we have a truck data point in our space, then the space is 50 percent full.
But what happens if I add a dimension? Let's also consider if the automobiles have sunroofs or not. Now there are four possible kinds of automobiles in our “data space:”
- Cars without sunroofs
- Cars with sunroofs
- Trucks without sunroofs
- Trucks with sunroofs
So now we can ask our question again: How many data points will it take to keep this data space 50 percent full. The answer, of course, is two---because we need to fill out two of the four possible automobiles in the data space. While two is not terribly many data points, observe, it is double the number of data points we needed for when we had only one dimension to consider.
Let's do it again, and consider if the vehicle is yellow or not. Now we have eight possible kinds of automobiles in our “data space:”
- Yellow cars without sunroofs
- Yellow cars with sunroofs
- Yellow trucks without sunroofs
- Yellow trucks with sunroofs
- Non-yellow cars without sunroofs
- Non-yellow cars with sunroofs
- Non-yellow trucks without sunroofs
- Non-yellow trucks with sunroofs
And now we can again ask how many data points it will take to fill 50 percent of the space. The answer is eight divided by two, or four data points (again double the amount we needed with 1 less dimension).
Note the pattern:
- For 1 dimension we needed 20 = 1 data point.
- For 2 dimensions we needed 21 = 2 data points.
- For 3 dimensions we needed 22 = 4 data points.
And if we keep going:
- For 4 dimensions we need 23 = 8 data points.
In general, to maintain 50 percent density in a space with N features we need 2N-1 data points. For example, we go from 11 dimensions to 12 dimensions, we go from needing 1024 data points to needing 2048 (an extra ~1000 data points). But what if we go from 21 to 22 dimensions? Things get out of hand quite quickly, even at only 20ish dimensions, an increase of paying attention to just one additional feature means we need over one million more data points to maintain the same data density in our space.
So why should we care so much about data density?
This has to do with the more general question central to doing machine learning as a data scientist: overfitting vs. underfitting. We will come back to this topic in greater detail in a subsequent post. For now, let's think about what a supervised machine learning algorithm is really up to.
Recall our car example; what if we want to build a classifier algorithm whose job it is to find out which cars tend to be bought by a given population and which don't. If we only have a few examples, then our classifier, might detect a pattern, but only because it hasn't looked at enough data points yet.
For instance, if we only have two data points e.g. we know that a yellow car with a sunroof was bought by this population, but a non-yellow truck with no sunroof was not, then how is our classifier supposed to tell what makes the difference between whether or not a purchase will be made. Maybe it could draw the line between being yellow vs. non-yellow, but it could also draw the line between having a sunroof vs. not having a sunroof, or between being a car or a truck.
In order to know which of these classification rules is the right one in general, we need more data points. We need our data space to be so filled up---the volume of data density needs to be so high---that a classifier looking at it to decide between “vehicles this population type will buy” vs. “vehicles this population type will not buy” will be able to find a general pattern that will also work for new data points. It needs to be able to generalize beyond the data it learns from to new data in the future, and the way that happens is by maintaining sufficient data density as it is training.
The problem, the so-called “curse,” is that while we want to get more features to look at in order to figure out the difference between a “yes” (will buy) and a “no” (won't buy), doing so can get expensive. More and more data is required as we add each new dimension, and data is expensive. Even then, if a data scientist is lucky enough to have plenty of data, it's still expensive to process. Hence, data scientists must develop skills that allow them to analyze data with more features while strategically looking only at the dimensions that are most rich with information, or are most important to the questions that they are trying to ask.
How Do We Manage the Curse of Dimensionality?
Some of the most popular methods of managing curse of dimensionality are question-independent---that is, they don't care about the question your classifier is trying to answer, they simply reduce the dimensionality to where most of the data is located. Dimensionality reduction uses linear algebra to reduce the dimensions you're looking at to the part of the data space where your data is most dense, ignoring parts where the data is sparse. PCA and SVD are two such techniques for accomplishing this. These techniques (which are more or less mathematically equivalent), work by looking for areas---called sub-spaces---of the entire data space where most of the data points tend to vary.
For example, let's say most of our data actually was for only yellow automobiles in the above example, with only a small number of non-yellow. By paying attention to the yellow/non-yellow dimension, we would be looking at a feature without much variation, but would still have to pay the cost of the extra dimensionality. SVD and PCA find individual features and a sort of “combo-features” that tend to come together (e.g. 4-wheel drive and trucks vs. non-4-wheel drive and cars), and score those dimensions from best to worst. Combo-feature dimensions that are densely populated get high scores, and ones where there is not a lot of information get low scores. The data scientist can then select the number of dimensions, by picking only the ones with the highest scores.
There are ways of managing the curse of dimensionality that do pay attention to the question being asked. As your machine learning classifier is trying to figure out how to optimize in a supervised learning context, it can use a family of techniques known as regularization to figure out which data points are important. Essentially, regularization penalizes the classifier for paying attention to features that don't help much toward answering the question at hand, while favoring features that provide relevant information. This filters out the irrelevant and sparse sections of your data, leaving your classifier to only pay attention to the robust features of your dataset.
More advanced tactics that start to delve into the realm of “representation learning” also exist. The matrix factorization and latent feature detection techniques that drive much of the technology behind recommendation engines are one example of this. Continuous word representations such as Word2Vec or GloVe are likewise focused on finding more “compact” (low dimensional) data spaces in which data points can be represented without losing key information.
While data cleaning is not always the most exciting aspect of a data scientist’s day, techniques like these allow a data scientist to treat the problem of feature engineering with the same principles we find so exciting in machine learning.