We’ve discussed the power of Word2Vec, a way to generate a vector representation of words that can then be manipulated like mathematical equations. But what does this really do for us? Why is it so important?
When it comes to data, the more the better, right? You’d think, then, when building a data matrix for a set of documents that it would be worth including every possible bit of available information. As it turns out, this is not 100 percent correct.
Say you want to classify a collection of professional journal articles, and your goal is to separate all the medical-related documents from all the ones that have nothing to do with doctors or nurses or hospitals and such. To represent documents in a data matrix, what you would do is keep a record of the total count of every word used. For each document (represented in the rows of your matrix), you count up how many times every word in your vocabulary is used (recorded in the matrix columns).
The problem is, representing every word in your vocabulary as an individual feature means you’re looking at a very high-dimensional data space. If you recall from our previous post, we discussed the curse of dimensionality and how data in high-dimension spaces with lots of columns—lots of features—for each data point behaves very different from what we’re used to in a small number of dimensions. As you add new features (properties), the volume of data points needed to maintain the same data density increases exponentially. As such, you need to ensure every new feature you add is a valuable one, because the cost of acquiring and processing that exponentially large amount of data is considerable.
To make matters worse, for any document—even an especially long book—the words used represent only a tiny fraction of the words in your total vocabulary. This means you’re going to need hundreds of thousands or millions of dimensions to record your word-frequency counts—the count of each word in each document—and the vast majority of those counts are going to be zero.
This is what data scientists call sparsity—when you have a lot of features (the columns of your matrix) and very little information for each row for most of those features. Sparsity is the enemy of finding signal in your data, so as a data scientist, your goal is to get more robust, “more compressed,” denser data, and find the important features in that data that allow you to get better signal.
Let’s return to the medical document classification example we mentioned above. If we want to figure out if a document is a medical document, what we want to do is create a data space where our algorithm can carve out a boundary with all and only the medical documents on one side of that boundary, and all the documents that don’t relate to medicine on the other side.
The problem with sparse data is that while your algorithm may be able to carve out a slice that works for the data points you have, it might not be good for any future data points you might add. In other words, you don’t know if it’s capturing a pattern in the data, or it’s just capturing a slice that works for the data you’ve seen so far. This is why data density is so important. You want to have enough data so that adding a new data point isn’t going to change the game. To do that, you want to reduce the dimensionality.
Cutting Things Down To Size
So you want to reduce the dimensionality? The first thing you might think is “OK, I’m going to get rid of some of my dimensions. Some of them are important, some of them aren’t. Which ones should I choose?”
One thing you can do is to pick all the terms that are most different from each other. In the case of our medical document example, you’re going to look at all the medical documents and all the non-medical documents, and then look for terms that are present or absent. Maybe on the medical documents words like “doctor’ and “scalpel” and “nurse” are more present, while they don’t show up on the non-medical documents. This is in a sense what happens in regularization.
Another technique is to simply eliminate the most popular words from your data. Words like “the” and “a” and “and” itself—we call these “stop words.” These words are used in all contexts, so they don’t provide us any useful classification information. Similarly, words that are very rare provide very little information—so we eliminate them as well. It doesn’t matter if they’re incredibly relevant; there’s no reason to add a whole dimension for a word that only shows up in one out of every five or ten thousand documents.
But while these techniques are a decent start, they don’t really make significant headway in really cutting down a very large dimensional space—one upwards of a hundred thousand or million dimensions—to something much smaller. This is when we turn to Word2Vec.
As we discussed before, Word2Vec is an algorithm that takes every word in your vocabulary and turns it into a unique vector that can be added, subtracted, and manipulated in other ways just like a vector in space. When you apply Word2Vec to every term in an entire document, the word vectors of those terms can be combined in different mathematical ways so that we end up with a single aggregate vector that represents the document’s total topic direction (a process of this sort is sometimes called Doc2Vec).
Doc2Vec lets you take a document represented in a huge dimensional space (millions of dimensions) and re-represent it in a suitably small data space (e.g. mere hundreds of dimensions). The trick to this process is figuring out just how much to compress your matrix’s dimensionality. If you compress down too much, you’ll end up scrunching up your data too closely, not leaving enough space to carve between the important differences in your data points. What you want is a nice balance where your documents cluster together with other similar documents, but remain spread out significantly enough to discern one from another.
Let’s look again at our medical document example. Applying Doc2Vec to the collection of documents, each document is now a vector in a vector space. If we’ve tuned our dimensionality well, we’ll find that all the medical documents—represented as a collection of vectors each with an aggregate topic of something medical-related—will be located in a single region of the space, while all the non-medical documents will be separate. Furthermore, we now have enough density of data points that the boundary drawn by our machine learning algorithm, based upon where those data points lie in the vector space, will be the “true boundary” that separates medical and non-medical document vectors. Finding this “true boundary” means that it is not only successfully in separating the documents we used to train the algorithm but also marks the division separates new medical documents from non-medical documents
In other words, we’ve defeated sparsity and found the signal in our data. Thanks Doc2Vec.
This post was written with contributions from Mike Tamir.
Want more data science tutorials and content? Subscribe to our data science newsletter.