In many cases traditional text classification can be difficult to scale, because as the order of the taxonomy count increases, the amount of training that goes into the work increases as well. Moreover, with taxonomy counts in the thousands or tens of thousands, it can become increasingly difficult (or expensive) to gather a sufficient volume of labeled text examples for each taxonomic class.
One solution to this problem is to move to Word2Vec for the processing of your unstructured text data. Word2Vec (W2V) is an algorithm that takes every word in your vocabulary—that is, the text you are classifying—and turns it into a unique vector that can be added, subtracted, and manipulated in other ways just like a vector in space.
At a high level, W2V embeddings of your vocabulary into a vector space is a kind of “side effect” of building certain neural net algorithms designed to do tasks like autocompletion or detecting likely adjacent words in a document. As the neural net “reads” through document after document, learning how to manipulate the vocabulary into a format that it can process in its “hidden layer(s)” in order to predict the most likely missing words, the algorithm learns something about the relations that each of the terms in the vocabulary have with respect to one another, based on the frequencies with which they come together. These patterns end up getting encoded into a mathematical “matrix” that after a while is able to map any word in the vocabulary to a vector (or spot) in a much lower dimensional vector space.
Once embedded, these word-vectors end up displaying very interesting relationships between one another. Since vectors can be added and subtracted, we can ask questions by creating word vector equations, like what happens if we take the word vectors for King, Man, and Woman as follows:
King + (Woman - Man) = ??
When you take the vector for King and add it to the vector produced from the vector for Woman minus the vector for Man, what is your result?
King + (Woman - Man) = Queen
This works because the way that the neural network ended up learning about related frequencies of terms ended up getting encoded into the W2V matrix. Analogous relationships like the differences in relative occurrences of Man and Woman end up matching the relative occurrences of King and Queen in certain ways that the W2V captures.
Doc2Vec is an application of Word2Vec that takes the tool and expands it to be used on entire document, such as an article. In the simplest form, “naive” Doc2Vec takes the Word2Vec vectors of every word in your text and aggregates them together by taking a normalized sum or arithmetic average of the terms. As you add your word vectors together over and over again, most of the terms will only show as noise and cancel each other out—a random walk, so to speak. But while the walk is mostly random, it will actually have a bit of drift. By looking at that drift—the aggregate direction of the text’s vectors—you end up getting the total topic direction.
For example: we might imagine averaging all the words in the book A Tale of Two Cities. If you convert the entire text to Word2Vec vectors, the direction of the resulting single aggregate vector will drift towards embedded concepts such as “class struggle,” representing the major theme of the book.
Contributions by Mike Tamir, Chief Data Science Officer at Galvanize, and Bo Moore, Storyteller at Galvanize.
Want more data science tutorials and content? Subscribe to our data science newsletter.