In this post, I’ll outline a project I did to experiment with a neural network based machine-learning technique commonly called Doc2Vec. As the name implies, the algorithm takes as input a collection of documents, and as output positions each document in the collection as an element in a vector space. The dimension of this vector space is one of the model parameters, so you can be as granular or as coarse as desired in your vector description of each document (though as the dimension of the vector space increases, it becomes harder to train the model). This vector can than be used as a feature in other models. There are simpler ways to encode documents as vectors (naive bag of words, TFIDF etc.), but these methods have the major disadvantage of throwing away the ordering of the words in the documents, focusing only on the counts.
set in sentence, the in just order all to we of you a it. the know meaning that interpret than more need unordered words of
Oh I’m sorry, I meant: We all know that in order to easily interpret the meaning of a sentence, you need more than just the unordered set of words in it. A major advantage of Doc2Vec is that it takes into account the surrounding context of words and sentences when doing the encoding.
Encoding a document as a vector is a natural counterpart for a clustering algorithm such as KMeans (since once you’ve embedded two documents in a vector space, there is a well defined “distance” between them simply defined by the Euclidean distance between the two points). I wanted to see how well this relatively simple two-step approach worked for clustering articles. A real life use case for this could be a recommendation engine that presents similar articles after you have finished reading the current one. Now, onto the code!
I needed a plentiful supply of high quality articles about a variety of topics, so I choose The New York Times. They have a convenient API that allowed me to scrape a large collection of metadata about their past content.
Using the requests python library, I hit their archive API to get urls for all of their articles over a fixed period of time, then used BeautifulSoup, a great python library for scraping html, to get the full text of each article.
Now, we have to prep all of this unprocessed text into something palatable for a machine learning model like Doc2Vec. In order to compress the universe of symbols the neural network had to consider, I removed punctuation, stopwords (common words like “the”, “is”, etc) and used the Porter stemmer algorithm to truncate modifications of words down to their stems (dive, diving, dived -> dive).
Next, I used the gensim library which contains a great implementation of the Doc2Vec algorithm. The “size” keyword argument here specifies the dimension of the vector space we want to use as our representation space. I trained the coefficients of the neural network repeatedly with a gradually decreasing learning rate, with the hopes that the model would settle into a reasonable local optimum.
The next step was to feed these document vectors into KMeans clustering with a variety of values for k, and see which one has the best results. Here, “best result” means that sweet spot where we’ve gotten the majority of gains in minimizing distances from cluster centers, and further increases in k result in significantly diminishing returns.
Plotting the results, I determined 100 was an appropriate cluster count.
By this point, you’re just itching for results, right? Onto the final function that ties everything together. In the code below, we can choose an arbitrary article, and the code will find the nearest articles in the cluster belonging to the original article (of course, you could ignore the clustering all together and just use the Euclidean distance alone).
Let’s see how it does.