Using neural networks to cluster news articles

In this post, I’ll outline a project I did to experiment with a neural network based machine-learning technique commonly called Doc2Vec.  As the name implies, the algorithm takes as input a collection of documents, and as output positions each document in the collection as an element in a vector space.  The dimension of this vector space is one of the model parameters, so you can be as granular or as coarse as desired in your vector description of each document (though as the dimension of the vector space increases, it becomes harder to  train the model).  This vector can than be used as a feature in other models.  There are simpler ways to encode documents as vectors (naive bag of words, TFIDF etc.), but these methods have the major disadvantage of throwing away the ordering  of the words in the documents, focusing only on the counts.

set in sentence, the in just order all to we of you a it. the know meaning that interpret than more need unordered words of

Oh I’m sorry, I meant: We all know that in order to easily interpret the meaning of a sentence, you need more than just the unordered set of words in it.  A major advantage of Doc2Vec is that it takes into account the surrounding context of words and sentences when doing the encoding.

Encoding a document as a vector is a natural counterpart for a clustering  algorithm such as KMeans (since once you’ve embedded two documents in a vector space, there is a well defined “distance” between them simply defined by the Euclidean distance between the two points).  I wanted to see how well this relatively simple two-step approach worked for clustering articles.   A real life use case for this could be a recommendation engine that presents similar articles after you have finished reading the current one.  Now, onto the code!

I needed a plentiful supply of high quality articles about a variety of topics, so I choose The New York Times.  They have a convenient API that allowed me to scrape a large collection of metadata about their past content.

Using the requests python library, I hit their archive API  to get urls for all of their articles over a fixed period of time, then used BeautifulSoup, a great python library for scraping html, to get the full text of each article.

def save_all_article_text(year, month):
    r = requests.get('https://api.nytimes.com/svc/archive/v1/{}/{}.json?api-key={}'
					 .format(year, month, API_KEY))
    raw_data = r.json()
    raw_data = raw_data['response']['docs']
    cleaned_metadata = [{'title':x['headline']['main'],
                     'keywords':[y['value'] for y in x['keywords']],
                     'summary':x['lead_paragraph'],
                     'url':x['web_url']} for x in raw_data
                    if x['document_type'] in ('article')
                   ]
    for x in cleaned_metadata:
        x['article_text'] = get_text(x['url'])
    with open('NYTArticleData{}_{}_Full.txt'.format(year, month), 'w') as outfile:
        json.dump(cleaned_data, outfile)
		
def get_text(url):
    result = requests.get(url)
    content = result.content
    soup = BeautifulSoup(content,"html.parser")
    article_text = ''
    paragraphs = soup.findAll("p", {"class":"story-body-text story-content"})
    for element in paragraphs:
        article_text += ''.join(element.findAll(text = True))
    return article_text

Now, we have to prep all of this unprocessed text into something palatable for a machine learning model like Doc2Vec.  In order to compress the universe of symbols the neural network had to consider, I removed punctuation, stopwords (common words like “the”, “is”, etc) and used the Porter stemmer algorithm to truncate modifications of words down to their stems (dive, diving, dived -> dive).

punc_table = str.maketrans({key: None for key in string.punctuation})
uni_table = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
eng_stopwords = stopwords.words('english')
ps = PorterStemmer()

def preprocess_text(article_text):
    return " ".join([ps.stem(w).translate(punc_table).translate(uni_table)
                     for w in word_tokenize(article_text)
                     if w not in eng_stopwords])

 

Next, I used the gensim library which contains a great implementation of the Doc2Vec algorithm. The “size” keyword argument here specifies the dimension of the vector space we want to use as our representation space.  I trained the coefficients of the neural network repeatedly with a gradually decreasing learning rate, with the hopes that the model would settle into a reasonable local optimum.

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50,
				window=5, min_count=5, dm=1,
				workers=8, sample=1e-5)
model.build_vocab(docs)
for i in range(100):
    print(i)
    model.train(docs )
    model.alpha *= 0.99
    model.min_alpha = model.alpha

The next step was to feed these document vectors into KMeans clustering with a variety of values for k, and see which one has the best results.  Here, “best result” means that sweet spot where we’ve gotten the majority of gains in minimizing distances from cluster centers, and further increases in k result in significantly diminishing returns.

X = [v for v in model.docvecs]
scores=[]
for k in range(1, 300):
    kmeans_test = KMeans(n_clusters=k,
						 random_state=0).fit(X)
    score = kmeans_test.score(X)
    scores += [score]

 

Plotting the results, I determined 100 was an appropriate cluster count.

By this point, you’re just itching for results, right?  Onto the final function that ties everything together.  In the code below, we can choose an arbitrary article, and the code will find the nearest articles in the cluster belonging to the original article (of course, you could ignore the clustering all together and just use the Euclidean distance alone).

def find_closest_articles(article_index):
    main_article = article_data[article_index]
    article_cluster = article_data[article_index]['model_cluster']
    coordinates = main_article['doc_vector']
    neighbors = [y for i, y in enumerate(article_data)
				 if y['model_cluster']==article_cluster
				 and i !=article_index]
    for n in neighbors:
        n['distance'] = distance.euclidean(coordinates, n['doc_vector'])
    neighbors.sort(key=lambda x: x['distance'])
    print('Title: ' + str(main_article['title']))
    print('Keywords: ' + str(main_article['keywords']))
    print('Summary: ' + str(main_article['summary']))
    print('\n****Suggested Articles****\n')
    for x in range(5):
        print('Title: ' + str(neighbors[x]['title']))
        print('Keywords: ' + str(neighbors[x]['keywords']))
        print('Summary: ' + str(neighbors[x]['summary']))
        print('\n**')

Let’s see how it does.

find_closest_articles(0)
Title: Angela Merkel, Russia’s Next Target
Keywords: ['Cyberwarfare and Defense', 'Presidential Election of 2016', 'Merkel, Angela', 'Putin, Vladimir V']
Summary: With a friend entering the White House, Vladimir Putin can turn his hacking army on Germany. The history of the Cold War can tell us what to expect.

****Suggested Articles****

Title: Fake News, Fake Ukrainians: How a Group of Russians Tilted a Dutch Vote
Keywords: ['Netherlands', 'Rumors and Misinformation', 'Ukraine', 'Russia', 'Politics and Government', 'Van Bommel, Harry', 'Referendums', 'International Relations', 'Cyberwarfare and Defense', 'Elections', 'Espionage and Intelligence Services']
Summary: With Europe facing critical elections, fears of Russian meddling are high. Many officials say the first test comes in March in the Netherlands.

**
Title: Fearful of Hacking, Dutch Will Count Ballots by Hand
Keywords: ['Elections', 'Computer Security', 'Computers and the Internet', 'University of Amsterdam', 'Trump, Donald J', 'Netherlands', 'Russia']
Summary: The decision to forgo electronic counting, in national elections scheduled for next month, was a response to fears that outside actors, like Russia, might try to tamper with the election.

**
Title: In Election Hacking, Julian Assange’s Years-Old Vision Becomes Reality
Keywords: ['Classified Information and State Secrets', 'Assange, Julian P', 'WikiLeaks', 'Democratic National Committee', 'Democratic Party', 'Presidential Election of 2016', 'Putin, Vladimir V', 'Trump, Donald J', 'Clinton, Hillary Rodham', 'Russia']
Summary: In a 2006 essay, Mr. Assange, the founder of WikiLeaks, outlined the politically disruptive potential of technology. Hillary Clinton’s loss might be a realization of his vision.

**
Title: Mr. Trump, We Need an Answer
Keywords: ['Presidential Election of 2016', 'Russia', 'Trump, Donald J', 'Putin, Vladimir V']
Summary: He needs to address questions about his campaign’s involvement with Russia’s effort to influence the election.

**
Title: Russian Hackers Find Ready Bullhorns in the Media
Keywords: ['Cyberwarfare and Defense', 'News and News Media', 'Russia', 'Presidential Election of 2016', 'News Sources, Confidential Status of', 'Social Media', 'Computer Security', 'Cyberattacks and Hackers']
Summary: Journalists seek to serve the public interest, but sometimes find themselves unwilling — or unwitting — accomplices to a source’s agenda.

**
find_closest_articles(10)
Title: On a Fijian Island, Hunters Become Conservators of Endangered Turtles
Keywords: ['Turtles and Tortoises', 'Endangered and Extinct Species', 'Conservation of Resources', 'Poaching (Wildlife)', 'World Wildlife Fund', 'International Union for Conservation of Nature', 'Fiji']
Summary: A moratorium on harvesting turtles and a World Wildlife Fund program have helped replenish Fiji’s turtle population after decades of decline.

****Suggested Articles****

Title: A Bumblebee Gets New Protection on Obama’s Way Out
Keywords: ['Bees', 'Endangered and Extinct Species', 'Fish and Wildlife Service', 'Global Warming']
Summary: The administration added the rusty-patched bumblebee, which once covered 28 states but is threatened by pesticides, disease and climate change, to the endangered species list.

**
Title: Most Primate Species Threatened With Extinction, Scientists Find
Keywords: ['Endangered and Extinct Species', 'Monkeys and Apes', 'Lemurs', 'Agriculture and Farming', 'Fish Farming', 'Biodiversity', 'Hunting and Trapping', 'Mines and Mining']
Summary: From gorillas to gibbons, a wide-ranging survey finds that the world’s primates are in steep decline.

**
Title: When the National Bird Is a Burden
Keywords: ['Bald Eagles', 'Endangered and Extinct Species', 'Fish and Wildlife Service', 'Agriculture Department', 'Livestock']
Summary: Bald eagles have been the emblem of the United States for more than two centuries. Now, in some parts of the country, they’re a nuisance.

**
Title: Tilikum, the Killer Whale Featured in ‘Blackfish,’ Dies
Keywords: ['Whales and Whaling', 'SeaWorld Entertainment Inc', 'Amusement and Theme Parks', 'Blackfish (Movie)', 'Fish and Other Marine Life']
Summary: SeaWorld announced that orca, thought to have been around 36 years old, died after suffering health problems, including a lung infection.

**
Title: Gene-Modified Ants Shed Light on How Societies Are Organized
Keywords: ['Ants', 'Genetics and Heredity', 'Genetic Engineering', 'Kronauer, Daniel', 'Biology and Biochemistry', 'Proceedings of the National Academy of Sciences', 'Journal of Experimental Biology']
Summary: Daniel Kronauer’s transgenic ants offer scientists the chance to explore the evolution of animal societies — and, perhaps, our own.

**
find_closest_articles(20)
Title: New York’s Unequal Justice for the Poor
Keywords: ['Legal Aid for the Poor (Civil)', 'Cuomo, Andrew M', 'Law and Legislation', 'New York State']
Summary: Gov. Andrew Cuomo’s veto of a bill requiring the state to pay more for indigent defense was disappointing, and it can’t be the end of the matter.

****Suggested Articles****

Title: De Blasio Steps Away From Trump Turmoil to Defend His Ideas in Albany
Keywords: ['Budgets and Budgeting', 'de Blasio, Bill', 'Felder, Simcha', 'New York City', 'Plastic Bags']
Summary: New York City’s mayor, who has publicly opposed the new president, attended the state budget hearing, where he spoke about education, children’s services and affordable housing.

**
Title: After Victory Lap for Second Avenue Subway, M.T.A. Chief Will Retire
Keywords: ['Second Avenue Subway (NYC)', 'Metropolitan Transportation Authority', 'Prendergast, Thomas F', 'Cuomo, Andrew M', 'Manhattan (NYC)']
Summary: Thomas F. Prendergast, who has led the agency since 2013 and is revered by New York leaders, transit advocates and union chiefs, is expected to step down early this year.

**
Title: Governor Cuomo’s Tuition Plan
Keywords: ['Colleges and Universities', 'New York State', 'Tuition', 'Cuomo, Andrew M']
Summary: The union president representing CUNY’s faculty applauds the governor’s plan and calls for more funding for public colleges and universities.

**
Title: Contracts to Defend de Blasio and Aides May Cost City $11.6 Million
Keywords: ['de Blasio, Bill', 'New York City', 'Campaign Finance', 'Campaign for One New York', 'Debevoise & Plimpton', 'Lankler, Siffert & Wohl', 'Carter Ledyard & Millburn LLP', 'Walden Macht & Haran LLP', 'Cunningham Levy Muse LLP', 'Bergman, Paul B, PC']
Summary: New York City has contracts with six law firms in connection with state and federal criminal investigations into fund-raising practices by the mayor and other officials.

**
Title: New York Legislature Begins Work With 2016 Battles Still Fresh
Keywords: ['State Legislatures', 'Politics and Government', 'State of the State Message (NYS)', 'Independent Democratic Conference (New York State Senate)', 'Cuomo, Andrew M', 'Klein, Jeffrey D', 'New York State']
Summary: The 2017 session started with lawmakers, including some of Gov. Andrew Cuomo’s fellow Democrats, unhappy after last year’s scrapped special session.

**

Not bad!

Leave a Reply

Your email address will not be published. Required fields are marked *