Using neural networks to cluster news articles

In this post, I’ll outline a project I did to experiment with a neural network based machine-learning technique commonly called Doc2Vec.  As the name implies, the algorithm takes as input a collection of documents, and as output positions each document in the collection as an element in a vector space.  The dimension of this vector space is one of the model parameters, so you can be as granular or as coarse as desired in your vector description of each document (though as the dimension of the vector space increases, it becomes harder to  train the model).  This vector can than be used as a feature in other models.  There are simpler ways to encode documents as vectors (naive bag of words, TFIDF etc.), but these methods have the major disadvantage of throwing away the ordering  of the words in the documents, focusing only on the counts.

set in sentence, the in just order all to we of you a it. the know meaning that interpret than more need unordered words of

Oh I’m sorry, I meant: We all know that in order to easily interpret the meaning of a sentence, you need more than just the unordered set of words in it.  A major advantage of Doc2Vec is that it takes into account the surrounding context of words and sentences when doing the encoding.

Encoding a document as a vector is a natural counterpart for a clustering  algorithm such as KMeans (since once you’ve embedded two documents in a vector space, there is a well defined “distance” between them simply defined by the Euclidean distance between the two points).  I wanted to see how well this relatively simple two-step approach worked for clustering articles.   A real life use case for this could be a recommendation engine that presents similar articles after you have finished reading the current one.  Now, onto the code!

I needed a plentiful supply of high quality articles about a variety of topics, so I choose The New York Times.  They have a convenient API that allowed me to scrape a large collection of metadata about their past content.

Using the requests python library, I hit their archive API  to get urls for all of their articles over a fixed period of time, then used BeautifulSoup, a great python library for scraping html, to get the full text of each article.

def save_all_article_text(year, month):
    r = requests.get('https://api.nytimes.com/svc/archive/v1/{}/{}.json?api-key={}'
					 .format(year, month, API_KEY))
    raw_data = r.json()
    raw_data = raw_data['response']['docs']
    cleaned_metadata = [{'title':x['headline']['main'],
                     'keywords':[y['value'] for y in x['keywords']],
                     'summary':x['lead_paragraph'],
                     'url':x['web_url']} for x in raw_data
                    if x['document_type'] in ('article')
                   ]
    for x in cleaned_metadata:
        x['article_text'] = get_text(x['url'])
    with open('NYTArticleData{}_{}_Full.txt'.format(year, month), 'w') as outfile:
        json.dump(cleaned_data, outfile)
		
def get_text(url):
    result = requests.get(url)
    content = result.content
    soup = BeautifulSoup(content,"html.parser")
    article_text = ''
    paragraphs = soup.findAll("p", {"class":"story-body-text story-content"})
    for element in paragraphs:
        article_text += ''.join(element.findAll(text = True))
    return article_text

Now, we have to prep all of this unprocessed text into something palatable for a machine learning model like Doc2Vec.  In order to compress the universe of symbols the neural network had to consider, I removed punctuation, stopwords (common words like “the”, “is”, etc) and used the Porter stemmer algorithm to truncate modifications of words down to their stems (dive, diving, dived -> dive).

punc_table = str.maketrans({key: None for key in string.punctuation})
uni_table = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
eng_stopwords = stopwords.words('english')
ps = PorterStemmer()

def preprocess_text(article_text):
    return " ".join([ps.stem(w).translate(punc_table).translate(uni_table)
                     for w in word_tokenize(article_text)
                     if w not in eng_stopwords])

 

Next, I used the gensim library which contains a great implementation of the Doc2Vec algorithm. The “size” keyword argument here specifies the dimension of the vector space we want to use as our representation space.  I trained the coefficients of the neural network repeatedly with a gradually decreasing learning rate, with the hopes that the model would settle into a reasonable local optimum.

model = Doc2Vec(alpha=0.025, min_alpha=0.025, size=50,
				window=5, min_count=5, dm=1,
				workers=8, sample=1e-5)
model.build_vocab(docs)
for i in range(100):
    print(i)
    model.train(docs )
    model.alpha *= 0.99
    model.min_alpha = model.alpha

The next step was to feed these document vectors into KMeans clustering with a variety of values for k, and see which one has the best results.  Here, “best result” means that sweet spot where we’ve gotten the majority of gains in minimizing distances from cluster centers, and further increases in k result in significantly diminishing returns.

X = [v for v in model.docvecs]
scores=[]
for k in range(1, 300):
    kmeans_test = KMeans(n_clusters=k,
						 random_state=0).fit(X)
    score = kmeans_test.score(X)
    scores += [score]

 

Plotting the results, I determined 100 was an appropriate cluster count.

By this point, you’re just itching for results, right?  Onto the final function that ties everything together.  In the code below, we can choose an arbitrary article, and the code will find the nearest articles in the cluster belonging to the original article (of course, you could ignore the clustering all together and just use the Euclidean distance alone).

def find_closest_articles(article_index):
    main_article = article_data[article_index]
    article_cluster = article_data[article_index]['model_cluster']
    coordinates = main_article['doc_vector']
    neighbors = [y for i, y in enumerate(article_data)
				 if y['model_cluster']==article_cluster
				 and i !=article_index]
    for n in neighbors:
        n['distance'] = distance.euclidean(coordinates, n['doc_vector'])
    neighbors.sort(key=lambda x: x['distance'])
    print('Title: ' + str(main_article['title']))
    print('Keywords: ' + str(main_article['keywords']))
    print('Summary: ' + str(main_article['summary']))
    print('\n****Suggested Articles****\n')
    for x in range(5):
        print('Title: ' + str(neighbors[x]['title']))
        print('Keywords: ' + str(neighbors[x]['keywords']))
        print('Summary: ' + str(neighbors[x]['summary']))
        print('\n**')

Let’s see how it does.

find_closest_articles(0)
Title: Angela Merkel, Russia’s Next Target
Keywords: ['Cyberwarfare and Defense', 'Presidential Election of 2016', 'Merkel, Angela', 'Putin, Vladimir V']
Summary: With a friend entering the White House, Vladimir Putin can turn his hacking army on Germany. The history of the Cold War can tell us what to expect.

****Suggested Articles****

Title: Fake News, Fake Ukrainians: How a Group of Russians Tilted a Dutch Vote
Keywords: ['Netherlands', 'Rumors and Misinformation', 'Ukraine', 'Russia', 'Politics and Government', 'Van Bommel, Harry', 'Referendums', 'International Relations', 'Cyberwarfare and Defense', 'Elections', 'Espionage and Intelligence Services']
Summary: With Europe facing critical elections, fears of Russian meddling are high. Many officials say the first test comes in March in the Netherlands.

**
Title: Fearful of Hacking, Dutch Will Count Ballots by Hand
Keywords: ['Elections', 'Computer Security', 'Computers and the Internet', 'University of Amsterdam', 'Trump, Donald J', 'Netherlands', 'Russia']
Summary: The decision to forgo electronic counting, in national elections scheduled for next month, was a response to fears that outside actors, like Russia, might try to tamper with the election.

**
Title: In Election Hacking, Julian Assange’s Years-Old Vision Becomes Reality
Keywords: ['Classified Information and State Secrets', 'Assange, Julian P', 'WikiLeaks', 'Democratic National Committee', 'Democratic Party', 'Presidential Election of 2016', 'Putin, Vladimir V', 'Trump, Donald J', 'Clinton, Hillary Rodham', 'Russia']
Summary: In a 2006 essay, Mr. Assange, the founder of WikiLeaks, outlined the politically disruptive potential of technology. Hillary Clinton’s loss might be a realization of his vision.

**
Title: Mr. Trump, We Need an Answer
Keywords: ['Presidential Election of 2016', 'Russia', 'Trump, Donald J', 'Putin, Vladimir V']
Summary: He needs to address questions about his campaign’s involvement with Russia’s effort to influence the election.

**
Title: Russian Hackers Find Ready Bullhorns in the Media
Keywords: ['Cyberwarfare and Defense', 'News and News Media', 'Russia', 'Presidential Election of 2016', 'News Sources, Confidential Status of', 'Social Media', 'Computer Security', 'Cyberattacks and Hackers']
Summary: Journalists seek to serve the public interest, but sometimes find themselves unwilling — or unwitting — accomplices to a source’s agenda.

**
find_closest_articles(10)
Title: On a Fijian Island, Hunters Become Conservators of Endangered Turtles
Keywords: ['Turtles and Tortoises', 'Endangered and Extinct Species', 'Conservation of Resources', 'Poaching (Wildlife)', 'World Wildlife Fund', 'International Union for Conservation of Nature', 'Fiji']
Summary: A moratorium on harvesting turtles and a World Wildlife Fund program have helped replenish Fiji’s turtle population after decades of decline.

****Suggested Articles****

Title: A Bumblebee Gets New Protection on Obama’s Way Out
Keywords: ['Bees', 'Endangered and Extinct Species', 'Fish and Wildlife Service', 'Global Warming']
Summary: The administration added the rusty-patched bumblebee, which once covered 28 states but is threatened by pesticides, disease and climate change, to the endangered species list.

**
Title: Most Primate Species Threatened With Extinction, Scientists Find
Keywords: ['Endangered and Extinct Species', 'Monkeys and Apes', 'Lemurs', 'Agriculture and Farming', 'Fish Farming', 'Biodiversity', 'Hunting and Trapping', 'Mines and Mining']
Summary: From gorillas to gibbons, a wide-ranging survey finds that the world’s primates are in steep decline.

**
Title: When the National Bird Is a Burden
Keywords: ['Bald Eagles', 'Endangered and Extinct Species', 'Fish and Wildlife Service', 'Agriculture Department', 'Livestock']
Summary: Bald eagles have been the emblem of the United States for more than two centuries. Now, in some parts of the country, they’re a nuisance.

**
Title: Tilikum, the Killer Whale Featured in ‘Blackfish,’ Dies
Keywords: ['Whales and Whaling', 'SeaWorld Entertainment Inc', 'Amusement and Theme Parks', 'Blackfish (Movie)', 'Fish and Other Marine Life']
Summary: SeaWorld announced that orca, thought to have been around 36 years old, died after suffering health problems, including a lung infection.

**
Title: Gene-Modified Ants Shed Light on How Societies Are Organized
Keywords: ['Ants', 'Genetics and Heredity', 'Genetic Engineering', 'Kronauer, Daniel', 'Biology and Biochemistry', 'Proceedings of the National Academy of Sciences', 'Journal of Experimental Biology']
Summary: Daniel Kronauer’s transgenic ants offer scientists the chance to explore the evolution of animal societies — and, perhaps, our own.

**
find_closest_articles(20)
Title: New York’s Unequal Justice for the Poor
Keywords: ['Legal Aid for the Poor (Civil)', 'Cuomo, Andrew M', 'Law and Legislation', 'New York State']
Summary: Gov. Andrew Cuomo’s veto of a bill requiring the state to pay more for indigent defense was disappointing, and it can’t be the end of the matter.

****Suggested Articles****

Title: De Blasio Steps Away From Trump Turmoil to Defend His Ideas in Albany
Keywords: ['Budgets and Budgeting', 'de Blasio, Bill', 'Felder, Simcha', 'New York City', 'Plastic Bags']
Summary: New York City’s mayor, who has publicly opposed the new president, attended the state budget hearing, where he spoke about education, children’s services and affordable housing.

**
Title: After Victory Lap for Second Avenue Subway, M.T.A. Chief Will Retire
Keywords: ['Second Avenue Subway (NYC)', 'Metropolitan Transportation Authority', 'Prendergast, Thomas F', 'Cuomo, Andrew M', 'Manhattan (NYC)']
Summary: Thomas F. Prendergast, who has led the agency since 2013 and is revered by New York leaders, transit advocates and union chiefs, is expected to step down early this year.

**
Title: Governor Cuomo’s Tuition Plan
Keywords: ['Colleges and Universities', 'New York State', 'Tuition', 'Cuomo, Andrew M']
Summary: The union president representing CUNY’s faculty applauds the governor’s plan and calls for more funding for public colleges and universities.

**
Title: Contracts to Defend de Blasio and Aides May Cost City $11.6 Million
Keywords: ['de Blasio, Bill', 'New York City', 'Campaign Finance', 'Campaign for One New York', 'Debevoise & Plimpton', 'Lankler, Siffert & Wohl', 'Carter Ledyard & Millburn LLP', 'Walden Macht & Haran LLP', 'Cunningham Levy Muse LLP', 'Bergman, Paul B, PC']
Summary: New York City has contracts with six law firms in connection with state and federal criminal investigations into fund-raising practices by the mayor and other officials.

**
Title: New York Legislature Begins Work With 2016 Battles Still Fresh
Keywords: ['State Legislatures', 'Politics and Government', 'State of the State Message (NYS)', 'Independent Democratic Conference (New York State Senate)', 'Cuomo, Andrew M', 'Klein, Jeffrey D', 'New York State']
Summary: The 2017 session started with lawmakers, including some of Gov. Andrew Cuomo’s fellow Democrats, unhappy after last year’s scrapped special session.

**

Not bad!

Regularization from a Bayesian Viewpoint

I’ve read far too many articles on generalized linear models (GLMs) where regularization is introduced as a magical trick that can help you avoid overfitting your model to the data.  In this short post, I’d like to show in a relatively introductory way how many regularization methods such as l_2 regularization emerge naturally when GLMS are considered in a Bayesian context.

Estimation Techniques

The ultimate goal for many modeling techniques is to estimate a parameter Y using some data X.  One way to solve this problem is to choose a model for which you can easily compute Pr(X|Y), and use the value of Y which was most likely to generate the data we’ve seen.

For example, suppose we had a 6-sided die that may or not be fair. At the outset, we make no assumptions about the probability of each number appearing.  Let’s suppose we roll it five times, and observe that we roll three 4s, and two 1s.  Based only on this data, the most probable distribution for each value of the die is 3/5 for 4 and 2/5 for 1, and 0 for all other values.  Many would argue that this conclusion is absurd – how can we possibly conclude anything about the die based on such little data?  But what if rolled the die 120 times and observed 25 1s, 20 2s, 23 3s, 15 4s, 17 5s and 20 6s? The most likely probabilities would be the frequency of each values over 120, not 1/6 uniformly like our intuition might suggest.  Is that a more reasonable thing to conclude?  At what point do we say that we have seen enough data to make our final conclusion?  The above method is sometimes known as the “frequentist” approach.

A follower of the Bayesian school of thought would say that we should not draw conclusions solely based on the data we’ve seen. Data can be noisy, and you can never be sure you’ve seen enough data to obtain a true signal about the quantity you are trying to estimate.  Instead, we should integrate our prior beliefs about the mechanism being observed, and allow the data to influence our beliefs.  As we see more data, our beliefs develop to reflect what we’ve observed.  This approach allows us to balance the influence of the data we see with what we believe to be reasonable a priori assumptions about the measurement of interest.  The approach all flows from Bayes’ Formula, which states the following:

Pr(Y|X) = \frac{Pr(X|Y)Pr(Y)}{Pr(X)}

The goal now will be to find the Y that maximizes Pr(Y|X) – we’ve turned the frequentist method on its’ head!  Let’s pick apart Bayes’ Formula and look at the individual ingredients on the right-hand side.  Pr(X|Y) is something that can often be easily computed based on a mathematical model of the situation we are analyzing.  For example, if Y is the parameter determining the likelihood that a coin flip gives “heads”, and X is the number of heads we observe in 10 flips of a coin, then we can compute Pr(X=k|Y) using the binomial distribution.  Pr(X) is the probability of seeing our data and can in certain circumstances be very difficult to compute, as it often involves integration over our distribution for Y. The good news is that since we only care about maximizing Pr(Y|X) with respect to Y (and don’t usually care about the specific value of this maximal probability), when we take the derivative of Bayes’ Formula with respect to Y, Pr(X) will just be some constant and not affect our computation of the optimal Y.  So let’s just sweep that under the rug.   Pr(Y) is drawn from what is known as our prior distribution for the quantity Y that we are trying to estimate.   It represents our “reasonable initial assumptions” about the possible values Y could take.

If we’re flipping a coin, our prior distribution for the probability of heads occurring could look something similar to a Gaussian distribution centered at 1/2 or it could be a uniform distribution on the interval [0,1], for example.  The most mathematically convenient choice for our prior distribution in this particular case is called a beta distribution, which the uniform distribution is a special case of.  The convenience comes from the fact that Pr(Y|X) \propto Pr(X|Y)Pr(Y) turns out to look like a beta distribution with different parameters, reflecting the additional “information” that the data X gives us.  The mathematical terminology for this compatibility would be to say that the beta distribution is a conjugate prior for the binomial distribution.

This compatibility allows us to iterate on our estimations as new data comes in, without having to do a brand new analysis each time: we can apply Bayes’ Formula with new data X’ by replacing our initial prior distribution Pr(Y) with our updated posterior distribution Pr(Y|X), and repeat this process for as long as we like, updating our posterior distribution for Y as we see more and more data.

To summarize both approaches outlined above:

Frequentist Approach
  • Choose a mathematical model for Pr(X|Y)
  • Choose the Y which maximizes this conditional probability.
Bayesian Approach
  • Choose a mathematical model for Pr(X|Y)
  • Choose a prior distribution for Y
  •  Choose the Y which maximizes Pr(Y|X) using Bayes’ Formula

While the Bayesian approach can be more computationally intensive and is more complex mathematically, it can help you avoid the mistake of over-relying on noisy data, and is my preferred approach to making estimates.

Regularization from Bayesian Assumptions

This section will assume you are familiar with some of the basic facts about linear regression. The fundamental assumption of linear regression is that our quantity of interest Q depends on our dependent variables in a linear fashion, plus an error term which is normally distributed and centered at 0. That is,

Q = f(\mathbf{X}) = a_0X_0 + a_1X_1 + ... + a_kX_k + N(0,\sigma)

The parameter we want to determine in this situation is the vector of coefficients, \mathbf{W} = (a_0,...,a_k). Given a bunch of data points (Q_i,\mathbf{X_i}), we can compute Pr(Q_i|\mathbf{W}, \mathbf{X_i}) for any particular choice of \mathbf{W}. The familiar least squares solution to linear regression is obtained by finding the vector of coefficients which maximizes the product \prod Pr(Q_i|\mathbf{W}, \mathbf{X_i}), akin to the frequentist approach in the first section.  This method can break down in many scenarios, such as when there are many independent variables and not a correspondingly high number of data points.  In this case, the model may rely too heavily on the scant data that is given, and your model can fail to produce viable results in the future.

Since we want to act like Bayesians here, a natural thing to do here is to introduce prior distributions for each coefficient we’re trying to estimate.  Let us assume a priori that each coefficient is distributed normally around 0 with a variance of c, and see where the math takes us.  Let’s look at a single data point (Q,\mathbf{X}) for simplicity.  The full solution can be obtained by simply multiplying the probabilities of the individual data points (since we usually assume that each observation is independent).  Let us first collect the building blocks of our analysis:

    Pr((Q,\mathbf{X})|\mathbf{W}) \propto e ^  {-\frac{(Q - f(\mathbf{X}))^2} {2\sigma^2}} \propto e ^ {-\frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2}   {2\sigma^2}}

                    Pr(\mathbf{W}) \propto \prod \limits_{i=1}^{k} e ^ {-\frac{a_i^2}{2c^2}}

Note that we’ve dropped some constants from the above statements that don’t affect the conclusion which follows. Now, let us apply Bayes’ Formula to determine  Pr(\mathbf{W}|Q,\mathbf{X}):

Pr(\mathbf{W}|Q,\mathbf{X}) \propto Pr((Q,\mathbf{X})|\mathbf{W}) Pr(\textbf{W})

Pr(\mathbf{W}|Q,\mathbf{X}) \propto e ^ {-\frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2} {\sigma^2}} \prod \limits_{i=1}^{k} e ^ {-\frac{a_i^2}{c^2}}

We want to find the \mathbf{W} that maximizes this probability, which looks to be quite messy to compute.  Let’s take the negative natural logarithm to turn the nasty product into a sum, and find the optimal \mathbf{W} that minimizes our new function instead.  This composed function, called the negative log likelihood (NLL) will have its minimum where our original function had its maximum, because the natural logarithm is monotonically increasing.

 NLL \propto \frac{(Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2} {\sigma^2} + \frac{1}{c^2}\sum\limits_{i=1}^{k} {a_i^2}
or,
 NLL \propto (Q - a_0X_0 + a_1X_1 + ... + a_kX_k)^2 + \frac{\sigma^2}{c^2}\sum\limits_{i=1}^{k} {a_i^2}

This now looks very much like the “standard” formula for the l_2 regularized linear regression loss function:

  (Q - \mathbf{W} \cdot \mathbf{X})^2 + \lambda ||\mathbf{W}||^2

except now we have an interpretation for what \lambda really means!  As we increase the variance c^2 of our prior distribution, the corresponding \lambda decreases, imposing less of a penalty on the coefficients of \mathbf{W}.  In other words, the weaker our prior beliefs (larger variance), the weaker the regularization, and vice versa.  If we choose different priors for \mathbf{W}, we get different regularizations.  If we assume that all of our priors are Laplace distributions with the same parameter, we get l_1 regularization.  We can even choose different priors for different coefficients to get hybrid regularization methods.  It’s a deep rabbit hole we could start going down. In the end, it’s important to remember that we’re trying to build models that help us make predictions using new unseen data, so cross validation using various forms of regularization is a good way to try out different regularization methods.

For me, the best way to learn a concept is to build a logical narrative that leads naturally to the new idea. When I first encountered regression, regularization, and model building (in a haphazard self-taught way) I didn’t fully appreciate how connected the various concepts were, until they were boiled down to their mathematical essence.  I hope I have constructed a useful narrative here that sheds some light on one of these connections.

NYC Subway Visualization

For the past month or so, I’ve been working on a project that combines the task of data processing, analysis, and visualization.  After considering many possibilities, I finally settled on looking at the MTA turnstile dataset, which lists the cumulative turnstile exits and entries for every turnstile bank in every station in the NYC subway system, binned into semi-regular 4 hour chunks.  The main goal of this project was for me to learn how to scrape, clean, and package the data into a nice format.  The finished visualization is up and running at http://arettines.com/MTA-Graphic, and you can look at all of the code I wrote to generate it here.   There are several trends that become apparent after looking at the data, which I’ll describe in another post.  For now, I’d like to talk about the work involved in the project itself.

To begin the project, I wrote a short python script to download the 200+ files from the web to see what I was working with.  Each of these text files contains roughly 200,000 lines of text, containing information about the individual turnstile unit, the particular subway stations, when the data was recorded, etc.  The files needed some serious cleaning before I could use them to accomplish my goal of looking at trends in subway usage.  Each station could have many different turnstiles that would need to be grouped together.  In some situations, turnstile data that should have been grouped together was listed separately.   Many stations had irregular reporting times, and some turnstiles would occasionally fail to report their totals.  Another quirk I had to deal with is that the .csv files containing the turnstile data have a different format prior to 10/18/2014.  To deal with this, I normalized the data using some carefully chosen conventions and I binned all of the reported numbers into one of six 4-hour intervals and ignored data with erroneous or missing numbers.  Luckily, the pandas package in python has many built-in features to handle the sorting and rearranging of this large data set (200 files x 200,000 rows = 40,000,000 rows in our dataframe).

While python is great for taking a deep look at the data, I also wanted to work on an interactive visualization for users to look at some aspects of the data.  I decided to compute a weekly profile for each station, so users could see a “typical” week of usage for every station.  This involved learning a ton of new things – starting with html and javascript.  I gave myself a crash course on website construction and data visualization, with an emphasis on using amazing javascript library D3.  D3 has a ton of built in scripts and functionality to streamline the creation of beautiful data-driven websites.

I imagined creating a map of NYC, with a marker for each station.  The user could click on the area around each station, and be presented with an overview of the usage at that particular station.  I needed to find the precise latitude and longitude coordinates of each subway station.  I cleaned the data available at this NYC government website to get the coordinates.   I also had to acquire a map of NYC that could “speak” with D3 in order to produce a map.  This is where I discovered Geojson and Topojson – two formats for encoding geographic data that are fairly ubiquitous in the digital map-making community.  They encode the positions of vertices of a collection of polygons which represent the geographic data.  D3 has built in features that read a geojson or topojson file, and produce a map.  That’s great news, but there was a small hitch – I wanted my map to be interactive, and since the files I found at this github page were too large and unwieldy for an interactive page, I had to find a way to simplify the files.  The polygons used to describe NYC were just too detailed, and were unnecessarily slowing down the webpage.  Luckily, mathematics comes to the rescue.  Visvalingam’s algorithm  is a method which takes a polygon and greatly reduces the number of edges, without distorting the shape too much.  Even better, I found an amazing tool at http://www.mapshaper.org/  which runs the algorithm on a topojson file that you upload.

NYCsimp1
The original NYC boundaries, and the simplified version obtained using the algorithm.

With this simplified (but still reasonably accurate) map of NYC, I could begin the next step of the visualization – partition NYC into areas for each of the subway stations.  A Voronoi diagram is a natural way to do this.  Each subway station is assigned the area for which it is the closest station.  D3 has a built in voronoi method – you just have to feed it an array of coordinates for the stations, and it overlays a voronoi diagram on your page.  I used my topojson map of NYC to serve as a clip-path for the voronoi diagram, which resulted in the diagram being restricted to the landmass of NYC – exactly as I wanted.

NYCsubwayvoronoi
The Voronoi diagram created using the locations of subway stations.

 

               The next step was to link up the weekly usage with each voronoi region, so that when a user clicks on the region, a chart displays the information.  This is precisely the kind of situation where D3 shines.  The final product is at www.arettines.com/MTA-Graphic.  Each station is color-coded according to how heavily it is used.  The darker the green, the heavier the usage.  Hover over a region to see the name of the station and the lines it serves, and click on a station to see a typical week of usage for that station.  You can hover over the bars in the graph for more precise information.  Have fun playing with it!  In a future post, I’ll take a look at some interesting facets of the dataset.

NYCSample

 

Graphs are everywhere: Graph theory and the “sharing economy”

 

Matching up users with specific needs to suppliers who can meet those needs is a common theme in the plethora of apps and services that are budding in today’s “sharing economy”.   In this post, I’ll explore a simple mathematical concept underlying several different real world scenarios.

Imagine you run a dating company.  You have a group of 50 men and 50 women, and your job is to pair up men and women with one another to go off on a blind date.  You’ve done your due diligence, and have come up with a way of measuring how good of a match a chosen man and woman might be.  Perhaps you’ve obtained information about each participants interests, preferences etc.  In any case, you can assign a score to each possible pairing of man and woman, so in this case you have a list of 502 = 2500 scores.  How do we efficiently pick the best matching?  We want to maximize the chance that the customers have a good time during their dates, after all.

Now imagine you run Uber.  Seems daunting.  Let’s say you run the logistics department at Uber, because you know next to nothing about running a business.  Here’s a problem you might encounter:  In a highly populated area like NYC, in a given span of 20 seconds during rush hour, Uber might receive 100 requests from costumers to call a cab.  Let’s say there are 150 drivers available, each a certain amount of time away from each customer.  For each user, we can assign a score to each potential pairing of a customer with a driver.  The sooner the driver could arrive at the customer, the better the score of the pairing.  We don’t want the customers waiting too long, so how can we efficiently allocate drivers to customers?

One more scenario.  I promise this will be the last one.  Imagine you are hosting an educational service that attempts to pair up people with complementary skills so that they can both gain from the others experience.  Suppose each person has a list of their current qualifications, and a list of skills they would like to learn.  If Person A is a skilled programmer, and would like to learn about optimization analysis, and Person B is an optimization analyst and would like to learn how to program, then these two people could probably learn a lot from one another!  So give a large group of people, each with a set of current skills and desired skills, how might we pair these people off in an optimal way?  There are various notions of what an “optimal” pairing might be, but let’s say we use the number of complementary skills that the given pair has to determine its quality.

These scenarios all turn out to have identical mathematical descriptions, despite their varied settings.  The method for solving the problems presents itself by translating everything into the language of graph theory.

The notion of a graph is a very fundamental mathematical concept.  For our purposes a graph consists of a set of vertices, and a set of edges connecting some pairs of distinct vertices.  Graphs are ubiquitous throughout mathematics because they capture the notion of a set of objects (vertices) and some form of relationship between the objects (edges).  Depending on the setting, the objects and relations that vertices and edges stand for can vary widely.  One can also consider a graph as a mathematical object in its own right.  Graph theory is the study of mathematical properties of graphs.

Once we have down the notion of a graph, we can consider a weighted graph, which is just a graph with a number attached to each edge.  We’ll actually use weighted graphs when thinking about the earlier scenarios.  We need one more mathematical concept to complete the translation of our problems into graph theoretic terms.

A matching of a graph G is a selection of edges of G such that each vertex occurs in at most one pairing.

Graph1
A weighted graph and two possible matchings

Observe that different matchings can have a different number of total edges in it.  This depends on the exact nature of the graph.  Therefore, it makes mathematical sense to ask the question: Given a graph G, what is the matching with the largest number of edges? Can we find these maximal matchings algorithmically and efficiently?  If our graph G is weighted, we can ask what is the matching which maximizes the sum of the weights?  If all of the weights in G are equal, then the second question is equivalent to the first one.  In our example above, the matching in the upper right maximizes the sum of the weights, while the lower matching maximizes the number of edges in the matching.

One way to solve this problem for a given graph is to simply try out all possible matchings and take the best one.  However, if the graph is large, there is a computationally unfeasible number of pairings to iterate through.  It turns out there is an elegant and efficient solution to this problem, called Edmond’s Blossom Algorithm, discovered by Jack Edmonds in 1967.

Very roughly speaking, Edmond’s algorithm works by first picking a matching at random.  It then uses an ingenious method to find an improvement to the matching if one exists.  It keeps repeating this process until no further improvement can be found.

I’ll go into some of the more technical details of the algorithm in a later post.  For now let’s treat it as a black box that produces an optimal answer to the graph theoretic questions in the previous paragraph.  Let’s go through our earlier scenarios and frame them in this new language.

Scenario 1

Graph2

In the first scenario, we construct a bipartite graph (which means the vertices can be separated into two groups such that within each group, no two vertices are connected by an edge).    We add an edge for each potential matching, and we weight the edge by how good of a match we think they’ll make.

 

Scenario 2

Graph2

In the second scenario, we also construct a bipartite graph (the two groups of vertices are drivers and customers) and weight each edge by how long it would take the driver to arrive at the passenger.

 

Scenario 3

Graph4

In our last hypothetical scenario, each vertex represents a user, and two users share an edge if they have complementary skills.  Each edge could be weighted by some measure of how good the matching is.

In each of these scenarios, the question we want to answer is to find the maximal matching of the associated weighted graph.  Edmond’s algorithm gives us the solution in each case!

 

Yet another blog…

What will this blog be about?  I’m a mathematician by training,  so I plan to explore a variety of topics, ranging from the abstract to the concrete, with a focus on the patterns and symmetries that are sometimes hidden at a casual glance.  I hope you enjoy!