For the past month or so, I’ve been working on a project that combines the task of data processing, analysis, and visualization. After considering many possibilities, I finally settled on looking at the MTA turnstile dataset, which lists the cumulative turnstile exits and entries for every turnstile bank in every station in the NYC subway system, binned into semi-regular 4 hour chunks. The main goal of this project was for me to learn how to scrape, clean, and package the data into a nice format. The finished visualization is up and running at http://arettines.com/MTA-Graphic, and you can look at all of the code I wrote to generate it here. There are several trends that become apparent after looking at the data, which I’ll describe in another post. For now, I’d like to talk about the work involved in the project itself.
To begin the project, I wrote a short python script to download the 200+ files from the web to see what I was working with. Each of these text files contains roughly 200,000 lines of text, containing information about the individual turnstile unit, the particular subway stations, when the data was recorded, etc. The files needed some serious cleaning before I could use them to accomplish my goal of looking at trends in subway usage. Each station could have many different turnstiles that would need to be grouped together. In some situations, turnstile data that should have been grouped together was listed separately. Many stations had irregular reporting times, and some turnstiles would occasionally fail to report their totals. Another quirk I had to deal with is that the .csv files containing the turnstile data have a different format prior to 10/18/2014. To deal with this, I normalized the data using some carefully chosen conventions and I binned all of the reported numbers into one of six 4-hour intervals and ignored data with erroneous or missing numbers. Luckily, the pandas package in python has many built-in features to handle the sorting and rearranging of this large data set (200 files x 200,000 rows = 40,000,000 rows in our dataframe).
I imagined creating a map of NYC, with a marker for each station. The user could click on the area around each station, and be presented with an overview of the usage at that particular station. I needed to find the precise latitude and longitude coordinates of each subway station. I cleaned the data available at this NYC government website to get the coordinates. I also had to acquire a map of NYC that could “speak” with D3 in order to produce a map. This is where I discovered Geojson and Topojson – two formats for encoding geographic data that are fairly ubiquitous in the digital map-making community. They encode the positions of vertices of a collection of polygons which represent the geographic data. D3 has built in features that read a geojson or topojson file, and produce a map. That’s great news, but there was a small hitch – I wanted my map to be interactive, and since the files I found at this github page were too large and unwieldy for an interactive page, I had to find a way to simplify the files. The polygons used to describe NYC were just too detailed, and were unnecessarily slowing down the webpage. Luckily, mathematics comes to the rescue. Visvalingam’s algorithm is a method which takes a polygon and greatly reduces the number of edges, without distorting the shape too much. Even better, I found an amazing tool at http://www.mapshaper.org/ which runs the algorithm on a topojson file that you upload.
With this simplified (but still reasonably accurate) map of NYC, I could begin the next step of the visualization – partition NYC into areas for each of the subway stations. A Voronoi diagram is a natural way to do this. Each subway station is assigned the area for which it is the closest station. D3 has a built in voronoi method – you just have to feed it an array of coordinates for the stations, and it overlays a voronoi diagram on your page. I used my topojson map of NYC to serve as a clip-path for the voronoi diagram, which resulted in the diagram being restricted to the landmass of NYC – exactly as I wanted.
The next step was to link up the weekly usage with each voronoi region, so that when a user clicks on the region, a chart displays the information. This is precisely the kind of situation where D3 shines. The final product is at www.arettines.com/MTA-Graphic. Each station is color-coded according to how heavily it is used. The darker the green, the heavier the usage. Hover over a region to see the name of the station and the lines it serves, and click on a station to see a typical week of usage for that station. You can hover over the bars in the graph for more precise information. Have fun playing with it! In a future post, I’ll take a look at some interesting facets of the dataset.