As the creator of popular NYC blog I Quant NY, Ben Wellington has a knack for turning data scrapes into civic action. Mr. Wellington, a visiting assistant professor at Pratt with a Ph.D. in Computer Science from NYU, is behind a lot of small reforms with big impact—there’s the fire hydrant he found that cost New Yorkers over $50,000 in annual fines, unearthed flaws within city’s restaurant grading system, and the $27.25 Metrocard button (giving purchasers exactly 11 rides). Through I Quant NY, Mr. Wellington uses data released by government agencies to find and tell interesting stories that often end up influencing local policy changes: “I’ve picked at the limited number of data sets that have become public, and have shown that opening up data leads to a world where government and citizens become partners in making our City better,” he wrote on the site’s about page.
The Observer talked to Mr. Wellington about other leaders in open data, the lack of NYPD police data, and the best ways to tell stories with numbers.
Your work seems like it’s half-urban planning, half-data analysis. Is that an accurate description?
I do what I call ‘urban data science.’ It’s about looking for interesting trends, patterns, and insights in the vast amount of public data the city releases.
Last month you wrote a post about Eric Garner and the lack of police data. Can you go into a little more detail about that?
New York has plowed ahead on its journey to open up more data to the public. When city agencies release data, it allows reporters and citizens and really anyone to ask their own questions from the data set. Unfortunately, many agencies have been very slow to release their data. That could be for one of a few reasons: either technical, political, or privacy reasons.
To date, the NYPD has not released crime data at an incident level to the public. What I mean by that, is instead of just giving us aggregate numbers—’there’s this many crimes in this neighborhood’—we want to see data on every individual crime and where it was and what happened so we can do better analysis on it.
What other agencies have been behind the curve?
The Department of Education is as well. We’re seeing school survey data but we’re not seeing attendance records or many different aspects of their data set that could make it much more rich. With [Mayor] de Blasio, we’re seeing some momentum in this space.
Outside of parking tickets, what other really rich data sets are you drawn to?
Permitting data is really interesting. You can see how things are being built. With licensing data, you can see the zip code of workers from different industries. For example, home improvement workers need to have a license. You can see in certain industries where people are coming from. I found that [most] pedicab drivers come from Brighton Beach. You can see that right in the data.
How do you decide what to investigate?
Sometimes I have an observation as someone who lives in New York. Walking by a fast food place, I had this idea that the combination fast food restaurants—as in Taco Bell/Pizza Hut—might not be as up to standard as the individual ones. I looked at the health inspection data to see if combination restaurants do worse on health inspections than the single [restaurants]. And the answer was a resounding yes. For every single combination chain I looked at, they almost always performed worse. So Taco Bell did okay and Pizza Hut did okay, but Taco Bell/Pizza Hut, in general, did not do great.
So that’s an example of when I make an observation and then try to find the data to back it up. The other half of the time, I see a data set and just try to think of the observations that could be found in it. It really depends on what the city is releasing at any given time.
What’s the best way to tell stories with data?
If you can tell a story simply, you should. There’s this idea in computer science and math called ‘Occam’s razor’ and that’s just the idea that simpler ways of problem solving will out-perform more complicated ones. I think that holds true for telling stories with data.
I found the fire hydrant in New York that was generating the most parking ticket revenue on the Lower East Side—there were two of them making more than minimum wage just by being fire hydrants. They were making about $55,000 a year just because of mislabeled parking spot. When I wrote about that, a reporter reached out and said ‘well how did you find this?’ my response was, ‘I counted.’ It speaks to the simplicity of some of this stuff.
You spoke at an NYCEDC event where you said that there’s a value in ‘making work reproducible.’ What does that mean?
When somebody does some sort of analysis, if they don’t give access to the raw data that they’re working on it makes it irreproducible. That means someone can’t come along and do a similar analysis and confirm your results. As more and more people use data, there’s going to be more and more mistakes made. The best way to keep the quality of work good is to always release the underlying data with your analysis. The scientific method in science is basically saying that reproducibility is key, and I think we should take that into data science as well.
Even with data journalism, reporters will file FOIL requests to get data and then write on it. I would challenge those same reporters to also release the data set along with their writing so that others can follow up on their great work instead of just wondering how things were done.
So what’s the long term goal here? What do you want this work to accomplish?
I think that in the end, one of my biggest goals is to help change the talking points around open data to one of partnership instead of transparency and watchdog. I really just believe that as more people do things and help make the city run better, our agencies will sort of benefit from this and the city as a whole will benefit. That’s the high level goal of the blog. I hope that I’m setting an example and others can join the movement, and help make the city better.