We’ll never be able to read all these documents

Salty old newsroom hands glorify investigations where reporters spend months reading boxes of documents in a sunless room to find the smoking gun proving a politician is corrupt. What if we could cut that down to a week and a half?

We’re working on projects – and documenting our process – to help journalists do exactly that, using machine learning tools and code. We’ll be adding how-to guides to this site in the months to come.

In the meantime, here are some real-world examples where reporters have already done this kind of work. (You can see an overview and links to more examples here.).

Finding serious assaults misclassified as minor ones

The Los Angeles Times double-checked the LAPD’s crime statistics, finding that the department had misclassified many serious assaults as minor crimes. They used a machine-learning algorithm to figure out what keywords commonly appeared in reports of serious assaults but were uncommon in minor ones, and vice versa. Then they found instances where keywords that indicated serious assaults were present in reports police had classified as minor. (Then hand-checked their work!)

Detecting sexual abuse complaints among disciplinary reports

The Atlanta Journal-Constitution was faced with almost that exact situation: too many reports of doctors being disciplined for misconduct and not enough time to read them to find which were about sexual abuse. After reading a substantial subset and categorizing them by hand, they picked words they thought would differentiate sexual-abuse-related reports from others. As keywords were added and removed, they tested the results to optimize for a balance of hiding irrelevant documents versus including relevant ones with fewer keywords. Then they set the keywords loose on all the documents, giving them a set that were likely relevant to the story and worth a close read.

Spotting political advertising

When automatically “reading” documents is quick and easy, possibilities open up for quantitatively reporting on text without a human reading it at all. ProPublica’s Facebook Political Ad Collector aimed to build a public database of political ads crowdsourced from readers who used Facebook. The database collected all ads seen by the participants and then used machine-learning to sort the political ones from non-political ones. For this project I used a technique called Naive Bayes to learn which words’ presence in an ad were most associated with being political or not, automatically classified ads as they came in. (It worked pretty well!)

Finding out what topics Members of Congress talk about

For another project at ProPublica, where I used to work, I used machine-learning to sort press releases into a set of topics created by the Library of Congress to categorize bills. The goal was to find which legislative topics each member of Congress talks most about to their constituents. First I hand-picked keywords for many different topics. For example farm, ranch, and USDA were matched with the “Agriculture and Food” topic. Next the computer scanned each document to determine how “close” the text mathematically matched to each topic. The algorithm, called Doc2Vec, is pretty good at guessing when a document may be related to a topic even if that document doesn’t contain any of the topic’s keywords.

Identifying toxic comments

The New York Times and Alphabet’s Jigsaw division (aka Google) collaborated to automatically score reader comments by how “toxic” they were – in essence checking how similar new comments were to those editors had previously accepted or rejected. This let editors concentrate on the comments that were neither clearly okay nor clearly not.

You can do this, too

We’re convinced more journalists can use these techniques to help with investigations and day-to-day stories. Throughout this calendar year, we’ll be posting guides, tips, tools, and code to do just that. If you’d like to know when we post more information, put your email using the box below. And if you have a project you think might benefit from machine learning now, let us know at bots@qz.com. Maybe we can collaborate.