We need to find more records like these in a huge pile of data

It’s probably no surprise that computers can process numeric data remarkably well. So if you’ve got thousands or millions of records, you can use an algorithm to find records similar to those you already know about.

Those records  – whether they represent airplanes, taxpayers, or whatever else – can be classified with algorithms like “random forests” that are reasonably easy to understand and explain.

We’ll do that explaining in the coming weeks right here, so drop your email address in the box at the bottom of the page if you’d like updates. In the meantime, we’ll post existing examples below to get you thinking about projects where machine learning can help.

Spotting surveillance planes in flight data

BuzzFeed News used a random forest algorithm to sort through thousands of airplane flights to find which ones were hidden spy planes operated by law enforcement, the military, and military contractors.

Their algorithm automatically compared each flight to a set of flights that had been hand-categorized as either surveillance planes or something else. The algorithm then determined on its own which attributes of each flight were relevant and which weren’t – for instance, that sharp turns mattered (these planes often fly in tight circles), but that total number of transponder pings from the aircraft didn’t.

Once the algorithm categorized all the flights, BuzzFeed then checked, by hand, the ones the computer thought were spy planes. Interestingly, the algorithm also flagged several planes owned by skydiving companies, which often fly in tight circles.

There are more complex algorithms than random forests, but they’re also more complicated to use and often far harder to understand. Many don’t, for instance, tell you which variables matter most to the computer’s decision — important for explaining to readers or sources why you trust the computer made the right sorts of decisions.

You can do this, too

BuzzFeed published the data and code used for this story, though it might be tricky to replicate for someone not already familiar with the R programming language – or using a different data set. We hope to provide some tools, guides, and resources geared toward journalists wishing to do similar work. Give us your email below and we’ll let you know when we do.