Machine learning can detect what’s particular to Lyft’s financial filings compared to other companies, or Donald Trump’s speeches versus those of other presidents.
We think it would be useful to help journalists use machine learning this way, and we’re actively building projects and guides we’ll publish on this site in the coming weeks. But to give you a sense of what’s possible, here are some real-world examples where I used this kind of sorting. (You can see an overview and links to more examples here)
Detecting what Members of Congress care about
Members of Congress put out a lot of press releases, thousands a year. Many are pretty boring, but ProPublica hypothesized that by reading them all you might get a sense of a member’s political priorities – and even summarize what Congress is concerned about each week.
When I worked at ProPublica, I used an “old-fashioned” machine-learning technique called TFIDF to find the words in a single member’s set of press releases that appeared at a higher rate than in the combined set of all of Congressional press releases. Essentially it showed the things a member cared about (or, at least, talked about) more often than their peers.
We could also find newsy topics by comparing occurrences in a single week to occurrences over a few years worth of press releases. This worked pretty well to automatically highlight, for example, “government shutdown” and “furloughed” in January.
Spotting risk factors unique to Lyft
Here at Quartz, we’re doing something similar with financial filings. Public companies must share with investors the factors management thinks could pose risks to the company’s success.
I had been noodling on this already, and had a corpus of those risk statements from the annual filings of every company in the S&P 500. Then, in the hour after Lyft filed paperwork for its IPO, I was able to see which words Lyft mentioned as risk factors that other companies did not.
You can do this, too
We’re writing up the process and code I used for the Lyft story so anyone can use the same techniques. We also have a couple similar projects in the works, and will share details about those in the coming weeks, too.
If you’d like an alert when we post that information, drop your email into the box below. And if you have a project you think might benefit from this kind of machine learning, let us know at firstname.lastname@example.org.