Exposing Alleged Corruption with Universal Sentence Encoder and Annoy

More than 700,000 leaked documents, weighing in at 356 gigabytes, reveal how Isabel dos Santos, the wealthiest woman in Africa and the daughter of Angola’s former president, siphoned hundreds of millions of dollars in public money out of one of the poorest countries on the planet. Digging into such leaks is something the International Consortium of Investigative Journalists (ICIJ) has plenty of experience with. But they had a problem:

That’s a heck of a lot of files.

We turned each of the 29,630,810 sentences, extracted from those 700,000 leaked documents with DataShare, into 512-dimensional vectors using the Universal Sentence Encoder. Similar sentences should have vectors that are close together. Rather than training our own model, we just used Google’s off-the-shelf one; after all, we didn’t have a training set of which sentences are similar to each other. Those vectors were indexed with Annoy to make it easy and quick to search.

So, suppose you’re looking for board meeting minutes for Acme Corp. In a traditional keyword-search-based system, you could easily search “Acme minutes.” But then you’ll miss the meeting agendas. You’ll also miss examples that an a text scanner mistranscribed as “mimites.” And you’ll miss examples in Portuguese that say ata da reunião do conselho. This semantic-search system will catch all (or most!) of those.