We just finished a project that used AI to help reporters dig through the 200,000 documents that made up the Mauritius Leaks, in collaboration with other Quartz journalists and the International Consortium of Investigative Journalists.
Our AI aided the investigation by applying a journalist’s human judgment identifying an interesting kind of document – such as a tax return or a business plan – across the entire trove. You can read more about it here.
The whole trove remains secret, which means we can’t publish the exact steps we took for this analysis. But we’ve posted a Github repository with very similar steps using a public document dump from a New York City court case.
We used the doc2vec implementation from the Gensim topic modeling library and trained a model from scratch, with 20 epochs, which took about 13 hours on my laptop. The documents themselves had already been turned into plain text by ICIJ, which uses Apache Tika as part of their document search system.
In order to access the content of the documents quickly, and to simulate what would be returned from a given search query in ICIJ’s search interface, I additionally indexed them to a local ElasticSearch instance (but a SQL database would’ve worked fine).
What is Doc2Vec?
Doc2vec is an extension of the more widely-known word2vec algorithm that — using math you don’t have to understand — maps words into a 100-dimensional space where similar words are close together and the relationships between words are represented spatially. This math makes it possible for the meanings of words to be added and subtracted in ways that often make sense. (A common example is that, if you ask a word2vec model the meaning of “king” minus the meaning of “man” plus the meaning of “woman”, it will respond “queen”).
Doc2vec maps documents into this same space, so similar documents are close together.
Read more of Quartz’s coverage of the Mauritius Leaks.