When we launched the Quartz AI Studio, we promised ourselves we wouldn’t do any projects that involved Twitter data. And we promised ourselves we wouldn’t do any projects that involved fact-checking.
But then Dan Keemahill at the Austin American-Statesman asked if we could automatically detect statements of fact … on Twitter. Specifically, he was interested in fact-checking tweets containing the hashtag #txlege for the Texas legislative session.
So we did. And it’s working pretty nicely.
The process boils down to four parts, which could apply to many text-classification projects:
- Prep the data: Pull together a sample of hand-categorized tweets, and put them into a format for training.
- Make a language model: Teach a computer to understand the language we’re using (English) infused with the corpus of text we’re using (#txlege tweets).
- Make a categorization model: Using the language model, train a new model to distinguish between checkable and non-checkable tweets.
- Make real-time guesses: Use the categorization model to evaluate new tweets … and post the fact-checkable ones into a Slack channel, as in the image above.
The data at hand
All of our projects start with data: images, spreadsheets, or text. In this case, we started with a csv of 1,971 tweets that Dan, Madlin Mekelburg, and others at the newspaper had already categorized by hand, deeming them checkable or not.
I did some experimenting on another set of tweets, using the steps below, and asked Dan and his team to correct the computer’s work.
That gave us 3,797 tweets, which I put into a CSV with two columns:
tweet_text, which was the text of the tweet, and
checkable, which was “True” or “False.”
The language model
Many approaches to natural language processing use word probabilities for language predictions. So, for example, is a tweet more likely to be fact-checkable if it contains “hot” or “dog” or “hot dog.”
We’re using a different approach – a neural network – that is instead trained to predict the word most likely to follow “I would like to eat a hot … .” We then leverage this base “understanding” of English to further predict whether the phrases being used look like a fact-checkable tweet.
But 4,000 tweets is not nearly enough information to teach a computer patterns of the English language. So we start with an existing model called Wikitext 103, which was trained on nearly 30,000 Wikipedia articles. (I learned about this model because I’m using the fast.ai open-source library and had seen this excellent, free fast.ai class.)
But the English of Wikipedia is still quite different from the English of Twitter, with its hashtags, brevity, quips, and snark. Fortunately, fast.ai makes it easy to infuse an existing language model with a new corpus–whether it be medical texts, newspaper articles, or #txlege tweets. This known as “transfer learning.”
My corpus consists of the
tweet_text column from our original 3,797 tweets plus the text from another 3,688 tweets I collected using a nifty recipe in IFTTT. (You’ll have to sign up to see the recipe, or applet.) Every time someone tweeted with the #txlege hastag, IFTTT added the tweet as a row in a Google spreadsheet. I set this up at the very beginning of the project, so after a few days I had plenty to work from.
I’ve published the code I used to get tweets, but not retweets, from the sheets. And there’s much more detail in the Google Colab notebook about how I used fast.ai to infuse the Wikitext model with the corpus of tweets.
The categorization model
With a language model in hand, I could use used the original, hand-coded tweets to train a categorization model to guess which tweets were fact-checkable and which were not.
In fast.ai it was easy to randomly split those tweets into a “training” set the model used to build model, and a smaller “validation” set used to gauge how well it was doing along the way.
After a few cycles of training, it was guessing correctly on my sample sets 93 percent of the time.
If I had a pile of old tweets, I could write a little loop to test each one for fact-checkability and get a nice list of guesses. Then I’d be done.
But I wanted to evaluate new tweets as they happened.
That meant putting my code into the wild.
Making real-time guesses
I wanted my little project to watch for #txlege tweets, evalutate them, and post the fact-checkable ones into Slack – 24/7.
To watch for tweets, I set up a slightly different recipe in IFTTT that would send any new tweets to a webhook – which is just a URL set up to receive data.
Usually I set up webhooks on Glitch or Amazon Web Services (using AWS Lambda and Claudia API Builder). But the code used to run my pre-trained model is a little complex for both of those, so I signed up for an account on a hosting service called Render. It works wonderfully, though does cost $5/month for what I’m doing.
- Reads my trained model from a file I exported from my original notebook called
export.pkland put on Amazon’s S3 web service (I could have put in a public Dropbox or Google Drive file instead).
- Runs the text from the latest tweet through the model.
- If the result is “True” – fact-checkable – it …
- Sends Slack message to the American-Statesman and with a link to that tweet
The final step was putting the URL for this Render “web service” into my IFTTT recipe, which diligently sends every new #txlege tweet to my code for processing.
I’m pretty happy with how this all turned out. Like the American-Statesman’s earlier efforts, the model is drawn toward tweets with numbers …
… but not always …
Unfortunately, I finally finished this project after the Texas legislative session ended. So now I’m looking into monitoring a list of Texas officials instead.