A Crash Course for Journalists In Classifying Text with Machine Learning

When journalists ask their audience for help, success creates a whole new problem: what do you do with thousands of tips?

Or what do you do with thousands of textual descriptions of … anything … potholes, disciplinary actions at prisons, aircraft safety incidents? There are too many to really read.

And any time you feel “there are too many to really read,” that’s when you should consider getting help from machine learning.

The folks at ProPublica’s Documenting Hate project had this problem, with around 6,000 tips about hate crimes and bias incidents contributed by readers. To report a hate incident, someone only has to provide a written description of what happened.

If they choose, they can also fill out checkboxes for why the victim was targeted — e.g. because of their race, religion or immigrant status.

Only some people check any of those boxes. But that “targeted because” data is important for analysis and for getting tips to the right reporter. Could we train a computer to guess at what kind of target was involved based on the written description alone?

Who’s the target?

Take this made-up (and simple) tip about a real hate crime. On Monday morning, the synagogue’s education director found swastika graffiti on the door of the Havurah House. She was just arriving to start preparing for Hebrew School. Havurah House is a Jewish community in Middlebury, New Hampshire.

A human being obviously knows that this is a report of an anti-semitic hate crime — that is, one targeted on the basis of religion. Our goal was to build a model that could correctly identify “religion” just given the text.

This is called “text classification,” but you can think of it as a very fancy way of filtering a column in Excel. Imagine you had a spreadsheet of 311 reports, with a text description and the city department they were assigned to (police, sanitation, homeless services) and you filtered it just to see sanitation-related tips. Now imagine you could do that — but without requiring the “department” column to even exist.

[If you’re not sure if text classification solves your problem, or if you want more info on other kinds of familiar journalism situations where AI can help you, read our first post, How you’re feeling when machine learning might help.]

Note that we’re not intuiting something a knowledgeable human couldn’t figure out at a glance. Some tips are just unclear or vague and It’s not as if this algorithm can suss out the absolutely correct answer by magic. Just like there are borderline examples that two humans might disagree about, the computer will sometimes give the wrong answers in hard cases. But unlike a human, it’s no good at saying, “Hmmm this was a tough one but here’s my best guess.” Even though that’s obvious with something written in English, about which we have strong intuitions, it’s important to keep in mind with all machine-learning or AI tools: the computer is guessing, not exposing a hidden reality.

And, helpfully, it can read thousands or — millions — of examples in minutes.

We trained a computer model that solved this problem pretty well. Before we dive into how we did it, you can experiment with the output of our model in this interactive demo. (No coding necessary!)

Our plan and execution

Here’s how we went about the project, the choices that we made (including the dead-ends!) and what we ended up with. While the nitty-gritty choices we made won’t apply to your project, the steps might.

If you’re a coder, you can try this plan with your own data in an Jupyter notebook. If you’re not comfortable with code, and/or don’t have time to dive in, this is also a road map for working with someone who is — either at your organization or local university.

Our workflow — and likely the workflow of any successful journalism project using machine-learning — had 7 steps:

  1. Figuring out what question we wanted to answer: can automatically classify hate crimes by type?
  2. Getting the data
  3. Cleaning the data to remove the things that might confuse a computer.
  4. Choosing an algorithm. (Or maybe more than one.)
  5. Formatting the data in the way that your chosen algorithm requires it.
  6. Feeding most of your data to your algorithm and perhaps waiting a few minutes.
  7. Looking at the results and deciding if it’s good enough or not — and if it isn’t, repeating steps 2-6 as necessary.

Step 1: A Question

We knew from the beginning what we wanted to do: predict the “targeted because” reason from the written, free-form description. In more human terms, we’re trying to answer the question: the tipster would have chosen which of disability, ethnicity, gender, immigrant, race, religion or sexual-orientation?

We have the right answer to that question for some of the tips. We can use those provided answers as what’s called “training data” — examples of the description along with the right answer we want the model to generate. And once we have a model that works, it should predict the answers when the tipster didn’t include the “targeted because” classification.

Step 2: Getting the data

Luckily, this part was easy. The data already existed at ProPublica and they agreed to share it with us — as a spreadsheet.

Step 3: Cleaning the data

Before we got started with the oh-so-sexy machine-learning bit of this project, we’ve got to look at our data and figure out how to clean whatever might distract the computer. By cleaning I mean changing the data in little ways to make it easier for a computer to understand. A simple example of data cleaning that relates to text processing tasks like this one is removing verb endings from words, so that the computer treats “punch” and “punched” and “punching” as the same thing.

You can think of your role in the data-cleaning process as asking yourself “which two things are so similar that the computer should see them as exactly the same?”

It’s better to do this data-cleaning with code, rather than by hand, in, say, Excel. That way you can apply the same cleaning to your eventual input.

That data cleaning can involve changing each datapoint or it can involve removing some entirely. We ignored tips that the Documenting Hate manager had marked as submitted by trolls or not applicable (e.g. people writing to ProPublica about other topics).

To start with, that’s all we did, then moved on to the next step.

Later on, once we’d built a basic model and were working to improve it, we came back to the data cleaning step. In the Documenting Hate database, some tips were submitted in Spanish. I wrote some (simple) custom code to guess if a tip was in English or Spanish and, if it was Spanish, to ignore it.

Every single piece of the data cleaning process is specific to the question you’re asking. I can’t tell you how to do it. But it’s necessary in basically every project and it will get you far better improvements than fiddling with the settings on your model.

Step 4: Choosing an algorithm.

There’s a dizzying array of machine-learning algorithms and more every day, but it’s key to remember that usually each one is just an incremental improvement on its predecessors. I’m here to tell you that — for your purposes as a journalist — it doesn’t matter much which you pick, as long as you pick an algorithm that does the right kind of thing.

The kind of thing that we’re trying to do is called supervised classification. It’s called supervised machine learning since we have the right answers for some of our data. And we’re trying to get the computer to guess categories (as opposed to, say, numbers), so this is a classification problem.

Classification usually is viewed as a yes or no choice. So when we’re classifying each hate incident, we’re actually training seven models — one each for sexual orientation, religion, race, etc.

Step 5. Formatting the data in the way that your chosen algorithm requires it.

Our tips are in English, in a regular ol’ spreadsheet. One column containing a sentence or two, maybe a couple paragraphs of text. But computers can’t read! So we’re going to have to modify the data to a particular “format” for the algorithm we chose.

Whatever algorithm you choose probably requires the data you feed into it be in a slightly different “format”. Sometimes, words will be have to be represented by numbers and different techniques exist to get the data into the exact right format. Consult the documentation for the algorithm you chose for this.

For instance, the “Naive Bayes” algorithm I used needs the data to be “vectorized” — turning each written string into a list of numbers. For this project I used the a Python library called scikit-learn, and for this step its TfidfVectorizer class, vectorized my data automatically. (But I didn’t know that was the right choice at the start; I experimented with some other options first!)

Step 6. Feeding most of your data to your algorithm and perhaps waiting a few minutes.

This is the part where the computer is learning. And it’s pretty simple, from your perspective. After setting the scene, one or two lines of computer code with words like “fit” or “predict” and maybe a minute or two with your computer’s fan spinning real loud… you just machine-learned!

But why “most” of the data? Won’t the model do better with more training data? Yes, but we keep some portion to the side, so we can evaluate how the model did, with data it wasn’t trained on (but that we know the right answers for).

Step 7. Looking at the results and deciding if it’s good enough or not — and if it isn’t, repeating steps 2-6 as necessary.

What’s “good enough”? Well that’s really up to you! If there are mistakes in your training set — where someone made the wrong choice — or just tough cases that are confusing to humans, you can’t expect the computer to get the right answer. So there’s a theoretical maximum accuracy that’s less than 100%. And it’s important to keep in mind that there’s no high score chart here: your goal is to help you in your journalism, so whatever will save you time is a success.

But how do you tell how well it’s doing? We use that held-out data (that we didn’t use to train the model) to check whether the model got the right answer. Inevitably, it’ll get some wrong and we’ll have to find a metric to summarize how right it is. There are a lot of choices of metrics, but you’ll want to read up on precision and recall — the balance of which is something you’d have to decide. If you’re looking for sexual-orientation-related hate crimes, would you rather include some false positives (hate crimes actually about, say, religion) or exclude some hate crimes that really were related to sexual orientation?

For this task, we cared about each kind of hate incident equally. But for other kinds of classification, that might not be true. For instance, if you were trying to reduce the number of documents you have to read by classifying them as irrelevant or not, then skipping a relevant document may be worse than reading an irrelevant one, and so you may want to only classify documents as uninteresting if the classifier is pretty sure.

You may even want to look at the data points that the model got wrong and double-check to make sure that the readers who submitted the tip gave them the right label. If it’s wrong, fix it!

If you’re not a coder, here are premade tools you might use.

The workflow I described above requires some coding knowledge and we have a “template” notebook on Github. But if that’s too much coding, you can try Google’s Natural Language AutoML tool. That tool uses some quite fancy techniques to try to train a high-performing model on your data, without you having to make any choices. You still have to do the data cleaning and format the spreadsheet in the right way.

Each of my model-training experiments cost about $10, but Google Cloud offers $300 of credit to start. Of course, this involves uploading your data to Google, so it may not be a good choice if you’re investigating something super confidential and/or investigating Google.

Here’s a step-by-step guide:

Formatting your data for AutoML can be a bit finicky. You’ll want to generate a CSV:

  1. with no headers
  2. in the first column, textual descriptions
  3. In the second column, a label that is the “answer” you’re trying to get the model to guess. (This can anything you want — “hatecrime” or “not”, “1.0” or “0.0”, whatever)
  4. Be sure that your dataset doesn’t have any blank textual descriptions or you’ll get an inscrutable error message.

So it should look like this:

It’s not instant: Training my model, with about 3000 records, took a few hours. And you’ll get better results if you spend a few hours cleaning your data beforehand (though the effect wasn’t THAT dramatic when I experimented with it).

To use AutoML, you’ll have to sign up for a Google Cloud account.

Once you’re in, click the menu button in the top left corner,

and scroll to the very bottom to find “Natural Language” under the “Artificial Intelligence” header.

That will take you to a new page with two big options. Click “Create a custom model.”

Click + New Dataset on the top bar.

Now give your dataset a name (no spaces or dashes!), leave the single-label classification button selected….

and upload your dataset under the “Upload a CSV file from your computer” option.

Then click “Create dataset.”

Now go check Twitter or get a cup of coffee. This can take a few hours, but Google will email you when it’s done. (Also, you might get an inscrutable error like “AutoML Natural Language was unable to process your dataset” and “Uri is not found in CSV row “,1.0”.” Try to fix that if you can — empty text description rows can be a culprit — and try again.)

Once you’re done, you’ll get an email. Go back to the AutoML page and click the Evaluate tab. You’ll see a detailed but well-designed dashboard of metrics about how well Google’s models were able to do with your data.

“Average precision” is the metric Google uses to measure how well its model is doing with your data. Mine got 87%, which is pretty good!

On the Predict tab, you can type new, fake data and see what the model guesses about it. There’s also a pre-built API you can integrate into your app (but how to do that is outside the scope of this tutorial).

For a deeper dive, here’s the code and tools for we used.

Read this Jupyter Notebook: https://github.com/Quartz/aistudio-dochate-public/blob/master/DocHate.ipynb

2 thoughts on “A Crash Course for Journalists In Classifying Text with Machine Learning

  1. I love the idea of this article and the transparency of the python notebook. BUT if you can’t give us the data it makes it hard to do anything with the notebook. Maybe you could include some comparable input data that with a few small changes in the notebook could illustrate the same principles?

    1. Hi Sam,

      I see your point. Our goal with publishing the notebook is to give people example code that really does work, end to end, that they can use with their own data. I thought about creating “fake” data, but then the math wouldn’t work out because there’d be very few data points and the right relationship between words and classes wouldn’t exist.

      Thanks for reading!

Comments are closed.