Vermonters were abuzz on social media this week after Wired magazine published an article that said the Green Mountain State is home to the most toxic internet commenters in America, but making sense of news stories about data can be tricky.
While some of the comments about Wired's article seemed to prove it right, others raised questions about the statistics behind that conclusion. Librarian Jessamyn West was one of the Twitter users discussing Wired’s story. (Wired used data from Disqus, which is the same commenting platform VPR uses for comments on vpr.net.)
As the former director of operations for MetaFilter (an online content platform,) West knows quite a bit about how people engage with each other online. As a librarian in rural Vermont, she also has a good sense of how people engage with news and information; she saw the conversation about the new Wired story as an opportunity to talk about how news consumers can figure out what to make of statistics in an era of fake news.
#1 Look Under The Hood
West says the most important thing to do when a piece of content draws a conclusion from data or a study is to try to get an understanding of the process that turned raw data or information into a conclusion.
“As with any sort of study, you have to look at ‘Well, how was the study made? What were the parts of the study that went together?” West says.
Understanding the methodology of a study or analysis can help answer some basic questions.
For example, Vermont is regularly at the top (or bottom) of data-based rankings of all 50 states. Does this mean Vermont is special? As much as we’d like to think so, West says there’s often another explanation.
Is that total number of comments, or per capita? I think the argument is VT has unusually low commenting generally, so trolls stand out.
— Jessamyn! MLib. (@jessamyn) August 24, 2017
“Vermont does suffer from this thing that we like to call the ‘Per Capita Effect,’ which is: Because it has such a tiny population – 600,000 people is smaller than most metro areas – things that rely on ‘It’s got a higher percentage of X than anything else’ – Vermont’s got a ton of those just because it’s so tiny,” West says.
The economy of the news business plays a role in this too, West said, because news organizations that depend on web traffic in order to collect advertising money have an incentive to cover these national rankings.
“These 50-state things are great because then you can get people from all 50 states linking to information about their own people. Or you know the great New Hampshire-Vermont rivalry can spring up as a result of this,” she said. “That’s good for people who author things that are advertiser-driven online.”
#2 Figure Out What’s Really Being Measured
Sometimes, a study uses one statistic as a proxy, or indicator, for another. For example, economists study lumber sales to try to gain insights about how many houses are being built. Having a clear understanding of what is really being measured is important too, West says.
In other words, West says, “What are you really measuring when you’re saying ‘the most toxic trolls’?”
For Wired, West says, the definition of a “toxic” online comment is left up to a piece of software.
“Looking into this, the word ‘toxic’ is a very specific term of art for the tool, this tool Perspective that’s made by this company Alphabet, who you may know as Google, that is trying to bring [Artificial Intelligence] into commenting,” she said.
In other words: Wired’s analysis used a computer program to decide whether a comment was respectful and positive or toxic to the discussion.
The results are imperfect, and West says part of the reason for that is that words can have different meanings that a computer might not understand.
“We talk a lot in the comment world [about] the use-mention distinction,” she says. “So talking about a racial slur is different than using a racial slur against someone. Or for instance more simply: Talking about gay people doesn’t necessarily mean you’re fighting about gay people.”
It’s impossible to know for sure (we’ll get to that in a moment), but West says Vermont commenters might have been labeled as toxic simply because they’re having conversations online about difficult topics.
“It may actually be that Vermonters are talking about social justice issues more than their companions across the river in New Hampshire, and it may actually be that they’re more engaged, not necessarily more toxic,” she says.
#3 Swim Upstream
Understanding how someone else reached a conclusion from data can be helpful in figuring out how reliable it is, but West says one of the big advantages of using data is that anyone can "swim upstream" and do their own analysis.
“The great thing about data is if you can access the same numbers, you can assess it and draw your own conclusions,” she said.
Using advanced spreadsheet skills to work through a data set is a daunting challenge for some people, and in this case Wired refused to release the underlying data set. But West says there are still ways to reverse-engineer parts of the study to get a better understanding.
For example: “You can look at what makes a comment toxic just by Googling the Perspective API, which is the interface to the programming on the back end, and type in your own comments to see if something that you would say would be considered toxic, and then you can learn a little bit more about how the big data analysis happened.”
Try typing a comment into the Perspective API and see if it's considered "toxic."
Ultimately, science is based on the premise that the results of a study can be recreated by following the same process.
Even if news consumers don’t actually recreate an experiment or data analysis, taking the time to understand it and think critically can lead to new insights and also help prevent the spread of fake news.
Update 12:19 p.m. 8/25/2017 This story has been updated to note that VPR's website uses Disqus, which is the same company that provided the source data for Wired's analysis of "toxic" comments.