As large social media companies face the rising challenge of moderating the billions of words posted on their platforms every day and blocking hate speech as well as other types of harmful content before it sees the light of day, they are hoping this huge workload can someday be taken on by robots. Google, for one, released an API called Perspective in 2017 that claimed to be able to detect “toxic” text-based content with artificial intelligence.
Perspective defines “toxic” as “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.” The neural network used to flag such content was trained using a large set of text-based data rated by people on a scale from “very healthy” to “very toxic.” Based on that information, the algorithm then evaluates new content by assigning it a likelihood of being toxic.
But a new study by a group of AI researchers at the University of Washington found that this tool has a racial bias against African Americans, meaning that it has a high tendency to flag content posted by black Americans as toxic when it is actually harmless.
The study, led by University of Washington PhD student Maarten Sap, tested Google’s Perspective with a few widely used sets of Twitter posts (in the U.S.) and found that tweets written by black people are twice as likely than those written by white people to be labeled as toxic tweets—and that more black tweets than white tweets are mistaken for offensive content.
For example, a Twitter post that reads “Wassup, nigga” has a 87% likelihood of being flagged as toxic, while a differently worded post reading “Wassup, bro” only has a 4% likelihood of being labeled as toxic. But if the first message was posted by a black person, its intent most likely would be harmless given the cultural context of the African American community.
Sap said the reason for such biases is machine learning algorithms’ inherent ignorance of social context of speech, such as dialect of English and the identity of the speaker. The algorithms used in the Google AI tool are trained on text-only annotations, meaning that they don’t have any information about the author behind the content they evaluate.
“In America, when a white person [says] the N-word, it’s considered a lot more offensive than if a black person were to say the N-word,” Sap explained during a presentation about his study.
Sap said his team chose racial bias and Twitter as the focus of this study because Twitter is an important platform for black activism and research has shown that racial minority populations are most often the target of hate speech.
Because self-reported race information isn’t readily available in Twitter profiles, the study relied on African American English dialect (AAE) as a proxy to tell whether a tweet author was black or white.
Sap also warned that natural language processing (NPL) algorithms with this issue will often amplify the biases existing in their training datasets. Observer’s previous reporting on the use of natural language processing in online advertising has found similar conclusions on gender biases.
“The key with language is that we know from socio-linguistics that language conveys social meaning beyond the literal meaning of our words,” Sap told Observer. “This unearths all sorts of challenges relating to social dynamics between people of different identities and makes the problem of deciding whether something is offensive much more complex and subjective than deciding whether an image contains a human face or not.”
One potential solution to the racial bias issue found in his study, Sap said, is to introduce a “balancing” algorithm that helps a machine learning model to recognize certain English dialects or identify the race of the speaker. But it won’t be perfect, at least in the foreseeable future, as whether a sentence is offensive or not is a highly subjective question.