Almost two years ago, Lean In and Getty images partnered to address bias in the way women are portrayed in stock photography by creating The Lean In Collection, over 6,000 images portraying women as leaders and/or partners.

In March of this year, data from Procter & Gamble’s Always® Confidence and Puberty Survey revealed that more than half of girls surveyed felt that female emoji are stereotypical, while 75 percent wanted to see girls portrayed more progressively.

In July, Google announced that the Unicode Consortium had approved 11 new professional emoji — such as farmer, mechanic and welder — with both female and male options, and in a range of skin tones.

They also approved a set of male and female versions of existing emoji. (For you emoji fans out there, this means you’ll now be able to express yourself using a female runner, or a man getting a haircut, but the consortium stopped short of including gender-diverse emoji in this proposal.)

But now it’s not only stock photos and emoji that are at issue. A July 27 article  in MIT Technology Review reported that researchers from Boston University and Microsoft have discovered that a popular data set called “Word2Vec”, which is based on a group of 300 million words taken from Google News, is “blatantly sexist.”

This is important because, according to the authors, “Numerous researchers have begun to use [Word2Vec] to better understand everything from machine translation to intelligent Web searching.” In other words, biases in the Google Word2Vec data set could potentially spread to and distort other research and applications that use this set of data.

The analysis was conducted using vector space mathematics, essentially a mathematical model that shows the relationships between terms. Simply put, they’re like analogies: Paris is to France as Tokyo is to X, where X = Japan. But, says the MIT article:

“ask the database “father : doctor :: mother : x” and it will say x = nurse. And the query “man : computer programmer :: woman : x” gives x = homemaker.”

Not good.

This may not seem surprising; of course language embeds biases. But, say the researchers, “One might have hoped that the Google News embedding would exhibit little gender bias because many of its authors are professional journalists.”

A blog post by University of Maryland professor Hal Daumé uses “the black sheep problem” to illustrate the difference between truth and meaning, and show how data can be problematic in the absence of context. He says:

The “black sheep problem” is that if you were to try to guess what color most sheep were by looking [at] language data, it would be very difficult for you to conclude that they weren’t almost all black. In English, “black sheep” outnumbers “white sheep” about 25:1 (many “black sheeps” are movie references); in French it’s 3:1; in German it’s 12:1. Some languages get it right; in Korean it’s 1:1.5 in favor of white sheep.

So, while we mention black sheep more often, in reality white sheep are more prevalent. Furthermore, the phrase “black sheep” is unlikely to be used literally, while “white sheep” is highly likely to refer to actual sheep.

Why Does This Matter?

As language becomes encoded in technical systems, it can have real-world impact for organizations and people. Bias in data, whatever the origin, can blind us to opportunities and risks and can have financial consequences.

This is a critical issue for organizations looking to incorporate digital strategies, such as personalization, neural networks/artificial intelligence, and other emerging technologies into their offerings, because competitiveness in the future will depend on the ability to detect, interpret, analyze and act on complex market signals in real time. The more we depend on predictive intelligence and automated decision-making, the more important it is to identify and, to the extent possible, remove bias from data to prevent it from contaminating future analysis.

It’s also a fundamental issue for law, policy, healthcare, climate change—pretty much any sector or issue that depends on data. In fact, the White House Office of Security and Technology Policy recently released a request for information (RFI) on issues related to the future of artificial intelligence. Hidden bias in data also has human consequences.

How Can Organizations Address Data Bias

The researchers have proposed a way to “retrain” the data set using a combination of vector space mathematics and human intelligence using Amazon’s Mechanical Turk service, which, they say, has shown to reduce bias. At the same time, it raises an entirely new series of questions about what “de-biasing” methodologies should and will look like in the future, and who will make those decisions.

If you’d like to dive deeper, you can find the original paper here.

Thanks to Margaret, Alistair and friends for alerting me to the MIT paper and, as always, deepening the conversation.