The idea of a humanlike artificial intelligence assistant that you can speak with has been alive in many people’s imaginations since the release of “Her,” Spike Jonze’s 2013 film about a man who falls in love with a Siri-like AI named Samantha. Over the course of the film, the protagonist grapples with the ways in which Samantha, real as she may seem, is not and never will be human.
Twelve years on, this is no longer the stuff of science fiction. Generative AI tools like ChatGPT and digital assistants like Apple’s Siri and Amazon’s Alexa help people get driving directions, make grocery lists, and plenty else. But just like Samantha, automatic speech recognition systems still cannot do everything that a human listener can.
You have probably had the frustrating experience of calling your bank or utility company and needing to repeat yourself so that the digital customer service bot on the other line can understand you. Maybe you’ve dictated a note on your phone, only to spend time editing garbled words.
Linguistics and computer science researchers have shown that these systems work worse for some people than for others. They tend to make more errors if you have a non-native or a regional accent, are Black, speak in African American Vernacular English, code-switch, if you are a woman, are old, are too young or have a speech impediment.
Tin ear
Unlike you or me, automatic speech recognition systems are not what researchers call “sympathetic listeners.” Instead of trying to understand you by taking in other useful clues like intonation or facial gestures, they simply give up. Or they take a probabilistic guess, a move that can sometimes result in an error.
As companies and public agencies increasingly adopt automatic speech recognition tools in order to cut costs, people have little choice but to interact with them. But the more that these systems come into use in critical fields, ranging from emergency first responders and health care to education and law enforcement, the more likely there will be grave consequences when they fail to recognize what people say.
Imagine sometime in the near future you’ve been hurt in a car crash. You dial 911 to call for help, but instead of being connected to a human dispatcher, you get a bot that’s designed to weed out nonemergency calls. It takes you several rounds to be understood, wasting time and raising your anxiety level at the worst moment.
What causes this kind of error to occur? Some of the inequalities that result from these systems are baked into the reams of linguistic data that developers use to build large language models. Developers train artificial intelligence systems to understand and mimic human language by feeding them vast quantities of text and audio files containing real human speech. But whose speech are they feeding them?
If a system scores high accuracy rates when speaking with affluent white Americans in their mid-30s, it is reasonable to guess that it was trained using plenty of audio recordings of people who fit this profile.
With rigorous data collection from a diverse range of sources, AI developers could reduce these errors. But to build AI systems that can understand the infinite variations in human speech arising from things like gender, age, race, first vs. second language, socioeconomic status, ability and plenty else, requires significant resources and time.
‘Proper’ English
For people who do not speak English – which is to say, most people around the world – the challenges are even greater. Most of the world’s largest generative AI systems were built in English, and they work far better in English than in any other language. On paper, AI has lots of civic potential for translation and increasing people’s access to information in different languages, but for now, most languages have a smaller digital footprint, making it difficult for them to power large language models.
Even within languages well-served by large language models, like English and Spanish, your experience varies depending on which dialect of the language you speak.
Right now, most speech recognition systems and generative AI chatbots reflect the linguistic biases of the datasets they are trained on. They echo prescriptive, sometimes prejudiced notions of “correctness” in speech.
In fact, AI has been proved to “flatten” linguistic diversity. There are now AI startup companies that offer to erase the accents of their users, drawing on the assumption that their primary clientele would be customer service providers with call centers in foreign countries like India or the Philippines. The offering perpetuates the notion that some accents are less valid than others.
Human connection
AI will presumably get better at processing language, accounting for variables like accents, code-switching and the like. In the U.S., public services are obligated under federal law to guarantee equitable access to services regardless of what language a person speaks. But it is not clear whether that alone will be enough incentive for the tech industry to move toward eliminating linguistic inequities.
Many people might prefer to talk to a real person when asking questions about a bill or medical issue, or at least to have the ability to opt out of interacting with automated systems when seeking key services. That is not to say that miscommunication never happens in interpersonal communication, but when you speak to a real person, they are primed to be a sympathetic listener.
With AI, at least for now, it either works or it doesn’t. If the system can process what you say, you are good to go. If it cannot, the onus is on you to make yourself understood.