So, Umm, Google Duplex’s Chatter Is Not Quite Human

A systems scientist breaks down the intricacies of making a machine that can fool humans into thinking it’s one of us

Google’s Duplex voice assistant drew applause last week at the company’s annual I/O developer conference after CEO Sundar Pichai demonstrated the artificially intelligent technology autonomously booking a hair salon appointment and a restaurant reservation, apparently fooling the people who took the calls. But enthusiasm has since been tempered with unease over the ethics of a computer making phone calls under the guise of being human. Such a mixed reception has become increasingly common for Google, Amazon, Facebook and other tech companies as they push AI’s boundaries in ways that do not always seem to consider consumer privacy or safety concerns.

Audio of Google Duplex booking a women's haircut. Courtesy of Google

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

The Duplex saga also highlights the intricacies of human conversation, along with the difficulties of replicating real-time speech in a machine mimicking a natural voice. Google trained the voice assistant by feeding its artificial neural network data from phone conversations—including the audio itself, but also contextual information such as the time of day and purpose of the call. This machine-learning process is similar in some ways to teaching AI to recognize and reproduce images, another ability that has aroused ethical and privacy concerns. Google has made clear, however, that for now Duplex can be trained to have only very specific verbal exchanges with people; it cannot handle general, open-ended conversations. The company also claims it is “experimenting with the right approach” to inform people they are on the phone with Duplex as opposed to a real person.

Scientific American asked Timo Baumann, a systems scientist researching speech processing at Carnegie Mellon University’s Language Technologies Institute, to explain how artificial neural networks can be trained to identify and reproduce images and sounds. Baumann also analyzed what Google has achieved with Duplex, along with the ethical challenges that advances in AI can create.

[An edited transcript of the interview follows.]

What is the difference between a neural network that can produce realistic images and one that can carry on a conversation in natural language?

Different artificial neural networks are used to train different types of AI. For images you want to identify objects, understand their relation, capture the style and so on—information that is spread over the full image. Neural networks for images detect edges to find the shapes of objects in the image to get a grasp of what's going on—similar to what your visual cortex does. A conversation, however, develops over time, so you have to have a notion of how things evolve. The same word at one position may mean something else [when] in a different position.

That’s a fundamental difference between images, which work brilliantly for convolutional neural networks, and speech, which is made of variable-length signals. Recurrent neural networks like the ones Google used to train Duplex are a way to deal with those variable lengths. “Recurrent” means the network analyzes small parts of the signal—10 or 20 milliseconds at a time—and integrates the outcome of the analysis into the next step, gradually accumulating information over time. It’s similar to the way we recognize a word by piecing together the sounds we hear as they are spoken. In both cases—images and speech—you can also run a network in reverse to produce output. In conversation [the network] has to go back and forth between understanding what the user says and speaking itself.

How do you train a neural network to actually have a conversation—as opposed to a scripted exchange of words?

Duplex seems to consist of multiple partial neural networks, and each of these subnetworks focuses on a different aspect of natural language. One part, for example, is responsible for learning the actions required to perform a particular task, or domain—whether it is booking a restaurant reservation or making a haircut appointment. Google also says it integrates other types of data across all domains, such as filler expressions that, for example, convey hesitation (“umm”) or understanding (“ahh”). That’s a smart strategy because Google can collect and train the neural network responsible for [those expressions] using a larger amount of data than what would be available to train for a particular type of conversation.

How important are these conversational expressions and their timing when creating AI that can carry on a realistic conversation?

Randomly putting a filler like the expression “umm” into a sentence doesn’t make a lot of sense. When used properly, however, that type of expression actually can serve an important purpose in a conversation. In one of the examples that Google provides, Duplex uses the word “so” to let the listener know that information is coming and the expression “umm” to give the listener a little extra time to prepare for this information. The AI may have learned from data that that was a good place to put the “umm” as a marker—to warn the listener to really listen because this is when the information will be conveyed. In that way the “umm” isn’t just a filler but is conveying meaning.

That said, Duplex is probably not using fillers strategically but merely adds them in reasonable places. Other aspects [that appear to be] missing from Duplex are back-channel expressions that provide important feedback from the listener to the speaker. If I’m talking on the phone and I hear “uh-huh, uh-huh” on the other end, this has a very important function to notify me that you’re listening and understanding—and that I should continue on. You don’t hear any of that in Duplex. Feedback information also has to be delivered at precisely the right time and with low delay; otherwise it creates awkwardness or confusion. Those expressions might be tiny but they can have a large effect on a conversation. If the AI system is speaking too slowly, people interacting with it will be unsure and repeat themselves because they’ll think that whomever they’re talking to didn’t get the message.

What place does ethics have in the development of conversational AI?

Should research into simulating natural language conversations be done? Absolutely. It’s genuinely interesting to find out how human language works. Although we do it every day, we have very little clue as to what’s important and what isn’t. Should Google be doing its Duplex research in the way they’ve done it, testing their AI with real people unaware they are talking to a computer? I don’t know. Of course, Google needs input on how well their system is performing, but there are clear ethical implications to having people speak with a machine without knowing it is.

How will these ethical problems become more difficult as AI improves?

One question that will increasingly come up when discussing the ethics of AI is whether AI is the right tool for the job. In the case of self-driving cars, for example, it’s a very exciting challenge to address. But is the solution [to concerns about traffic safety and congestion] a self-driving car? Or is it a bus that efficiently gets people where they want to go, and may or may not be driven by a machine? When you improve self-driving technology to the point it’s mainstream, are you now increasing the number of cars on the road? And is that the right solution for society?

In the case of [Duplex] scheduling an appointment at the hairdresser, the minimal solution would be just to make it easier for people's phones to make and change appointments via automated interfaces. No AI involved, only classic computer science and engineering work—but it will allow my hairdresser to give haircuts rather than reschedule appointments all day.

When does it make sense to create AI that can have a natural conversation with a person?

In short: where the conversation is the goal, not an information exchange that can easily be automated. Many academics have researched speech technology for its use in elderly care, as one example, to fight the loneliness, the intellectual and interactional starvation of old people. There are also—even larger—ethical issues in depending on AI to care for the elderly. But the main reason eldercare AI is being researched is that people themselves are not willing to [care for the elderly]. If someone is worried about the elderly having to interact with machines rather than people, the answer might be to tell that person to go and spend more time with their grandmother. If we can’t solve the problem by changing our priorities and behavior, at least giving the elderly a machine to interact with to improve their quality of life is better than nothing.