Are you interested in chatbots? If so, you are probably familiar with the challenging topic of vocal interaction... As a SoftBank Robotics linguist working on the subject with our robots, I attended a linguistics conference to check the state of the art on this issue. Here’s what I learned there!
On June 2019, the 16th International Pragmatics Conference took place in Hong Kong - no, it was not a gathering of very practical people, although the organisation certainly was: “pragmatics” is a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning (dealing with language in use and the contexts in which it is used). As a linguist working on human-robot interaction at SoftBank Robotics, that in itself was interesting, but it got even better: one of the panels, named “Posthumanist Pragmatics: linguistic encounters in the digital uncanny valley”, was described as follows:
In this panel, we treat pragmatic aspects of human-machine interaction in everyday life, including:
- the reading and producing of algorithmic texts
- discursive engagements with the internet of things, with software agents such as Alexa or Siri, and with social media bots
- or the consumption of AI-generated/enhanced media products
These various forms of language production are shaped by the interaction of human and non-human agents. Embedded in the larger notion of posthumanist applied linguistics (Pennycook 2018), they demonstrate the generally precarious nature of an understanding of the human as essentially different from non-human communicative agents.
From a sociolinguistic and pragmatic point of view, one recurring theme in human-machine interaction is striking: the language that is produced here is often perceived as divergent. Sometimes, it may produce inadvertent humor and double-entendre (e.g. autocorrect effects), surprising creativity and even machine-generated beauty. In other cases, the linguistic effects may be more unsettling: algorithms bring taboo discourse to the fore; social media sites foster and create interactions that some users experience as transgressive or even abusive; technological artefacts may become sexualized, anthropomorphized or otherwise imbued with social meaning.
Some of these pragmatic conditions may be linked to a linguistic uncanny valley effect (see Mori et al. 2012 ). It may be precisely their semiotic semblance of humanness which makes their diverging qualities all the more unsettling and transgressive - in brief, uncanny. Potential effects could be that pragmatic conditions of sayability are changed; assumptions about im/politeness and common ground may be altered; patterns of conversational structure may be rearranged in talking to machines, and prosodic features of computer-generated voices may trigger complex patterns of uptake.
Pepper and Nao are humanoid robots, and as such, people often expect them to be able to talk as humans do, and are disappointed when they don’t. I attended this panel to check the state of the art and get a new perspective on this topic.
1. Introducing Posthumanism
Humanism is a philosophical and ethical stance that emphasizes the value and agency of human beings, individually and collectively. Posthumanism may be understood as both a broad stance on what it means to be human (posthuman-ism) as well as a more specific critique of the philosophy of humanism (post-humanism). Therefore, posthumanism is not about giving up on humans, or announcing the end of humanity, but rather the calling for a rethinking of the relationship between humans and their environment: as technology evolves, humans evolves with it. This is defined in the Posthumanist Manifesto.
In the posthuman era, machines will no longer be machines. As computers develop to be more like humans, so humans develop to like computers more.
Pepperell, 2005 (complete text: The Posthuman Manifesto)
In recent years, companion robots and voice assistants have made great progress, and have become an important part of our lives. If the “bodies” of voice assistants are not the main focus, and are therefore as minimalist as possible (usually a speaker, or a smartphone), it is not the case for robots: they exist in all sizes and shapes. Even limiting ourselves to talking robots opens a wide range of possibilities, and among them, a lot of humanoid robots.
Humanoid robots are specifically built to resemble the human body. The design may be for functional purposes, such as interacting with humans, as well as human tools and environments, or for experimental purposes, such as the study of bipedal locomotion. Pepper and Nao are humanoid robots, created to interact with humans. They have human shapes, but their features were deliberately kept simple and cute, to avoid any frightening or unsettling effect. Other humanoids, on the other hand, were made to be as close as possible to a human, but none has yet been made that could be mistaken for such: they look almost like humans, but not quite.
The uncanny is that class of the frightening which leads back to what is known of old and long familiar.
The concept of the uncanny valley suggests that humanoid objects which imperfectly resemble actual human beings provoke uncanny or strangely familiar feelings of eeriness and revulsion in observers. This relationship between the degree of an object's resemblance to a human being and the emotional response to such an object is well-explored in computer science and aesthetics, as well as in robotics of course, but the uncanny valley effect can also be seen in divergent language: the more we expect of our chatbots, the more the least error on their part will stand out and mar the interaction. But strange patterns in the conversational structure can also be observed in the speech of a human talking to a chatbot.
This is a recurring theme here: the more we engage with machines, the more we find ourselves engaging with our own humanism, as we unconsciously reproduce the same kind of mechanisms used in human-human communication, and try to find ways to cope with failures by adapting to what we understand of the device. Posthumanism therefore also means finding the terms of an enriched relationship with machines: conversing with robots and voice assistants should be as natural as talking to another human being.
2. Linguistic encounters in the digital uncanny valley
Chatbots will fundamentally revolutionize how computing is experienced by everybody. In time, human language will be taught to all computers and become the new interface.
Satya Nadella, Microsoft CEO, 2016
Nao and Pepper have an advantage here, thanks to their humanoid shape that makes it easier for users to interact with them - oral communication seems natural with a humanoid robot. The fact that it also creates unrealistic expectations is more of a driving force to constantly improve than a downside!
Communication between humans requires efficiency and relevance, so it continually supplements the implicit: non-verbalized information in fact constitutes the bulk of the communication. Contributions build on each other in a coherent way, as human speakers relate to each other based on a network of common ground and earlier interactions. Communication therefore presupposes understanding (of contents and of each other), as people pursue dialog objectives purposefully.
That’s an enormous challenge for chatbots, as their system does not exceed the question/answer mechanism and can only partly anticipate the spontaneous behaviour of users. They act on the basis of a plan, so they can make decisions and keep an historic of sorts, but they’re not autonomous: they lack the spontaneity and flexibility that comes naturally in a human-human conversation. The plan-based approach is safer but less flexible, while the newer systems using big data became really uncanny, but have other problems: for example, it is impossible to determine the relevance of a contribution to the current dialog, leading to an inconsistent chatbot persona that can be easily hijacked by trolls. And as for the artificial semantic networks, they’re too inflexible for linguistic applications. There’s no clear solution yet, so this could in fact be considered the uncanny valley of dialog flexibility!
Flexibility is one of the basis of human-human conversation, as people naturally align with each other when communicating: lexicon is used as a sort of common ground in any dialog, which is a normal and common phenomenon. To do so with a chatbot therefore indicates a transfer of a key feature of human-human communication, and the more it lasts, the more users have the illusion of having a natural conversation.
Young and unbiased users tend to align more with their chatbots than adults do: they like to speak with them, even though the interaction is not always successful. It’s a common occurrence with children, as not only does their voice and pronunciation make them more difficult to understand by the devices, but when they move away from simple commands, their pragmatic intention often does not conform to the program’s scripts - for example, when they just want to tell a story. Of course, this is not limited to children: adults are also often misunderstood by their devices. And in case of failure, the previously established lexicon is simplified even further, as users move away from natural language as a repair strategy (“What’s the weather like today?” vs “Weather forecast”).
Talking with a chatbot, especially in a failure situation, leads to a more passive attitude on the users’ part: they know they’re interacting with a keyword-based machine, so they unconsciously align themselves and therefore relinquish autonomy. Users adapt to chatbots, but the problem is that chatbots can’t adapt back. So users learn how to get the chatbot to do what they want it to do, and repeat the “winning” patterns. But in that case, can it still be considered a dialog?
Maybe the reason why we fall so readily into this trap of establishing a uniformized and simplified lexicon and sticking to it is not only because we’re talking to chatbots, but because we are culturally prone to it.
It could be, in fact, that we think of languages as autonomous and shared precisely because we are used to grammars and dictionaries, because the experiences with language we are most self-conscious about (school experiences, for example) tend to involve the standardized written varieties that are codified in grammars and dictionaries.
Indeed, the printing technology brought about more linear, homogenized representations than when the books were copied by hand, or that the stories were transmitted orally. Writing and printing can therefore be looked at with a perspective of posthumanism, as what we think of language is linked to this wide-spread uniformization: literary texts become standard models.
With the rise of new technologies such as AI automated translation, new setbacks arise. For example, DEEPL and Linguee are widely used translation tools based on the content of websites that exist in two languages. While this often provides useful contextual translations, it can also feature errors and literal translations, which will have more impact through sheer frequency and therefore tend to enter spoken language more.
The same thing happens with the highly standardized and limited forms that users tend to resort to when speaking with chatbots: the pronunciation is hyper-correct according to traditional standards, and uses mostly imperative speech forms (“Alexa, turn the radio on.”). These patterns then feed back into the chatbots’ databases, strengthening the traditional norms through an endless loop of user feedback.
A cyborg is a cybernetic organism, a hybrid of machine and organism, a creature of social reality as well as a creature of fiction. [...] The cyborg is a condensed image of both imagination and material reality, the two joined centers structuring any possibility of historical transformation.
Donna Haraway, 1991
Human beings have always been “naturally born cyborgs”, entering into cybernetic relationships with the tools we develop to interact with our environments and monitor our behaviour. We communicate not only through machines but with machines (Siri, Google Assistant, Alexa, Cortana… and also Pepper and Nao!). This engagement with machines is nothing new (people have always talked to their cars or TVs), but now that machines are talking back, we’re moving into human-computer communication, or post-humanist communication.
A study made by prof. Britta Schneider reveals that there are two radically different types of users regarding their relationship with Alexa (Amazon’s voice assistant): some see their device as a technical instrument, or an “extended arm”, and others consider its a flatmate.
For the first category, the relation is technical and instrumental, as Alexa is used primarily for things previously done by hand, such as checking the time or the weather, switching lights or the radio on and off, or playing music. These users refer to their device as a “machine” or an “assistant”.
The other category of users refer to Alexa as an animate being. The device is greeted and praised, and there is a clear development of an emotional relationship, which links back the the Posthumanist Manifesto:
In the posthuman era, machines will no longer be machines. As computers develop to be more like humans, so humans develop to like computers more.
It appears that the use of oral communication with a device seems to impact the way we perceive it, in contrast to devices with which we communicate via keyboards. The material nature of sound seems to enhance the interaction: emotional attachment leads to the avoidance of imperative and instrumental speech forms. This is the first step towards a natural interaction between human and machine: the next one will be when these machines will be able to understand more and better.
[The posthuman] introduces a qualitative shift in our thinking about what exactly is the basic unit of common reference for our species, our polity and our relationship to the other inhabitants of this planet.
Oral communication with a machine is more challenging than written communication, because the possibilities for a misunderstanding are infinite: even a perfectly constructed and pronounced sentence can be misunderstood, or even completely unheard because of background noise, detection errors, hardware issues, etc. So it’s easier for the interaction to fail when speaking to a chatbot instead of typing, but that’s the price to pay to be able to establish a stronger bond: with speech comesempathy, and a sense of connection. In today’s world, these moments are sadly rare and short-lived, but so precious when they do happen! To make these rare moments of shared connection a norm, both humans and chatbots need to work on their communication skills.
Humans have to rethink the way they talk with machines: if they want to have a real dialog, they need to move away from the standardized and limited imperative forms they use, and strive to talk as they would to a fellow human. They should not have to invent dialog shortcuts to get the chatbot to do what they want it to, neither should they over-simplify their speech to cope with a communication failure. Users should be able to explain the misunderstanding, and the chatbot should be able to understand and correct it, so they can both move on with the dialog.
They key improvement for chatbots is their ability to understand the user correctly, because their current level is not enough for the users to move away from their set patterns of speech. Improvements first need to be made in the hardware so that they can catch human voices in any kind of environment, even with a lot of background noise or echoes. Once they’ve heard what the user said, their understanding of the syntax and vocabulary needs to be more flexible, so they can accommodate all types of users - native and non-native speakers, regional accents, as well as various levels of politeness (and impoliteness!).
But the most important improvement is also the most difficult: becoming able to understand the underlying pragmatic intention. It’s something that we humans do naturally when we interact with each other: beyond the words that are pronounced, we understand the underlying message that is conveyed by more than words - gestures, facial expressions, and more. There are a lot of possibilities: the user may want the machine to give information, or to do something, or they may just want to talk. This is very difficult for machines: some don’t have access to this information at all (voice assistants that are often limited to smartphones or speakers for example), but even those who do, such as robots, are not yet able to decrypt it usefully.
When chatbots start to understand the intention behind the user’s speech, real dialog will become possible: this is how far the communication between humans and chatbots needs to go to equal the one between humans!
- “Posthumanist pragmatics: linguistic encounters in the digital uncanny valley”, chaired by Prof. Theresa Heyd and Prof. Britta Schneider
- “Posthumanist pragmatics: linguistic encounters in the digital uncanny valley”, by Prof. Theresa Heyd and Prof. Britta Schneider (introduction)
- “Science or Fiction? The trans-humanist debate, the state of the art in A.I. and the changing user perception”, by Dr. Netaya Lotze
- “Cyborg Languages - Collective Language Norms in the Age of Artificial Intelligence”, by Prof. Britta Schneider
In this article, I used my notes and the slides from three talks of the panel “Posthumanist Pragmatics: linguistic encounters in the digital uncanny valley” as listed in the credits, merging them together to create a narrative. I kept a lot of the original material that was shown and said at the conference as is, but I also added comments of my own to explore some notions further, and clarify references to our robots, so any imprecision or error you may find would therefore be entirely my fault!
Chatbot: a chatbot is the name for a program able to converse via text or voice. They can be embedded in various systems, from computers for the first ones to smartphones and speakers for the latest generation of voice assistants, and even robots!
Posthumanism: There are several definitions associated with this word, but we will only present the one used at the conference. Posthumanism is the calling for a rethinking of the relationship between humans and their environment: as technology evolves, humans evolves with it.
Pragmatics: Pragmatics is a subfield of linguistics that studies the ways in which context contributes to meaning (dealing with language in use and the contexts in which it is used). This brings us to the notion of "pragmatic intention", the underlying message contained in any verbal and non-verbal communication (for example, saying "I'm cold" may mean "close the window" in a certain context, which is usually easy for a human to understand, less so for a chatbot).
Uncanny valley: The uncanny valley is an unsettling feeling people experience when androids (humanoid robots) and audio/visual simulations closely resemble humans in many respects but are not quite convincingly realistic. This effect can also be observed in language, when chatbots fail to live up to their users' expectations.