Wikipedia VUI

Exploring voice interfaces for the web

YEAR

2019

DURATION

16 Weeks

Context

University

How might we design voice interfaces beyond hotword commands?

Communicating with voice, especially speech, is one of the most natural ways humans exchange information, give commands and express feelings. With the arrival of machine learning, the typical error rate in speech recognition dropped significantly. Speech recognition technology matured to "usable" and production ready.

The brief asked to build an alternative to Siri, Cortana, or Alexa for a specific (simple) use case. We explored the potential and implications of speech recognition technology as a material from an interaction design point of view.

Browse, search and explore the world's biggest encyclopedia only with your voice.

The three different voices of the VUI are named Caren, Alex and Daniel and take care of different features and levels of the app.

Meet Caren, Alex and Daniel

A conversation with three experts

We aimed to make the conversation feel as human as possible and not like an automated process. Each of the three voices has its unique task and personality. They complement each other in how they present content and their voice frequency, which also helps to distinguish the levels of the app.

Auditive information architecture

The system answers user input with navigation, feedback, or content from the Wikipedia API. By using constant recording, it enables a fluent conversation and so-called cross-level commands.

Some commands require content of the current article, while cross-level commands will always work, no matter how deeply the user navigates into the wiki.

Structure of the available voice command types

Simple accompanying graphical user interfaces

Onboarding users

We were able to present a working proof-of-concept at the universities exhibition.

To make the system as accessible as possible, we designed a straightforward user interface that kept the voice interactions in focus. It eases the learning curve and also accounts for debugging purposes.

Hacking on the shoulder of giants.

Before creating our voice user interface concept, we explored existing technologies, such as the Google AIY Kit V1, for which we published a hacky tutorial. Before we started this project, we weren't regularly using APIs or engaging in developer forums. We learned the value of good documentation and open source. Group work with Roman.

Testing different text-to-speech output parameters

Mapping a voice command in Google Dialogflow

As a starting point, we hacked the Google assistant's software and hardware

Interaction design learnings

Humane input = humane output

Until the development of AI-powered speech recognition, speech was a form of communication exclusively performed between humans. This is old news, with machines now able to have decent, fluent conversations. We should ensure that the design of those dialogues leans towards the human user. While commands and speech output should generally be kept short, adding human character to the machine's output is a good idea. This can enable personal identification and makes the conversation feel more like a friendly talk rather than a cold exchange of information.

We introduced roles and even a personality to features of the app and learned that this concept helps to tackle many pain points. For example, the user's error tolerance increases. It's harder to be mad at Caren than at a computer-generated voice without any personality. Also, it makes the experience way more comfortable and eases the strangeness of speaking with a machine. The immersion could be better, and you acknowledge talking with a computer. We can't wait to see how secondary speech recognition will improve this design even further.

Command length and complexity

Keeping them short and straightforward helps both the machine and the human. Machine-sided the error quote to register hotwords sinks with the length of the command. The user benefits, too, because they don't need to remember and perform complex spells. In comparison, you also wouldn't want overcomplex interactions for GUIs. Efficiency is key. Error tolerance towards VUIs is lower than it is with other HUI systems. Having to speak out loud a command over and over again is way more annoying than clicking the mouse a couple of times.

Readback

Readback is a term of the air traffic language. It's used to double-check exchanged information to ensure the opponent got it right. We can adopt this model for VUIs. Especially when there is no visual feedback, it helps the user confirm the VUI understood him correctly. For example, our Wikipedia VUI uses readback when replying to a search request with "Here are your results for [search term]..." instead of "Here are your results...".

No dead ends

We implemented two things to avoid dead ends. First, we allowed constant speech recognition. Therefore the user can always stop, rewind or execute an action. No matter what the VUI is doing at this moment. The ability to interrupt the speech output helps lower the frustration rate if there is a misunderstanding between humans and machines. The user stays in control.

Additionally, we use so-called cross-level commands. Almost every voice command can always be performed, no matter how deeply the user navigated into the application. Only the controls for "subcategory" and "links" require the user to log in to an article. It is possible to start a new search or grab the table of contents from a completely different article at any point.

Structure

Wikipedia was an excellent example for us to prototype because it provided us access to more or less identically structured websites. When thinking of websites nowadays, developers mostly think about a mobile-first approach and set the focus on the visual experience. We should always consider implementing well-structured fallback text to ensure the site is compatible with VUI or screen readers. Letting the VUI describe images isn't practical. We should reconsider whether images need to contain richer information in their alt-tag or as soon machine learning will improve at interpreting images.