Best practices for developing voice user interfaces
Learn best practices specific to VUI development, how it differs from traditional graphical user interfaces, and best practices to take your VUI to the next level.
A voice user interface is a technology that allows people to interact with a computer or device using spoken commands. Think of Captain Kirk standing on the bridge of the Starship Enterprise asking the computer for an analysis. Once science fiction, VUI is now one of the fastest growing technologies in the world.
One billion searches are conducted by voice every month, and 72% of people who use voice search do so every day. Google’s language technology now recognizes over 100 languages, and recent analysis indicates that the natural language processing stack is over 95% accurate. There is no denying that VUIs are making tremendous strides in terms of adoption and accuracy. One could even argue that the gains in acceptance are a result of the gains in accuracy. There’s some truth to that, but it’s not the whole story.
How VUI helps create future-proof digital products
Humans are hardwired for language. As a species, humans have been communicating with the spoken word for no less than 50,000 years. On average, we can speak 125 to 150 words per minute: that’s more than three times the average typing speed. Putting it that way, one wonders if future generations will even bother to learn to type.
SEE: Hiring Kit: Backend Developer (TechRepublic Premium)
If you are developing a digital product or service, chances are a VUI is or will be on your roadmap. Twenty years ago, adding a voice user interface to an application required a team of specialized engineers, expensive hardware, and often resulted in what sounded like speak & spell.
Today, even a beginner can create your first voice application in under an hour with something like the Alexa Skills Kit. But it’s not just the technology that makes or breaks your VUI. To create a voice user interface that will take your digital offering to the next level, you need to understand a few best practices and philosophies.
VUI best practices
Start with the ideal interaction
You should start designing your voice interaction by mapping an end-to-end dialog flow. Start with the golden path and then work on filling in the branches and edge cases. Watch out for dead ends in your conversation trees. Just like when talking to people, awkward silence is a conversation killer.
More options don’t mean more value
Keep in mind that users start without a clear indication of the options available, so proper onboarding is essential. Start with an overview of what the interface can do. Keep lists short – usually three or fewer options. Consider prefixing these options with numeric identifiers to make it easier for your users to remember. It’s also important to remember that text-to-speech engines typically recite information much more slowly than people read it, so keep your menu options concise.
context, context, context
Programmatically decrypting and maintaining context is difficult both within a single session and across multiple sessions. When people interact with each other, we are privy to a variety of non-verbal cues. Pitch, tone, and even facial expressions provide additional context. Most commercial VUI programs are unaware of these contextual cues. Interestingly, however, almost all of them can provide additional context in the response via Speech Synthesis markup. SSML allows a developer to build pauses, pitches, and even some emotions into responses, increasing the conversational feel of your VUI.
Language-specific error handling
Error handling on a VUI has specific challenges. Error messages must be specific and suggest a course of action for the user. For example: “I’m afraid I don’t know how to help you with this. As a reminder, I can help you with the following…”
SEE: Settings Kit: Python Developers (TechRepublic Premium)
You should also beware of a generic try-catch error handler that pushes a system-level error up to your TTS. You don’t want your voice assistant telling users that “socket is closed by remote host” or some other common, low-level occurrence. Logging is your best friend when it comes to debugging a VUI. Remember that your logs contain what the VUI heard, not necessarily what the user said.
One of the more challenging parts of creating a good VUI is training your model in all of the different ways your users might ask for the same thing. You’ll never be able to come up with all the variations yourself, and polls generally don’t work because people write differently than they speak.
Instead, you need to observe—and if possible, record—users in real life to understand a reasonable amount of user input at launch. Make sure you’re watching users who are representative of your target users: doctors use very different shorthand and abbreviations than mechanics or soldiers.
Don’t forget about privacy and security
When developing a VUI, it is your responsibility to understand privacy and security concerns. Commercial smart speakers are always looking for a wakeword. However, once engaged, they typically record and decipher everything said, taking up to eight seconds between commands before reverting to passive listening.
Developers need to be aware of any sensitive information that might be required for a specific use case and the policies and regulations governing the handling of that data. Also remember that it is impossible to know who might enter a room between the time information was requested and the actual response.
How to choose the right VUI technology
Today there is quite an extensive list of options to speed up the development of your voice user interface. Before committing to any particular solution, make sure you have a good handle on your non-functional needs:
- Will the device be constantly connected to the internet?
- speed and accuracy
- Does the translation have to be done in real time?
- What is the trade-off between speed and accuracy?
- Domain Data Models
- How well trained are the models in your domain?
- Do you need to understand whole sentences or just pick out keywords?
- Is there a keyboard or touchscreen in case voice input fails?
- Does an incorrectly processed voice command result in an irreversible action?
- Under what environmental conditions does your solution need to work?
VUI stands for a fundamental change in human-computer interaction. When creating a speech-enabled application, designers and developers need to rethink the approach. Focus on voice-first, real conversational experiences, and your customers will thank you.