Creating Harmony Between Voice and Touch Inputs in Mobile Applications

By Jon Holtan and Ardy Rahman

Over the last 4 years, we’ve seen a dramatic improvement in the technology that underlies voice recognition and synthesis. The relatively rapid user adoption of both Amazon Alexa and Google Assistant devices is indicative of the utility voice interfaces bring to our everyday lives. However, voice has, for the most part, developed as its own platform. If you build a Google Action, for example, the idea is that the user would be interacting only with the Action at any given moment, even if launched from the Google Assistant on a mobile device. While this separation of platforms (voice and mobile) has been necessary as voice platforms refined their technology, we are now entering a time when voice stops being its own platform and, instead, becomes a new interface for multiple platforms.

At TribalScale, we’ve been at the forefront of this wave of advanced voice interfaces; we have worked on integrating both voice and touch into the same application for mobile and embedded platforms. However, being at this bleeding edge means not having all the answers upfront. Even our voice technology partners don’t have the answers because all the issues surrounding the marriage of voice and touch are not well defined. That being the case, we’ve outlined some of the technical challenges and considerations we’ve encountered to help you layer voice interactions into your applications.

Synchronizing Actions from Multiple Inputs

Mobile apps are a good example of how touch and voice have historically been built as two different experiences. Typically, upon activation, the OS’s voice assistant will take over the device and divert all interactions through its interface. While for most apps this might not be problematic, for brand’s that want to add advanced voice interactions inside their application, this can add friction to the user experience. While the OS’s inherent voice assistant is powerful, it might not accurately represent a brand’s desired tone or functionality for their specific application. In these instances, where voice interactions are desired inside an existing application without leaving the experience of the app, a custom built voice assistant might be warranted.

However, the added complexity for these apps is that voice needs to be utilized as another form of input, just like touch. In simple terms, you create an action when you touch or tap a phone and, similarly, you can create an action when giving a specific voice command. One of the biggest considerations for these types of mixed input applications is that your voice can contain a variety of data, whereas touch will not. Some of this data in a voice command will be intents and some of it will be values that reflect state. For instance the user could say “Reserve a hotel room nearby at 5:30pm,” which showcases the intent Reserve and the state values are hotel room, nearby, and 5:30pm. This information will need to be stored and retrieved by your application if the user decides to switch between voice input and touch input. This synchronization process is necessary, but you will still run into issues with contextual awareness of the voice input system if you do not build-in a handoff process between voice and touch.

Contextual Navigation

When dealing with a single form of input, you will have a better understanding of how a user will navigate a given screen. The landscape changes once you introduce another form of input as you must ensure an accurate account for the app’s current and historical state whenever an action, or a series of actions, is invoked from a different input. We previously discussed that it is important to think in terms of state, and navigation is a prime example of this. A user must pass off various state requirements in order to navigate the app. Technically speaking, it will be important to treat each of these actions as the same.

An example of this in an Android application is to categorize everything as an event and use an EventBus to distribute the events into a centralized location for processing. The benefit of this is that you can construct your voice input layer independently of your app (e.g. in Dialogflow) and have the two communicate via the Bus. The downside here is that you could be bogged down with too many events and it is harder to debug when a problem arises. Therefore, it is extremely important that you test your functions and be sure that they are actually pushing the events. We recommend that you construct a diagram of how events flow throughout the app so you can refer to it if there is an issue. Once you set up the flow of events, you will be able to ensure that your user can navigate all areas of the app with touch or voice, and will never lose context.

Elegant User Feedback

Even with a perfectly synchronized voice and touch systems inside an application, unpredictable errors can arise. In a situation where the user is relying on a voice system for both input and output, receiving no feedback during an unexpected error can lead to distrust of the application. This is easier to handle when the user is looking at a screen and using touch, but when a user is engaging the application with voice and not necessarily looking at a screen, then you must provide thoughtful voice feedback. Not informing our users of what is going on can be detrimental to the entire experience. Similarly, when we alert the user with an alert dialog or some form of message when an API fails, we need to take similar precautions to provide feedback when the task fails. For example, in the event where an API call fails when the user is engaging the voice input layer of an application, the application could trigger an event and alert the voice system to state the error while asking the user if they would like to retry. While it may sound easy to reset the entire system upon failure, you should resist this sort of fatal error handling. If you are able to provide user feedback that is informative and elegant, you will build trust with your users even when the application behaves erratically.


We recommend you think about what was discussed above when building your mixed input application: synchronizing your inputs to ensure you have a single flow of data, providing context to your systems so navigation can be handled, and most importantly, giving your users helpful feedback during failure states. When all three of these principles are followed, you will be amazed at how quickly users begin to trust your app. With these new paradigms of interaction, applications will need to continue to focus on building up trust before broad user adoption. Hopefully this article has given you some insight on how to think about voice and touch as equal, and not separate. We know it’s a wild west out there when it comes to building voice enabled apps on emerging platforms, so if you have any questions about including voice within your application, don’t hesitate to drop us a line.


Jon found his passion for programming while studying game development at UC Santa Cruz. He is an Agile Engineer at TribalScale Orange County where he spends his time building quality applications for the biggest brands, teaching Android, and exploring the latest technologies. Outside of work, he enjoys cooking delicious meals and spending time with the Orange County tech community.


About the author

Ardy is the Director of US West for TribalScale and consults partner companies on how to best build innovative software for emerging platforms. His strategic insights are instrumental to brands across industries, including automotive, insurance, fitness, and healthcare. His current focus is helping TribalScale’s partners integrate voice and ML into their applications in order to bring more intelligent digital experiences to consumers.

Join our fast growing team and connect with us on Twitter, LinkedIn & Facebook! Learn more about us on our website.

Visit Us on Medium

You might also be interested in…