Allowing the user to give or receive information via multiple modes of interaction
User Interface・Conversational Interfaces

Work In Progress

Our Elements guide is still in progress, and therefore lacks full visual and technical assets. We hope to release them by summer of 2020. Thanks for reading Lingua Franca!


Multi-modality refers to the ability of an interface to accept multiple kinds of input and output. An example is a voice assistant that lets user speak or tap on the screen. Another example is a search bar that accepts a photo or a phrase as the search query. An interface should encourage multi-modal interaction when the interface is diverse and unstructured, as in these cases the user may not be able to express or interpret information in a single ‘dimension’. While increasingly challenging for designers, multi-modal interfaces often allow for greater creative freedom, as well as more delight in uncertainty and serendipity.


There is research on how children learn to think and behave multi-modally from a very young age[1] (also termed crossmodal learning). This makes sense, given that most of our interactions in the physical world occur in multiple modes, from interpersonal communication to hazard response. While designers may hesitate to add additional modalities to a given interaction for fear of ‘overloading’ the user, the actual effect may in fact be the opposite. A well-timed visual interaction that takes place simultaneously with an audible interaction may serve to improve usability, and to clarify ambiguity.


AI systems make multi-modal interaction challenging, as the requirement for training data and model interaction design grows significantly. Recent attempts have been made to create multi-modal AI systems that intermingle visual and language representations so that users may interact in several dimensions[2].

However, a multi-modal interface need not require such intermingling at the model level. The interface can simply represent its output or action using two different modalities to improve comprehension with users. For example, a gesture-recognition system may output both a visual and textual indication of the detected gesture, so that users understand whether their gesture is being misinterpreted for a similar one. If the output were only given textually, the users may not understand how to overcome the misinterpretation.



  1. Infants are superior in implicit crossmodal learning and use other learning mechanisms than adults by Rohlf, et al. ↩︎

  2. One Model To Learn Them All by Kaiser, et al. ↩︎