Gaze and Speech

Gaze and Speech in Multimodal Human-Computer Interaction: A Scoping Review

Multimodal interaction has long promised to make interfaces more intuitive, adaptive and effective by combining complementary inputs. Among these, gaze and speech form a compelling pairing: gaze provides cues to attention, while speech conveys meaning and intent. With the increasing availability of eye tracking and advances in language processing, their integration is now feasible across extended reality and mobile contexts. Yet despite decades of exploration, research on gaze-speech interaction remains fragmented. This scoping review systematically examines 103 papers published between 1995 and 2025, categorising them as explicit, where users deliberately provide gaze and speech, and implicit, where systems interpret behaviours to support interaction. Across both, we identify recurring ways for combining gaze and speech to resolve ambiguity, ground references, and support adaptivity. We contribute a synthesis of research on their combined use while highlighting challenges of temporal alignment, fusion and privacy, offering guidance for future research toward richer multimodal interaction.

Reference

@inproceedings{10.1145/3772318.3791662,
author = {Khan, Anam Ahmad and Weidner, Florian and Rhee, Jungwoo and Abdrabou, Yasmeen and Bianchi, Andrea and Velloso, Eduardo and Gellersen, Hans and Newn, Joshua},
title = {Gaze and Speech in Multimodal Human-Computer Interaction: A Scoping Review},
year = {2026},
isbn = {9798400722783},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3772318.3791662},
doi = {10.1145/3772318.3791662},
booktitle = {Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems},
articleno = {672},
numpages = {24},
keywords = {Gaze, Speech, Multimodal Interaction, Human–Computer Interaction, Scoping Review},
series = {CHI '26}
}

Back to projects