KAIST Logo
CHI 2026

Gaze and Speech in Multimodal Human-Computer Interaction: A Scoping Review

Anam Ahmad Khan , Florian Weidner , Jungwoo Rhee , Yasmeen Abdrabou , Andrea Bianchi , Eduardo Velloso , Hans Gellersen , Joshua Newn

Multimodal interaction has long promised to make interfaces more intuitive, adaptive and effective by combining complementary inputs. Among these, gaze and speech form a compelling pairing: gaze provides cues to attention, while speech conveys meaning and intent. With the increasing availability of eye tracking and advances in language processing, their integration is now feasible across extended reality and mobile contexts. Yet despite decades of exploration, research on gaze-speech interaction remains fragmented. This scoping review systematically examines 103 papers published between 1995 and 2025, categorising them as explicit, where users deliberately provide gaze and speech, and implicit, where systems interpret behaviours to support interaction. Across both, we identify recurring ways for combining gaze and speech to resolve ambiguity, ground references, and support adaptivity. We contribute a synthesis of research on their combined use while highlighting challenges of temporal alignment, fusion and privacy, offering guidance for future research toward richer multimodal interaction.


Back to projects