Multimodal interaction has long promised to make interfaces more intuitive, adaptive and effective by combining complementary inputs. Among these, gaze and speech form a compelling pairing: gaze provides cues to attention, while speech conveys meaning and intent. With the increasing availability of eye tracking and advances in language processing, their integration is now feasible across extended reality and mobile contexts. Yet despite decades of exploration, research on gaze-speech interaction remains fragmented. This scoping review systematically examines 103 papers published between 1995 and 2025, categorising them as explicit, where users deliberately provide gaze and speech, and implicit, where systems interpret behaviours to support interaction. Across both, we identify recurring ways for combining gaze and speech to resolve ambiguity, ground references, and support adaptivity. We contribute a synthesis of research on their combined use while highlighting challenges of temporal alignment, fusion and privacy, offering guidance for future research toward richer multimodal interaction.