Several key aspects for joint attention modeling remain unexplored and underdeveloped. Addressing these gaps constitutes the primary focus of the present proposal. More precisely, we will work on:Multimodal integration of interactional cues. So far only visual and pose cues were included in the models, without incorporating speaking status information and transcripts.Object-level modeling: we will include referential objects enabling more precise modeling of joint attention and word–referent mapping.Tool refinement: Enhance annotation tools to be scalable and user-friendly for behavioral research.The key goals and innovation of the present project lie in the multimodal integration of language and interaction cues with automated verification of objects’ presence, while measuring how attention is directed toward these objects during communication. This approach is novel in combining gaze, language, and referential object information, and will be validated in naturalistic, out-of-the-lab field settings, providing a scalable and ecologically valid framework for modeling joint attention and word-referent mapping. Through these advancements, we aim to set new standards in predictive modeling of joint attention and develop scalable tools for the community to advance the study of early language development. The project mainly brings together the expertise in gaze modeling (Idiap) and early child language development (UZH).