Towards a Unified understanding of Attention and Social Behaviors in the Wild

Understanding and interpreting people’s states, activities, and behaviors stands as a foundational objective in the field of artificial intelligence. Indeed, the ability to discern non-verbal behaviors, interpret social signals, and infer the mental states of others are crucial components of social intelligence, enabling individuals to predict and understand the behavior of others, anticipate their reactions, gauge their interest, and engage in more meaningful and effective social exchanges.In particular, gaze and attention form central elements driving many cognitive processes related to intentions, actions, and communications, with many connections with other behavioral cues like head and hand gestures, speaking status, facial expressions, or interactions. While past research efforts attempted to model these cues, the associated tasks were often treated independently which forgoes any possible relationships between them, or in specific settings. In light of this, the goal of this research project is to develop holistic and comprehensive models for human attention and social behavior understanding in the wild, where videos feature a wide diversity of scenes, environments, people, objects, and activities.  This implies the integration of social prediction tasks at two different levels: (i) subject-level, where we seek to model head dynamics and facial behavior (e.g. facial expressions, gaze estimation, head/hand gestures), and (ii) scene-level,  by analyzing people’s states and behavior to infer gaze following patterns (where and at what a person looks at), attentiveness, interactions, and social communication, which requires context modeling.The core idea is to leverage a multi-task co-training framework to model these tasks jointly, thereby accounting for intra-task and inter-task dynamics. The unified model is poised to bring the following benefits: (a) efficiency owing to a single model supporting multiple tasks, (b) improved performance by virtue of multi-task supervision, and (c) strong person/scene representations that can transfer well to person-centric downstream tasks.Achieving the desired result hinges upon addressing multiple challenges related to the specific tasks themselves, or the overarching unification objective. These can be framed as a set of research questions that will drive our investigations: (a) How can we leverage vision-language models to incorporate a semantic layer into the gaze following task? (b) How should we model head-face-gaze dynamics to infer head gestures and gaze directions? (c)  How can we represent person-centric information given head and body streams? (d) How can we exploit the graph message passing framework or cross-attention models to design fusion mechanisms of subject-level and scene-level information? (e) How to curate data for the tasks of interest, and capitalize on label propagation to learn from a combination of heterogeneous datasets and annotations?Given the unification end-goal, our approach is to leverage a consistent token-based representation in all our tasks, thereby ensuring compatibility between the different architectural components developed separately. We envision the final architecture to include scene and person transformer encoders followed by a fusion module to allow the exchange of information between each person and the scene or other people. Finally, the updated person tokens will serve as input queries to a transformer decoder before feeding into task-specific prediction heads.By investigating unified computational models of social behavior and communication in the wild, including how visual attention is influenced by and coordinated with other cues, we expect to impact the research community by advancing the state-of-the-art in human-centric computer vision, providing novel tasks and benchmarks, and contribute to other disciplines that rely on such models to automatically code these behaviors for subsequent large-scale analyses.
Idiap Research Institute
SNSF
Sep 01, 2025
Aug 31, 2029