The increasing ubiquity of social media in our society, the extense user participation, and the explosion of multimedia content available online have generated an extraordinary interest on understanding how people behave and interact in social media outlets, such as Facebook, Twitter, or YouTube, to mention some. Existing efforts to study the user behavior in social media have mainly focused on the automatic analysis of text content, links, and metadata from blogging, microblogging, and social networks. However, despite the popularity of video online and the emergence of new forms of video interaction, most behavioral online video content remains unexplored.

Research in social computing has used cameras and microphones to sense and computationally analyse human social interactions. In particular, some works have focused on the nonverbal aspect of the conversational interaction that takes place in group meetings. The nonverbal channel is produced in parallel to the spoken words, through audio cues (speaking patterns, prosody) and visual cues (gaze, facial expression, posture and gestures), and carries information that is useful to predict human behavior, mood, personality, and social relations, in a very wide range of situations.

VlogSense is an innovative approach that aims to leverage research from both social media and behavioral computing for the automatic analysis of conversational video blogs (vlogs for short). This research aims to analyze conversational vlogs using automatic multimodal techniques that extract actually displayed behavior and that are applicable at large scale. Because vlogs are inherently multimodal, depict natural behavior, and so are complex in terms of content, VlogSense require the integration of robust methods for multimodal analysis and for social media understanding.

The ultimate goal of research in vlogs is to obtain a rich, robust, multimodal characterization of vloggers that can ultimately be useful to build automated tools for discovering and interacting with and through vlog content.


In their typical, single talking-head setting, conversational vlogs can be thought as a multimodal extension of traditional text-based, where vloggers implicitly or explicitly share information about themselves that words, either written or spoken, cannot convey.

Acknowledgements: This research is supported by the Swiss National Center of Competence (NCCR) on Interactive Multimodal Information Management (IM)2.

snsf_logo im2_logo