You are here: Home Scientific Portal Education Current and past participants

Current and past participants

For photographs of the researchers, please click here.


Mateu Aguilo
Harm op den Akker
Matthew Aylett
Naresh Bansal
Joan Isaac Biel Tres
Mihaela Bobeica
Liudmila Boldareva
Mila Boldareva
Vincent Bozzo
Andreu Cabrero
Barbara Caputo
Octavian Cheng
Mathias Creutz
Leucio Antonio Cutillo
Zacharie Degreve
Suyog Deshpande
Lina Dib
Ferran Diego
Dennis Doubovski (1)
Dennis Doubovski (2)
Jochen Ehnes
Carl Ek
Robert Eklund
Marc Ferras

Arlo Faria
Joe Frankel
Weina Ge
Sebastian Germesin
Frantisek Grezl
Guillaume Heusch
Ivan Himawan
Beatriz Trueba Hornero
Marijn Huijbregts
Cuong Huy To
Martin Karafiat
Thomas Kleinbauer
Jachym Kolar
Matej Konecny
Kenichi Kumatani
Jean-Christophe Lacroix
Quoc Anh Le
Lukas Matena
Hari Krishna Maganti
Fernando Martinez
Rosa Martinez
Jozef Milch
Xavier Anguera Miro
Binit Mohanty

Darren Moore
Anh Nguyen
Gaurav Pandey
Vikas Panwar
Jan Peciva
Volha Petukhova
Benjamin Picart
Marianna Pronobis
Michael Pucher (1)
Michael Pucher (2)
Santhosh Kumar Chellappan Pillai
Bogdan Raducanu
Anand Ramamoorthy
Ramandeep Singh
Kumutha Swampillai
Korbinian Riedhammer
Javier Tejedor
Sophie-Anne Thobie
Muhammad Muneeb Ullah
Nynke Van der Vliet
Gerwin Van Doorn
Roel Vertegaal
Oriol Vinyals
Junichi Yamagishi
Shasha Xie

Mateo AgiloICSI Trainee: Mateu Aguilo (Masters candidate)
Visiting From: Polytechnical University of Catalonia (UPC), Barcelona
Period: started March 15, 2005, for 6 months

Speaker segmentation and tracking performance can be improved by the use of a speech/non-speech detector. In meetings applications, noise such as that coming from papers dropping, people breathing, or coughing may become an important problem. Past speaker activity detection approaches have been using domain-dependent acoustic modelling or feature-domain signal processing, achieving different levels of accuracy. For this project, we will explore a mixed technique which combines both standard and nonstandard acoustic features, such as energy, harmonicity and syllable rate into a train-free bootstrap system. The system will be evaluated for the available meeting corpora and compared to existent systems. Novel approaches will be studied in order to improve segmentation error rate.

DFKI Trainee: Harm op den Akker (Masters Internship) Harm Opdenakker
Visiting From: UT
Period: 17 September 2007 - 14 December 2007
Project Title: Automatic Dialogue Act Segmentation using Machine Learning

The traineship addresses the task of Dialogue Act Segmentation. Dialogue Acts are sentence-like units that constitute a speaker's intention and the desired influence on the listener. Segmenting utterances into dialogue acts is an important first step in recognizing higher levels of structure in a discourse. The task here is to split a series of uttered words into DA segments. The approach is to use a machine classifier to assign to every word either the "boundary" or "non-boundary" class. In order to reach high performance for such a classifier, useful features have to be extracted and evaluated. The task for this traineeship is to first define as many as possible features, most of which may have allready been used in other work, and see how they individually affect the performance of a classifier. Then an optimal subset of these features have to be created. Finally, many different classifiers like Bayesian Networks, Neural Networks and Support Vector Machines are tested to see which one performs best.

Matthew AylettICSI Trainee: Matthew Aylett (Postdoctoral Fellow)
Visiting From: Rhetorical Systems and Univ. of Edinburgh
Period: started April 1, 2005 (staying to November)


Hot spots in dialog are points where one or more participants are highly involved in the discussion. These regions are likely to contain important information for users who are browsing a meeting or for information retrieval applications. It has been shown that human coders can reliably rate the level of engagement of a speaker at the utterance level. There is substantial evidence that such rating is based strongly on acoustic cues such as f0 and energy.

Non-lexical prosodic analysis (NLPA) attempts to extract as much prosodic structure as possible from an utterance without the aid of lexical information. There are two main motivations for NLPA: 1) Applications may not have access to reliable recognition output, 2) NLPA establishes how relevant acoustic-only cues may be in prosodic interpretation.

In this project we will apply NLPA to the ICSI Meeting Corpus. We will extend the NLPA paradigm to include probabilistic output for phrasing, disfluency and prominence structure and investigate how such output relates to hot spots and to what extent machine learning algorithms can harness this output to meaningfully segment dialog.

UEDIN Trainee: Naresh Bansal (U/G Internship)
Visiting From:  Indian Institute of Technology Guwahati
Period: 2 May 05 - 31 July 05
Project Title: Speaker diarization using dynamic bayesian networks

In today's world of ever increasing volumes of audio data, meeting data etc. a key problem is the indexing, effective searching and efficient accessing of these information archives. Speaker diarization is one such technology which deals with marking and categorizing of audio sources. The goal is to divide a multi-speaker audio stream into homogeneous segments based on speaker information. Each segment contains the voice of only one speaker. The final output after combining all these segments will be "Who spoke when?”. Challenges involve correctly identifying the boundaries between speakers, recognizing the speaker in those segments and finding out the number of unique speakers in the audio recording. Though the simplicity of well known hidden Markov models (HMMs) makes them the unanimous choice for automatic speech recognition (ASR) tasks, they do not provide a unifying statistical framework where new models can be tested without modifying the existing software. HMMs are just a small region of a huge family of statistical models. HMM can be viewed as a specific case of more general dynamic graphical models - dynamic Bayesian networks (DBNs) which are Bayesian networks (BNs), directed acyclic graph consisting of links between nodes that represents variables in the domain, evolving over time. We propose DBN based single and multi stream (asynchronous) acoustic models for speaker diarization. The task discussed in this project is in the context of AMI meeting conversations (More information regarding AMI can be found at ).

Joan BielICSI Trainee: Joan Isaac Biel Tres (MSc Internship)
Visiting From: UPC Technical Uni of Catalonia
Period: 1 March 2007 - 31 December 2007
Project Title: Advanced language identification combining sub-word unit counts with prosodic features

Automatic Language Identification (LID) is the task of determining the language being spoken on the basis of a given sample of speech. It is an important technology in many applications, such as spoken language translation, multilingual speech recognition, and spoken document retrieval, where systems must distinguish the spoken language in a very first step. Among the various approaches to LID, the PPRLM system (Parallel Phone Recognizers followed by Language Models) has shown to be successful. However, its main drawback is that it requires annotated corpora for each of the languages to be detected. The propose of this project is to build a LID system that follows the latest improvements of the approach. First, the PPR front-end will be substituted by a sub-word unit recognizer trained in an unsupervised fashion using unlabeled data. Second, the LM back-end will be replaced by a Support Vector Machine that takes the relative counts of n-grams of the subword units as input features. Subsequently, this system will be combined with another component capturing the prosodic characteristics of each language.

TNO Trainee: Mihaela Bobeica (Ph.D. Internship)
Visiting From: University of Nice Sophia Antipolis
Period: 1 Oct 04 - 1 May 05
Project Title: Development of meeting ontology models for meeting data annotation and navigation

In accordance with previous research relating to ontology design and implementation, as well as to meeting domain modeling (e.g. CoBrA, Compendium, Dialog Mapping), the steps that will be taken toward building the meeting ontologies are as follows: acquisition of the domain knowledge (annotation schemes, annotated data and upper models), conceptual design, fleshing out the designed schema, consistency check and deployment (ontology-based annotation). Different aspects that will be modeled are: (on a lower level) agents, time, devices, meeting items (presentations, agendas etc), individual actions, meeting task; (on a higher level) argumentation structure and decision-making process. Considering the fact that the meeting ontologies will be subsequently applied to different kinds of specialised knowledge, the models will be designed to ensure domain-independence. The ontology building process will focus on modelling parliamentary debates, as a basis for testing the deployment of the meeting models.

Mihaela Bobeica - Mila BoldarevaTUM Trainee: Mila Boldareva
Visiting From: TNO
Period: December 04 to November 05

AMI video footage is challenging for the retrieval task because of a specific content. The meeting recordings are very much different from the data being used in e.g. TREC video evaluation forum - broadcast news. To complement retrieval of events and objects known through detection and/or annotation, interactive retrieval of video fragments for an unforseen information need is proposed. Using algorithms for salient points identification [1], retrieval units are indexed as (sets of) interesting regions. They can be matched against areas of interest pointed by the user in the retrieval session. The answers to the users' query contain most of best-matching regions. The corpus is pre-processed to enable fast interaction with the user to refine their information need [2]. References: [1] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. of Computer Vision, 60, 2 (2004), pp. 91-110. [2] L. Boldareva, D. Hiemstra. Interactive Content-Based Retrieval Using Pre-computed Object-Object Similarities. In: Int. Conf. on Image and Video Retrieval, LNCS 3115, pp. 308-316.

TNO Trainee: Liudmila Boldareva (Ph.D. Internship)
Visiting From: University of Twente
Period: 1 DEc 04 - 1 Dec 05
Project Title: Interactive retrieval of video fragments based on salient regions

AMI video footage is challenging for the retrieval task because of a specific content. The meeting recordings are very much different from the data being used in e.g. TREC video evaluation forum - broadcast news. To complement retrieval of events and objects known through detection and/or annotation, interactive retrieval of video fragments for an unforseen information need is proposed. Using algorithms for salient points identification [1], retrieval units are indexed as (sets of) interesting regions. They can be matched against areas of interest pointed by the user in the retrieval session. The answers to the users' query contain most of best-matching regions. The corpus is pre-processed to enable fast interaction with the user to refine their information need [2]. References: [1] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. of Computer Vision, 60, 2 (2004), pp. 91-110. [2] L. Boldareva, D. Hiemstra. Interactive Content-Based Retrieval Using Pre-computed Object-

Oject Similarities. In: Int. Conf. on Image and Video Retrieval, LNCS 3115, pp. 308-316.

IDIAP Trainee: Vincent Bozzo ( Masters Internship) portrait Vincent Bozzo
Visiting From: EPFL
Period: 15 September 2008 - 13 March 2009
Project Title: Design of a Flash user interface for browsing and searching audio-visually captured conference talks

Having a well built and functional interface is a key feature for every modern software. It is the entry point for the users and also acts as a showroom for the program capabilities. Therefore, my job will focus on bringing a portable and usable RIA “Rich Internet Application” to allow browsing, playing back, searching meeting and conference talks. Ensuring portability and easy access is key. The interface will be developed with Adobe Flex Builder and embedded into a web page, particularly on the AMI&AMIDA corpus web portal. It will also include a search tool, in addition to a certain number of novel Web 2.0 features such as online sharing and commenting capabilities. This project involves not only a new GUI design but also the rebuild and the enhancement of the current back end architecture. During the whole conception process, from the preliminary design to the final version, end-users will be involved as much as possible to adopt a user centered design accordingly to standard “Human Computer Interactions” and “Interaction Design” methodologies.

UEDIN Trainee: Andreu Cabrero (MSc Internship)
Visiting From: University Polytechnic Catalunya
Period: 1 Feb 06 - 31 July 06
Project Title:




Barbara CaputoIDIAP Trainee: Barbara Caputo (Postdoctoral Visit)
Visiting From: KTH, Stockholm
Period: 1 Dec 05 - 30 June 06
Project Title:




Octavian chengIDIAP Trainee: Octavian Cheng (Ph.D. Internship)
Visiting From: University of Auckland
Period: 19 Oct 05 - 18 Oct 06
Project Title:




ICSI Trainee: Mathias Creutz (Ph.D. Internship)
Visiting From: Helsinki University of Technology
Period: 1 Nov 05 - 1 April 06
Project Title:




Cutillo Tony IDIAP Trainee: Leucio Antonio Cutillo (Masters Internship)
Visiting From: Eurecom, Univ Nice & Polytechnic Turin
Period: 1 April 2007 - 14 September 2007
Project Title: Enhancing intelligent presentation acquisition systems: the automatic cameraman


IDIAP is developing new technologies for intelligent and automatic presentation acquisition and broadcasting systems. The system comprises now 3 cameras and microphones, as well as an automatic slides capturing and indexing system, all synchronized. The cameras are pointing to the speaker (one close-up view, one large field-of-view) and to the audience. To enhance the system, the aim of the project is to develop and implement a visual serving algorithm to control a pan/tilt/zoom (PTZ) camera, in order to keep the speaker always visible in the close-up camera view.

Zacharie De Greve IDIAP Trainee: Zacharie Degreve (Masters Internship)
Visiting From: Faculte Polytechnique de Mons (Belgium)
Period: 2 February 2007 - 20 June 2007
Project Title: Keyword Spotting on Word Lattices

The goal of this project is to perform confidence based keyword spotting on word lattices, which are a compact way for storing the most probable hypotheses previously generated by a Large Vocabulary Continuous Speech Recognizer. By doing so, more knowledge (lexical and syntactic) is taken into account than single-pass, one-best approaches. Classicaly, the latter techniques yield good results in case of simple tasks such as the design of small vocal interfaces, where the system has to deal with only a few extraneous words around keywords. But such approaches are not sufficient when huge databases, such as AMI meeting databases, need to be spotted. In our work, we intend to process the lattices generated for the AMI meeting databases, and we will compare the performance of different KWS rescoring methods, namely keyword posteriors based and likelihood ratios based.
Wessel F., Schluter R., Macherey K., Ney H. Confidence Measures for Large Vocabulary Speech Recognition. IEEE Transactions on Speech and Audio Processing, Vol.9, NO.3, March 2001 M. Weintraub. LVCSR Loglikelihood Ratio Rescoring for Keyword Spotting. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 297–300, Detroit, 1995.

USFD Trainee: Suyog Deshpande ( UG Internship)
Visiting From: Indian Institute of Technology (IIT, Guwahatii)
Period: 11 May 2007 - 23 July 2007
Project Title: Interface to support multitasking in AMI-AMIDA meetings
I worked on designing interfaces that will support multitasking. The main aim of my project was develop an interface which will allow user to focus on remote meetings and simultaneously carry out another task. This is achieved by designing notification system which will provide information along with notification. User can decide whether to attend a meeting or not using the information. The project also involved the study of interruption on human behaviour. This helped me to categorize alerts based on level of importance. Users were alerted on different scales depending on level of importance. Comparative study amongst proposed models is an important step to decide better notification system. Digital prototypes of proposed systems were developed to carry out experiment. Results this experiment after analysis will help to understand better notification system.

USFD Trainee: Lina Dib ( PhD Internship)portrait Lina Dip
Visiting From: Rice University
Period: 1 June 2008 - 30 November 2008
Project Title: Exploring the Significance of Conversation and Sound in Remembering
The AMIDA project at the University of Sheffield engages in research specifically focused on the development of digital tools to help record conversational information. During my stay, I will carry out fieldwork and conduct semi-structured interviews with ordinary people who will be collecting recordings of everyday conversations and sounds they consider significant. Participants will record these sounds in the context of their holidays, a situation in which we believe they will be highly motivated to do so. Our study will seek to understand how and why these recordings are significant. It will also examine how sound as a particular medium relates to memory practices in general.

Interview and ethnographic settings can be very informative, and user’s actual practices can articulate important subtleties that have technical ramifications. Interest in users’ daily lives and environments, both before and after their interaction with recording tools, can lead to valuable insight. How and when do these records help users remember past conversations and events, and/or construct and share personal narratives? Fieldwork implies the possibility of a more longitudinal study that looks at the staging of conversational and sonic memories – that observes the phases involved in remembering such as creating, collecting, archiving, editing and viewing – so that memory can be addressed as a series of contextual manipulations, a ‘process’ or procedure, rather than simply as an amorphous thing or place we access using various tools and cues. How might conversational recording devices affect users, and how might users create new practices around these devices? Perhaps designed with the context of memory in mind, these tools might engage the users in new and creative ways. Attending to their practices and experiences is part of an iterative design method, key in the creation of usable technologies.

UEDIN Trainee: Ferran Diego (Master Internship)
Visiting From: UPC Technical Uni of Catalonia
Period: 1 Feb 05 - 1 July 05
Project Title:




DennisDoubovski SaschaSchreiberTUM Trainee: Dennis Doubovski
Visiting From: Twente, Netherlands
Period: Starting around February 1, 2005, for 3 months

The individual actions and gestures during a meeting give a strong advice of what is going on in a meeting. Therefore this is an important research area in AMI.

First the student will collect data, useful for training various probabilistic classifiers. Then he will implement algorithms for segmentation and recognition of individual gestures. In this context several techniques for recognizing gestures will be evaluated, such as graphical models, HMMs, and other probabilistic methods. Also different feature extraction mechanisms should be tested, like image subtraction, model specific characteristics and other algorithms that are supported by particle filtering and template based methods.

A final evaluation of the different approaches will complete the work on individual gesture recognition.

DennisDoubovski SaschaSchreiberTraining Applicant: Dennis Doubovski
Visiting From: University of Twente
Period: 16th February 05 for 3 months

The detection and tracking of humans is an elementary step in the chain of video processing, especially for high level tasks like gesture or event or recognition.

For this reason a tracking algorithm utilizing a particle filter, known as Condensation, will be implemented. Based on this framework, various features like the skin color or the shilouette of humans will be used for evaluating each hypothesis. In this connexion a suitable technique should be deployed to combine the hypotheses' output for the different features so that both the position as well as the number of persons visible in each frame of the video can be derived.

UEDIN Trainee: Jochen Ehnes (Postdoctoral Visit)
Visiting From: University of Tokyo
Period: 15 April 2007 - 14 January 2008
Project Title: Projected Interactive Collaborative Meeting Environment (or PIC-ME! for short)


We intend to build a system that can project documents on the table’s surface in front of the meeting participants. The system will actively support ongoing meetings by presenting information of relevance to the current discussion. While a personal computer at each participant’s place may be counterproductive, as the participants would each interact with their computers, a projection on the tabletop should be less intrusive because all participants can see the projected documents. As a result, the participants can discuss these documents in a natural way.
In order to provide a resolution high enough to be able to project text documents, we plan to use multiple projectors, most probably one per user. However the software running on the computers that generate the projected images will enable the participants to easily move documents from their place to that of other participants.
In order to make the interaction with the system as unintrusive as possible, we plan to make the projected objects ‘graspable’. In order to move around the projected objects we will track physical objects associated with the projected ones. We may also investigate possibilities to move the projected documents using only hand gestures.

In addition to the table, we plan to extend the system to incorporate the whiteboard as well. This way, participants will be able to interact with content on the whiteboard directly from their place and move content between their space and the whiteboard easily.

ICSI Trainee: Carl Ek ( ICSI Scheme)portrait Carl Ek
Visiting From: University of Manchester
Period: 1 February 2009 - 30 September 2009
Project Title: Gaussian Process Latent Variable Models for Feature Fusion


Using several sources of observations can significantly improve the accuracy in many decision processes. Feature fusion is usually approached in one of two ways either as early or late fusion. In early fusion the observations are first merged into a single

representation upon which a decision is made. This is different from late fusion where the information is merged first on decision level. Typically this means that each feature have a vote and that the majority vote is the decision. There are both advantages and disadvantages to the different fusion schemes. In this project we are developing a latent variable model for feature fusion that shares characteristics of both early and late fusion. Our model is generative and capable of making decisions give any subset of the modeled features. With each feature space we associate a likelihood function which means that we can detect irregularities in the features and remove such instantiations from the decision process.

Robert EklundICSI Trainee: Robert Eklund (Postdoctoral Visit)
Visiting From: Linkoping University
Period: June 05 to Feb 06
Project Title:
Automatic indexing and browsing of meetings are becoming increasingly important issues. However, speech contains features and phenomena that are not readily found in text alone. One such thing is "involvement", which is easier to detect from prosody alone than from text. Such events have been called "hot-spots", and are possible to find automatically (see Aylett), and also yield a fairly high agreement between human annotators. Automatic content summarization is also an increasingly important field under development and several different methods are being tried out with varying results.

What is less clear, however, is to what extent hot-spots are related to "points-of-interests" from a content perspective, like decisions made, tasks being appointed, deadlines set, goals met/agreed upon etc.

The aim of this study is to find out to what extent hot-spots are correlated with "content-heavy" stretches in the meeting corpus.

While hot-spots have been annotated at ICSI, content labelling and automatic summarization is being carried out at several different sites, using various tools and (tag) taxonomies.

Moreover, there is clearly a many-to-many mapping between the different kinds of content analysis/labeling carried out (on the one side) and the different kinds of hot-spot labels that you find (on the other side), which means that there is the need to settle on a subset of what kind of content-labels are of primary importance before comparing these to the hot-spot labels.

The goals of the research plan are as follows: Create an overview of existing labeling schemas and to find out to what extent these are congruent/overlapping and how these could be merged with aligned with hot-spot annotations.

Decide how any potential new annotation will be carried out, both with regard to tag set and what tool(s) to use.

Analysis of the data, most likely by hand but (hopefully) also by automatic means given that existing tools can be used to do this.

Statistical evaluation of correlations, as well as (most likely) other kinds of analyses.

End result presented in the form of a conference paper (if written with co-author(s)*), or as an internal report or journal paper.

A tentative time plan looks thus:

New data set/base where hot-spots are merged with content annotations ready early in November

Analysis Period: November and December.

Evaluation in January:

Final report/paper: February (or possibly March).

Frantisek Grezl - Marc Ferras - Xavier Anguera - Michael PucherICSI Trainee: Frantisek Grezl (Ph.D. Student)
Visiting From: Brno University of Technology, Czech Republic
Period: 1 Nov 2004 - 31 March 2005 (then continuing for 5 months at IDIAP)

State-of-the-art feature extraction is now moving beyond the standard simple cepstrum computation of single speech frames accompanied with their deltas. Emerging techniques involve nonlinear transformations (e.g. via neural nets), phone/state class posterior estimation, and feature combination. In addition, the signal duration for feature computation is expanding from 25 ms up to 500 ms. Front-end processing may incorporate a combination of standard short-term cepstral features plus deltas (timespan <100ms) together with either TANDEM features (timespan up to 200 ms) or TRAPS-based features (timespan up to 500 ms). The combination of these features can by as simple as their concatenation or involve more sophisticated combination, e.g. via HLDA transforms. Front-ends using simple concatenation of long-term features with standard cepstral features have recently been used with great success in automatic speech recognition (ASR) systems for transcribing conversational telephone speech, achieving relative reductions of up to 10% in word error rate. The aim of this project is to address long-term (TRAPS-based) features in the context of Meetings recognition, especially newly proposed techniques for deriving such features (HATS, TMLP), and to explore possibilities for combination of short-term and long-term features using more advanced techniques such as HLDA.

UEDIN Trainee: Arlo Faria (U/G Internship)
Visiting From: ICSI
Period: 1 Feb 05 - 1 July 05
Project Title:




ICSI Trainee: Joe Frankel (Postdoctoral Visit)
Visiting From: CSTR Edinburgh
Period: 1 Sept 05 - 1 April 06
Project Title: Transfer learning for tandem ASR feature extraction

Automatic speech recognition (ASR) systems typically employ spectral-based acoustic features such as Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) cepstra. In the tandem approach developed at OGI, IDIAP, and ICSI, such acoustic observations are augmented to include features derived using non-linear mappings operating on long (up to 500ms) windows onto the acoustic signal. The non-linear mappings take the form of multi-layer perceptrons (MLP) which are trained to estimate phone posteriors. To calculate the final MLP-based features, the posteriors are subject to a log transform and dimensionality reduction using a Karhunen-Loeve transform (KLT). This project aims to improve MLP-based features for use in tandem ASR systems through the application of multi-task learning (MTL), which seeks to improve accuracy on a primary task (here phone posterior estimation) by using the training patterns of a set of related secondary tasks as an inductive bias. Possible secondary tasks for investigation include estimating posteriors for gender, speaker/speaker cluster, phone at preceding/following frames, segment start/end and multi-level articulatory features. Another secondary task which is of particular relevance to the meeting-room environment is speech enhancement, in which the MLP is trained to provide a mapping from far-field to near-field channels in addition to estimation of phone posteriors.

weina-ge IDIAP Trainee: Weina Ge ( PhD Internship)
Visiting From: The Pennsylvania State University
Period: 1 June 2007 - 31 August 2007
Project Title: Shape alphabets for visual scene analysis

This project aims at exploring the shape aspect of visual features, an important perception cue that has not yet been well exploited in visual vocabulary (bag-of-visterms) approaches for visual scene analysis. As a starting point, we will study the most recent literature on shape-based object recognition/detection, and implement and analyze some of the state-of-art techniques (e.g. [1, 2]). Experiments will be carried out to assess the repeatability of the shape feature under various changes. We plan to extent the visual vocabulary with these shape alphabets and evaluate the performance on larger scale datasets.
[1] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, "Groups of Adjacent Contour Segments for Object Detection", PAMI 2007 (to appear).
[2] S. Belongie, J. Malik, and J. Puzicha, "Shape Matching and Object Recognition Using Shape Contexts", PAMI 2002.

UEDIN Trainee: Sebastian Germesin ( PhD Internship)portrait Sebastian Germesin
Visiting From: DFKI
Period: 1 February 2009 - 30 April 2009
Project Title: Automatic Detection of Agreement/Disagreement in Multi-Party Interaction

The automatic derivation of summaries is one major aspect of the AMIDA project's goal in developing technology to support human interaction in meetings. Different types of automatically collected information are needed to support this process. One type of information of particular importance to meeting summaries is information about when someone agreed or disagreed to a (previous) statement of someone else.

Focussing on this specific type of information, the main goal of my work during the AMIDA training program is the development of an automatic system for the detection of the aforementioned agreements and disagreements. For that, I use the already existing subjectivity annotations of the project's corpora, which were defined and annotated at the University of Edinburgh by Theresa Wilsoni and include annotations about agreements and disagreements. A thorough investigation of different supervised machine learning techniques, using the WEKA toolkit for machine learning and a detailled inspection of different types of features will be done.

Furthermore, a comparison of the gained results to previous work on, e.g., the ICSI corpus is intended. As time permits, I will focus the last part of my work in the visualisation of the results to support existing AMIDA applications.

Guillaume HeuschIDIAP Trainee: Guillaume Heusch
Visiting From: EPFL, CH
Period: Started September 1, 2004, for 6 months

Face image processing is an important research area in AMI (Detection, Tracking and Recognition), especially in the context of meeting room data analysis. Lighting is a significant factor affecting the appearance of faces. The goal of this project is to study and to implement some state-of- the-art face image lighting normalization techniques [1,2]. As a first step, the student will study the effect of lighting change on the face recognition algorithms developed at IDIAP [3]. Then, he will study and implement the above image normalization techniques. Finally, an experimental comparison of selected techniques will be performed on a specific face recognition task. References: [1] Georghiades A., Kriegman D., Bielhumeur P., "From Few to Many: Generative Models for Recognition Under Variable Pose and Illumination", IEEE PAMI (2001) [2] Ralph Gross and Vladimir Brajovic, "An Image Preprocessing Algorithm for Illumination Invariant Face Recognition", 4th International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA), 2003 [3] F. Cardinaux, C. Sanderson, and S. Marcel, "Comparison of MLP and GMM Classifiers for Face Verification on XM2VTS, in 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Guilford, UK, 2003, pp. 91

UEDIN Trainee: Ivan Himawan (PhD Internship)Ivan Himawan
Visiting From: Queensland University of Technology
Period: 23 April 2007 - 22 April 2007
Project Title: Towards Speech Recognition for Ad-hoc Microphone Arrays

With advances in sensor and sensor network technology, multimedia-capable devices and ad-hoc computing networks are becoming ubiquitous. In this new context, there is potential for applications that employ ad-hoc networks of microphone-equipped devices collaboratively as a virtual microphone array. This dynamic deployment of microphones presents a number of challenges such as maintaining synchronisation between channels, determining position of microphones and speakers, and reducing deviation of microphones responses.

In this project, strategies in dealing with uncertainty in microphone array beamformer will be first studied using traditional approach such as array calibration and its effects on the speech recognition accuracy. Departing from this traditional approach, practical algorithms will be developed in dealing with less constrained network.

Beatriz Trueba HorneroICSI Trainee: Beatriz Trueba Hornero (MSc Internship)
Visiting From: UPC Technical Uni of Catalonia
Period: 1 March 2007 - 31 December 2007
Project Title: Overlap Speech Detection in Meetings

More and more speech processing is being done on meetings rather than single-speaker sources. For practical reasons scientists seek to be able to record and process meetings using a single recording source. This has turned out to be a major new challenge for speech processing that comprises not only most of previous problems but also further specific ones that arise due to the new scenario with multiple interactive participants. The issue that is going to be investigated in my research at ICSI concerns a major meeting specific problem: Overlapping speech regions (also known as co-channel speech). Current speech recognition and speaker diarization systems currently either ignore hand-labeled overlapped regions or just don't handle them at all and perform badly. In the next six months, I will study two major things: - Strategies for detecting overlapping speech: Co-channel speech's characteristics will be deeply studied from different points of view, including a novel prosodic approach, to identify a set of discriminative features that is able to define a model for overlapping speech. This work will be performed together with Kofi Boake, a PhD student at ICSI. - Strategies for dealing with overlapping speech: Questions like the following will be answered: Given a perfect overlapping speech detector, what is a good strategy for a speaker diarization system to deal with it? How much improvement can there be expected given an overlap detector with a certain accuracy?

Marijn HuijbregtsICSI Trainee: Marijn Huijbregts (PhD Internship)
Visiting From: University of Twente
Period: 1 October 2006 - 30 April 2007
Project Title: The Blame Game: Performance Analysis of Speaker Diarization System Components

The goal of speaker diarization is to automatically segment an audio recording into speaker homogeneous regions. Although the identity of each speaker is not known and even the number of speakers is unknown, a diarization system should be able to anonymously label each speaker in the recording and answer the question: ‘Who spoke when?’. The International Computer Science Institute (ICSI) has successfully participated the speaker diarization task in the NIST Rich Transcription benchmark evaluations with a system based on a Hidden Markov Model architecture and Gaussian Mixture Models that are trained using only the speech in the data under evaluation. In this project we performed an analysis of this speaker diarization system. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the Speech Activity Detection (SAD) component contributes most to the total diarization error rate (23%). The lack of ability to model overlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%). In order to improve SAD for use in the diarization system, we have implemented another SAD component that is similar to the component used in last years NIST evaluation (RT06s). The two main differences are that this system creates its initial segmentation based on bootstrap models (instead of energy) and that it allows for small silence segments (of less than 300ms). On our development set this improved the Diarization Error Rate (DER) by two percent absolute. One percentage because of SAD improvement and another percentage because of better performance of the diarization system itself.

EPFL Trainee: Cuong Huy To (Ph.D. Internship)
Visiting From: IDIAP
Period: 1 March 06 - 31 Dec 06
Project Title:




University Sheffield Trainee: Martin Karafiat (Ph.D Internship)
Visiting From: Brno
Period: 28 June 04 - 30 September 05
Project Title:




ICSI Trainee: Thomas Kleinbauer (ICSI Scheme) Thomas Kleinbauer
Visiting From: DFKI
Period: 1 June 2007 - 28 February 2008
Project Title: Applying FrameNet to meeting discourse data

The FrameNet project aims at creating a lexicon of conceptual structures of meaning, called "frames", based on Fillmore's theory of frame semantics. In a second part of the project, documents are annotated with FrameNet frames to demonstrate how certain words in a sentence "evoke" specific frames. So far, these annotations have been performed solely on written texts. The AMIDA project on the other hand is concerned with spontaneous interactions of people, specificially within meetings. A corpus of about 100 hours of meetings has been recorded and fully transcribed.
In the nine months of my training program, I will study if and how the dialogs of real-life interactions in meetings can be analyzed on a semantic level using FrameNet.
Given the different nature of spoken language discourse when compared to written documents, this task poses a number of interesting challenges. For instance, people typically speak less grammatically, introducing speech disfluencies, such as, false starts, repeats, corrections. Furthermore, wordings in spontaneous speech typically differ from written language. In addition, in a computational setting, an automatic system will have to face errors introduced by an imperfect speech recognition system. It is certainly interesting to study how these effects are reflected in FrameNet semantics.
My work will provide first insights in how how ready FrameNet is for application to speech data, with its attendant difficulties. In order to accomplish this task, I plan to define (in collaboration with the FrameNet staff) new frames to be added to the 88+ currently implemented in FrameNet whenever new semantic concepts are encountered during my analysis.
In a bigger picture, my work will be beneficial for possible subsequent processing steps, such as, automatic abstractive summarization or machine translation. Another potential application
of my work would be semantic feedback to ASR systems, with the potential to improve language models and recognition rates.

ICSI Trainee: Jachym Kolar (Postdoctoral Visit)
Visiting From: University of West Bohemia in Pilsen
Period: 1 Oct 05 - 1 April 06
Project Title: Utilizing prosody for automatic sentence and dialog act segmentation of meeting data

During his visit at the ICSI, Jachym Kolar will focus on improving methods for segmentation of speech into utterance units (sentences, dialog acts) and their classification according their function within discourse (statement, question, disruption, backchannel, etc.), mainly on meeting speech data. This segmentation and classification is crucial to applying downstream higher-level Natural Language Processing (NLP) methods (information retrieval, speech-to-speech translation, etc.) since these methods typically require a formatted input. As only features based on pause duration and recognized words were usually employed in the previous work on this task on meeting data, he will explore contribution of various prosodic (pitch, duration, energy, voice quality) and shallow syntactic features (Parts-of-Speech, speech chunks) to the system performance in this task. A couple of different modeling techniques are planned to be explored.

TNO Trainee: Matej Konecny (MSc Internship)
Visiting From: BRNO
Period: 1 October 2006 - 31 March 2007
Project Title:




IDIAP Trainee: Kenichi Kumatani ( PhD Internship) Kumatani
Visiting From: University of Karlsruhe
Period: 1 July 2007 - 30 June 2008
Project Title: Microphone array processing for the far-field speech recognition


The recognition of speech in the meeting poses a number of challenges to current automatic speech recognition (ASR) techniques. In such a situation, microphone array processing has the potential to relieve users from the necessity of donning close talking microphones (CTMs) before interacting with ASR systems. Adaptive beamforming is the promising approach because it can keep the distortionless constraint on the speech signal for the look direction. Specifically, in this project, I address three subjects:

1) the filter bank design method for subband beamforming,

2) the beamforming algorithm with the maximum negentropy criterion for the scenario where a single speaker is stationary, and

3) the speech separation algorithm for overlapping speech.

I demonstrate the effectiveness of each technique through a set of automatic speech recognition experiments on the multi-channel data collected by the European Union integrated project Augmented Multi-party Interaction (AMI).


University Sheffield Trainee: Jean-Christophe Lacroix (Master Internship)
Visiting From: ENST
Period: 1 July 04 - 1 Nov 04
Project Title:



IDIAP Trainee : Quoc Anh Le ( Masters Internship)portrait Quoc Anh Le
Visiting From: University of Namur
Period: 1 August 2008 - 31 January 2009
Project Title: Automatic true-false question answering in meetings

Archives of meeting recordings are useful to check or to retrieve information from past meetings. However, such archives are usable only if appropriate tools, generally named meeting browsers, are available to locate the information that is searched for. One approach to meeting browsing is to design general-purpose browsers that help human users to locate relevant information, but another possibility is to design browsers that locate information automatically, for instance for verification (fact checking) purposes. The goal of this internship project is to design such an automatic browser, and to assess its performance on a set of pairs of true-false statements, which have been initially used to evaluate human-directed browsers. In other words, the goal is to design and to implement a system that determines the true and the false statement in each pair, to evaluate its performance over a set of several hundreds such pairs (for 4 recorded meetings), and to compare it with human subjects using existing meeting browsers. A comparative analysis of the system and the human scores on specific questions should indicate whether or not system and humans have the same difficulties answering such questions. This work will thus provide a baseline score against which humandirected browsers must be compared, to demonstrate significant improvement over a fully automated tool.

IDIAP Trainee: Lukas Matena (Masters Internship) Lukas Matena
Visiting From: BRNO
Period: 16 August 2007 - 30 July 2008
Project Title: Mobile interface for remote participation in augmented meeting

Advanced tools for automatic detection of conversation evens, social interactions etc. were developed in the AMI project. The forthcoming AMIDA project focuses on real-time access to ongoing meetings enabling remote participants to get involved into the discussion. A mobile interface is therefore needed to support remote participant's interaction even with limited bandwidth (e.g. GSM/UMTS network). The emphasis will be put on efficiency of using such interface. The goal is to find out and show which ways can be the remote participant supported by his (mobile) device.
The first step will is an application showing current situation in a meeting room (graphical representation of look direction, speaker identification, text transcript). First tests with real users should show whether the decision to use graphical representation instead of real image meets all users requirements and which additional features
would be useful. Next step is a 3D interface for mobile phone connected to the data communication central - "the Hub".

Harikrishna MagantiIDIAP Trainee: Hari Krishna Maganti
Visiting From: Ulm University, Germany
Period: Started October 1, 2004, for one year

Speaker segmentation and tracking are crucial to the AMI project and in many applications such as in speech acquisition and recognition and meeting rooms. In the context of meeting room conversations, the speech stream is continuous and there is no information about the location of boundries between speakers- the "speaker segmentation" problem and also which portions of the speech belong to which speaker- the "speaker tracking" problem. The goal of this project is to find solutions to "who (which speaker) spoke when (at what time), where (location), and what (transcription)" using the source localization and acoustic information.

Rosa Martinez TorresICSI Trainee: Rosa Martinez (Masters candidate)
Visiting From: Polytechnical University of Catalonia (UPC), Barcelona
Period: started March 15, 2005, for 6 months

Dividing an input audio stream into acoustically homogeneous segments according to speaker identity -- so called "speaker diarization" -- is an important process for automatic speech recognition systems, especially in the context of meetings, characterized by multiple (overlapping) speakers and noisy environments. The aim of this project is to incorporate techniques commonly used in state-of-the-art speaker ID systems -- specifically, construction of cluster models via adaptation from a universal background model and feature warping -- in order to improve diarization performance. The effectiveness of such techniques, as demonstrated by LIMSI in NIST's recent RT-04 evaluation for Broadcast News diarization, has motivated us to implement them and assess their performance in the meetings context. The new module will be inserted after the standard clustering is performed in order to reach the optimum number of clusters more accurately. After a partial clustering using standard methods, new cluster models will be obtained using speaker adaptation of background models from the data available for each cluster. Standard merging techniques will be used to reach the final clustering. In addition, since the distribution of features can be highly modified by noise, reverberation, and the motion of speakers with respect to mics, a more robust feature representation can be obtained using feature warping in conjunction with the speaker ID module.

TUM Trainee: Jozef Milch ( PhD Internship) portrait Jozef Milch
Visiting From: BRNO
Period: 2 February 2009 - 30 June 2009
Project Title: Feature Point based object tracking

The video content understanding is very important for various applications, e.g. meeting analyses. State of the art approaches often try to spot object directly in videos, track them in time and detect the events in its trajectory. This work focuses on object detection and tracking on corner points. More accurately, the proposed approach deals with point of interest based background modelling and foreground segmentation. The foreground objects should be modelled as graph of salient points, which should accurately represent the object. Each node should be denoted by its position and descriptor based on additional feature, as gradients, colour histogram, etc, while the graph's edges are represented by node distances and directions. The detected object graph can subsequently be tracked by applying graph similarity measures. More robustness can be gained by adding some elasticity into the graph

structure, which is important for non-rigid objects. Finally, the algorithm should be extended to the 3D space and applied to several synchronized cameras.

IDIAP trainee: Fernando Fernandez Martinez (Ph.D. Internship)
Visiting From: Universidad Politecnica de Madrid

Period: 1 June 2006 - 31 August 2006

Project Title: Addressee Identification in AMI Multi-party Meetings

Fernando Fernandez MartinezAutomatic analysis of recorded meetings has become an emerging research domain focused on the different aspects of interactions among meeting participants. Addressing is an aspect of every form of communication. It represents a form of orientation and directionality of the act the current actor performs toward the particular other(s) who are involved in an interaction. In conversational communication involving two participants, the hearer is always the addressee of the speech act that the speaker performs, however, Addressing becomes a real issue in multi-party conversation.

Addressing could provide useful information about meetings in such a way that questions like ‘Who was asked to prepare a presentation for the next meeting?’ or ‘Were there any arguments between participants A and B?’ could be answered. In addition, the identification of the addressees of the dialogue acts is also important for inferring the dialogue structure in those multi-party dialogues.

Addressing information could be helpful as well to those who develop communicative agents in interactive intelligent environments and remote meeting assistants. These agents need to recognize when they are being addressed and how they should address people in the environment. My work will be focused on addressee identification in face-to-face meetings from the AMI data collection restricting the analysis to small group meetings (four participants).

The main goals will be (1) to find relevant features for addressee classification in meeting conversations using information obtained from multi-modal resources (gaze, speech, conversational and meeting context), and (2) to explore to what extent the performances of classifiers can be improved by combining different types of features obtained from these resources. To that effect, a static Bayesian Networks (BN) approach will be studied for the addressee classification task. However, as the contextual feature set includes information about the addressee of the immediately preceding dialogue act, we will also explore how well the addressee of the current dialogue act can be identified using the predicted, instead of the gold standard, value of the addressee of the previous dialogue act. For that purpose, Dynamic Bayesian Network (DBN) classifiers will be employed.

Frantisek Grezl - Marc Ferras - Xavier Anguera - Michael PucherICSI Trainee: Xavier Anguera Miro (Ph.D. student)
Visiting From: Polytechnical University of Catalonia (UPC), Barcelona
Period: Started September 1, 2004, for one year

One of the main challenges for automatic processing of speech data collected from tabletop microphones in meeting rooms is the mix of multiple speakers within the same audio stream. Complicating the problem, the meeting room speech stream is affected by channel distortions and speaker overlaps. The speaker clustering task tries to determine where each speaker in the meeting starts and stops speaking, clustering all the segments generated by the same speaker into a single group. Such speaker clustering is essential for an array of meeting processing tasks, including mark-up of transcripts with speaker labels as well as improved speech recognition through speaker-specific adaptations. The goal of this project is to determine who spoke when, given the recordings by far-field microphones from the meeting rooms, including identification of regions of speaker overlap. In addition to techniques currently employed for speaker segmentation and clustering in more established domains such as broadcast news, we will explore approaches designed specifically for the meeting room data, such as preprocessing of the signal for speech enhancement and improved separation by estimating time delays between microphones using delay-sum techniques.

University Sheffield Trainee: Binit Mohanty (U/G Internship)
Visiting From: Indian Institute of Technology, Kanpur
Period: 1 May 05 - 1 July 05
Project Title:




ICSI Trainee: Korbinian Riedhammer (ICSI Scheme) Korbinian Riedhammer
Visiting From: University of Erlangen-Nuremburg
Period: 2 January 2008 - 30 June 2008
Project Title: Generating Automatic Summaries

Thanks to new technology, more and more text and speech data is acquired and stored, for example from newspapers, broadcast news or meetings. However, these information can only be used successfully, if there is a efficient way to search and access it. A good way to start is to generate automatic summaries to produce a condensed version of the data. In general, two tasks can be distinguished: Text and speech summarization. Although the objective is the same, the data basis is very different: Written or broadcast news usually feature well-formed sentences and a continuous story-line. In contrast to that, speech data obtained from meetings is much more stereotype in a sense of structure of the sentences, used words and dialog interaction (e.g. acknowledgments, questions and responses, ...). In my work at ICSI, I focus on the latter problem using two well annotated data sets, the AMI and ICSI meeting corpora which feature about 200 recordings which were fully transcribed and summarized by human labelers. In detail, the task is to generate so called extractive summaries by selecting important or representative parts of the meeting. In previous works, G. Murray [1,2,3] successfully applied techniques like MMR or TF-IDF. I work on extensions of these approaches by integrating different sources of information like dialog act tags, prosodic attributes or ASR confidence scores. Also, I put emphasis seeing the extractive summary as a binary classification task, i.e. to determine whether or not a sentence is important. This allows to embed various kinds of features and their context into well-studied machine learning environments.
[1] G. Murray, S. Renals, J. Carletta. Extractive Summarization of Meeting Recordings, Proc. Eurospeech 2005, Lisboa, Portugal
[2] G. Murray, S. Renals, J. Carletta, J. Moore. Incorporating Speaker and Discourse Features into Speech Summarization. In Proc. Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 367-374 [3] G. Murray, S. Renals. Towards Online Speech Summarization. In Proc. Interspeech 2007, Antwerp, Belgium, pp. 2785-2788

UEDIN Trainee: Javier Tejedor ( PhD Internship)Javier Tejedor
Visiting From: University Autonoma of Madrid
Period: 15 February 2008 - 31 July 2008
Project Title: Vocabulary-Independent Spoken Term Detection over meetings data


The task of Spoken term detection (STD), defined by NIST in 2006, deals with the search for words and phrases from spoken audio archives. The nature of the STD task means that there is bias toward terms which do not normally occur in the vocabularies of Large Vocabulary Continuous Speech Recognition (LVCSR) systems, such as proper names and entity

names. Furthermore, the search terms are unknown a priori to the speech processing modules, and a two-stage process is employed in which the audio is first indexed (as a lattice) and then search for terms performed on the lattice. Therefore, vocabulary independence is a key component in STD research.

This project will develop methods for vocabulary-independent spoken term detection from meetings data, and in doing so we will extend the scope of the AMIDA meetings browser. We will develop techniques for retrieving terms which are either out-of-vocabulary or incorrectly recognized by the recognizer, meaning that a simple search through the LVCSR transcript would not provide the desired results. These techniques will also be applied to enrollment of OOV words, and to automatic retrieval of relevant extractive summaries, which will allow audio-based search for meeting elements such as agreements, commitments and so on. The English meetings domain of AMI corpus will be used to evaluate all of these techniques.

UT Trainee: Anh Nguyen (MSc Internship)
Visiting From: Vietnam National University
Period: 1 April 06 - 31 Sep 06
Project Title:




BRNO Trainee: Gaurav Pandey
Visiting From: Indian Institute of Information Technology
Period: 15 march 05 - 30 June 05
Project Title: Keyword spotting on continuous speech data using semantic categories

The work was aimed at the enhancing of acoustic (HMM-based) keyword spotting (KWS) by introducing LVCSR-loop with semantic categories to the back-ground model. First experiments aimed at learning and reproducing the results obtained with acoustic keyword spotter (the TRAP-NN-LCRC40hPostTrans system) on ICSI data. In these baseline experiments, the keyword spotting is done by traversing the data in the normal forward direction (taking left context into account). The following experiments aimed at the detection of keywords by traversing the speech data also in the reverse direction (right context). The accuracy of system in terms of figure-of-merit (FOM) was evaluated for all systems and we have tried to combine these two ways. Later on, the work was extended to include semantic categories. Firstly, by the introduction of the most common words in the free phoneme loop to add a little bit of context to the process of keyword detection. Based on satisfactory results, the work was extended to provide semantic categories. The keywords were treated according to their part of speech categories (nouns, verbs, adjectives and adverbs) and were detected using a bi-gram networks created for the different categories. In experiments, this system outperformed the baseline acoustic KWS, the improvement was especially important for the 'noun' category, where the FOM increased from 73.0% to 77.5%.

Vikas PanwarTrainee: Vikas Panwar
Visiting From: Twente
Period: 8 May 2005 - 22 July 2005

Interaction between participants and the objects displayed over a table is an elementry aspect of a meeting. In a virtual meeting environment, it becomes necessary that any such interaction is designed to look natural and the person who wants to discuss about the object should be able to define arguements clearly. One of the future developments that is foreseen for inclusion in this virtual meeting room is the possibility to discuss virtual objects that have been visualized in the virtual meeting environment. During design meetings these discussions include making changes or suggestions for changes for these virtual objects. Participants should be able to provide these suggestions by examining and modifying the object of attention. For such an interaction, there is a need for a suitable interface which should provide the functions or actions needed for examining or modifying the object by changing its physical properties such as size, color, position, orientation etc. Such an interface could be a combination of different modalities for example, a combination of speech and graphical user interfaces. Also,there should be an effective turn-taking process for making updates in a virtual environment.Especially to solve the problems associated with making updates and changes in the environment e.g.,in a design meeting, when several participants want to make changes to objects at the same time.

Jan PecivaTrainee: Jan Peciva
Visiting From: Brno University of Technology
Period: 1 April - 23 November 2005

The virtual meeting room project at the University of Twente is focused on the creation of a dynamic 3D representation of a meeting. This virtual meeting environment is meant to validate models of face-to-face verbal and nonverbal meeting behavior, but it also allows to experiment with remote participation by one or more meeting participants. In the context of the AMI design meetings it has become interesting to look at the meeting environment as an environment for collaborative work. In this traineeship the problems associated with maintaining the consistency in the environment for the various collaborating participants will be topic of concern. In the collaborative virtual meeting room that connects distributed meeting participants the meeting participants can see other participants represented by their avatars. They can see also head movements of other participants and whom they are looking at, their hand movements, and possibly all other information that will be present in a virtual meeting room. The collaborative environment should be realized through the development of methods for data sharing in time-sensitive manners, optimizing them for different network conditions, e.g. long latency or low bandwidth, and through the implementation and testing of real time interaction between users of the collaborative virtual meeting room.

IDIAP Trainee: Benjamin Picart (Masters Internship) portriat Benjamin Picart
Visiting From: Faculte Polytechnique de Mons (Belgium)
Period: 2 February 2009 - 15 June 2009
Project Title: Improved phone posterior estimation through kNN and MLP-based distance

In this work, we investigate the use of a Multi-Layer Perceptron (MLP) to compute distances between phone posterior features vectors, in order to improve the k-Nearest-Neighbors (kNN) classification rule performance. Given these posterior features vectors, we search for the optimal configuration of the MLP minimizing the classification error rate, mainly by selecting the optimal number of units in the hidden layer and the number k of nearest-neighbors in the kNN classification rule. The use of posteriors as input is motivated by the fact that they are speaker and environment independent (so they capture much of the phonetic information contained in the signal), they can be seen as the optimal phonetic representation of speech features, … as we will see into this work. The results will be compared to those obtained using other distances (e.g. standard Euclidian distance, Bhattacharyya distance, Kullback-Leibler (KL) divergence and cosine function). To achieve this objective, we use the TIMIT database.

IDIAP Trainee: Marianna Pronobis (Masters Internship)portrait Marianna Pronobis
Visiting From: Royal Institute of Technology (KTH), Stockholm, Sweden
Period: 1 April 2008 - 31 August 2008
Project Title: Automatic Gender Recognition Based on Audio-Visual Cues

The ability to perform automatic recognition of human gender is crucial for artificial systems employed in a number of applications. Typical examples are information retrieval and human-computer interaction systems. The outcome of an Automatic Gender Recognition (AGR) system can be used for generating meta-data information such as how many males or females were present in a meeting or to annotate audio and video files. Moreover, such system can be used to provide contextual information that can be exploited by other systems. For instance, it can be used for improving intelligibility of human-computer interaction, selecting and estimating better, more specialized acoustic models for speech recognition, or simply, reducing the search space in speaker recognition or surveillance systems. The goal of the thesis is to develop a multimodal AGR system that when applied in realistic scenarios can provide sufficient robustness towards various types of noise that occurs in natural settings such as public places, meeting rooms or outdoor environments etc. In the proposed solution, both audio and visual cues will be used simultaneously as a source of information. It is expected that they will provide a more comprehensive description of a subject then a single modality and, in consequence, higher robustness of the system to the degradation or even temporal unavailability of the input signals. In this work we would like to identify which audio and visual features will yield a better AGR system under varying conditions and how much performance of the AGR system in a realistic scenario can be improved by combining these two modalities. To address the aforementioned questions, we will first separately study the effectiveness of different audio and visual features for the specific task on clean datasets, and then difficulty of the problem will be increased by repeating the studies for the conditions that occur in every day settings. Finally, the information provided by different cues will be integrated to build a robust audio-visual AGR system.

Volha PetukhovaUT Trainee: Volha Petukhova (Masters candidate)
Visiting From: Tilburg University, Netherlands
Period: Started February 1, 2005, for five months

Communication has a central place in meetings. Communication= def. transmission of content X from a sender Y to a recipient Z using an expression W and a medium Q in an environment E with a purpose/function F (Allwood, 2002).

Expressions could be verbal or non-verbal in nature. The main aims of the research project are to study the interaction of verbal and non-verbal dialogue acts; to explore the semantic and pragmatic information that is available in the individual modalities; to investigate the function of non-verbal behavior, gestures in particularly; the multidimensional interaction of the verbal and non-verbal communicative acts and/or linguistic and non-linguistic components of utterances.

References: Allwood, J. (2000) Bodily Communication - Dimensions of Expression and Content. Multimodality in Language and Speech Systems. Björn Granström, David House and Inger Karlsson (Eds.). Kluwer Academic Publishers. Dordrecht, The Netherlands

ICSI Trainee: Michael Pucher (Ph.D. Internship)
Visiting From: Telecommunications Research Center, Vienna
Period: 1 Jan - 1 June 05
Project Title: Latent semantic analysis based language models for meeting recognition

Language models that combine N-gram models with Latent Semantic Analysis (LSA) based models have been successfully applied for conversational speech recognition and for broadcast news. LSA defines a semantic similarity space using a large training corpus. This semantic similarity can be used for dealing with long distance dependencies, which are a problem for N-gram based models. Since LSA based models are sensitive to the topics of the training data and meetings mostly have a restricted topic or agenda, we think that these models can improve speech recognition accuracy on meetings. In this project the performance of LSA based language models on meeting recognition will be evaluated. For the training of the LSA model we will use topicalized meeting data together with larger training corpora. There are two crucial aspects of LSA based language models that we want to work on. The first is the conversion from the semantic similarity space to the probabilistic space of language models. The second is the integration of N-gram models and LSA based semantic models. We want to investigate different methods for dealing with these two issues in the meeting domain

Frantisek Grezl - Marc Ferras - Xavier Anguera - Michael PucherICSI Trainee:  Michael Pucher (Ph.D. student)
Visiting From: Telecommunications Research Center, Vienna
Period: started February 1, 2005, for six months

Language models that combine N-gram models with Latent Semantic Analysis (LSA) based models have been successfully applied for conversational speech recognition and for broadcast news. LSA defines a semantic similarity space using a large training corpus. This semantic similarity can be used for dealing with long distance dependencies, which are a problem for N-gram based models. Since LSA based models are sensitive to the topics of the training data and meetings mostly have a restricted topic or agenda, we think that these models can improve speech recognition accuracy on meetings.

In this project the performance of LSA based language models on meeting recognition will be evaluated. For the training of the LSA model we will use topicalized meeting data together with larger training corpora. There are two crucial aspects of LSA based language models that we want to work on. The first is the conversion from the semantic similarity space to the probabilistic space of language models. The second is the integration of N-gram models and LSA based semantic models. We want to investigate different methods for dealing with these two issues in the meeting domain.

BRNO Trainee: Santhosh Kumar Chellappan Pillai ( PhD Internship) Santhosh Kumar
Visiting From: ECE Dept. Amrita Vishwa Vidyapeetham, Ettimadai
Period: 28 May 2007 - 31 August 2007
Project Title: Multilingual Phoneme Recognition

Speech recognition has matured enough today as a technology to be used in commercial applications. Yet, dealing with speech recognition in multilingual societies and under real meeting conditions is far from satisfactory. Some of the issues that need to be addressed in this context are:
1.Automatic language identification
2.Identifying out of vocabulary words
In dealing with multilingual speech recognition, it is always convenient to think of multilingual phone recognizers rather than mono-lingual phone recognizers. In the first part of the work, we will address the use of multilingual phone sets for language identification, and study if the language identification could be done using fewer multilingual phone recognizers to obtain the same accuracy as using many mono-lingual phone recognizers.
Later, we would also address the detection of out of vocabulary(OOV) words in multicultural environments where English is the primary language, but many words are mixed across languages. More importantly, these words need not be spoken using the native English phone set, but using the speaker's own phone set which could be different from the native English phone set. In this study, we address how some of the issues in detecting the OOVs could be solved using a multilingual phone set.

IDIAP Trainee: Bogdan Raducanu ( Postdoctoral Visit)Bogdan Raducanu
Visiting From: Computer Vision Center, Barcelona
Period: 1 February 2008 – 31 July 2008
Project Title: Social Interaction Modeling


Considering social interaction from the audio perspective, I plan to investigate during this traineeship some aspects related with vocal prosody that are fundamental in modeling interaction between people and which could give hints on people social behaviors and the outcome of the interaction process. In concrete, I will study automatic labeling of

prosodic features like voice pitch, accentuation, loudness, tonality, spectral entropy, speaking rate, and speaking time, with the goal of applying them to model conversational dynamics. Possible target applications are modeling of turn-taking, dominance (who is in charge of a conversation), emphasis (variation in intonation) and activity level (engagement).

USFD Trainee: Anand Ramamoorthy ( Masters Internship)Anand Ramamoorthy
Visiting From:
Period: 21 January 2008 - 20 July 2008
Project Title: Towards a Robust Means of Assessing the Utility of Interactive Compression Techniques


Interactive Compression (IC) has been proposed as an effective means of extracting salient units of information from textual sources (for instance, meeting transcripts). Interactive Compression works by allowing the user to control the manner in which textual information is presented. This is done by excising or highlighting bits of the text (either words or utterances) on the basis of salience ratings. Recent qualitative and quantitative experiments, which drew extensively on the AMI Meeting Corpus, have demonstrated the efficacy of IC techniques with reference to information extraction in the context of a well defined task. However, the results of these studies have been criticised on the grounds that they are dependent on the task chosen for the experiment. Moreover, the human brain appears to have evolved strategies for extracting salient units of information from texts/documents which are unmodified. This presents an interesting challenge to those investigating the IC approach.

There are two parts to this challenge. On one hand, the relative superiority of IC enabled document search over everyday visual search needs to be demonstrated beyond reasonable doubt, and on the other, an objective measure of the effectiveness of IC techniques needs to be developed so as to ensure that the experiments reflect ‘real world’ concerns with regard to information extraction from textual sources. We hypothesise that interactive compression will render the extraction of salient units of information more efficient in terms of time and speed when compared to ordinary visual search applied to unmodified text.

USFD Trainee: Ramandeep Singh ( UG Internship) Raman Singh
Visiting From: Indian Institute of Technology (IIT, Kanpur)
Period: 8 May 2007 - 25 July 2007
Project Title: Informed discriminative training of meeting data acoustic models
Discriminative training techniques such as Minimum Bayes Risk (MBR) may be used to improve the performance of HMM-base3d acoustic models. In the case of MCR training, knowledge of the correct transcription of the training data is used to assign a loss to competing transcriptions. There is often a mismatch between the final evaluation procedure and the loss function used in discriminative training. For example, when evaluating meeting data transcriptions, the heitations 'UM' and 'UH' are mapped onto a generic hesitation token, and no penalty is received when mis-transcribing 'UH' as 'UM'. At the discriminative training stage, however, the loss function currently used will assign a loss to a competing transcription which substitutes 'UM' in place of 'UH'. There are many examples of such mismatches. The aim of this project is to evaluate how much (if at all) discriminative training can benefit from knowledge of the final evaluation procedure.

USFD Trainee: Kumutha Swampillai (Ph.D. Internship)
Visiting From:
Period: 16 January 06 - 16 July 06
Project Title:



Sophie Anne ThobieAMI Applicant: Sophie-Anne Thobie
Visiting From: LIMSI-CNRS, Bordeaux
Period: 1 March - 1 September 2005

The main feature of the work to be done in the AMI training programme is to find verbal and nonverbal characteristics of confusion during meetings. How does a meeting participant or its representation as an embodied conversational agent act when it lacks understanding of the situation? What kind of verbal and nonverbal (gaze, gestures, facial expressions, posture) show this confusion and how can we express them in embodied agents? During the traineeship an attempt will be made to model verbal and nonverbal communication issues in a situation where there is a misunderstanding among meeting participants. We will also look at the possibility to generate this type of behavior in a situation of communicating embodied conversational agents in a meeting environment. Some concrete examples of behavior, based on our model, will be generated and methods for reducing data and ameliorating the smoothness of the movements (taken from Thobie's earlier Ph.D. work) will be employed.

IDIAP Trainee: Muhammad Muneeb Ullah (Masters Internship)
Visiting From: Royal Institute of Technology (KTH), Stockholm, Sweden
Period: 1 May 2007 - 31 January 2008
Project Title:




UEDIN Trainee: Nynke Van der Vliet (MSc Internship)
Visiting From: University of Twente
Period: 1 Oct 05 - 30 April 06
Project Title:

During meetings, even when several people talk at once the participants often perceive one person as having the floor. For this reason, floor can be an important concept for understanding the content of a meeting. Previous attempts at statistical modelling to predict the floor have been limited to considering patterns of who has spoken previously. In this project, we will review the literature describing the process of establishing the floor, multi-modal cues for passing the turn from one participant to another, and markers for the types of utterances, such as "backchannels", that do not take the floor. We will then use the ICSI and AMI meeting corpora to describe the relationship between floor states and other properties that have been annotated for these corpora, such as hand gestures, gaze, and the use of explicit addressing.

Gerwin van doornIDIAP Trainee: Gerwin van Doorn (MSc Internship)
Visiting From: University of Twente
Period: 1 sept 05 - 31 march 06
Project Title: Accelerated Media Playback of Meeting Recordings

Meetings can take up a lot of time without containing dense information. People do not want to watch or listen to recorded meetings of an hour or more. Most information in a meeting is contained in the recorded speech of the meeting, so being able to play back speech faster than it was produced while maintaining good comprehensibility may be a valuable feature of a meeting browser. The goal of this project is to test several interactive browsing techniques that make it possible to play back meetings in a shorter time Period than they were recorded. The outcome of this research should prove or disprove if the accelerated playback techniques are increase browsing performance. Techniques I will be looking at include interactive control of time-compression and overlapping audio using binaural placement. For more details, see home page

ICSI Trainee: Oriol Vinyals (Masters Internship)Oriol Vinyals
Visiting From: UPC Technical Uni of Catalonia
Period: 2 April 2007 - 30 September 2007
Project Title: Improving the ICSI Diarization Engine

The goal of speaker diarization is to segment an audio recording into speaker-homogeneous regions. This task is sometimes referred to as the "Who Spoke When" task. Knowing when each speaker is speaking is useful as a pre-processing step in speech-to-text (STT) systems to improve the quality of the output. Such pre-processing may include vocal tract length normalization (VTLN) and/or speaker adaptation. Automatic speaker segmentation may also be useful in information retrieval and as part of the indexing information of audio archives.

One of the aspects that we are working on is increasing the speed of the Diarization Engine. We proposed several modifications like a faster logarithm implementation or a faster way to do the agglomerative clustering using a Fast Match technique. The engine runs faster than real time with those techniques.

We also look towards improving the accuracy of the overall system. To do so, we are exploring new audio features and other techniques to merge clusters not based on the Bayesian Information Criteria or a better initialization.

Some of the future goals include the implementation of an Online Diarization System or a Multimodal Diarization System that will use image information.

This work is being done in collaboration with Gerald Friedland, a staff member of the ICSI speech group, and with Yan Huang, a PhD student of the UC Berkeley among other collaborations with member of the Speech Group.

Frantisek Grezl - Marc Ferras - Xavier Anguera - Michael PucherICSI Trainee: Marc Ferras (Masters candidate)
Visiting From: Polytechnical University of Catalonia (UPC), Barcelona
Period: Started September 1, 2004, for 6 months (with possible extension to full year)

Word accuracy of ASR systems falls off dramatically when using far-field microphones, yet the use of tabletop microphones is both convenient and common in meeting room recordings. Thus, it is essential to have a preprocessing stage which copes with reverberation while trying to simultaneously maximize word accuracy of the ASR system. For this project, some already existing dereverberation techniques will be studied, implemented and evaluated for the available meeting corpora. These techniques are focused on both beamforming and speech modelling at the signal level (LPC, HNM). At the same time, other related and novel approaches will also be examined, aimed at joint beamforming-LPC (or PLP) modelling, and may involve pitch tracking or working on different metrics for LPC residual minimization.

Darren MooreUniversity Sheffield Trainee: Darren Moore (Postdoctoral Visit)
Visiting From: IDIAP
Period: 1 Aug 04 - 1September 04
Project Title:




UT Trainee: Roel Vertegaal (Postdoctoral Visit)
Visiting From: Queen’s University
Period: 1 Nov 05 - 1 May 06
Project Title:




UEDIN Trainee: Junichi Yamagishi (Postdoctoral Visit)
Visiting From: Tokyo Institute of Technology
Period: 1 April 2006 - 31 March 2007
Project Title: Online learning algorithm of decision trees for HMM-based speech recognition


In real speech data like meeting speech data, target speakers and/or environment suddenly change. Hence, to recognize a new target speaker’s speech in new target environment, we need to effectively and rapidly tune parameters of HMM-based acoustic model into the new speaker or environment. For this purpose, many online parameter adaptation or estimation approaches are proposed. However, in HMM-based ASR system, the HMM-based acoustic model has complex model topology and the model topology affects the accuracy because the topology is utilized for tying of model parameters and model selection for unseen context. Therefore, the structure of the decision tree, which defines the model topology, should be also reconstructed for the new target speaker/environment. In this training program, I intend to challenge the online algorithm for learning and growing the decision tree of the HMM. Generalization of this online learning algorithm of the decision tree and dynamic model selection algorithm proposed in the field of statistical learning is of great interest.

ICSI Trainee: Shasha Xie (PhD Internship)portrait Shasha Xie
Visiting From: The University of Texas at Dallas
Period: 15 February 2009 - 14 August 2009
Project Title: Improving the performance of extractive meeting summarization


Automatic summarization is a useful technique to facilitate users to browse a large amount of data and obtain information more efficiently from either text or audio sources. Extractive meeting summarization provides a concise and informative summary for the lengthy meetings and is an effective tool for efficient information access. In previous research, a global optimization framework is introduced for meeting summarization task based on the hypothesis that utterances convey bits of information, or concepts. The best set of utterances are selected to cover as many concepts as possible while satisfying a length constraint. This optimization method can be extended to consider the sentence importance weights, where the summary utterances are selected to cover salient concepts, and important sentences as well. We will study different measures of evaluating sentence importance, such as the cosine similarity of each sentence to the entire meeting document, or confidence scores from supervised learning approaches. And we will also investigate different ways of combining concept weights and sentence importance weights. We hope the introduction of sentence importance scores can improve the readability of extractive summary, and thus improve the system performance. Furthermore, because we extract concept as the information units for extractive summarization, we hope these relatively small but robust units can improve the performance on ASR condition, where using small units can avoid the impact of high WER.

European Research Area    Information Society Technologies

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: