The event took take place on Wednesday, November 23rd 2016, at EPFL.
- 9:00 – 9:30
- Robustness of classifiers beyond random noise (Seyed Mohsen Moosavi Dezfooli, EPFL)
- 9:30 – 10:00
- Importance Sampling Tree for Large-scale Empirical Expectation (Olivier Canévet, IDIAP)
- 10:00 – 10:30
- Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (Michaël Defferrard, EPFL)
- 10:30 – 10:45
- Coffee break
- 10:45 – 11:15
- ManifoldNet: Manifold-Guided Training of Neural Networks (Dengxin Dai, ETHZ)
- 11:15 – 11:45
- Beyond Sharing Weights for Deep Domain Adaptation (Artem Rozantsev, EPFL)
- 11:45 – 12:15
- Segmenting aerial images into polygonal shapes (Nadine Rüegg, ETHZ)
- 12:15 – 14:00
- Lunch break
- 14:00 – 14:30
- Learning robust sequences of goal-oriented tasks (Jose Ramon Medina, EPFL)
- 14:30 – 15:00
- Learning adaptive dressing assistance from human demonstration (Emmanuel Pignat, IDIAP)
- 15:00 – 15:30
- What Foursquare Can't Tell: Characterizing Perception in Urban Environments (Darshan Santani, IDIAP)
- 15:30 – 15:45
- Coffee break
- 15:45 – 16:15
- Learning with Feature Side-information (Amina Mollaysa, UNIGE)
- 16:15 – 16:45
- CNN-based presentation attack detection for trustworthy speaker verification (Hannah Muckenhirn, IDIAP)
- 16:45 – 17:15
- Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media (Nam Le, IDIAP)
Several recent works have shown that state-of-the-art classifiers are vulnerable to adversarial perturbations of the datapoints. On the other hand, it has been empirically observed that these same classifiers are relatively robust to random noise. In this talk, we propose a semi-random noise regime that generalizes both the random and adversarial noise regimes. We provide precise theoretical bounds on the robustness of classifiers in this general regime, which depends on the curvature of the classifier's decision boundary. These bounds are in line with some of previous empirical observations that classifiers satisfying curvature constraints are robust to random noise. Moreover, we quantify the robustness of classifiers in terms of the subspace dimension in the semi-random noise regime, and show that our bounds remarkably interpolate between the adversarial and random noise regimes. This result suggests bounds on the curvature of the classifiers' decision boundaries that we support experimentally, and offers insights onto the geometry of high dimensional classification problems.
We present an "Importance Sampling Tree", which is a tree-based procedure inspired by the Monte-Carlo Tree Search that dynamically modulates an importance-based sampling to prioritize computation, while getting unbiased estimates of weighted sums. We apply this generic method to learning on very large training sets, and to the evaluation of large-scale SVMs.
The core idea is to reformulate the estimation of a score - whether a loss or a prediction estimate - as an empirical expectation, and to use such a tree whose leaves carry the samples to focus efforts over the problematic "heavy weight" ones.
We illustrate the potential of this approach on three problems: to improve Adaboost and a multi-layer perceptron on 2D synthetic tasks with several million points, to train a large-scale convolution network on several millions deformations of the InfiMNIST and CIFAR data set, and to compute the response of a SVM with several hundreds of thousands of support vectors. In each case, we show how it either cuts down computation by more than one order of magnitude and/or allows to get better loss estimates.
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (Michaël Defferrard, EPFL)
Convolutional neural networks (CNNs) have greatly improved state-of-the-art performances in computer vision and speech analysis tasks, due to its high ability to extract multiple levels of representations of data. In this work, we are interested in generalizing CNNs from low-dimensional regular grids, where image, video and speech are represented, to high-dimensional irregular domains, such as social networks, telecommunication networks, or words’ embedding. We present a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Importantly, the proposed technique offers the same computational complexity as classical CNNs, while being universal to any graph structure. Numerical experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs.
Manifold learning techniques have received extensive attention, with applications to tasks as many as dimensionality reduction, clustering, semi-supervised learning, feature encoding, and visualization. In this talk, we will talk about our very recent research about manifold guided training of neural networks. We will focus on how to exploit manifold structures to train a neural network, in the context of dimensionality reduction, network distillation, and semi-supervised learning. We demonstrate that our approach outperforms competing methods for all of the three tasks.
Deep Neural Networks have demonstrated outstanding performance in many Computer Vision tasks but typically require large amounts of labeled training data to achieve it. This is a serious limitation when such data is difficult to obtain. In traditional Machine Learning, Domain Adaptation is an approach to overcoming this problem by leveraging annotated data from a source domain, in which it is abundant, to train a classifier to operate in a target domain, in which labeled data is either sparse or even lacking altogether. In the Deep Learning case, most existing methods use the same architecture with the same weights for both source and target data, which essentially amounts to learning domain invariant features. Here, we show that it is more effective to explicitly model the shift from one domain to the other. To this end, we introduce a two-stream architecture, one of which operates in the source domain and the other in the target domain. In contrast to other approaches, the weights in corresponding layers are related but not shared to account for differences between the two domains. We demonstrate that this both yields higher accuracy than state-of-the-art methods on several object recognition and detection tasks and consistently outperforms networks with shared weights in both supervised and unsupervised settings.
We present a supervised deep learning approach for segmenting aerial images into polygonal shapes. While most existing approaches address the problem of automated generation of online maps as a pixel-wize segmentation task, we instead frame this problem as constructing polygons representing objects such as roads and buildings. Our approach uses a combination of a convolutional neural network to detect object corners and a siamese network to find matching corners that belong to the same object. We use large amounts of data to train our model by making use of readily available online maps that can be downloaded at virtually no cost, at an arbitrary scale, and for a large number of cities across the globe. Preliminary experimental results demonstrate that this approach performs in-par with traditional approaches while presenting additional benefits. The polygonal representation produced by our approach opens the door to the use of object-level constraints which has proven to be extremely difficult to model with conventional pixel-based approaches or conditional random fields.
The majority of the learning from demonstration literature has been devoted to encoding the human behavior solving a single task in an accurate and general way. The segmentation of complex tasks into simpler sequences of subtasks is typically addressed as a decoupled problem done either by the human or an isolated segmentation algorithm. In this work, we study the problem of jointly modeling sequences of simple tasks and their parameters while ensuring robustness by means of a Hidden Semi-Markov Model. Each state (subtask) is represented by a stable dynamical system (DS) and transitions between DSs happen only when relevant state-dependent conditions, such as convergence to the DS attractor, are fulfilled. This way, both the segmentation and the parameterization of the DSs is considered jointly and the state-dependent transitions significantly increase the robustness to perturbations. Experimental results in a robot manipulator show the increased reproduction capabilities and robustness with respect to state of the art approaches.
For tasks such as dressing assistance, robots should be able to adapt to different user morphologies, preferences and requirements. We present a learning from demonstration method to efficiently acquire and adapt such skills. Our method encodes robot state (e.g., position and velocity of end-effector) and object state (e.g., coat or shoe position) in a hidden semi-Markov model (HSMM). The parameters of this model are learned from a set of demonstrations performed by a human. During execution of the task, the HSMM acts as a cost function combined with optimal control techniques privileging minimal intervention. The signals are encoded in multiple frames of reference simultaneously, allowing a fast adaptation to different user postures and morphologies.
What Foursquare Can't Tell: Characterizing Perception in Urban Environments (Darshan Santani, IDIAP)
There is an increasing interest in social media and ubiquitous computing to characterize places beyond their function and towards psychological and emotional constructs. In this context, an area of active research is the development of "a better idea of how people perceive and experience places". As soon as we walk into a bar, cafe or a restaurant, we judge if the place is appropriate for us. In other words, we form place impressions combining perceptual cues that involve most senses as well as prior knowledge of both the physical space and its inhabitants.
In this talk, we will present a large-scale study to examine how crowdsourced images can be used to characterize and automatically infer urban perception of indoor and outdoor places. We first show that reliable estimates of urban perception can be obtained using images as visual stimuli in an online setting for both place types. Urban perception were elicited across several physical and psychological constructs. Indoor places were judged along categories including romantic, bohemian, formal and trendy, among others. Outdoor perception were assessed along dimensions including dangerous, dirty, happy, picturesque, etc. Second, using generic deep learning features, we demonstrate the feasibility to automatically infer urban perception with a maximum r-squared of 0.52 and 0.49 respectively for indoor and outdoor places.
In most real-life problems, the attributes come with their own vectorial descriptions which give more detailed information about the attributes' properties. We refer those vectorial description of attributes to feature side-information. The feature side-information is usually abandoned or used for feature selection prior to model fitting. In this study, we propose a framework to incorporate feature side-information in the learning phase to improve the prediction accuracy. The approach we use is applicable for both linear and nonlinear representation without making assumption on the mapping and hand engineering. Result shows accuracy improvement when compared to the base line model which doesn't use the side information.
CNN-based presentation attack detection for trustworthy speaker verification (Hannah Muckenhirn, IDIAP)
Automatic Speaker Verification (ASV) systems can achieve a high accuracy in the presence of zero-effort impostors, i.e., speakers that simply attempt to be accepted by the system as another person while using their own voice. However, these systems have been shown to be vulnerable to more elaborated attacks. Presentation attacks, also called spoofing attacks, refer to the presentation of falsified or altered samples to a biometric sensor to induce illegitimate acceptance. Three types of presentation attacks represent a real threat to ASV systems: replay, voice conversion and speech synthesis.
Traditionally, the detection of presentation attacks has been addressed by using handcrafted features based on short term processing, followed by a classifier such as a neural network or a Gaussian mixture model. In this work, we present a Convolutional Neural Networks (CNN) based approach, where we make minimal assumptions and learn the features and the classifier jointly. We demonstrate the potential of this approach through an investigation on the AVspoof database.
Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media (Nam Le, IDIAP)
Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. In this presentation we will define and explore the problem of dubbing detection in broadcast data. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which is publicly available.